Skip to content

Comments

Parallelizing BallTree Construction#132

Closed
SebastianAment wants to merge 2 commits intoKristofferC:masterfrom
SebastianAment:parallel-ball-tree
Closed

Parallelizing BallTree Construction#132
SebastianAment wants to merge 2 commits intoKristofferC:masterfrom
SebastianAment:parallel-ball-tree

Conversation

@SebastianAment
Copy link

@SebastianAment SebastianAment commented Jan 7, 2022

Overview

This PR parallelizes the construction of BallTree structures, achieving a speedup of a factor of 5 for n = 1_000_000 points with 8 threads.

The implementation uses @spawn and @sync, which requires raising the Julia compatibility entry to 1.3 and incrementing the minor version of this package.

Benchmarks

Setup

using NearestNeighbors
using BenchmarkTools
d = 100

On Master

n = 100;
X = randn(d, n);
@btime T = BallTree(X);
  1.244 ms (23 allocations: 174.83 KiB)

n = 10_000;
X = randn(d, n);
@btime T = BallTree(X);
  372.398 ms (26 allocations: 16.95 MiB)

n = 100_000;
X = randn(d, n);
@btime T = BallTree(X);
  7.989 s (26 allocations: 169.53 MiB)

n = 1_000_000;
X = randn(d, n);
@btime T = BallTree(X);
  161.170 s (26 allocations: 1.66 GiB)

With this PR (updated after further edits with improved allocations)

n = 100;
X = randn(d, n);
@btime T = BallTree(X);
  813.417 μs (244 allocations: 189.97 KiB)

n = 10_000;
X = randn(d, n);
@btime T = BallTree(X);
  101.158 ms (25348 allocations: 18.70 MiB)

n = 100_000;
X = randn(d, n);
@btime T = BallTree(X);
  2.816 s (253697 allocations: 187.03 MiB)

n = 1_000_000;
X = randn(d, n);
@btime T = BallTree(X);
  33.461 s (2527680 allocations: 2.13 GiB)

Further, the PR still allows for sequential execution with the parallel = false keyword:

n = 100;
X = randn(d, n);
@btime T = BallTree(X, parallel = false);
  1.090 ms (24 allocations: 174.06 KiB)

n = 10_000;
X = randn(d, n);
@btime T = BallTree(X, parallel = false);
  362.205 ms (27 allocations: 16.95 MiB)

n = 100_000;
X = randn(d, n);
@btime T = BallTree(X, parallel = false);
  8.262 s (27 allocations: 169.53 MiB)

n = 1_000_000;
X = randn(d, n);
@btime T = BallTree(X, parallel = false);
  150.437 s (25 allocations: 1.66 GiB)

Summary

  • The parallel implementation yields a speed up for even small datasets of n = 100 data points, and achieves a speedup of a factor of 3 for n = 100_000 points.

  • Compared to the sequential code, the memory allocation is up by about 10-20% in size and considerably in number, which is due to the parallel code needing to allocate temporary arrays to avoid race conditions, while the sequential code reuses a single temporary. If allocations, rather than execution speed are the concern, one can always use the parallel = false flag this PR provides.

  • The sequential option parallel = false maintains the same allocation behavior and comparable performance as the master branch. Notably, the sequential branch of this PR is consistently 20% faster on the n = 100 test case compared to master.

The experiments were run on a 2021 MacBook Pro with an M1 Pro and 8 threads.

@SebastianAment SebastianAment force-pushed the parallel-ball-tree branch 2 times, most recently from e40c3f0 to f6acba9 Compare January 8, 2022 11:25
Copy link
Owner

@KristofferC KristofferC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this.

The parallel implementation yields a speed up for even small datasets of n = 100 data points,

But from what I understand, the parallel building only happens if the size is smaller than DEFAULT_BALLTREE_MIN_PARALLEL_SIZE which is 1024? What gives the speed improvement for small trees?


Since the structure of creating a BallTree and a KDTree is pretty much the same, the same could be applied there?


You seem to have an extra commit not related to the tree building in this PR.

return HyperSphere(SVector{N,T}(center), rad)
end

@inline function interpolate(::M, c1::V, c2::V, x, d) where {V <: AbstractVector, M <: NormMetric}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move this function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had two versions locally, the previous one, and this one without the array buffer variable ab. It turns out that in the sequential code, the compiler is able to get rid of the allocations without explicitly pre-allocating an ArrayBuffer variable. In the parallel code, having an array buffer leads to race conditions, which is why I wrote this modification.

I can move it back to where it was in the file.

high::Int,
tree_data::TreeData,
reorder::Bool,
parallel::Val{true},
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a Val and use a separate function like this feels a bit awkward. Couldn't one just look at parallel_size in the original build_BallTree function and then decide whether to call the parallel function or the serial one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using type dispatch on the parallel variable is important, because the compiler is able to get rid of temporary allocations during sequential execution. I can isolate the recursive component of the function though, and only use the Val(true) dispatch for that. If we only use a regular if statement on a Bool, performance during sequential execution will take a hit compared to the status quo.

@SebastianAment
Copy link
Author

The parallel implementation yields a speed up for even small datasets of n = 100 data points,

But from what I understand, the parallel building only happens if the size is smaller than DEFAULT_BALLTREE_MIN_PARALLEL_SIZE which is 1024? What gives the speed improvement for small trees?

This was run with a prior version where parallel_size = 0. A larger parallel_size seems beneficial for larger problems, where parallelization plays a bigger role.

Since the structure of creating a BallTree and a KDTree is pretty much the same, the same could be applied there?

I have a parallelized KDTree implementation locally too, but wanted to finish this one first. Do you prefer having everything in the same PR?

You seem to have an extra commit not related to the tree building in this PR.

Yes, maybe this wasn't smart in retrospect. I thought at the time that this PR would be easy to merge and just built on top of it. Would you like me to edit the commit history of the current PR?

@KristofferC
Copy link
Owner

Implemented in #216 for KDTree + BallTree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants