Parallelizing BallTree Construction#132
Parallelizing BallTree Construction#132SebastianAment wants to merge 2 commits intoKristofferC:masterfrom
BallTree Construction#132Conversation
e40c3f0 to
f6acba9
Compare
f6acba9 to
18b93cb
Compare
There was a problem hiding this comment.
Thanks for working on this.
The parallel implementation yields a speed up for even small datasets of n = 100 data points,
But from what I understand, the parallel building only happens if the size is smaller than DEFAULT_BALLTREE_MIN_PARALLEL_SIZE which is 1024? What gives the speed improvement for small trees?
Since the structure of creating a BallTree and a KDTree is pretty much the same, the same could be applied there?
You seem to have an extra commit not related to the tree building in this PR.
| return HyperSphere(SVector{N,T}(center), rad) | ||
| end | ||
|
|
||
| @inline function interpolate(::M, c1::V, c2::V, x, d) where {V <: AbstractVector, M <: NormMetric} |
There was a problem hiding this comment.
I had two versions locally, the previous one, and this one without the array buffer variable ab. It turns out that in the sequential code, the compiler is able to get rid of the allocations without explicitly pre-allocating an ArrayBuffer variable. In the parallel code, having an array buffer leads to race conditions, which is why I wrote this modification.
I can move it back to where it was in the file.
| high::Int, | ||
| tree_data::TreeData, | ||
| reorder::Bool, | ||
| parallel::Val{true}, |
There was a problem hiding this comment.
Using a Val and use a separate function like this feels a bit awkward. Couldn't one just look at parallel_size in the original build_BallTree function and then decide whether to call the parallel function or the serial one?
There was a problem hiding this comment.
Using type dispatch on the parallel variable is important, because the compiler is able to get rid of temporary allocations during sequential execution. I can isolate the recursive component of the function though, and only use the Val(true) dispatch for that. If we only use a regular if statement on a Bool, performance during sequential execution will take a hit compared to the status quo.
This was run with a prior version where
I have a parallelized KDTree implementation locally too, but wanted to finish this one first. Do you prefer having everything in the same PR?
Yes, maybe this wasn't smart in retrospect. I thought at the time that this PR would be easy to merge and just built on top of it. Would you like me to edit the commit history of the current PR? |
|
Implemented in #216 for KDTree + BallTree |
Overview
This PR parallelizes the construction of
BallTreestructures, achieving a speedup of a factor of 5 forn = 1_000_000points with 8 threads.The implementation uses
@spawnand@sync, which requires raising the Julia compatibility entry to 1.3 and incrementing the minor version of this package.Benchmarks
Setup
On Master
With this PR (updated after further edits with improved allocations)
Further, the PR still allows for sequential execution with the
parallel = falsekeyword:Summary
The parallel implementation yields a speed up for even small datasets of
n = 100data points, and achieves a speedup of a factor of 3 forn = 100_000points.Compared to the sequential code, the memory allocation is up by about 10-20% in size and considerably in number, which is due to the parallel code needing to allocate temporary arrays to avoid race conditions, while the sequential code reuses a single temporary. If allocations, rather than execution speed are the concern, one can always use the
parallel = falseflag this PR provides.The sequential option
parallel = falsemaintains the same allocation behavior and comparable performance as the master branch. Notably, the sequential branch of this PR is consistently 20% faster on then = 100test case compared to master.The experiments were run on a 2021 MacBook Pro with an M1 Pro and 8 threads.