-
Notifications
You must be signed in to change notification settings - Fork 60
[RFC] Asynchronous multithreaded evaluator #141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
to use with update_fitness!()
check that rebuild is necessary every time the candidate is inserted, not upon restarts only
since they run in the same thread that just creates overhead
because of Threads
|
Wow @alyst, this looks great. I will put it to its paces and familiarise myself with the code and then hopefully merge in a few days. This will be superuseful to many users. Big thanks and a Happy New Year to you. :) |
|
I've taken only a brief look so far and my testing has been positive. I'm currently only on a 2-core laptop (over holidays) so not meaningful to benchmark... ;) One concern is that this looks quite heavy/complex so might be harder to maintain going forward. Can you clarify what are the main benefits compared to something simple like an evaluator that simply When I'm on a 4-core machine I'll try to add some simple examples/benchmarks to see what kind of benefits can be had. |
It's "warm" vs "cold" start. There's quite some work done behind the curtains of
It's already much simpler than #46, since we don't have to send the candidates or the fitnesses across different processes. :)
Thank you, that would be very helpful! |
num_eval or job_id updates are exclusively done on the master thread
|
Thanks, that makes sense. I tried the default and dxnes optimizers on single- and multi-threaded performance on a 4-core (and a 2-core) laptop but I don't see increased performance or even more function calls performed. I'll also try BorgMOEA since maybe there we can expect better performance but if you can help me understand why I'm not seeing more func evals on at least dxnes (I try setting a higher lambda to ensure more samples per "round") that would help evaluate this. Here is the starts of an example script for multi-threaded optimization: |
|
Ok, forget my prev comment. I realized the func I optimized in the examples was way too fast to benefit from the thread switching. Want to do some more testing on a larger/faster 8-core/16-thread machine to test this out though but it does look solid so far. |
|
@robertfeldt Thanks for your script! I'm currently using it to profile the MultithreadedEvaluator. It's true that these problems are too fast (also dxnes uses multithreaded BLAS in my case, which may affect the results a bit), but still it highlighted some problems with my approach. I was too naive to assume that all the threads are fully utilized when I have seen 100% CPU usage. I've made some progress, but I hope to improve more soon. So it's definitely better to wait until the revised version. |
to prevent from rescheduling to different threads
hint which worker to wake up to process the new job
|
I've pushed the revised version. Unfortunately, it doesn't solve the issue with optimization of simple functions being actually slower than single-threaded. The previous design was the legacy of multi-processes parallelization implemented in #46: the master thread was both dispatching the jobs to the worker threads and storing the results in the archive, the worker threads were listening to the job requests coming from the master. I've also replaced the Channel with SpinLock-based synchronization to minimize worker threads idle times and Julian task switching (which, unfortunately, makes the code more complicated). As I've said before, single thread is still faster for simpler problems. But when the fitness calculation gets more expensive (one can easily model that by adding sleep(0.001) to the optimized function), the effects of multithreading become more visible. The effects further improve if lambda (the population size in dxnes) is made larger. For 19-worker setup with sleep(0.001) in rasstrigin() and lambda=1000 I get 5000 num_evals for res_single and 54000 num_evals for res_multi. But for algorithms like DiffEvo or Borg having more than 4 threads should not make a big improvement as it's limited by the number of individuals that need to be recalculated at each step of the algorithm. I would like to improve the situation for simpler problems too, but I don't know how to proceed further. |
|
Thanks for the updates, this looks very nice. I'm ready to merge but did notice this in the weekly Julia new summary: which looks useful and might simplify for us. I guess having our own queue and scheduling to it might still give benefits for us but the simplicity of using something like ThreadPools is appealing... |
|
I've also came across ThreadsPool.jl. It's definitely very useful, but IIUC every fitness evaluation would require creating and dispatching a Task object, which implies some overhead. |
Julia 1.3.0 finally supported threads, so I've implemented MultithreadEvaluator that makes use of it.
The PR adds the concept of AbstractAsynchronousEvaluator. The idea is that one can submit fitness calculation jobs to such evaluator via async_update_fitness() call, which immediately returns the job id, so that the optimization algorithm can continue, while the fitness calculation is being done in the background. When required, the results could be collected/waited for by sync_update_fitness!(). The support of asynchronous evaluation was added to BorgMOEA: it can generate recombined individuals and then immediately proceed to the next step, while their fitnesses are calculated (the synchronization happens before the next recombination).
The PR also generalizes the update_fitness!() by supporting any iterator over Candidate objects. This allows the generic support for fitness calculation over the whole population (via PopulationCandidatesIterator) or over Borg mutants (BorgMutantsIterator). The benefit over AbstractVector{<:Candidate} is that the candidates are only created when they are required for fitness calculation, thereby reducing memory footprint. I haven't checked, but single-objective algorithms, such as NES family, which require massive fitness recalculations, might benefit from using update_fitness!() as well.
The main component of PR is MultithreadEvaluator, which implements AbstractAsynchronousEvaluator interface and (on top of it) the parallelized version of update_fitness!(). It's loosely based on #46, but since it uses threads within the same process and over the same data, the communication is much simpler:
So far I'm testing BorgMOEA with 36 threads and it runs fine. I haven't done the benchmarks, but hopefully the overhead of communication between the threads is minimal (in comparison to #43). However, with many threads I see that Pareto frontier update becomes the main bottleneck. Although it uses R*-tree for efficient indexing, maintaining large frontier (~8000 points) is quite expensive, so it looks like using 36 threads is not much more efficient than, say, 18. ATM I don't know how to efficiently address it. Hopefully, single-objective algorithms that can benefit from parallel fitness calculation (NES) will not have this issue.
make_evaluator() (and so bboptimize() as well) is taught to create MultithreadEvaluator when called with NThreads=n keyarg.
Any optimization algorithm that uses update_fitness!(..., candidatesIterator, ...) should be effectively supported by MultithreadEvaluator.
Of course, problem-specific fitness() method have to be made multithread-compatible. If any (temporary) objects are being modified during fitness calculation, one needs to check that each worker thread will operate on its own objects. There's an example of using array pools and Threads.SpinLock in OptEnrichedSetCover.jl, but probably some clean and simple example has to be added to BBO.