Skip to content

Conversation

@alyst
Copy link
Contributor

@alyst alyst commented Dec 30, 2019

Julia 1.3.0 finally supported threads, so I've implemented MultithreadEvaluator that makes use of it.

The PR adds the concept of AbstractAsynchronousEvaluator. The idea is that one can submit fitness calculation jobs to such evaluator via async_update_fitness() call, which immediately returns the job id, so that the optimization algorithm can continue, while the fitness calculation is being done in the background. When required, the results could be collected/waited for by sync_update_fitness!(). The support of asynchronous evaluation was added to BorgMOEA: it can generate recombined individuals and then immediately proceed to the next step, while their fitnesses are calculated (the synchronization happens before the next recombination).

The PR also generalizes the update_fitness!() by supporting any iterator over Candidate objects. This allows the generic support for fitness calculation over the whole population (via PopulationCandidatesIterator) or over Borg mutants (BorgMutantsIterator). The benefit over AbstractVector{<:Candidate} is that the candidates are only created when they are required for fitness calculation, thereby reducing memory footprint. I haven't checked, but single-objective algorithms, such as NES family, which require massive fitness recalculations, might benefit from using update_fitness!() as well.

The main component of PR is MultithreadEvaluator, which implements AbstractAsynchronousEvaluator interface and (on top of it) the parallelized version of update_fitness!(). It's loosely based on #46, but since it uses threads within the same process and over the same data, the communication is much simpler:

  • The evaluator spawns N worker tasks in separate threads (Julia public API doesn't yet allow to control which threads the tasks are being assigned to, so I had to use the non-public API (see MTEvaluatorWorker() ctor); maybe it's worth submitting an issue to julia project).
  • The master task (where the optimization algorithm runs) notifies the workers on the new jobs via jobid_chan::Channel{Int} that is uniquely bound to each worker task.
  • The workers notify the master about the completion of fitness calculation via shared done_workers::Channel{Int} channel.
  • To efficiently keep track of the completed and pending fitness jobs, there's SlidingBitSet type.

So far I'm testing BorgMOEA with 36 threads and it runs fine. I haven't done the benchmarks, but hopefully the overhead of communication between the threads is minimal (in comparison to #43). However, with many threads I see that Pareto frontier update becomes the main bottleneck. Although it uses R*-tree for efficient indexing, maintaining large frontier (~8000 points) is quite expensive, so it looks like using 36 threads is not much more efficient than, say, 18. ATM I don't know how to efficiently address it. Hopefully, single-objective algorithms that can benefit from parallel fitness calculation (NES) will not have this issue.

make_evaluator() (and so bboptimize() as well) is taught to create MultithreadEvaluator when called with NThreads=n keyarg.

Any optimization algorithm that uses update_fitness!(..., candidatesIterator, ...) should be effectively supported by MultithreadEvaluator.

Of course, problem-specific fitness() method have to be made multithread-compatible. If any (temporary) objects are being modified during fitness calculation, one needs to check that each worker thread will operate on its own objects. There's an example of using array pools and Threads.SpinLock in OptEnrichedSetCover.jl, but probably some clean and simple example has to be added to BBO.

@robertfeldt
Copy link
Owner

Wow @alyst, this looks great. I will put it to its paces and familiarise myself with the code and then hopefully merge in a few days. This will be superuseful to many users. Big thanks and a Happy New Year to you. :)

@robertfeldt
Copy link
Owner

I've taken only a brief look so far and my testing has been positive. I'm currently only on a 2-core laptop (over holidays) so not meaningful to benchmark... ;)

One concern is that this looks quite heavy/complex so might be harder to maintain going forward. Can you clarify what are the main benefits compared to something simple like an evaluator that simply Threads.@threads loops over the fitness calculations of a set of candidates?

When I'm on a 4-core machine I'll try to add some simple examples/benchmarks to see what kind of benefits can be had.

@alyst
Copy link
Contributor Author

alyst commented Jan 2, 2020

Can you clarify what are the main benefits compared to something simple like an evaluator that simply Threads.@threads loops over the fitness calculations of a set of candidates?

It's "warm" vs "cold" start. There's quite some work done behind the curtains of @threads for: the tasks are created, assigned to threads, scheduled and waited for. It's ok if this is done once per invocation of the algorithm, but in our case it's done millions of times (at each optimization step). Still it's worth to make a benchmark, but I suspect for simple fitness functions @threads for might be even slower that no multithreading.
With MultithreadEvaluator the creation of the tasks is done once per optimization, [async_]update_fitness!() just notifies the existing tasks. I hope the Channel type that I use for threads communication implements the most efficient synchronization (which, I guess, should be SpinLock-based and no messing with Julian task scheduling).

One concern is that this looks quite heavy/complex so might be harder to maintain going forward.

It's already much simpler than #46, since we don't have to send the candidates or the fitnesses across different processes. :)
There's MTEvaluatorWorker() constructor that uses non-public API, but besides that I think it's quite straightforward.
Maybe there would be some packages to implement efficient high-level threading primitives, but so far I haven't come across any.

When I'm on a 4-core machine I'll try to add some simple examples/benchmarks to see what kind of benefits can be had.

Thank you, that would be very helpful!

num_eval or job_id updates are exclusively done on the master thread
@robertfeldt
Copy link
Owner

Thanks, that makes sense.

I tried the default and dxnes optimizers on single- and multi-threaded performance on a 4-core (and a 2-core) laptop but I don't see increased performance or even more function calls performed. I'll also try BorgMOEA since maybe there we can expect better performance but if you can help me understand why I'm not seeing more func evals on at least dxnes (I try setting a higher lambda to ensure more samples per "round") that would help evaluate this.

Here is the starts of an example script for multi-threaded optimization:
https://gist.github.com/robertfeldt/753922ccc4a3ad46e59e257fb86763e4

@robertfeldt
Copy link
Owner

Ok, forget my prev comment. I realized the func I optimized in the examples was way too fast to benefit from the thread switching. Want to do some more testing on a larger/faster 8-core/16-thread machine to test this out though but it does look solid so far.

@alyst
Copy link
Contributor Author

alyst commented Jan 4, 2020

@robertfeldt Thanks for your script! I'm currently using it to profile the MultithreadedEvaluator. It's true that these problems are too fast (also dxnes uses multithreaded BLAS in my case, which may affect the results a bit), but still it highlighted some problems with my approach. I was too naive to assume that all the threads are fully utilized when I have seen 100% CPU usage. I've made some progress, but I hope to improve more soon. So it's definitely better to wait until the revised version.

@alyst
Copy link
Contributor Author

alyst commented Jan 6, 2020

I've pushed the revised version. Unfortunately, it doesn't solve the issue with optimization of simple functions being actually slower than single-threaded.
But it improves the design of how the thread communication is organized.

The previous design was the legacy of multi-processes parallelization implemented in #46: the master thread was both dispatching the jobs to the worker threads and storing the results in the archive, the worker threads were listening to the job requests coming from the master.
The new design benefits from the workers actually being the threads of the same process. Now the main thread only puts the jobs to the queue, while the worker threads are actively taking the jobs from that queue. That solves the issue that no job dispatch was possible during the archive update (in multiobj optimization that could be expensive). Now the update of the archive can happen in parallel with the fitness calculation, and the threads should be getting the new jobs much faster.

I've also replaced the Channel with SpinLock-based synchronization to minimize worker threads idle times and Julian task switching (which, unfortunately, makes the code more complicated).

As I've said before, single thread is still faster for simpler problems. But when the fitness calculation gets more expensive (one can easily model that by adding sleep(0.001) to the optimized function), the effects of multithreading become more visible. The effects further improve if lambda (the population size in dxnes) is made larger. For 19-worker setup with sleep(0.001) in rasstrigin() and lambda=1000 I get 5000 num_evals for res_single and 54000 num_evals for res_multi. But for algorithms like DiffEvo or Borg having more than 4 threads should not make a big improvement as it's limited by the number of individuals that need to be recalculated at each step of the algorithm.

I would like to improve the situation for simpler problems too, but I don't know how to proceed further.
For example, I see that sometimes the worker tries to get the job, but when it gets the lock on the jobs queue, the job is already gone. I suspect that's because OS scheduler randomly gives less priority for that specific thread (now I report the number of function evaluations per worker and one can see that often there is disbalance). I've tried to profile the code, but it looks like the @profile bboptimize() results are not 100% accurate in multithreading case: I see that most of the time is spent in waiting for the next job and no time at all in computing the fitness, but that doesn't look accurate.

@robertfeldt
Copy link
Owner

robertfeldt commented Jan 21, 2020

Thanks for the updates, this looks very nice. I'm ready to merge but did notice this in the weekly Julia new summary:

https://discourse.julialang.org/t/ann-threadpools-jl-improved-thread-management-for-background-and-nonuniform-tasks/33592

which looks useful and might simplify for us. I guess having our own queue and scheduling to it might still give benefits for us but the simplicity of using something like ThreadPools is appealing...

@alyst
Copy link
Contributor Author

alyst commented Jan 21, 2020

I've also came across ThreadsPool.jl. It's definitely very useful, but IIUC every fitness evaluation would require creating and dispatching a Task object, which implies some overhead.
It would be nice if at some point there would be a package that provides a generic framework for the multithreading we need in BBO: tasks constrained to f(x::T) evaluation, where f and T are fixed, but using synchronization mechanisms with minimal extra costs.
(Well, actually, we can try extracting this code from MultithreadedEvaluator and make such package ourselves. Maybe there would be some interest from the other people)

@robertfeldt robertfeldt merged commit aec4b96 into robertfeldt:master Feb 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants