[RFC] Asynchronous multithreaded evaluator #141

alyst · 2019-12-30T16:29:01Z

Julia 1.3.0 finally supported threads, so I've implemented MultithreadEvaluator that makes use of it.

The PR adds the concept of AbstractAsynchronousEvaluator. The idea is that one can submit fitness calculation jobs to such evaluator via async_update_fitness() call, which immediately returns the job id, so that the optimization algorithm can continue, while the fitness calculation is being done in the background. When required, the results could be collected/waited for by sync_update_fitness!(). The support of asynchronous evaluation was added to BorgMOEA: it can generate recombined individuals and then immediately proceed to the next step, while their fitnesses are calculated (the synchronization happens before the next recombination).

The PR also generalizes the update_fitness!() by supporting any iterator over Candidate objects. This allows the generic support for fitness calculation over the whole population (via PopulationCandidatesIterator) or over Borg mutants (BorgMutantsIterator). The benefit over AbstractVector{<:Candidate} is that the candidates are only created when they are required for fitness calculation, thereby reducing memory footprint. I haven't checked, but single-objective algorithms, such as NES family, which require massive fitness recalculations, might benefit from using update_fitness!() as well.

The main component of PR is MultithreadEvaluator, which implements AbstractAsynchronousEvaluator interface and (on top of it) the parallelized version of update_fitness!(). It's loosely based on #46, but since it uses threads within the same process and over the same data, the communication is much simpler:

The evaluator spawns N worker tasks in separate threads (Julia public API doesn't yet allow to control which threads the tasks are being assigned to, so I had to use the non-public API (see MTEvaluatorWorker() ctor); maybe it's worth submitting an issue to julia project).
The master task (where the optimization algorithm runs) notifies the workers on the new jobs via jobid_chan::Channel{Int} that is uniquely bound to each worker task.
The workers notify the master about the completion of fitness calculation via shared done_workers::Channel{Int} channel.
To efficiently keep track of the completed and pending fitness jobs, there's SlidingBitSet type.

So far I'm testing BorgMOEA with 36 threads and it runs fine. I haven't done the benchmarks, but hopefully the overhead of communication between the threads is minimal (in comparison to #43). However, with many threads I see that Pareto frontier update becomes the main bottleneck. Although it uses R*-tree for efficient indexing, maintaining large frontier (~8000 points) is quite expensive, so it looks like using 36 threads is not much more efficient than, say, 18. ATM I don't know how to efficiently address it. Hopefully, single-objective algorithms that can benefit from parallel fitness calculation (NES) will not have this issue.

make_evaluator() (and so bboptimize() as well) is taught to create MultithreadEvaluator when called with NThreads=n keyarg.

Any optimization algorithm that uses update_fitness!(..., candidatesIterator, ...) should be effectively supported by MultithreadEvaluator.

Of course, problem-specific fitness() method have to be made multithread-compatible. If any (temporary) objects are being modified during fitness calculation, one needs to check that each worker thread will operate on its own objects. There's an example of using array pools and Threads.SpinLock in OptEnrichedSetCover.jl, but probably some clean and simple example has to be added to BBO.

to use with update_fitness!()

check that rebuild is necessary every time the candidate is inserted, not upon restarts only

since they run in the same thread that just creates overhead

because of Threads

robertfeldt · 2019-12-31T09:43:39Z

Wow @alyst, this looks great. I will put it to its paces and familiarise myself with the code and then hopefully merge in a few days. This will be superuseful to many users. Big thanks and a Happy New Year to you. :)

robertfeldt · 2019-12-31T14:59:56Z

I've taken only a brief look so far and my testing has been positive. I'm currently only on a 2-core laptop (over holidays) so not meaningful to benchmark... ;)

One concern is that this looks quite heavy/complex so might be harder to maintain going forward. Can you clarify what are the main benefits compared to something simple like an evaluator that simply Threads.@threads loops over the fitness calculations of a set of candidates?

When I'm on a 4-core machine I'll try to add some simple examples/benchmarks to see what kind of benefits can be had.

alyst · 2020-01-02T10:43:28Z

Can you clarify what are the main benefits compared to something simple like an evaluator that simply Threads.@threads loops over the fitness calculations of a set of candidates?

It's "warm" vs "cold" start. There's quite some work done behind the curtains of @threads for: the tasks are created, assigned to threads, scheduled and waited for. It's ok if this is done once per invocation of the algorithm, but in our case it's done millions of times (at each optimization step). Still it's worth to make a benchmark, but I suspect for simple fitness functions @threads for might be even slower that no multithreading.
With MultithreadEvaluator the creation of the tasks is done once per optimization, [async_]update_fitness!() just notifies the existing tasks. I hope the Channel type that I use for threads communication implements the most efficient synchronization (which, I guess, should be SpinLock-based and no messing with Julian task scheduling).

One concern is that this looks quite heavy/complex so might be harder to maintain going forward.

It's already much simpler than #46, since we don't have to send the candidates or the fitnesses across different processes. :)
There's MTEvaluatorWorker() constructor that uses non-public API, but besides that I think it's quite straightforward.
Maybe there would be some packages to implement efficient high-level threading primitives, but so far I haven't come across any.

When I'm on a 4-core machine I'll try to add some simple examples/benchmarks to see what kind of benefits can be had.

Thank you, that would be very helpful!

num_eval or job_id updates are exclusively done on the master thread

robertfeldt · 2020-01-03T05:58:40Z

Thanks, that makes sense.

I tried the default and dxnes optimizers on single- and multi-threaded performance on a 4-core (and a 2-core) laptop but I don't see increased performance or even more function calls performed. I'll also try BorgMOEA since maybe there we can expect better performance but if you can help me understand why I'm not seeing more func evals on at least dxnes (I try setting a higher lambda to ensure more samples per "round") that would help evaluate this.

Here is the starts of an example script for multi-threaded optimization:
https://gist.github.com/robertfeldt/753922ccc4a3ad46e59e257fb86763e4

robertfeldt · 2020-01-04T00:03:22Z

Ok, forget my prev comment. I realized the func I optimized in the examples was way too fast to benefit from the thread switching. Want to do some more testing on a larger/faster 8-core/16-thread machine to test this out though but it does look solid so far.

alyst · 2020-01-04T11:32:19Z

@robertfeldt Thanks for your script! I'm currently using it to profile the MultithreadedEvaluator. It's true that these problems are too fast (also dxnes uses multithreaded BLAS in my case, which may affect the results a bit), but still it highlighted some problems with my approach. I was too naive to assume that all the threads are fully utilized when I have seen 100% CPU usage. I've made some progress, but I hope to improve more soon. So it's definitely better to wait until the revised version.

to prevent from rescheduling to different threads

hint which worker to wake up to process the new job

alyst · 2020-01-06T19:51:26Z

I've pushed the revised version. Unfortunately, it doesn't solve the issue with optimization of simple functions being actually slower than single-threaded.
But it improves the design of how the thread communication is organized.

The previous design was the legacy of multi-processes parallelization implemented in #46: the master thread was both dispatching the jobs to the worker threads and storing the results in the archive, the worker threads were listening to the job requests coming from the master.
The new design benefits from the workers actually being the threads of the same process. Now the main thread only puts the jobs to the queue, while the worker threads are actively taking the jobs from that queue. That solves the issue that no job dispatch was possible during the archive update (in multiobj optimization that could be expensive). Now the update of the archive can happen in parallel with the fitness calculation, and the threads should be getting the new jobs much faster.

I've also replaced the Channel with SpinLock-based synchronization to minimize worker threads idle times and Julian task switching (which, unfortunately, makes the code more complicated).

As I've said before, single thread is still faster for simpler problems. But when the fitness calculation gets more expensive (one can easily model that by adding sleep(0.001) to the optimized function), the effects of multithreading become more visible. The effects further improve if lambda (the population size in dxnes) is made larger. For 19-worker setup with sleep(0.001) in rasstrigin() and lambda=1000 I get 5000 num_evals for res_single and 54000 num_evals for res_multi. But for algorithms like DiffEvo or Borg having more than 4 threads should not make a big improvement as it's limited by the number of individuals that need to be recalculated at each step of the algorithm.

I would like to improve the situation for simpler problems too, but I don't know how to proceed further.
For example, I see that sometimes the worker tries to get the job, but when it gets the lock on the jobs queue, the job is already gone. I suspect that's because OS scheduler randomly gives less priority for that specific thread (now I report the number of function evaluations per worker and one can see that often there is disbalance). I've tried to profile the code, but it looks like the @profile bboptimize() results are not 100% accurate in multithreading case: I see that most of the time is spent in waiting for the next job and no time at all in computing the fitness, but that doesn't look accurate.

robertfeldt · 2020-01-21T10:31:32Z

Thanks for the updates, this looks very nice. I'm ready to merge but did notice this in the weekly Julia new summary:

https://discourse.julialang.org/t/ann-threadpools-jl-improved-thread-management-for-background-and-nonuniform-tasks/33592

which looks useful and might simplify for us. I guess having our own queue and scheduling to it might still give benefits for us but the simplicity of using something like ThreadPools is appealing...

alyst · 2020-01-21T13:13:12Z

I've also came across ThreadsPool.jl. It's definitely very useful, but IIUC every fitness evaluation would require creating and dispatching a Task object, which implies some overhead.
It would be nice if at some point there would be a package that provides a generic framework for the multithreading we need in BBO: tasks constrained to f(x::T) evaluation, where f and T are fixed, but using synchronization mechanisms with minimal extra costs.
(Well, actually, we can try extracting this code from MultithreadedEvaluator and make such package ourselves. Maybe there would be some interest from the other people)

alyst added 17 commits December 28, 2019 22:07

pop: generalize fitness/candidate_type() defs

2860b77

pop: add PopulationCandidatesIterator

3b383c6

to use with update_fitness!()

EpsBoxArchive: tweak rebuild frontier logic

60472d0

check that rebuild is necessary every time the candidate is inserted, not upon restarts only

update_fitness!(): relax Eval & Candidates types

d472200

ParallelEval: update Eval iface, arch_fitness

9a4b3ff

ParallelEvaluator: multi-objective tests

59af76c

add SlidingBitset

c654e3a

add AbstractAsyncEvaluator

ec1832d

add MultithreadEvaluator

de9447d

Borg: update_pop_fitness()! use PopCandiIterator

dbd00dc

Borg: pop_by_mutants(): use BorgMutantsIterator

da4ea6d

Borg: support AbstractAsyncEvaluator

8e7234b

make_eval: support MultithreadEval

2b01fa7

Borg: test with MultithreadEvaluator

8ee9646

MultithreadEval: merge master and listener tasks

bea35b7

since they run in the same thread that just creates overhead

bump julia to 1.3.0, update CI

ba05b25

because of Threads

update_fitness!(::MTEval): use jobids pool

d02c168

MultithreadEval: don't use Atomic{Int}

34cc672

num_eval or job_id updates are exclusively done on the master thread

alyst added 6 commits January 6, 2020 19:18

mteval: claim_calculated() returns N claimed

56baa9e

mteval: count and report num_evals per worker

fd4b5e9

mteval: worker tasks should be sticky

df050d7

to prevent from rescheduling to different threads

FitPopulation: thread-safe access to candi_pool

8ae0ac3

revise AsyncEvaluator API

389ee3e

MultithreadEval: add candidate_type()

a6644ad

alyst added 4 commits January 6, 2020 19:20

Borg: update the use of AsyncEvaluator

b2748e1

MultithreadEval: refactor to have active threads

d9f6cfd

MultithreadEval: report num_evals and num_jobs

13b2a60

MultithreadEval: try to hint the scheduler

f662155

hint which worker to wake up to process the new job

robertfeldt merged commit aec4b96 into robertfeldt:master Feb 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Asynchronous multithreaded evaluator #141

[RFC] Asynchronous multithreaded evaluator #141

Uh oh!

alyst commented Dec 30, 2019

Uh oh!

robertfeldt commented Dec 31, 2019

Uh oh!

robertfeldt commented Dec 31, 2019

Uh oh!

alyst commented Jan 2, 2020

Uh oh!

robertfeldt commented Jan 3, 2020

Uh oh!

robertfeldt commented Jan 4, 2020

Uh oh!

alyst commented Jan 4, 2020

Uh oh!

alyst commented Jan 6, 2020

Uh oh!

robertfeldt commented Jan 21, 2020 •

edited

Loading

Uh oh!

alyst commented Jan 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[RFC] Asynchronous multithreaded evaluator #141

[RFC] Asynchronous multithreaded evaluator #141

Uh oh!

Conversation

alyst commented Dec 30, 2019

Uh oh!

robertfeldt commented Dec 31, 2019

Uh oh!

robertfeldt commented Dec 31, 2019

Uh oh!

alyst commented Jan 2, 2020

Uh oh!

robertfeldt commented Jan 3, 2020

Uh oh!

robertfeldt commented Jan 4, 2020

Uh oh!

alyst commented Jan 4, 2020

Uh oh!

alyst commented Jan 6, 2020

Uh oh!

robertfeldt commented Jan 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alyst commented Jan 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

robertfeldt commented Jan 21, 2020 •

edited

Loading