Skip to content

Improve multiprocessing #73

@ddale

Description

@ddale

An alternative implementation for multiprocessing of paintGrid exists in hexrd.fitgrains. The approach works as follows:

  1. Create a Worker class that implements the multiprocessing.Process interface, but is not a subclass of that class. This will be used when multiprocessing is disabled, for example during profiling. The worker exits when the queue is empty.

  2. Create a WorkerMP class that subclasses Worker and multiprocessing.Process. Basically all this class needs to do is call 'Process.init' to enable multiprocessing.

  3. Create a multiprocessing.JoinableQueue and populate it with the job-specific information, which tends to be very small (for example, a single quaternion).

  4. Pack all of the contextual data into a dictionary, to be passed to the individual workers during instantiation.

  5. Create a multiprocessing.Manager.List to hold the results

  6. Start the multiprocessing workers sequentially:

    for i in range(n_cpus):
        w = Worker(queue, results, params)
        w.start()

    Each worker begins processing immediately, its possible processing may even complete before all workers have been spun up.

  7. Wait until the results list is complete, updating progress bars based on its length.

Improvements to be made:

  • Refactor this multiprocessing approach into a separate module containing abstract base classes to avoid code duplication.
  • Implement a custom map function that is called with a Worker class (not an instance), the contextual information, number of cpus, a list of data to iterate over, and a progress callback as input. It creates a queue to pass to the workers, creates a managed list to hold the results, spins up the workers sequentially and begins processing, and then enters a loop to report progress until processing is complete. The function returns the list of results.
  • Consider breaking this into a custom Pool class with a map method. This implementation is cleaner, Pool would be instantiated by passing the Worker class, the contextual data dict, and the number of cpus, map would be called with the list of data over which to iterate, and the callback. The problem is that for smaller datasets, much of the processing time appears to be consumed by spinning up the Workers themselves, so we want each worker to begin processing immediately, not wait until the entire pool is ready. We should time the initialization step though, perhaps it is not such a big issue.
  • Convert paintGrid multiprocessing to use this approach.
  • Refactor fitgrains multiprocessing to use this new approach.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions