https://www.overleaf.com/read/hwwmfvgdqnny#01433e
-
Install Poetry following the documentation: https://python-poetry.org/docs/#installing-with-the-official-installer
-
Initialize a virtual environment running the command
poetry init -
Install the latest version of
chimerarunning the commandpoetry add chimera-distributed-ml -
Start the Docker Daemon. You can make it either by opening Docker Desktop or by starting the Daemon via CLI (in Linux:
sudo systemctl start docker). Docker Daemon makes Docker REST APIs available, so we can run commands likedocker buildanddocker run, that are called internally bychimera. -
Create and run distributed models with
chimera!
-
Install Poetry following the documentation: https://python-poetry.org/docs/#installing-with-the-official-installer
-
Clone the
chimeraproject via either HTTPS or SSH:- HTTPS:
git clone https://github.com/Samirnunes/chimera.git - SSH:
git clone git@github.com:Samirnunes/chimera.git
- HTTPS:
-
Go to project's root directory (where
pyproject.tomlis located) and runpoetry install. It will generate a.venvfile in the root directory with the installed dependencies, and apoetry.lockfile. -
Start the Docker Daemon. You can make it either by opening Docker Desktop or by starting the Daemon via CLI (in Linux:
sudo systemctl start docker). Docker Daemon makes Docker REST APIs available, so we can run commands likedocker buildanddocker run, that are called internally bychimera. -
Create and run distributed models with
chimera!
The chimera framework is a Python package for DML designed for educational and prototyping purposes. It provides a structured environment for experiments with key DML techniques, including Data Parallelism, Model Parallelism, and Hybrid Parallelism.
As a distributed computing framework, chimera aims to simplify the creation, in a local environment, of distributed machine learning models by streamlining the construction of a Master node on the host machine and Worker nodes on separate virtual machines using Docker containers. By providing a standardized API-based communication framework, chimera enables researchers and practitioners to test, evaluate, and optimize distributed learning algorithms with minimal configuration effort. The framework supports Data, Model and Hybrid Parallelism, whose algorithms are shown below:
-
Data Parallelism: Distributed SGD for models such as linear regression, logistic regression, and others, depending on the loss function.
-
Model Parallelism: Distributed Bagging using generic weak learners from the scikit-learn package, with the same dataset on each Worker node.
-
Hybrid Parallelism: Distributed Bagging using generic weak learners from the scikit-learn package, with different datasets on each Worker node.
Docker containers act as Workers. To run the created distributed system, it will be given a standardized function named run, on which a Master type and a port must be selected for the server in the host machine. The run function starts the Chimera master server and handles worker containers, then initializing the necessary components for the distributed system to work.
The client-master and master-workers communications are made via REST APIs.
Figure: Example of Chimera files.
-
After installing
chimera, you need to create aMasterand itsWorkers:- Master: create a
.pyfile in your root directory. This file must specify the environment variables necessary to run the code in string format (in the case of Lists, you must follow the JSON string format for Lists) and run achimeramaster server withchimera.run. For example:chimera.run(AggregationMaster(), 8080). The available configuration environment variables are in the classesNetworkConfigandWorkersConfig, insidesrc/chimera/containers/config.py.
Figure: Example of a master's file.
- Workers: create a folder called
chimera_workersand create.pyfiles which are going to represent your workers. Each file must initialize achimeraworker and callworker.serve()inside anif __name__ == "__main__":block, which will initialize the worker server whenchimera.runis called in the master's file. Note that the environment variableCHIMERA_WORKERS_NODES_NAMESin the master's file must contain all the workers' file names, without the.pysuffix.
Figure: Example of a worker's file.
- Master: create a
-
Before running the master's file, you must specify the local training dataset for each worker. This is made by creating a folder called
chimera_train_datacontaining folders with the same name as the worker's files (clearly without the.py). Each folder must have aX_train.csvfile containing the features and ay_train.csvcontaining the labels. WhetherX_train.csvandy_train.csvare the same or not for all the workers is up to you. Keep in mind what algorithm you want to create in the distributed environment! -
Finally, you can run the master's file using:
poetry run python {your_master_filename.py}. This should initialize all the worker's containers in your Docker environment and the master server in the host machine (the machine running the code).
Figure: General Architecture for a Chimera Distributed System. It summarizes how to create a distributed model with Chimera.
The following environment variables allow users to configure the chimera distributed machine learning system. These variables define network settings, worker configurations, and resource allocations, ensuring flexibility to different environments.
The following variables define the Docker network settings for chimera:
-
CHIMERA_NETWORK_NAME(default:"chimera-network") - The name of the Docker network wherechimeraruns. -
CHIMERA_NETWORK_PREFIX(default:"192.168.10") - The IP network prefix for the Docker network. - Must be a valid IPv4 network prefix (e.g.,"192.168.10"). -
CHIMERA_NETWORK_SUBNET_MASK(default:24) - The subnet mask for the Docker network, defining how many bits are reserved for the network. - Must be an integer between0and32.
The following variables control the behavior of worker nodes in chimera:
-
CHIMERA_WORKERS_NODES_NAMES- A list of worker node names.
- Must be unique across all workers.
- Example:
["worker1", "worker2", "worker3"].
-
CHIMERA_WORKERS_CPU_SHARES(default:[2])- A list of CPU shares assigned to each worker.
- Each value must be an integer ≥
2. - Example:
[2, 4, 4]assigns different CPU shares to three workers.
-
CHIMERA_WORKERS_MAPPED_PORTS(default:[101])- A list of host ports mapped to each worker’s container.
- Must be unique across all workers.
- Example:
[5001, 5002, 5003]assigns distinct ports to three workers.
-
CHIMERA_WORKERS_HOST(default:"0.0.0.0")- The host IP address that binds worker ports.
"0.0.0.0"allows connections from any IP address.
-
CHIMERA_WORKERS_PORT(default:80)- The internal container port that workers listen on.
- This is the port inside the worker's container, not the exposed host port.
-
CHIMERA_WORKERS_ENDPOINTS_MAX_RETRIES(default:0)- The maximum number of retry attempts when communicating with worker nodes.
-
CHIMERA_WORKERS_ENDPOINTS_TIMEOUT(default:100.0)- The timeout (in seconds) for worker API endpoints.
These environment variables give users full control over how chimera distributes models, manages worker nodes, and configures networking in a flexible and simple manner.
The framework uses two dedicated loggers to track system's behavior and latency metrics:
-
Status Logger (
chimera_status): Logs general status messages related to the system's operations, such as workflow progress, key events, and high-level actions. The logs are saved in the filechimera_status.log. -
Time Logger (
chimera_time): Logs latency metrics, then, it's useful for monitoring and debugging time efficiency. These logs are stored in the filechimera_time.log.
Both loggers are configured using Python’s built-in logging module, and log messages at the INFO level. Each logger writes to its respective log file through a FileHandler.
For more examples, see: https://github.com/Samirnunes/chimera-examples




