-
Notifications
You must be signed in to change notification settings - Fork 1
Dev/distributed #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces distributed multi-GPU training capabilities to the RL pipeline, along with supporting infrastructure and updated training results from recent runs.
- Adds distributed training support via PyTorch DDP for multi-GPU parallelization
- Includes comprehensive documentation explaining the distributed training architecture
- Updates configuration and results from recent training runs
Reviewed Changes
Copilot reviewed 17 out of 62 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| RL/distributed_pipeline.py | New file implementing distributed training pipeline with DDP support |
| RL/readme_distributed.md | Documentation explaining distributed training concepts and usage |
| RL/config.json | Configuration update reducing max_generations from 20 to 1 |
| RL/pyproject.toml | New project configuration file with basic metadata |
| RL/main.py | Simple entry point with placeholder implementation |
| RL/.gitignore | Standard Python gitignore file |
| RL/.python-version | Python version specification (3.12) |
| RL/stl_outputs/generation_000_best_design_params.json | Updated design parameters from recent training run |
| RL/neural_networks/training_metrics_gen_0.json | Updated training metrics and timestamp |
| RL/checkpoints/*.json | Updated checkpoint files with new training data |
| RL/f1_wing_output/*.json | Updated CFD analysis parameters for wing designs |
| RL/cfd_results/*.json | New CFD simulation results for generation 0 individuals |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| self.logger.info("training...") | ||
| ``` | ||
|
|
||
| only rank 0 logs. this avvoids spam from all processes. |
Copilot
AI
Oct 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'avvoids' to 'avoids'.
| only rank 0 logs. this avvoids spam from all processes. | |
| only rank 0 logs. this avoids spam from all processes. |
| ) | ||
| self.logger = logging.getLogger("AlphaDesign") | ||
|
|
||
| def setup_directories(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has not to be done, as i think when we saving the files like checkpoints, or best designs, it gets created automatically
|
|
||
| return result | ||
|
|
||
| def broadcast_population(self, population: List[Dict[str, Any]]) -> List[Dict[str, Any]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this for distributed processes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wanna test this once, lemme host it on my lap and test it, with my pc able to connect, can u add a function to just test the distributed connection, instead of directly integrating this into the pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep it minimal to cleanup, as we'll not be using marimo like things on any part of the pipeline, clean this file to just keep the necessary, but i like the partition, as in heading of what it belongs in the comments followed by what is needed to be ignored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a description, and name the project name as AlphaDesign-RL, and an appropriate description accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the theme of this file is correct, but the requirement is a slight bit different as in, its like diff group of training happening in diff gpus (or cpus) we have, everyone would have a queue of their own, when they complete they append in the queue, and then the common cpu (or gpu) which is gonna run the compliance test of diff components is gonna take from the head of each queue, check them and give a reward accordingly, so that reward is gonna be queued again, so each one must have two queues, the rewards in the queue will be added in the next run accordingly. So is possible understand this and establish it accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just as pyproject.toml, name this package alphadesign-rl
|
We need to make this a bit faster as in, we nee do complete this within this year, having around 2 months roughly. |
Mantissagithub
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the comments, and change it accordingly.
…epts and methodologies - Introduced a comprehensive overview of decentralized training, including key takeaways and methodologies such as DiLoCo, DiPaCo, and SWARM Parallelism. - Included diagrams to illustrate distributed training architecture and issues. - Discussed the importance of resource pooling and communication overhead reduction in training models across different machines. - Outlined the need for a central queueing server and potential monitoring strategies for training processes.
Add detailed notes on PCCL, its architecture, and operational principles.
Added a mental model image to the PCCL documentation.
No description provided.