Dev/distributed #2

HARISH20205 · 2025-10-21T21:39:02Z

No description provided.

vercel · 2025-10-21T21:39:08Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
alpha-design	Ready	Preview	Comment	Oct 29, 2025 10:44pm

Copilot

Pull Request Overview

This PR introduces distributed multi-GPU training capabilities to the RL pipeline, along with supporting infrastructure and updated training results from recent runs.

Adds distributed training support via PyTorch DDP for multi-GPU parallelization
Includes comprehensive documentation explaining the distributed training architecture
Updates configuration and results from recent training runs

Reviewed Changes

Copilot reviewed 17 out of 62 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
RL/distributed_pipeline.py	New file implementing distributed training pipeline with DDP support
RL/readme_distributed.md	Documentation explaining distributed training concepts and usage
RL/config.json	Configuration update reducing max_generations from 20 to 1
RL/pyproject.toml	New project configuration file with basic metadata
RL/main.py	Simple entry point with placeholder implementation
RL/.gitignore	Standard Python gitignore file
RL/.python-version	Python version specification (3.12)
RL/stl_outputs/generation_000_best_design_params.json	Updated design parameters from recent training run
RL/neural_networks/training_metrics_gen_0.json	Updated training metrics and timestamp
RL/checkpoints/*.json	Updated checkpoint files with new training data
RL/f1_wing_output/*.json	Updated CFD analysis parameters for wing designs
RL/cfd_results/*.json	New CFD simulation results for generation 0 individuals

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-21T21:39:48Z

RL/readme_distributed.md

+    self.logger.info("training...")
+```
+
+only rank 0 logs. this avvoids spam from all processes.


Corrected spelling of 'avvoids' to 'avoids'.

Suggested change

only rank 0 logs. this avvoids spam from all processes.

only rank 0 logs. this avoids spam from all processes.

RL/distributed_pipeline.py

RL/.gitignore

Mantissagithub · 2025-10-22T20:31:37Z

RL/distributed_pipeline.py

+        )
+        self.logger = logging.getLogger("AlphaDesign")
+
+    def setup_directories(self):


this has not to be done, as i think when we saving the files like checkpoints, or best designs, it gets created automatically

RL/distributed_pipeline.py

Mantissagithub · 2025-10-22T20:34:19Z

RL/distributed_pipeline.py

+
+        return result
+
+    def broadcast_population(self, population: List[Dict[str, Any]]) -> List[Dict[str, Any]]:


is this for distributed processes?

Mantissagithub · 2025-10-22T20:37:08Z

RL/distributed_pipeline.py

just wanna test this once, lemme host it on my lap and test it, with my pc able to connect, can u add a function to just test the distributed connection, instead of directly integrating this into the pipeline

Mantissagithub · 2025-10-22T20:39:32Z

RL/.gitignore

keep it minimal to cleanup, as we'll not be using marimo like things on any part of the pipeline, clean this file to just keep the necessary, but i like the partition, as in heading of what it belongs in the comments followed by what is needed to be ignored

Mantissagithub · 2025-10-22T20:42:16Z

RL/pyproject.toml

add a description, and name the project name as AlphaDesign-RL, and an appropriate description accordingly

Mantissagithub · 2025-10-22T20:46:18Z

RL/readme_distributed.md

the theme of this file is correct, but the requirement is a slight bit different as in, its like diff group of training happening in diff gpus (or cpus) we have, everyone would have a queue of their own, when they complete they append in the queue, and then the common cpu (or gpu) which is gonna run the compliance test of diff components is gonna take from the head of each queue, check them and give a reward accordingly, so that reward is gonna be queued again, so each one must have two queues, the rewards in the queue will be added in the next run accordingly. So is possible understand this and establish it accordingly.

Mantissagithub · 2025-10-22T20:46:55Z

RL/uv.lock

just as pyproject.toml, name this package alphadesign-rl

Mantissagithub · 2025-10-22T20:50:33Z

We need to make this a bit faster as in, we nee do complete this within this year, having around 2 months roughly.

Mantissagithub

Please check the comments, and change it accordingly.

…epts and methodologies - Introduced a comprehensive overview of decentralized training, including key takeaways and methodologies such as DiLoCo, DiPaCo, and SWARM Parallelism. - Included diagrams to illustrate distributed training architecture and issues. - Discussed the importance of resource pooling and communication overhead reduction in training models across different machines. - Outlined the need for a central queueing server and potential monitoring strategies for training processes.

Add detailed notes on PCCL, its architecture, and operational principles.

Added a mental model image to the PCCL documentation.

HARISH20205 added 2 commits October 22, 2025 01:52

chore(pycache): cleaned repo

888db07

feat(distributed): single-node multi-GPU training support with DDP

7084d96

HARISH20205 requested review from Mantissagithub and Copilot October 21, 2025 21:39

Copilot AI reviewed Oct 21, 2025

View reviewed changes

Mantissagithub requested changes Oct 22, 2025

View reviewed changes

Mantissagithub assigned HARISH20205 and YuvaneshSankar Oct 24, 2025

Mantissagithub added the enhancement New feature or request label Oct 24, 2025

vercel bot deployed to Preview October 27, 2025 05:22 View deployment

Create PCCL notes with architecture and principles

0e4cb96

Add detailed notes on PCCL, its architecture, and operational principles.

vercel bot deployed to Preview October 29, 2025 22:33 View deployment

Adding mental model for pccl

786dffa

vercel bot deployed to Preview October 29, 2025 22:42 View deployment

Rename shapes at 25-10-30 04.11.38.png to pccl.png

2372e00

vercel bot deployed to Preview October 29, 2025 22:43 View deployment

Add mental model image to pccl.md

cb6c1f1

Added a mental model image to the PCCL documentation.

vercel bot deployed to Preview October 29, 2025 22:44 View deployment

	only rank 0 logs. this avvoids spam from all processes.
	only rank 0 logs. this avoids spam from all processes.


		return result

		def broadcast_population(self, population: List[Dict[str, Any]]) -> List[Dict[str, Any]]:

Dev/distributed #2

Are you sure you want to change the base?

Dev/distributed #2

Conversation

HARISH20205 commented Oct 21, 2025

Uh oh!

vercel bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Mantissagithub Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Mantissagithub Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Mantissagithub Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Mantissagithub Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Mantissagithub Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Mantissagithub Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Mantissagithub Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Mantissagithub commented Oct 22, 2025

Uh oh!

Mantissagithub left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vercel bot commented Oct 21, 2025 •

edited

Loading