Skip to content

Conversation

@HARISH20205
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Oct 21, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
alpha-design Ready Ready Preview Comment Oct 29, 2025 10:44pm

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces distributed multi-GPU training capabilities to the RL pipeline, along with supporting infrastructure and updated training results from recent runs.

  • Adds distributed training support via PyTorch DDP for multi-GPU parallelization
  • Includes comprehensive documentation explaining the distributed training architecture
  • Updates configuration and results from recent training runs

Reviewed Changes

Copilot reviewed 17 out of 62 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
RL/distributed_pipeline.py New file implementing distributed training pipeline with DDP support
RL/readme_distributed.md Documentation explaining distributed training concepts and usage
RL/config.json Configuration update reducing max_generations from 20 to 1
RL/pyproject.toml New project configuration file with basic metadata
RL/main.py Simple entry point with placeholder implementation
RL/.gitignore Standard Python gitignore file
RL/.python-version Python version specification (3.12)
RL/stl_outputs/generation_000_best_design_params.json Updated design parameters from recent training run
RL/neural_networks/training_metrics_gen_0.json Updated training metrics and timestamp
RL/checkpoints/*.json Updated checkpoint files with new training data
RL/f1_wing_output/*.json Updated CFD analysis parameters for wing designs
RL/cfd_results/*.json New CFD simulation results for generation 0 individuals

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

self.logger.info("training...")
```

only rank 0 logs. this avvoids spam from all processes.
Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'avvoids' to 'avoids'.

Suggested change
only rank 0 logs. this avvoids spam from all processes.
only rank 0 logs. this avoids spam from all processes.

Copilot uses AI. Check for mistakes.
)
self.logger = logging.getLogger("AlphaDesign")

def setup_directories(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has not to be done, as i think when we saving the files like checkpoints, or best designs, it gets created automatically


return result

def broadcast_population(self, population: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this for distributed processes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wanna test this once, lemme host it on my lap and test it, with my pc able to connect, can u add a function to just test the distributed connection, instead of directly integrating this into the pipeline

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep it minimal to cleanup, as we'll not be using marimo like things on any part of the pipeline, clean this file to just keep the necessary, but i like the partition, as in heading of what it belongs in the comments followed by what is needed to be ignored

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a description, and name the project name as AlphaDesign-RL, and an appropriate description accordingly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the theme of this file is correct, but the requirement is a slight bit different as in, its like diff group of training happening in diff gpus (or cpus) we have, everyone would have a queue of their own, when they complete they append in the queue, and then the common cpu (or gpu) which is gonna run the compliance test of diff components is gonna take from the head of each queue, check them and give a reward accordingly, so that reward is gonna be queued again, so each one must have two queues, the rewards in the queue will be added in the next run accordingly. So is possible understand this and establish it accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just as pyproject.toml, name this package alphadesign-rl

@Mantissagithub
Copy link
Member

We need to make this a bit faster as in, we nee do complete this within this year, having around 2 months roughly.

Copy link
Member

@Mantissagithub Mantissagithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the comments, and change it accordingly.

…epts and methodologies

- Introduced a comprehensive overview of decentralized training, including key takeaways and methodologies such as DiLoCo, DiPaCo, and SWARM Parallelism.
- Included diagrams to illustrate distributed training architecture and issues.
- Discussed the importance of resource pooling and communication overhead reduction in training models across different machines.
- Outlined the need for a central queueing server and potential monitoring strategies for training processes.
Add detailed notes on PCCL, its architecture, and operational principles.
Added a mental model image to the PCCL documentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants