Multi-GPU Distributed Training

Distributed training launch patterns with DDP, FSDP across PyTorch, PyTorch Lightning, MONAI, NVIDIA Modulus, and NVIDIA NeMo, Deepspeed, Ray Train, tailored for SLURM and other multi-GPU environments.

📂 Frameworks

Framework / Method	Launch Strategy	Notes
PyTorch (torchrun / torch.distributed)	Launch one process per node, via `torchrun` or `torch.distributed.launch`	DDP utility, cycleGAN, Monai
PyTorch (mp.spawn)	Launch one process per node, process generation using `mp.spawn`	Handy for script-based launches
PyTorch (srun)	Launch one process per GPU via `srun`	Clean integration with SLURM and srun
PyTorch (FSDP)	One process per node via `torchrun`	Standard FSDP
PyTorch Lightning	One process per GPU via `srun`	Uses Lightning “Trainer” class
MONAI (SLURM, DDP)	`torch.distributed.launch` & `srun`	Best practices for medical ML
NVIDIA Modulus	Launch per process with `srun`	PDE-oriented scientific workloads
NVIDIA NeMo Megatron	Per process launching strategy	Megatron model parallel for LLM training; singularity
Ray (SLURM Launch)	Launch per process with `srun`	general distributed communication framework
Deepspeed (SLURM Launch)	Launch per process with `srun`	Deepspeed distributed training configuration

Each folder in this repo contains example scripts and job templates tailored to its framework.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.ipynb_checkpoints		.ipynb_checkpoints
MultiGPU-NeMo-ptune @ 962f43f		MultiGPU-NeMo-ptune @ 962f43f
NVIDIA-modulus		NVIDIA-modulus
Ray		Ray
deepspeed		deepspeed
monai-ddp-launch-example		monai-ddp-launch-example
monai-ddp-not_torchlaunch-example		monai-ddp-not_torchlaunch-example
multi-gpu-starmap		multi-gpu-starmap
profiling-debugging		profiling-debugging
pytorch_fsdp		pytorch_fsdp
pytorch_launch_per_node		pytorch_launch_per_node
pytorch_launch_per_process		pytorch_launch_per_process
pytorch_lightning_ddp		pytorch_lightning_ddp
pytorch_model_parallel		pytorch_model_parallel
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-GPU Distributed Training

📂 Frameworks

About

Uh oh!

Releases

Packages

Uh oh!

Languages

YunchaoYang/Distributed_Training_MultiGPU

Folders and files

Latest commit

History

Repository files navigation

Multi-GPU Distributed Training

📂 Frameworks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages