feat(profiling): Add compute-communication overlap diagnostic for FSDP2#2418
Open
pbpatre wants to merge 4 commits intopytorch:mainfrom
Open
feat(profiling): Add compute-communication overlap diagnostic for FSDP2#2418pbpatre wants to merge 4 commits intopytorch:mainfrom
pbpatre wants to merge 4 commits intopytorch:mainfrom
Conversation
wconstab
reviewed
Feb 23, 2026
3804227 to
bdb76ef
Compare
Contributor
Contributor
What part of the code is specific to FSDP2? It looks pretty generic to me. |
Contributor
|
yes, you're right. it is actually specific to nccl kernel names, but was tested for FSDP2, maybe works in general. |
tianyu-l
reviewed
Feb 24, 2026
Contributor
tianyu-l
left a comment
There was a problem hiding this comment.
I feel this could be useful, and the code complexity doesn't seem to be crazy.
Meanwhile I would love to see refactor of
- Profiling file into "config build style" following recent refactor, so
ProfilingConfigbecomesProfiler.Config - provide a generic (list of) profile analyzer which this
OverlapAnalyzeris the first example
Let me know if this makes sense or not.
Author
|
@tianyu-l I agree that we can do the quick refractoring for config build style as part of this change. I have updated the PR with this change and added test cases. Let me know if it looks good
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an automated compute-communication overlap diagnostic to help identify communication bottlenecks in FSDP2 training.
Implements #2408
Motivation
Achieving high MFU in large-scale training requires effective overlap of compute and NCCL communication. Currently, identifying "communication bubbles" requires manually downloading multi-GB trace files and inspecting them in Chrome Trace Viewer. This feature provides an immediate diagnostic signal directly in the logs.
Changes
New Feature:
OverlapAnalyzer--profiling.experimental_diagnosticsflag (off by default)Implementation Details
prof.key_averages()for kernel time aggregationprof.events()for accurate trace duration calculation(compute + nccl - trace_duration) / nccl * 100ncclfor communication;gemm,aten,cublas,cudnn,cutlass,triton,flashfor computeExample Output
Llama 3 8B (8B params) on 2x H200:
[OverlapAnalyzer] Compute-Communication Overlap Report Total Compute Time : 370.05 ms Total NCCL Time : 324.26 ms Total Trace Time : 370.05 ms Overlap Efficiency : 100.0% (conservative lower bound) Status : OPTIMAL
Debug model (6M params) on 2x H200:
[OverlapAnalyzer] Compute-Communication Overlap Report Total Compute Time : 16.27 ms Total NCCL Time : 68.40 ms Total Trace Time : 68.40 ms Overlap Efficiency : 23.8% (conservative lower bound) Status : COMMUNICATION BOUND
Results correctly show larger models achieve better overlap (FSDP2 hides communication behind compute), while small models are communication-bound.
Testing
Tested on 2x NVIDIA H200 with:
NGPU=2 ./run_train.sh --model.flavor 8B --training.local_batch_size 1 --training.seq_len 2048 --training.steps 20 --profiling.enable_profiling --profiling.experimental_diagnostics --profiling.profile_freq 10 2>&1 | tee 8B_output.logNGPU=2 ./run_train.sh --job.config_file torchtitan/models/llama3/train_configs/llama3_8b.toml --profiling.enable_profiling --profiling.profile_freq=10 --profiling.experimental_diagnostics --training.steps=20 2>&1 | tee 6M_output.logLimitations
The overlap efficiency is reported as a "conservative lower bound" because
key_averages()sums kernel durations. When multiple kernels run concurrently on different CUDA streams, actual overlap may be higher. For precise analysis, inspect the Chrome trace directly.