feat(profiling): Add compute-communication overlap diagnostic for FSDP2 by pbpatre · Pull Request #2418 · pytorch/torchtitan

pbpatre · 2026-02-21T21:07:29Z

Summary

Adds an automated compute-communication overlap diagnostic to help identify communication bottlenecks in FSDP2 training.

Implements #2408

Motivation

Achieving high MFU in large-scale training requires effective overlap of compute and NCCL communication. Currently, identifying "communication bubbles" requires manually downloading multi-GB trace files and inspecting them in Chrome Trace Viewer. This feature provides an immediate diagnostic signal directly in the logs.

Changes

New Feature: `OverlapAnalyzer`

Analyzes profiler traces to compute overlap efficiency between compute kernels and NCCL operations
Reports Total Compute Time, Total NCCL Time, Total Trace Time, and Overlap Efficiency
Classifies workloads as "OPTIMAL" (≥50% overlap) or "COMMUNICATION BOUND" (<50%)
Enabled via --profiling.experimental_diagnostics flag (off by default)

Implementation Details

Uses prof.key_averages() for kernel time aggregation
Uses prof.events() for accurate trace duration calculation
Computes overlap: (compute + nccl - trace_duration) / nccl * 100
Classifies kernels by name patterns: nccl for communication; gemm, aten, cublas, cudnn, cutlass, triton, flash for compute

Example Output

Llama 3 8B (8B params) on 2x H200:
[OverlapAnalyzer] Compute-Communication Overlap Report Total Compute Time : 370.05 ms Total NCCL Time : 324.26 ms Total Trace Time : 370.05 ms Overlap Efficiency : 100.0% (conservative lower bound) Status : OPTIMAL

Debug model (6M params) on 2x H200:
[OverlapAnalyzer] Compute-Communication Overlap Report Total Compute Time : 16.27 ms Total NCCL Time : 68.40 ms Total Trace Time : 68.40 ms Overlap Efficiency : 23.8% (conservative lower bound) Status : COMMUNICATION BOUND

Results correctly show larger models achieve better overlap (FSDP2 hides communication behind compute), while small models are communication-bound.

Testing

Tested on 2x NVIDIA H200 with:

Llama 3 8B: NGPU=2 ./run_train.sh --model.flavor 8B --training.local_batch_size 1 --training.seq_len 2048 --training.steps 20 --profiling.enable_profiling --profiling.experimental_diagnostics --profiling.profile_freq 10 2>&1 | tee 8B_output.log
Debug model: NGPU=2 ./run_train.sh --job.config_file torchtitan/models/llama3/train_configs/llama3_8b.toml --profiling.enable_profiling --profiling.profile_freq=10 --profiling.experimental_diagnostics --training.steps=20 2>&1 | tee 6M_output.log

Limitations

The overlap efficiency is reported as a "conservative lower bound" because key_averages() sums kernel durations. When multiple kernels run concurrently on different CUDA streams, actual overlap may be higher. For precise analysis, inspect the Chrome trace directly.

torchtitan/tools/profiling.py

wconstab · 2026-02-24T19:28:05Z

wonder what others think about adding this to torchtitan. seems nice, but also, its specific to FSDP2 and not sure if it belongs in torchtitan. cc @fegin @tianyu-l @wwwjn

tianyu-l · 2026-02-24T19:34:16Z

@wconstab

its specific to FSDP2

What part of the code is specific to FSDP2? It looks pretty generic to me.

wconstab · 2026-02-24T19:40:40Z

yes, you're right. it is actually specific to nccl kernel names, but was tested for FSDP2, maybe works in general.

tianyu-l

I feel this could be useful, and the code complexity doesn't seem to be crazy.

Meanwhile I would love to see refactor of

Profiling file into "config build style" following recent refactor, so ProfilingConfig becomes Profiler.Config
provide a generic (list of) profile analyzer which this OverlapAnalyzer is the first example

Let me know if this makes sense or not.

pbpatre · 2026-02-25T18:13:38Z

@tianyu-l I agree that we can do the quick refractoring for config build style as part of this change. I have updated the PR with this change and added test cases. Let me know if it looks good

Meanwhile I would love to see refactor of

Profiling file into "config build style" following recent refactor, so ProfilingConfig becomes Profiler.Config

provide a generic (list of) profile analyzer which this OverlapAnalyzer is the first example

pbpatre requested review from fegin, tianyu-l, wconstab and wwwjn as code owners February 21, 2026 21:07

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 21, 2026

pbpatre mentioned this pull request Feb 21, 2026

[RFC] Feature: Automated Compute-Communication Overlap Diagnostic in Profiler #2408

Open

tianyu-l added this to 26H1 TorchTitan Development Feb 21, 2026

github-project-automation bot moved this to Todo in 26H1 TorchTitan Development Feb 21, 2026

tianyu-l linked an issue Feb 21, 2026 that may be closed by this pull request

[RFC] Feature: Automated Compute-Communication Overlap Diagnostic in Profiler #2408

Open

wconstab reviewed Feb 23, 2026

View reviewed changes

torchtitan/tools/profiling.py Outdated Show resolved Hide resolved

pbpatre added 3 commits February 24, 2026 23:13

Feature: Diagnostics for compute communication overhead

369e2e4

Make total timimg measurement more accurate

4295ac2

Remove status interpretation

bdb76ef

pbpatre force-pushed the feature/rfc-2408-compute-comms-overlap branch from 3804227 to bdb76ef Compare February 24, 2026 17:56

tianyu-l reviewed Feb 24, 2026

View reviewed changes

Refractor to use Profiler.Config pattern

5f81476

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(profiling): Add compute-communication overlap diagnostic for FSDP2#2418

feat(profiling): Add compute-communication overlap diagnostic for FSDP2#2418
pbpatre wants to merge 4 commits intopytorch:mainfrom
pbpatre:feature/rfc-2408-compute-comms-overlap

pbpatre commented Feb 21, 2026

Uh oh!

Uh oh!

wconstab commented Feb 24, 2026

Uh oh!

tianyu-l commented Feb 24, 2026

Uh oh!

wconstab commented Feb 24, 2026

Uh oh!

tianyu-l left a comment

Uh oh!

pbpatre commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pbpatre commented Feb 21, 2026

Summary

Motivation

Changes

New Feature: OverlapAnalyzer

Implementation Details

Example Output

Testing

Limitations

Uh oh!

Uh oh!

wconstab commented Feb 24, 2026

Uh oh!

tianyu-l commented Feb 24, 2026

Uh oh!

wconstab commented Feb 24, 2026

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

pbpatre commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

New Feature: `OverlapAnalyzer`