Skip to content

feat(profiling): Add compute-communication overlap diagnostic for FSDP2#2418

Open
pbpatre wants to merge 4 commits intopytorch:mainfrom
pbpatre:feature/rfc-2408-compute-comms-overlap
Open

feat(profiling): Add compute-communication overlap diagnostic for FSDP2#2418
pbpatre wants to merge 4 commits intopytorch:mainfrom
pbpatre:feature/rfc-2408-compute-comms-overlap

Conversation

@pbpatre
Copy link

@pbpatre pbpatre commented Feb 21, 2026

Summary

Adds an automated compute-communication overlap diagnostic to help identify communication bottlenecks in FSDP2 training.

Implements #2408

Motivation

Achieving high MFU in large-scale training requires effective overlap of compute and NCCL communication. Currently, identifying "communication bubbles" requires manually downloading multi-GB trace files and inspecting them in Chrome Trace Viewer. This feature provides an immediate diagnostic signal directly in the logs.

Changes

New Feature: OverlapAnalyzer

  • Analyzes profiler traces to compute overlap efficiency between compute kernels and NCCL operations
  • Reports Total Compute Time, Total NCCL Time, Total Trace Time, and Overlap Efficiency
  • Classifies workloads as "OPTIMAL" (≥50% overlap) or "COMMUNICATION BOUND" (<50%)
  • Enabled via --profiling.experimental_diagnostics flag (off by default)

Implementation Details

  • Uses prof.key_averages() for kernel time aggregation
  • Uses prof.events() for accurate trace duration calculation
  • Computes overlap: (compute + nccl - trace_duration) / nccl * 100
  • Classifies kernels by name patterns: nccl for communication; gemm, aten, cublas, cudnn, cutlass, triton, flash for compute

Example Output

Llama 3 8B (8B params) on 2x H200:
[OverlapAnalyzer] Compute-Communication Overlap Report Total Compute Time : 370.05 ms Total NCCL Time : 324.26 ms Total Trace Time : 370.05 ms Overlap Efficiency : 100.0% (conservative lower bound) Status : OPTIMAL

Debug model (6M params) on 2x H200:
[OverlapAnalyzer] Compute-Communication Overlap Report Total Compute Time : 16.27 ms Total NCCL Time : 68.40 ms Total Trace Time : 68.40 ms Overlap Efficiency : 23.8% (conservative lower bound) Status : COMMUNICATION BOUND

Results correctly show larger models achieve better overlap (FSDP2 hides communication behind compute), while small models are communication-bound.

Testing

Tested on 2x NVIDIA H200 with:

  • Llama 3 8B: NGPU=2 ./run_train.sh --model.flavor 8B --training.local_batch_size 1 --training.seq_len 2048 --training.steps 20 --profiling.enable_profiling --profiling.experimental_diagnostics --profiling.profile_freq 10 2>&1 | tee 8B_output.log
  • Debug model: NGPU=2 ./run_train.sh --job.config_file torchtitan/models/llama3/train_configs/llama3_8b.toml --profiling.enable_profiling --profiling.profile_freq=10 --profiling.experimental_diagnostics --training.steps=20 2>&1 | tee 6M_output.log

Limitations

The overlap efficiency is reported as a "conservative lower bound" because key_averages() sums kernel durations. When multiple kernels run concurrently on different CUDA streams, actual overlap may be higher. For precise analysis, inspect the Chrome trace directly.

@pbpatre pbpatre force-pushed the feature/rfc-2408-compute-comms-overlap branch from 3804227 to bdb76ef Compare February 24, 2026 17:56
@wconstab
Copy link
Contributor

wonder what others think about adding this to torchtitan. seems nice, but also, its specific to FSDP2 and not sure if it belongs in torchtitan. cc @fegin @tianyu-l @wwwjn

@tianyu-l
Copy link
Contributor

@wconstab

its specific to FSDP2

What part of the code is specific to FSDP2? It looks pretty generic to me.

@wconstab
Copy link
Contributor

yes, you're right. it is actually specific to nccl kernel names, but was tested for FSDP2, maybe works in general.

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this could be useful, and the code complexity doesn't seem to be crazy.

Meanwhile I would love to see refactor of

  • Profiling file into "config build style" following recent refactor, so ProfilingConfig becomes Profiler.Config
  • provide a generic (list of) profile analyzer which this OverlapAnalyzer is the first example

Let me know if this makes sense or not.

@pbpatre
Copy link
Author

pbpatre commented Feb 25, 2026

@tianyu-l I agree that we can do the quick refractoring for config build style as part of this change. I have updated the PR with this change and added test cases. Let me know if it looks good

Meanwhile I would love to see refactor of

  • Profiling file into "config build style" following recent refactor, so ProfilingConfig becomes Profiler.Config
  • provide a generic (list of) profile analyzer which this OverlapAnalyzer is the first example

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

[RFC] Feature: Automated Compute-Communication Overlap Diagnostic in Profiler

3 participants