Skip to content

Orchestrate runtime and memory profile measurement in simple_rl and simple_rl_multiprocess#2398

Draft
Lucaskabela wants to merge 2 commits intopytorch:mainfrom
Lucaskabela:lucaskabela/metric_measurement
Draft

Orchestrate runtime and memory profile measurement in simple_rl and simple_rl_multiprocess#2398
Lucaskabela wants to merge 2 commits intopytorch:mainfrom
Lucaskabela:lucaskabela/metric_measurement

Conversation

@Lucaskabela
Copy link
Contributor

@Lucaskabela Lucaskabela commented Feb 19, 2026

Summary

This PR instruments the RL code with timing and memory logging in order to evaluate subsequent changes (compile enablement) in an apples to apples manner

Test

vllm_compat - Simple RL

Execute CUDA_VISIBLE_DEVICES=7 with-proxy VLLM_BATCH_INVARIANT=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN python3 torchtitan/experiments/rl/vllm_compat/simple_rl.py

Step   0 | Loss: -0.0036 | Reward: +0.075 | Samples: 160
  Phase                Time    Peak Mem
  --------------------------------------
  rollout:            6.35s,  14.00 GiB  (14.7%)
  train:             26.85s,  51.70 GiB  (54.4%)
  optimizer:          0.10s,  23.62 GiB  (24.8%)
  weight_sync:        8.13s,  24.20 GiB  (25.5%)
  total:             41.44s
  Sample:  48 + 24 = 72.
Let me check the reasoning.
...
================================================================================
Training Summary (5 steps):
  Total wall-clock:               196.29s
  Cumul. rollout:                  31.32s  ( 16.0%)
  Cumul. train:                   126.13s  ( 64.3%)
  Cumul. optimizer:                 0.39s  (  0.2%)
  Cumul. weight_sync:              38.38s  ( 19.6%)
  Peak mem (rollout):             20.41 GiB
  Peak mem (train):               58.13 GiB
  Peak mem (optimizer):           23.62 GiB
  Peak mem (weight_sync):         24.20 GiB
================================================================================

unified - Simple multiprocess RL

Execute VLLM_BATCH_INVARIANT=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN python3 torchtitan/experiments/rl/unified/simple_rl_multiprocess.py

Step   0 | Loss: -0.0196 | Reward: +1.177
  Phase                Time    Peak Mem
  --------------------------------------
  rollout:            1.49s,   4.09 GiB  (4.3%)
  train:              6.04s,   6.78 GiB  (7.1%)
  optimizer:          0.06s,   8.43 GiB  (8.9%)
  weight_sync:        7.30s
  total:             14.83s
...
Step   9 | Loss: 0.1330 | Reward: +1.177
  Phase                Time    Peak Mem
  --------------------------------------
  rollout:            0.74s,   4.09 GiB  (4.3%)
  train:              4.60s,   9.58 GiB  (10.1%)
  optimizer:          0.03s,   8.44 GiB  (8.9%)
  weight_sync:        7.61s
  total:             12.95s
...
================================================================================
Training Summary (10 steps):
  Total wall-clock:               136.32s
  Cumul. rollout:                   8.29s  (  6.1%)
  Cumul. train:                    48.24s  ( 35.4%)
  Cumul. optimizer:                 0.32s  (  0.2%)
  Cumul. weight_sync:              79.79s  ( 58.5%)
  Peak mem (rollout):              4.09 GiB
  Peak mem (train):                9.58 GiB
  Peak mem (optimizer):            8.44 GiB
================================================

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 19, 2026
@Lucaskabela Lucaskabela force-pushed the lucaskabela/metric_measurement branch 6 times, most recently from d3ef4d9 to 8985a6a Compare February 19, 2026 23:15
lambda: self.state == GeneratorState.READY_TO_GENERATE
)

# Generate samples using vLLM
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Purely indentation change here (put inside the gpu_timer

@Lucaskabela Lucaskabela force-pushed the lucaskabela/metric_measurement branch from 8985a6a to 07acc32 Compare February 20, 2026 00:01
@Lucaskabela Lucaskabela marked this pull request as ready for review February 20, 2026 00:05


@contextmanager
def gpu_timer(sync: bool = True, enabled: bool = True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is likely that current torchtitan logging / profiling tools cannot cover new use cases, but I would hope we have a more systematic approach on it.
cc @allenwang28 @felipemello1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I looked at the torchtitan logger and it doesn't seem to offer fine-grained enough control over timing individual stages of the trainer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @Lucaskabela thanks for looking into this!

@felipemello1 built similar tooling here: https://github.com/meta-pytorch/torchforge/tree/main/src/forge/observability

We plan to propose more logging capabilities soon-ish into Titan, but in the meantime I feel like a starting point we can edit later is:

  1. Prefer context manager approaches for the regions you want measured
  2. Restrict these changes to the simple_rl folder where we plan to iterate quickly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good - I think this PR has been updated to align with this goal :) let me know any specific code we feel needs changing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's only used in RL folder, no need to change core.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats fair - will move this to the rl specific metric file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tianyu-l done :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let @wwwjn decide

Copy link
Contributor

@wwwjn wwwjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change, Observability is very important area. I want to postpone this PR for a while because of following reason:

  1. I have a stack of PR which optimize / change the API signature a lot, I prefer reconsider the metrics/observability once we are a relative stable rl loop.
  2. What infra metrics we really need, is these timer really enough to analysis RL runs?
  3. Can we do the metrics logging more clean (today in main torchtitan, it's "almost" just one config), not changing a lot of APIs and everywhere in generator and trainer

If you'll need to get these data for performance comparison before and after some changes, you can put up a script to do so for now

Comment on lines +473 to +476
gen_time_s=gen_time_s,
gen_peak_active_gib=gen_peak_active_gib,
gen_peak_active_pct=gen_peak_active_pct,
gen_peak_reserved_gib=gen_peak_reserved_gib,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metrics should not be put into Trajectory data

Copy link
Contributor

@allenwang28 allenwang28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making these changes! As I look through them more though, I want to be upfront - @felipemello1 has been thinking about this since the Forge days, and we've been discussing how we want to iterate and check these changes into Titan.

Rather than going back and forth in review, I think it'd be more efficient to propose a logging RFC that addresses these concerns holistically. I'd hold off on further changes here until then. Let me know if you'd like to be part of the discussions or if this is blocking you now!

@Lucaskabela
Copy link
Contributor Author

Sure :) In the meanwhile I will leave this PR as a reference for some subsequent PRs I am putting up for us to have timing data

@Lucaskabela Lucaskabela force-pushed the lucaskabela/metric_measurement branch from 04428a5 to 8caac90 Compare February 24, 2026 19:24
@Lucaskabela Lucaskabela marked this pull request as draft February 24, 2026 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants