Orchestrate runtime and memory profile measurement in simple_rl and simple_rl_multiprocess by Lucaskabela · Pull Request #2398 · pytorch/torchtitan

Lucaskabela · 2026-02-19T19:33:18Z

Summary

This PR instruments the RL code with timing and memory logging in order to evaluate subsequent changes (compile enablement) in an apples to apples manner

Test

vllm_compat - Simple RL

Execute CUDA_VISIBLE_DEVICES=7 with-proxy VLLM_BATCH_INVARIANT=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN python3 torchtitan/experiments/rl/vllm_compat/simple_rl.py

Step   0 | Loss: -0.0036 | Reward: +0.075 | Samples: 160
  Phase                Time    Peak Mem
  --------------------------------------
  rollout:            6.35s,  14.00 GiB  (14.7%)
  train:             26.85s,  51.70 GiB  (54.4%)
  optimizer:          0.10s,  23.62 GiB  (24.8%)
  weight_sync:        8.13s,  24.20 GiB  (25.5%)
  total:             41.44s
  Sample:  48 + 24 = 72.
Let me check the reasoning.
...
================================================================================
Training Summary (5 steps):
  Total wall-clock:               196.29s
  Cumul. rollout:                  31.32s  ( 16.0%)
  Cumul. train:                   126.13s  ( 64.3%)
  Cumul. optimizer:                 0.39s  (  0.2%)
  Cumul. weight_sync:              38.38s  ( 19.6%)
  Peak mem (rollout):             20.41 GiB
  Peak mem (train):               58.13 GiB
  Peak mem (optimizer):           23.62 GiB
  Peak mem (weight_sync):         24.20 GiB
================================================================================

unified - Simple multiprocess RL

Execute VLLM_BATCH_INVARIANT=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN python3 torchtitan/experiments/rl/unified/simple_rl_multiprocess.py

Step   0 | Loss: -0.0196 | Reward: +1.177
  Phase                Time    Peak Mem
  --------------------------------------
  rollout:            1.49s,   4.09 GiB  (4.3%)
  train:              6.04s,   6.78 GiB  (7.1%)
  optimizer:          0.06s,   8.43 GiB  (8.9%)
  weight_sync:        7.30s
  total:             14.83s
...
Step   9 | Loss: 0.1330 | Reward: +1.177
  Phase                Time    Peak Mem
  --------------------------------------
  rollout:            0.74s,   4.09 GiB  (4.3%)
  train:              4.60s,   9.58 GiB  (10.1%)
  optimizer:          0.03s,   8.44 GiB  (8.9%)
  weight_sync:        7.61s
  total:             12.95s
...
================================================================================
Training Summary (10 steps):
  Total wall-clock:               136.32s
  Cumul. rollout:                   8.29s  (  6.1%)
  Cumul. train:                    48.24s  ( 35.4%)
  Cumul. optimizer:                 0.32s  (  0.2%)
  Cumul. weight_sync:              79.79s  ( 58.5%)
  Peak mem (rollout):              4.09 GiB
  Peak mem (train):                9.58 GiB
  Peak mem (optimizer):            8.44 GiB
================================================

Lucaskabela · 2026-02-19T23:16:01Z

torchtitan/experiments/rl/unified/actors/generator.py

                lambda: self.state == GeneratorState.READY_TO_GENERATE
            )

-            # Generate samples using vLLM


Purely indentation change here (put inside the gpu_timer

tianyu-l · 2026-02-20T00:21:43Z

torchtitan/components/metrics.py



+@contextmanager
+def gpu_timer(sync: bool = True, enabled: bool = True):


It is likely that current torchtitan logging / profiling tools cannot cover new use cases, but I would hope we have a more systematic approach on it.
cc @allenwang28 @felipemello1

Yeah I looked at the torchtitan logger and it doesn't seem to offer fine-grained enough control over timing individual stages of the trainer

hey @Lucaskabela thanks for looking into this!

@felipemello1 built similar tooling here: https://github.com/meta-pytorch/torchforge/tree/main/src/forge/observability

We plan to propose more logging capabilities soon-ish into Titan, but in the meantime I feel like a starting point we can edit later is:

Prefer context manager approaches for the regions you want measured

Restrict these changes to the simple_rl folder where we plan to iterate quickly

Sounds good - I think this PR has been updated to align with this goal :) let me know any specific code we feel needs changing

tianyu-l · 2026-02-20T00:22:16Z

torchtitan/components/metrics.py

If it's only used in RL folder, no need to change core.

Thats fair - will move this to the rl specific metric file

@tianyu-l done :)

I'll let @wwwjn decide

wwwjn

Thanks for the change, Observability is very important area. I want to postpone this PR for a while because of following reason:

I have a stack of PR which optimize / change the API signature a lot, I prefer reconsider the metrics/observability once we are a relative stable rl loop.
What infra metrics we really need, is these timer really enough to analysis RL runs?
Can we do the metrics logging more clean (today in main torchtitan, it's "almost" just one config), not changing a lot of APIs and everywhere in generator and trainer

If you'll need to get these data for performance comparison before and after some changes, you can put up a script to do so for now

wwwjn · 2026-02-20T19:16:27Z

torchtitan/experiments/rl/unified/actors/generator.py

+                gen_time_s=gen_time_s,
+                gen_peak_active_gib=gen_peak_active_gib,
+                gen_peak_active_pct=gen_peak_active_pct,
+                gen_peak_reserved_gib=gen_peak_reserved_gib,


This metrics should not be put into Trajectory data

allenwang28

Thanks for making these changes! As I look through them more though, I want to be upfront - @felipemello1 has been thinking about this since the Forge days, and we've been discussing how we want to iterate and check these changes into Titan.

Rather than going back and forth in review, I think it'd be more efficient to propose a logging RFC that addresses these concerns holistically. I'd hold off on further changes here until then. Let me know if you'd like to be part of the discussions or if this is blocking you now!

Lucaskabela · 2026-02-20T19:49:53Z

Sure :) In the meanwhile I will leave this PR as a reference for some subsequent PRs I am putting up for us to have timing data

pytorch-bot bot added the ciflow/8gpu label Feb 19, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 19, 2026

Lucaskabela force-pushed the lucaskabela/metric_measurement branch 6 times, most recently from d3ef4d9 to 8985a6a Compare February 19, 2026 23:15

Lucaskabela commented Feb 19, 2026

View reviewed changes

Lucaskabela force-pushed the lucaskabela/metric_measurement branch from 8985a6a to 07acc32 Compare February 20, 2026 00:01

Lucaskabela marked this pull request as ready for review February 20, 2026 00:05

Lucaskabela requested review from fegin, tianyu-l, wconstab and wwwjn as code owners February 20, 2026 00:05

tianyu-l reviewed Feb 20, 2026

View reviewed changes

tianyu-l requested changes Feb 20, 2026

View reviewed changes

Lucaskabela force-pushed the lucaskabela/metric_measurement branch from 2c36e4f to 04428a5 Compare February 20, 2026 00:38

Lucaskabela requested review from allenwang28, felipemello1 and tianyu-l February 20, 2026 01:10

Lucaskabela mentioned this pull request Feb 20, 2026

Fix weight transfer in simple_multiprocess_rl to restore bitwise parity #2402

Closed

wwwjn requested changes Feb 20, 2026

View reviewed changes

allenwang28 requested changes Feb 20, 2026

View reviewed changes

This was referenced Feb 20, 2026

[RL] Prepare vllm definition for support_torch_compile compatibility #2393

Open

Enable torch.compile and CUDA graphs for vLLM inference Lucaskabela/torchtitan#4

Draft

Lucaskabela added 2 commits February 24, 2026 11:24

Orchestrate runtime and memory profile measurement

07202df

Refactor gpu_timer into rl folder

8caac90

Lucaskabela force-pushed the lucaskabela/metric_measurement branch from 04428a5 to 8caac90 Compare February 24, 2026 19:24

Lucaskabela marked this pull request as draft February 24, 2026 23:08



		@contextmanager
		def gpu_timer(sync: bool = True, enabled: bool = True):

Conversation

Lucaskabela commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test

vllm_compat - Simple RL

unified - Simple multiprocess RL

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wwwjn left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allenwang28 left a comment

Choose a reason for hiding this comment

Uh oh!

Lucaskabela commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Lucaskabela commented Feb 19, 2026 •

edited

Loading

wwwjn left a comment •

edited

Loading