[rl] Generator enables TP, using torchtitan as Trainer, add grader for reward calculation by wwwjn · Pull Request #2244 · pytorch/torchtitan

wwwjn · 2026-01-16T07:50:36Z

Stack from ghstack (oldest at bottom):

Current status:

Applied same parallelism on trainer / generator, eg, Trainer (TP=2) and Generator (TP=2)
- NOTE: now we have a strong assumption that the trainer and generator must have the same parallelism, and are collocated. This is because we unwrap the DTensor and assum we can always wrap it back with the same device mesh and placement.
- We should be able to remove this constrain once we have more powerful weight sync. cc @daniellepintz
Weight transfer: Simply unwrap DTensor in trainer before send
Collocated trainer and generator

Next Step:

Patch batch-invariant kernel, test if the trainer / generator is batch-invariant
Investigate collective communication kernel when
CI as a guard

Toy GRPO training:

INFO:__main__:[actor=<root>] 
Step   8 | Loss: -0.0078 | Reward: +1.352
INFO:__main__:[actor=<root>]   Sample:  Paris, the capital of Italy is Rome, and the capital of Spain is Madrid. The ca...
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] os.getpid()=3397374 Generating start generate (policy v9)...
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] Starting generation: 5 prompts, n_samples_per_prompt=8, max_tokens=100, temp=0.8
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] os.getpid()=3397877 Generating start generate (policy v9)...
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] Starting generation: 5 prompts, n_samples_per_prompt=8, max_tokens=100, temp=0.8
INFO 02-10 17:06:33 [loggers.py:259] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.6 tokens/s, Running: 40 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
INFO 02-10 17:06:33 [loggers.py:259] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.6 tokens/s, Running: 40 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
INFO 02-10 17:06:44 [loggers.py:259] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] Engine 000: Avg prompt throughput: 26.9 tokens/s, Avg generation throughput: 77.0 tokens/s, Running: 40 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
INFO 02-10 17:06:44 [loggers.py:259] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] Engine 000: Avg prompt throughput: 26.9 tokens/s, Avg generation throughput: 76.7 tokens/s, Running: 40 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
INFO 02-10 17:06:54 [loggers.py:259] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 88.2 tokens/s, Running: 40 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
INFO 02-10 17:06:54 [loggers.py:259] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 88.4 tokens/s, Running: 40 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
INFO 02-10 17:07:05 [loggers.py:259] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 88.5 tokens/s, Running: 40 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
INFO 02-10 17:07:05 [loggers.py:259] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 88.4 tokens/s, Running: 40 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
INFO 02-10 17:07:15 [loggers.py:259] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 88.3 tokens/s, Running: 40 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
INFO 02-10 17:07:15 [loggers.py:259] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 88.3 tokens/s, Running: 40 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] Generated 40 completions for scoring
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] os.getpid()=3397877 Generating finish generate (policy v9)...
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] Generated 40 completions for scoring
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] os.getpid()=3397374 Generating finish generate (policy v9)...
INFO:torchtitan.experiments.rl.unified.actors.grader:[actor=<root>.<torchtitan.experiments.rl.unified.actors.grader.Grader grader{'gpus': 1/2}>] Grader scoring trajectory (policy v9)...
INFO:torchtitan.experiments.rl.unified.actors.grader:[actor=<root>.<torchtitan.experiments.rl.unified.actors.grader.Grader grader{'gpus': 0/2}>] Grader scoring trajectory (policy v9)...
INFO:torchtitan.experiments.rl.unified.actors.grader:[actor=<root>.<torchtitan.experiments.rl.unified.actors.grader.Grader grader{'gpus': 1/2}>] Grader finished scoring: reward_mean=1.3548, reward_std=0.7215
INFO:torchtitan.experiments.rl.unified.actors.grader:[actor=<root>.<torchtitan.experiments.rl.unified.actors.grader.Grader grader{'gpus': 0/2}>] Grader finished scoring: reward_mean=1.3548, reward_std=0.7215
INFO:torchtitan.experiments.rl.unified.actors.trainer:[actor=<root>.<torchtitan.experiments.rl.unified.actors.trainer.Trainer trainer{'gpus': 0/2}>] os.getpid()=3397374 Trainer starts to train 9 on traj:
INFO:torchtitan.experiments.rl.unified.actors.trainer:[actor=<root>.<torchtitan.experiments.rl.unified.actors.trainer.Trainer trainer{'gpus': 1/2}>] os.getpid()=3397877 Trainer starts to train 9 on traj:
INFO:torchtitan.experiments.rl.unified.actors.trainer:[actor=<root>.<torchtitan.experiments.rl.unified.actors.trainer.Trainer trainer{'gpus': 0/2}>] Logprob verification: bitwise_identical=False, max_delta=1.729112e-01, avg_delta=2.079452e-02, tokens_checked=4000
INFO:torchtitan.experiments.rl.unified.actors.trainer:[actor=<root>.<torchtitan.experiments.rl.unified.actors.trainer.Trainer trainer{'gpus': 1/2}>] Logprob verification: bitwise_identical=False, max_delta=1.729112e-01, avg_delta=2.079452e-02, tokens_checked=4000
INFO:torchtitan.experiments.rl.unified.actors.trainer:[actor=<root>.<torchtitan.experiments.rl.unified.actors.trainer.Trainer trainer{'gpus': 0/2}>] os.getpid()=3397374 Trainer finish step 10
INFO:torchtitan.experiments.rl.unified.actors.trainer:[actor=<root>.<torchtitan.experiments.rl.unified.actors.trainer.Trainer trainer{'gpus': 1/2}>] os.getpid()=3397877 Trainer finish step 10
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] Updated weights into vLLM engine actor model. Number of parameters: 311
INFO:torchtitan.experiments.rl.unified.models.vllm_wrapper:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] Weights already loaded during model initialization.             Returning 311 loaded parameter names to satisfy vLLM safety check.
INFO 02-10 17:07:41 [gpu_model_runner.py:4381] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] Reloading and processing weights took 0.01 seconds
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] os.getpid()=3397374 Generator updating weights to policy v10...
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] Updated weights into vLLM engine actor model. Number of parameters: 311
INFO:torchtitan.experiments.rl.unified.models.vllm_wrapper:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] Weights already loaded during model initialization.             Returning 311 loaded parameter names to satisfy vLLM safety check.
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] os.getpid()=3397877 Generator updating weights to policy v10...
INFO:__main__:[actor=<root>] 
Step   9 | Loss: -0.0082 | Reward: +1.355
INFO:__main__:[actor=<root>]   Sample:  Paris, the capital of Italy is Rome, and the capital of Spain is Madrid. The ca...
INFO:__main__:[actor=<root>] 
================================================================================
INFO:__main__:[actor=<root>] RL Training complete
INFO:__main__:[actor=<root>] ================================================================================
/home/jianiw/.conda/envs/pytorch-3.12/lib/python3.12/multiprocessing/resource_tracker.py:255: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

[ghstack-poisoned]

ghstack-source-id: ba69db2 Pull Request resolved: #2244

torchtitan/experiments/rl/unified/actors/scorer.py

torchtitan/experiments/rl/unified/actors/trainer.py

torchtitan/experiments/rl/unified/models/attention.py

torchtitan/experiments/rl/unified/simple_grpo.py

[ghstack-poisoned]

ghstack-source-id: f2a8d93 Pull Request resolved: #2244

torchtitan/experiments/rl/unified/models/attention.py

tianyu-l · 2026-01-25T21:58:48Z

torchtitan/experiments/rl/unified/actors/grader.py

+    rewards: torch.Tensor
+
+
+class Grader(Actor):


Usage of score / grade / reward is a bit arbitrary right now.
claude tells me:

For general RL work: Stick with "reward function" or "reward model"
For RLHF pipelines: "Reward model" is standard
If choosing between scorer/grader: "Scorer" is slightly more aligned with ML conventions, as it emphasizes the quantitative nature of the output

When comparing scorer & grader, it sounds to me scorer is better than grader because the latter (more or less) suggests discrete reward value only.

I don't have strong opinion here, in our current task, the reward value is simple discrete value, and actual reward can be continuous or complicated

torchtitan/experiments/rl/unified/simple_grpo.py

tianyu-l · 2026-01-25T21:59:51Z

torchtitan/experiments/rl/unified/simple_grpo.py

-        metrics = trainer.step.call(batch).get().item(gpus=0)
+        # Fully sync RL loop with separate scoring step
+        # 1. Generator produces episode (without rewards)
+        episode = generator.generate.call().get().item(gpus=0)


There should be difference between episode vs. episodes. The generate call seems returning an Episodes object?

yes, the class is named "Episodes", but we are not doing batching here. So each instance of "Episodes" only contains one training prompt + completion + etc. I would rename it to be "Episode"

[ghstack-poisoned]

ghstack-source-id: 214011c Pull Request resolved: #2244

[ghstack-poisoned]

ghstack-source-id: 983404a Pull Request resolved: #2244

[ghstack-poisoned]

ghstack-source-id: c71835d Pull Request resolved: #2244

[ghstack-poisoned]

ghstack-source-id: b02cd33 Pull Request resolved: #2244

[ghstack-poisoned]

ghstack-source-id: b91976c Pull Request resolved: #2244

felipemello1 · 2026-02-10T01:43:40Z

torchtitan/experiments/rl/unified/actors/utils.py

This looks like a loss file. We should probably put it under /rl/loss. Here is the work i did in Forge, maybe we can just copy/paste? https://github.com/meta-pytorch/torchforge/tree/main/src/forge/rl/loss

If the number of losses look overwhelming, we can just keep GRPO/DAPO

+1 to not having a catchall utils here. For now, even just putting all this in the trainer.py file would be fine IMO while we figure out the best structure for losses.

felipemello1 · 2026-02-10T01:44:19Z

torchtitan/experiments/rl/unified/actors/utils.py

+    return total_loss, metrics, batch_token_log_probs
+
+
+def verify_logprob_identity(


if we log ratio (i.e. logprob train / logprog generator), we can check if this is always equal to 1. This is present in all losses in forge.

felipemello1 · 2026-02-10T01:47:31Z

torchtitan/experiments/rl/unified/actors/grader.py

+
+
+@dataclass
+class Episodes:


I think that this should be in some types.py file

[ghstack-poisoned]

ghstack-source-id: 7ddd39f Pull Request resolved: #2244

acisseJZhong · 2026-02-19T07:13:06Z