Skip to content

[rl] Generator enables TP, using torchtitan as Trainer, add grader for reward calculation#2244

Open
wwwjn wants to merge 20 commits intogh/wwwjn/7/basefrom
gh/wwwjn/7/head
Open

[rl] Generator enables TP, using torchtitan as Trainer, add grader for reward calculation#2244
wwwjn wants to merge 20 commits intogh/wwwjn/7/basefrom
gh/wwwjn/7/head

Conversation

@wwwjn
Copy link
Contributor

@wwwjn wwwjn commented Jan 16, 2026

Stack from ghstack (oldest at bottom):

Current status:

  • Applied same parallelism on trainer / generator, eg, Trainer (TP=2) and Generator (TP=2)
    • NOTE: now we have a strong assumption that the trainer and generator must have the same parallelism, and are collocated. This is because we unwrap the DTensor and assum we can always wrap it back with the same device mesh and placement.
    • We should be able to remove this constrain once we have more powerful weight sync. cc @daniellepintz
  • Weight transfer: Simply unwrap DTensor in trainer before send
  • Collocated trainer and generator

Next Step:

  • Patch batch-invariant kernel, test if the trainer / generator is batch-invariant
  • Investigate collective communication kernel when
  • CI as a guard

Toy GRPO training:

INFO:__main__:[actor=<root>] 
Step   8 | Loss: -0.0078 | Reward: +1.352
INFO:__main__:[actor=<root>]   Sample:  Paris, the capital of Italy is Rome, and the capital of Spain is Madrid. The ca...
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] os.getpid()=3397374 Generating start generate (policy v9)...
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] Starting generation: 5 prompts, n_samples_per_prompt=8, max_tokens=100, temp=0.8
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] os.getpid()=3397877 Generating start generate (policy v9)...
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] Starting generation: 5 prompts, n_samples_per_prompt=8, max_tokens=100, temp=0.8
INFO 02-10 17:06:33 [loggers.py:259] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.6 tokens/s, Running: 40 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
INFO 02-10 17:06:33 [loggers.py:259] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.6 tokens/s, Running: 40 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
INFO 02-10 17:06:44 [loggers.py:259] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] Engine 000: Avg prompt throughput: 26.9 tokens/s, Avg generation throughput: 77.0 tokens/s, Running: 40 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
INFO 02-10 17:06:44 [loggers.py:259] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] Engine 000: Avg prompt throughput: 26.9 tokens/s, Avg generation throughput: 76.7 tokens/s, Running: 40 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
INFO 02-10 17:06:54 [loggers.py:259] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 88.2 tokens/s, Running: 40 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
INFO 02-10 17:06:54 [loggers.py:259] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 88.4 tokens/s, Running: 40 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
INFO 02-10 17:07:05 [loggers.py:259] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 88.5 tokens/s, Running: 40 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
INFO 02-10 17:07:05 [loggers.py:259] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 88.4 tokens/s, Running: 40 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
INFO 02-10 17:07:15 [loggers.py:259] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 88.3 tokens/s, Running: 40 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
INFO 02-10 17:07:15 [loggers.py:259] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 88.3 tokens/s, Running: 40 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] Generated 40 completions for scoring
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] os.getpid()=3397877 Generating finish generate (policy v9)...
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] Generated 40 completions for scoring
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] os.getpid()=3397374 Generating finish generate (policy v9)...
INFO:torchtitan.experiments.rl.unified.actors.grader:[actor=<root>.<torchtitan.experiments.rl.unified.actors.grader.Grader grader{'gpus': 1/2}>] Grader scoring trajectory (policy v9)...
INFO:torchtitan.experiments.rl.unified.actors.grader:[actor=<root>.<torchtitan.experiments.rl.unified.actors.grader.Grader grader{'gpus': 0/2}>] Grader scoring trajectory (policy v9)...
INFO:torchtitan.experiments.rl.unified.actors.grader:[actor=<root>.<torchtitan.experiments.rl.unified.actors.grader.Grader grader{'gpus': 1/2}>] Grader finished scoring: reward_mean=1.3548, reward_std=0.7215
INFO:torchtitan.experiments.rl.unified.actors.grader:[actor=<root>.<torchtitan.experiments.rl.unified.actors.grader.Grader grader{'gpus': 0/2}>] Grader finished scoring: reward_mean=1.3548, reward_std=0.7215
INFO:torchtitan.experiments.rl.unified.actors.trainer:[actor=<root>.<torchtitan.experiments.rl.unified.actors.trainer.Trainer trainer{'gpus': 0/2}>] os.getpid()=3397374 Trainer starts to train 9 on traj:
INFO:torchtitan.experiments.rl.unified.actors.trainer:[actor=<root>.<torchtitan.experiments.rl.unified.actors.trainer.Trainer trainer{'gpus': 1/2}>] os.getpid()=3397877 Trainer starts to train 9 on traj:
INFO:torchtitan.experiments.rl.unified.actors.trainer:[actor=<root>.<torchtitan.experiments.rl.unified.actors.trainer.Trainer trainer{'gpus': 0/2}>] Logprob verification: bitwise_identical=False, max_delta=1.729112e-01, avg_delta=2.079452e-02, tokens_checked=4000
INFO:torchtitan.experiments.rl.unified.actors.trainer:[actor=<root>.<torchtitan.experiments.rl.unified.actors.trainer.Trainer trainer{'gpus': 1/2}>] Logprob verification: bitwise_identical=False, max_delta=1.729112e-01, avg_delta=2.079452e-02, tokens_checked=4000
INFO:torchtitan.experiments.rl.unified.actors.trainer:[actor=<root>.<torchtitan.experiments.rl.unified.actors.trainer.Trainer trainer{'gpus': 0/2}>] os.getpid()=3397374 Trainer finish step 10
INFO:torchtitan.experiments.rl.unified.actors.trainer:[actor=<root>.<torchtitan.experiments.rl.unified.actors.trainer.Trainer trainer{'gpus': 1/2}>] os.getpid()=3397877 Trainer finish step 10
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] Updated weights into vLLM engine actor model. Number of parameters: 311
INFO:torchtitan.experiments.rl.unified.models.vllm_wrapper:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] Weights already loaded during model initialization.             Returning 311 loaded parameter names to satisfy vLLM safety check.
INFO 02-10 17:07:41 [gpu_model_runner.py:4381] [actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] Reloading and processing weights took 0.01 seconds
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 0/2}>] os.getpid()=3397374 Generator updating weights to policy v10...
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] Updated weights into vLLM engine actor model. Number of parameters: 311
INFO:torchtitan.experiments.rl.unified.models.vllm_wrapper:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] Weights already loaded during model initialization.             Returning 311 loaded parameter names to satisfy vLLM safety check.
INFO:torchtitan.experiments.rl.unified.actors.generator:[actor=<root>.<torchtitan.experiments.rl.unified.actors.generator.Generator generator{'gpus': 1/2}>] os.getpid()=3397877 Generator updating weights to policy v10...
INFO:__main__:[actor=<root>] 
Step   9 | Loss: -0.0082 | Reward: +1.355
INFO:__main__:[actor=<root>]   Sample:  Paris, the capital of Italy is Rome, and the capital of Spain is Madrid. The ca...
INFO:__main__:[actor=<root>] 
================================================================================
INFO:__main__:[actor=<root>] RL Training complete
INFO:__main__:[actor=<root>] ================================================================================
/home/jianiw/.conda/envs/pytorch-3.12/lib/python3.12/multiprocessing/resource_tracker.py:255: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 16, 2026
wwwjn added a commit that referenced this pull request Jan 16, 2026
ghstack-source-id: ba69db2
Pull Request resolved: #2244
@wwwjn wwwjn changed the title refactor scorer and trainer generator actor [rl] refactor scorer and trainer generator actor Jan 16, 2026
@wwwjn wwwjn changed the title [rl] refactor scorer and trainer generator actor [rl] refactor grader and trainer generator actor Jan 20, 2026
wwwjn added a commit that referenced this pull request Jan 20, 2026
ghstack-source-id: f2a8d93
Pull Request resolved: #2244
rewards: torch.Tensor


class Grader(Actor):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usage of score / grade / reward is a bit arbitrary right now.
claude tells me:

For general RL work: Stick with "reward function" or "reward model"
For RLHF pipelines: "Reward model" is standard
If choosing between scorer/grader: "Scorer" is slightly more aligned with ML conventions, as it emphasizes the quantitative nature of the output

When comparing scorer & grader, it sounds to me scorer is better than grader because the latter (more or less) suggests discrete reward value only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong opinion here, in our current task, the reward value is simple discrete value, and actual reward can be continuous or complicated

metrics = trainer.step.call(batch).get().item(gpus=0)
# Fully sync RL loop with separate scoring step
# 1. Generator produces episode (without rewards)
episode = generator.generate.call().get().item(gpus=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be difference between episode vs. episodes. The generate call seems returning an Episodes object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the class is named "Episodes", but we are not doing batching here. So each instance of "Episodes" only contains one training prompt + completion + etc. I would rename it to be "Episode"

wwwjn added a commit that referenced this pull request Jan 30, 2026
ghstack-source-id: 214011c
Pull Request resolved: #2244
wwwjn added a commit that referenced this pull request Jan 30, 2026
ghstack-source-id: 983404a
Pull Request resolved: #2244
wwwjn added a commit that referenced this pull request Jan 30, 2026
ghstack-source-id: c71835d
Pull Request resolved: #2244
wwwjn added a commit that referenced this pull request Jan 30, 2026
ghstack-source-id: b02cd33
Pull Request resolved: #2244
@wwwjn wwwjn changed the title [rl] refactor grader and trainer generator actor [WIP][rl] refactor grader and trainer generator actor Jan 30, 2026
wwwjn added a commit that referenced this pull request Jan 30, 2026
ghstack-source-id: b91976c
Pull Request resolved: #2244

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a loss file. We should probably put it under /rl/loss. Here is the work i did in Forge, maybe we can just copy/paste? https://github.com/meta-pytorch/torchforge/tree/main/src/forge/rl/loss

If the number of losses look overwhelming, we can just keep GRPO/DAPO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to not having a catchall utils here. For now, even just putting all this in the trainer.py file would be fine IMO while we figure out the best structure for losses.

return total_loss, metrics, batch_token_log_probs


def verify_logprob_identity(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we log ratio (i.e. logprob train / logprog generator), we can check if this is always equal to 1. This is present in all losses in forge.



@dataclass
class Episodes:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this should be in some types.py file

@wwwjn wwwjn requested review from fegin and wconstab as code owners February 14, 2026 00:51
wwwjn added a commit that referenced this pull request Feb 14, 2026
ghstack-source-id: 7ddd39f
Pull Request resolved: #2244
@wwwjn wwwjn changed the title [WIP][rl] refactor grader and trainer generator actor [rl] refactor grader and trainer generator actor Feb 14, 2026
@wwwjn wwwjn changed the title [rl] refactor grader and trainer generator actor [rl] Generator enables TP, using torchtitan as Trainer, add grader for reward calculation Feb 19, 2026
# Parallelism configuration
tensor_parallel_size=generation_config.parallelism.tensor_parallel_degree,
distributed_executor_backend=generation_config.distributed_executor_backend,
distributed_executor_backend="external_launcher",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add comment on why we hardcode this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we need better documentation everywhere, e.g. to have justifications on why certain things are done -- you can use claude

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this moment, honestly I think the file organization and naming are stopping me from giving any high-quality reviews. I would suggest we have charts and documentations to help everyone understand the structure.

# Parallelism configuration
tensor_parallel_size=generation_config.parallelism.tensor_parallel_degree,
distributed_executor_backend=generation_config.distributed_executor_backend,
distributed_executor_backend="external_launcher",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we need better documentation everywhere, e.g. to have justifications on why certain things are done -- you can use claude

@endpoint
async def update(self, version: int, vllm_compat_state: dict) -> None:
"""Update generate weights.
async def update(self, version: int, all_weights: dict) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now another update(... dict)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This update() updates Generator's internal state (Version number + model weights)

data_parallel_shard_degree = -1
fsdp_reshard_after_forward = "default" # default / never / always
tensor_parallel_degree = 1
tensor_parallel_degree = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert

[ghstack-poisoned]
[ghstack-poisoned]
wwwjn added a commit that referenced this pull request Feb 22, 2026
…ight tying (#2410)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.13.0)
(oldest at bottom):
* #2395
* #2244
* #2221
* #2194
* #2191
* __->__ #2410


This is a alternative fix to
#2402 (comment).

Weight updating between trainer and generator is totally broken because:
It's caused by we called "reload_weights" when updating the weights. The
reload_weights has following steps:

- initialize_layerwise_reload(model): Saves the current real GPU tensors
as info.kernel_tensors, and replace all parameters with meta tensor.
- Call model.load_weights(weights_iter): This function is written by us
and calls set_model_state_dict, Internally, set_model_state_dict tries
to do param.data.copy_(loaded_weight) for each parameter. When
parameters are meta tensor, it will do "no-op". So the weights never get
updated


In this PR:

- Totally bypass reload_weights, and don't load from a file when we
update the weights
- Gets the model via
self.engine.model_executor.driver_worker.get_model()
- Iterates over model.named_parameters() to find the matching parameter
by name
- Does param.data.copy_(new_tensor) directly
[ghstack-poisoned]
[ghstack-poisoned]
device_module.set_device(self.device)

# Initialize distributed
# When running under Monarch, setup_env_for_distributed already
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a Monarch Actor - under what circumstances would you not be "running under Monarch"?

return trajectory

@endpoint
async def set_reward_fn(self, reward_fn: Callable) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious when this would happen that the Grader gets a new reward function?

advantages = self._compute_advantages(episode.rewards)

# Compute reference log probs using frozen ref_model
ref_token_log_probs = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally reference model should be able to be separate from the trainer due to both memory and performance constraints. Can do that in a follow up PR if you'd like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to not having a catchall utils here. For now, even just putting all this in the trainer.py file would be fine IMO while we figure out the best structure for losses.

[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
wwwjn added a commit that referenced this pull request Feb 25, 2026
ghstack-source-id: 8e9ab95
Pull Request resolved: #2244
[ghstack-poisoned]
wwwjn added a commit that referenced this pull request Feb 26, 2026
ghstack-source-id: 9f5bb36
Pull Request resolved: #2244
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants