[rl][combo] Refactor simple RL loop with torchtitan components by wwwjn · Pull Request #2443 · pytorch/torchtitan

wwwjn · 2026-02-26T00:06:52Z

This PR is a combination of all the 4 different PRs:
#2244
#2221
#2194
#2191

Current status:

Applied the same parallelism on trainer / generator, eg, Trainer (TP=2) and Generator (TP=2)
- NOTE: now we have a strong assumption that the trainer and generator must have the same parallelism, and are collocated. This is because we unwrap the DTensor and assume we can always wrap it back with the same device mesh and placement.
We should be able to remove this constrain once we have more powerful weight sync. cc @daniellepintz
Weight transfer: Simply unwrap DTensor in trainer before sending
Collocated trainer and generator

Next Step:

Patch batch-invariant kernel, test if the trainer / generator is batch-invariant
Supports different parallelism on trainer / generator side
CI as a guard

ghstack-source-id: d49aef3 Pull Request resolved: #2191 config sys for simple_grpo v1 ghstack-source-id: d49aef3 Pull Request resolved: #2192 config sys for simple_grpo v2 ghstack-source-id: d49aef3 Pull Request resolved: #2193

ghstack-source-id: 19d7ad4 Pull Request resolved: #2194

ghstack-source-id: f826258 Pull Request resolved: #2221

ghstack-source-id: 9f5bb36 Pull Request resolved: #2244

tianyu-l · 2026-02-26T00:52:48Z

torchtitan/experiments/rl/unified/configs.py

+
+
+@dataclass(kw_only=True, slots=True)
+class PolicyOptimizationConfig:


If it's GRPO specific, the name should reflect that. Also maybe better place to put this is simple_grpo.py

tianyu-l · 2026-02-26T00:53:14Z

torchtitan/experiments/rl/unified/configs.py

+
+
+@dataclass(kw_only=True, slots=True)
+class VLLMSamplingConfig:


This could be put into generator.py if that's the only place using it.

tianyu-l · 2026-02-26T00:55:45Z

torchtitan/experiments/rl/unified/plugin.py

+
+# Model-agnostic name used for vLLM model registration.
+# Must match the hf_overrides["architectures"] value passed to EngineArgs.
+VLLM_MODEL_NAME = "TorchTitanForCausalLM"


what's the "For" mean?

Suggested change

VLLM_MODEL_NAME = "TorchTitanForCausalLM"

VLLM_MODEL_NAME = "TorchTitanCausalLM"

tianyu-l · 2026-02-26T00:56:49Z

torchtitan/experiments/rl/unified/plugin.py

+
+Usage:
+    from torchtitan.experiments.rl.unified.plugin import register
+    register(model_spec)


need to update

tianyu-l · 2026-02-26T01:00:37Z

torchtitan/experiments/rl/unified/simple_grpo.py

+
+    config = ConfigManager().parse_args()
+
+    # Patch model_spec to use the RL-specific parallelize function.


most things below should be put into RLTrainer, after config.build() gives you an instance of RLTrainer.

tianyu-l · 2026-02-26T01:20:34Z

torchtitan/experiments/rl/unified/actors/generator.py

-            from torchtitan.experiments.rl.vllm_compat.weights_vllm_compat import (
-                vllm_compat_to_torchtitan,
-            )
+        vllm_attention_backend: str = "FLASH_ATTN"


If you call it VLLMGenerator.Config, you no longer need to have "vllm_" prefix any more.

tianyu-l · 2026-02-26T01:22:22Z

torchtitan/experiments/rl/unified/models/vllm_wrapper.py

-        )
+        self.config = model_spec.model
+        logger.debug(f"Creating model with config: {self.config}")
+        self.model = self.config.build()


you may need to do meta init when model is large, OK to put a TODO

tianyu-l · 2026-02-26T01:23:26Z

torchtitan/experiments/rl/unified/models/vllm_wrapper.py

            max_position = 0

-        rope_cache = self._extend_rope_cache_if_needed(rope_attr, max_position)
+        rope_cache = self._extend_rope_cache_if_needed(self.model.freqs_cis, max_position)


I think we should build this capability into torchtitan models. cc @fegin @shuhuayu

@wwwjn maybe make an issue

tianyu-l · 2026-02-26T01:25:17Z

torchtitan/experiments/rl/unified/models/vllm_wrapper.py

+    def load_weights(self, weights_iter):
+        """
+        vLLM required API.
+        Load weights from HF checkpoint using the provided state dict adapter.


is this comment meaningful?

tianyu-l · 2026-02-26T01:26:47Z

torchtitan/experiments/rl/unified/models/vllm_wrapper.py

                self.config.layer.attention.n_heads % tp_size == 0
            ), "Only support when n_heads can be divided by tp_size"

        replace_with_vllm_attention(self.model, tp_degree=tp_size)


Ideally we should do this on the config, rather than on the model. The vllm attention module needs to support build from config.

joecummings · 2026-02-26T03:35:14Z

torchtitan/experiments/rl/unified/actors/generator.py

-
-
-class VLLMRolloutEngine:
+class Generator(Actor, Configurable):


nit -> VllmGenerator

joecummings · 2026-02-26T03:36:01Z

torchtitan/experiments/rl/unified/actors/generator.py


-            titan_state = vllm_compat_to_torchtitan(vllm_compat_state)
-            self._direct_weight_update(titan_state)
+        vllm_gpu_memory_limit: float = 0.5


I think in another comment, you said the trainer and generator were not co-located, in which case, why are we limiting this to 0.5? The default is at least 0.9 IIRC.

In current PR, the trainer and generator are collocated

joecummings · 2026-02-26T03:36:49Z

torchtitan/experiments/rl/unified/actors/generator.py

-            token_log_probs_list,
-            prompt_token_ids_list,
+        # Register TorchTitan model with vLLM before any engine creation
+        from torchtitan.experiments.rl.unified.plugin import (


why is this a nested import?

joecummings · 2026-02-26T03:39:32Z

torchtitan/experiments/rl/unified/actors/generator.py

-    async def generate(self) -> None:
-        """Generate trajectories and compute rewards/advantages."""
-        logger.info(
+    async def generate(self) -> Episode:


Just a note that this will change in #2423 to allow for a proper passing of prompt to the generate function so no one calls this out

joecummings · 2026-02-26T03:40:13Z

torchtitan/experiments/rl/unified/actors/generator.py

-                lambda: self.state == GeneratorState.READY_TO_GENERATE
-            )

+        with torch.no_grad():


Is it actually necessary to do this under torch.no_grad? I don't see that pattern in other vLLM use cases

joecummings · 2026-02-26T03:41:57Z

torchtitan/experiments/rl/unified/actors/grader.py

+        return episode
+
+    @endpoint
+    async def set_reward_fn(self, reward_fn: Callable) -> None:


Apologies for the repeat comment, but when might we want to use this function?

joecummings · 2026-02-26T03:42:51Z

torchtitan/experiments/rl/unified/actors/trainer.py

+        model_spec: ModelSpec,
+        policy_optimization: PolicyOptimizationConfig,
+        batch_invariant_mode: bool,
+        hf_assets_path: str = "./tests/assets/tokenizer",


Is there a way to default to None or to a common location rather than a test asset?

joecummings · 2026-02-26T03:43:19Z

torchtitan/experiments/rl/unified/actors/trainer.py

+        dcp.load(hf_state_dict, storage_reader=storage_reader)
+        torchtitan_state_dict = self.sd_adapter.from_hf(hf_state_dict)
+
+        from torch.distributed.checkpoint.state_dict import (


Why is this a nested import?

joecummings · 2026-02-26T03:45:33Z

torchtitan/experiments/rl/unified/actors/trainer.py

+
        # Compute loss
-        loss, loss_metrics = compute_policy_gradient_loss_vllm(
+        loss, loss_metrics, batch_token_log_probs = compute_policy_gradient_loss(


Can we compute the forward_backward first and then pass this to the loss to keep the loss function only computing the loss itself? This will match the eventual pattern we want with Trainer.

joecummings · 2026-02-26T03:46:12Z

torchtitan/experiments/rl/unified/actors/utils.py

We can even put it directly in trainer.py for now until we have a more generic loss abstraction

wwwjn added 4 commits February 25, 2026 16:05

config sys v1

5e171e8

ghstack-source-id: d49aef3 Pull Request resolved: #2191 config sys for simple_grpo v1 ghstack-source-id: d49aef3 Pull Request resolved: #2192 config sys for simple_grpo v2 ghstack-source-id: d49aef3 Pull Request resolved: #2193

refactor model register

c166a5c

ghstack-source-id: 19d7ad4 Pull Request resolved: #2194

refactor save and load model weights using DCP

fb967e4

ghstack-source-id: f826258 Pull Request resolved: #2221

refactor scorer and trainer generator actor

e322469

ghstack-source-id: 9f5bb36 Pull Request resolved: #2244

pytorch-bot bot added the ciflow/8gpu label Feb 26, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 26, 2026

wwwjn changed the title ~~[rl~~ [rl] Refactor simple RL loop with torchtitan components Feb 26, 2026

wwwjn changed the title ~~[rl] Refactor simple RL loop with torchtitan components~~ [rl][combo] Refactor simple RL loop with torchtitan components Feb 26, 2026

wwwjn requested review from Lucaskabela, acisseJZhong and tianyu-l and removed request for tianyu-l February 26, 2026 00:09

tianyu-l requested changes Feb 26, 2026

View reviewed changes

joecummings reviewed Feb 26, 2026

View reviewed changes



		@dataclass(kw_only=True, slots=True)
		class PolicyOptimizationConfig:



		@dataclass(kw_only=True, slots=True)
		class VLLMSamplingConfig:

	VLLM_MODEL_NAME = "TorchTitanForCausalLM"
	VLLM_MODEL_NAME = "TorchTitanCausalLM"


		config = ConfigManager().parse_args()

		# Patch model_spec to use the RL-specific parallelize function.



		class VLLMRolloutEngine:
		class Generator(Actor, Configurable):

Conversation

wwwjn commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current status:

Next Step:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wwwjn commented Feb 26, 2026 •

edited

Loading