[rl] Use torchtitan config system for inference and simple GRPO by wwwjn · Pull Request #2191 · pytorch/torchtitan

wwwjn · 2026-01-02T16:59:35Z

Stack from ghstack (oldest at bottom):

Add job_config.py to extend current JobConfig. Now an issue is trainer's config and generator's config are not symmetric, eg Parallelism and Generation.parallelism
Use job config system as the centralized / source-of-truth config, loading config from run_configs/qwen3_0.6b.toml file.
Refactor the generator to use EngineArgs() and LLMEngine(), instead of LLM()
Rename simple_rl_multiprocess -> simple_grpo to be more descriptive
Clean up unused code branch

Test: (trainer ddp = 2, n_generator =1)

Following-up refactors:

Refactor2: vllm model register - using setup.py and plugin instead of import
Refactor3: Weight updater, by directly passing state_dict (DTensor) between trainer and generator
Refactor4: Use torchtitan Trainer, modularize each component

[ghstack-poisoned]

allenwang28

I like this direction, thanks! Mostly nits here

torchtitan/experiments/rl/unified/README.md

torchtitan/experiments/rl/unified/actors/generator.py

torchtitan/experiments/rl/unified/infer.py

torchtitan/experiments/rl/unified/simple_grpo.py

…inference and simple GRPO" 1. Add job_config.py to extend current JobConfig. Now an issue is `trainer`'s config and `generator`'s config are not symmetric, eg `Parallelism` and `Generation.parallelism` 2. Use job config system as the centralized / source-of-truth config, loading config from `run_configs/qwen3_0.6b.toml` file. 3. Refactor the generator to use EngineArgs() and LLMEngine(), instead of LLM() 4. Rename simple_rl_multiprocess -> simple_grpo to be more descriptive 5. Clean up unused code branch Test: (trainer ddp = 2, n_generator =1) <img width="755" height="294" alt="Screenshot 2025-12-30 at 7 34 00 PM" src="https://github.com/user-attachments/assets/94a3038f-6e5c-4749-9f7b-c575c63be2a1" /> Following-up refactors: - Refactor2: vllm model register - using setup.py and plugin instead of import - Refactor3: Weight updater, by directly passing state_dict (DTensor) between trainer and generator - Refactor4: Use torchtitan Trainer, modularize each component [ghstack-poisoned]

torchtitan/experiments/rl/unified/actors/generator.py

torchtitan/experiments/rl/unified/simple_grpo.py

torchtitan/experiments/rl/unified/job_config.py

torchtitan/experiments/rl/unified/README.md

torchtitan/experiments/rl/unified/job_config.py

torchtitan/experiments/rl/unified/actors/generator.py

torchtitan/experiments/rl/unified/job_config.py

[ghstack-poisoned]

…ight tying (#2410) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.13.0) (oldest at bottom): * #2395 * #2244 * #2221 * #2194 * #2191 * __->__ #2410 This is a alternative fix to #2402 (comment). Weight updating between trainer and generator is totally broken because: It's caused by we called "reload_weights" when updating the weights. The reload_weights has following steps: - initialize_layerwise_reload(model): Saves the current real GPU tensors as info.kernel_tensors, and replace all parameters with meta tensor. - Call model.load_weights(weights_iter): This function is written by us and calls set_model_state_dict, Internally, set_model_state_dict tries to do param.data.copy_(loaded_weight) for each parameter. When parameters are meta tensor, it will do "no-op". So the weights never get updated In this PR: - Totally bypass reload_weights, and don't load from a file when we update the weights - Gets the model via self.engine.model_executor.driver_worker.get_model() - Iterates over model.named_parameters() to find the matching parameter by name - Does param.data.copy_(new_tensor) directly

[ghstack-poisoned]

torchtitan/experiments/rl/unified/README.md

[ghstack-poisoned]

torchtitan/experiments/rl/unified/actors/generator.py

joecummings · 2026-02-24T16:00:51Z

torchtitan/experiments/rl/unified/actors/generator.py

            token_log_probs: List of per-token log prob lists for each completion
            prompt_token_ids: List of prompt token ID lists for each completion
        """
+        logger.info(


nit: maybe keep this at debug level. It seems like @daniellepintz pointed out these can flood the terminal

Bunch of these throughout generation

Good point, cleaned up

joecummings · 2026-02-24T16:03:56Z

torchtitan/experiments/rl/unified/actors/trainer.py

-        ddp_size: int = 1,
-        tp_size: int = 1,
+        config: Config,
+        policy_optimization: PolicyOptimizationConfig,


kwargs only after config position

joecummings · 2026-02-24T16:17:57Z

torchtitan/experiments/rl/unified/config_registry.py

+                selective_ac_option="op",
+            ),
+        ),
+        batch_invariant_mode=True,


This is only applied in generator therefore it should be part of the Generator config, not the top-level config

My understanding is if we want to achieve true "on-policy" RL, both trainer / generator side would need to be replaced with batch-invariant kernels. (So that even batch size on trainer / generator are different, they will produce exactly the same logprob for each sample). So I made this a top-level config, this is a "mode" we want to switch on/off globally

torchtitan/experiments/rl/unified/simple_grpo.py

torchtitan/experiments/rl/unified/actors/generator.py

acisseJZhong · 2026-02-24T19:17:05Z

torchtitan/experiments/rl/unified/actors/generator.py

        self,
+        config: Config,
+        *,
+        model_spec: ModelSpec,


same, curious what is the Criteria for keep fields here vs in Config? Should we align with how the rest of torchtitan behave?

I passed it here because it was set by upper-level class' config (here the upper-level class is RLTrainer)

torchtitan/experiments/rl/unified/simple_grpo.py

acisseJZhong · 2026-02-24T19:21:02Z

thanks for making the changes and rebasing to main, overall it looks good to me! just have a few questions around naming and config structure.

joecummings · 2026-02-24T19:26:19Z

torchtitan/experiments/rl/unified/models/vllm_wrapper.py

        # NOTE: We need to apply parallelize within model.__init__ because vllm
        # doesn't separate model creation and parallelism application and instead
        # requires parallelization to be done inside model constructor.
+        cfg = self.parallel_config


This is the trainer config, not the parallel_config, no?

Nice catch, fixed!

torchtitan/experiments/rl/unified/actors/generator.py

tianyu-l · 2026-02-25T01:45:35Z

torchtitan/experiments/rl/unified/actors/generator.py

+
+            # Ensure weights stay in bfloat16
+            vllm_state = {
+                k: v.to(torch.bfloat16) if v.dtype == torch.float32 else v


In VLLMEngine.Config you have a dtype field. Is it not for this cast to use?

This dtype conversion is from original code, and VLLMEngine.Config.dtype field is used to set EngineArgs(dtype=...), which is also bfloat16. This latter fields controls the weight dtype of vllm model.

My understanding is that, vLLM model's weights are usually set to bfloat16, because: 1) FA kernels only takes bfloat16 inputs, and 2) the kv_cache needs to be in bfloat16/float16 (kv_cache does not support float32 because it takes too much memory).

Once we have control of the forward path, we can control weights / activation dtype in a finer-granularity way, but we will need to figure-out where to change it to bfloat16 so it's compatible with vllm's kv cache mechanism. Now I just simply set generator model's weights to be bfloat16 (and trainer's weights to be bfloat16 when checking bit-wise identity) to bypass this issue, which need more careful investigation of inner vllm

I still don't see why they wouldn't agree.

When engine doesn't exist, you use EngineArgs to create one, with VLLMEngine.Config.dtype.
Here the state dict is from PolicyTrainer to update the existing engine, so has to align with what's set above.

tianyu-l · 2026-02-25T01:47:56Z

torchtitan/experiments/rl/unified/actors/generator.py

+            from torchtitan.experiments.rl.vllm_compat.weights import vllm_to_torchtitan

-            titan_state = vllm_compat_to_torchtitan(vllm_compat_state)
+            titan_state = vllm_to_torchtitan(vllm_state)


maybe leave a TODO on renaming vllm_to_torchtitan to hint that it's for state dict conversion

But actually, I wonder why we don't use our state_dict adapter / checkpointer utils for this conversion.

wonder why we don't use our state_dict adapter / checkpointer utils for this conversion.

This is achieved by the 3rd PR in this stack (which is focusing on removing fqn conversion & saving to file during weight transfer). In this PR I tried to focus on Config system changes and make each PR easy to read

tianyu-l · 2026-02-25T01:51:00Z

torchtitan/experiments/rl/unified/actors/generator.py


-        self.llm = None
-        self.tp_size = tp_size
+        self.engine = None


why VLLMEngine would own another engine?

torchtitan/experiments/rl/unified/actors/trainer.py

tianyu-l · 2026-02-25T01:58:52Z

torchtitan/experiments/rl/unified/models/utils.py

+    model = model_args.build()
+    # Set global default dtype to bfloat16. This is needed because vLLM's Attention
+    # layer uses torch.get_default_dtype() and it doesn't support float32
+    torch.set_default_dtype(torch.bfloat16)


hmm, trainer also can only use torch.bfloat16 master weight?

This is more for bit-wise identity check (as vllm is using bloat16 weights because of the reason here , I remove it when bit-wise identity mode is not enabled

torchtitan/experiments/rl/unified/models/utils.py

torchtitan/experiments/rl/unified/inference_example.py

torchtitan/experiments/rl/unified/models/utils.py

[ghstack-poisoned]

config sys v1

5bc3c98

[ghstack-poisoned]

pytorch-bot bot added the ciflow/8gpu label Jan 2, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 2, 2026

This was referenced Jan 2, 2026

config sys for simple_grpo v1 #2192

Closed

config sys for simple_grpo v2 #2193

Closed

[rl] refactor torchtitan model registery in vllm #2194

Open

wwwjn closed this Jan 2, 2026

Update on "config sys v1"

9727ee8

[ghstack-poisoned]

wwwjn reopened this Jan 2, 2026

wwwjn mentioned this pull request Jan 2, 2026

[rl] Using JobConfig as the centralized config system for inference and simple GRPO #2188

Closed

wwwjn changed the title ~~config sys v1~~ [rl] Using JobConfig as the centralized config system for inference and simple GRPO Jan 2, 2026

allenwang28 reviewed Jan 2, 2026

View reviewed changes

wwwjn added 2 commits January 12, 2026 16:13

wwwjn mentioned this pull request Jan 13, 2026

[rl] refactor save and load model weights using DCP #2221

Open

wwwjn requested review from acisseJZhong, tianyu-l and zhxchen17 January 13, 2026 00:37

acisseJZhong reviewed Jan 14, 2026

View reviewed changes

torchtitan/experiments/rl/unified/actors/generator.py Outdated Show resolved Hide resolved

acisseJZhong reviewed Jan 14, 2026

View reviewed changes

torchtitan/experiments/rl/unified/actors/generator.py Outdated Show resolved Hide resolved

acisseJZhong reviewed Jan 14, 2026

View reviewed changes

torchtitan/experiments/rl/unified/actors/generator.py Outdated Show resolved Hide resolved

acisseJZhong reviewed Jan 14, 2026

View reviewed changes

torchtitan/experiments/rl/unified/actors/generator.py Show resolved Hide resolved

acisseJZhong reviewed Jan 14, 2026

View reviewed changes

torchtitan/experiments/rl/unified/actors/generator.py Outdated Show resolved Hide resolved

acisseJZhong reviewed Jan 14, 2026

View reviewed changes

torchtitan/experiments/rl/unified/actors/generator.py Outdated Show resolved Hide resolved

acisseJZhong reviewed Jan 14, 2026

View reviewed changes

torchtitan/experiments/rl/unified/simple_grpo.py Outdated Show resolved Hide resolved

acisseJZhong reviewed Jan 14, 2026

View reviewed changes

torchtitan/experiments/rl/unified/simple_grpo.py Show resolved Hide resolved

wwwjn mentioned this pull request Jan 16, 2026

[rl] Generator enables TP, using torchtitan as Trainer, add grader for reward calculation #2244

Open

tianyu-l reviewed Jan 19, 2026

View reviewed changes

joecummings reviewed Jan 29, 2026

View reviewed changes

torchtitan/experiments/rl/unified/job_config.py Outdated Show resolved Hide resolved

torchtitan/experiments/rl/unified/job_config.py Outdated Show resolved Hide resolved

torchtitan/experiments/rl/unified/job_config.py Outdated Show resolved Hide resolved

tianyu-l reviewed Feb 19, 2026

View reviewed changes

felipemello1 reviewed Feb 19, 2026

View reviewed changes

torchtitan/experiments/rl/unified/job_config.py Outdated Show resolved Hide resolved

Update

909aa0b

[ghstack-poisoned]

wwwjn mentioned this pull request Feb 20, 2026

[rl] bypass reload_weights by manually copy weights per-param, fix weight tying #2410

Merged

Update

22a9e49

[ghstack-poisoned]

wwwjn added 2 commits February 22, 2026 16:48

Update

774026b

[ghstack-poisoned]

Update

ffdcc3a

[ghstack-poisoned]

wwwjn changed the title ~~[rl] Using JobConfig as the centralized config system for inference and simple GRPO~~ [rl] Use torchtitan config system for inference and simple GRPO Feb 23, 2026

Update

7bf6b74

[ghstack-poisoned]

wwwjn mentioned this pull request Feb 23, 2026

[RL] Fix rl on main #2427

Open

wwwjn added 2 commits February 23, 2026 17:53

Update

6f4d9cc

[ghstack-poisoned]

Update

9f579c1

[ghstack-poisoned]

wwwjn commented Feb 24, 2026

View reviewed changes

torchtitan/experiments/rl/unified/README.md Outdated Show resolved Hide resolved

wwwjn requested review from acisseJZhong, allenwang28, felipemello1, joecummings and tianyu-l February 24, 2026 02:02

Update

c947112

[ghstack-poisoned]

joecummings reviewed Feb 24, 2026

View reviewed changes

acisseJZhong reviewed Feb 24, 2026

View reviewed changes

joecummings reviewed Feb 24, 2026

View reviewed changes

wwwjn mentioned this pull request Feb 25, 2026

[RL] Add sum digits task and pre and post training evaluation #2423

Open

tianyu-l reviewed Feb 25, 2026

View reviewed changes

torchtitan/experiments/rl/unified/models/utils.py Outdated Show resolved Hide resolved

tianyu-l reviewed Feb 25, 2026

View reviewed changes

torchtitan/experiments/rl/unified/models/utils.py Outdated Show resolved Hide resolved

Update

ec37dab

[ghstack-poisoned]

wwwjn mentioned this pull request Feb 26, 2026

[rl][combo] Refactor simple RL loop with torchtitan components #2443

Open

Conversation

wwwjn commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

allenwang28 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

acisseJZhong commented Feb 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wwwjn commented Jan 2, 2026 •

edited

Loading