[RL] Add sum digits task and pre and post training evaluation by daniellepintz · Pull Request #2423 · pytorch/torchtitan

daniellepintz · 2026-02-23T15:04:12Z

Add sum digits task which model can learn and improve on in 12 steps
Add evaluate function and runs evaluation before and after training
Convert the INFO logging which runs on every step to DEBUG logging so we can visualize loss and rewards
Add per-token normalization of log probs to avoid long completions causing loss explosion

Final result:

[2026-02-23 06:44:14] INFO simple_rl_multiprocess.py:205: [actor=<root>] Evaluating pre-training baseline...
[2026-02-23 06:44:24] INFO simple_rl_multiprocess.py:87: [actor=<root>] Eval: Accuracy=20% (4/20) Format=30% (6/20)
[2026-02-23 06:44:24] INFO simple_rl_multiprocess.py:209: [actor=<root>] ================================================================================
[2026-02-23 06:44:24] INFO simple_rl_multiprocess.py:210: [actor=<root>] Starting RL training for 12 steps
[2026-02-23 06:44:24] INFO simple_rl_multiprocess.py:211: [actor=<root>] ================================================================================
NCCL version 2.28.9+cuda12.9
/home/daniellepintz/torchtitan/titan-rl-env/lib/python3.12/site-packages/torch/distributed/c10d_logger.py:83: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
  return func(*args, **kwargs)
[2026-02-23 06:45:06] INFO simple_rl_multiprocess.py:222: [actor=<root>] Step   0 | Loss: -0.0057 | Reward: -0.450 | Correct: 11/40
[2026-02-23 06:45:41] INFO simple_rl_multiprocess.py:222: [actor=<root>] Step   1 | Loss: -0.0051 | Reward: -0.600 | Correct: 8/40
[2026-02-23 06:46:19] INFO simple_rl_multiprocess.py:222: [actor=<root>] Step   2 | Loss: -0.0049 | Reward: -0.100 | Correct: 18/40
[2026-02-23 06:46:55] INFO simple_rl_multiprocess.py:222: [actor=<root>] Step   3 | Loss: -0.0047 | Reward: +0.205 | Correct: 24/40
[2026-02-23 06:47:31] INFO simple_rl_multiprocess.py:222: [actor=<root>] Step   4 | Loss: -0.0051 | Reward: +0.100 | Correct: 22/40
[2026-02-23 06:48:07] INFO simple_rl_multiprocess.py:222: [actor=<root>] Step   5 | Loss: -0.0046 | Reward: +0.000 | Correct: 20/40
[2026-02-23 06:48:42] INFO simple_rl_multiprocess.py:222: [actor=<root>] Step   6 | Loss: -0.0045 | Reward: +0.100 | Correct: 22/40
[2026-02-23 06:49:17] INFO simple_rl_multiprocess.py:222: [actor=<root>] Step   7 | Loss: -0.0043 | Reward: +0.100 | Correct: 22/40
[2026-02-23 06:49:52] INFO simple_rl_multiprocess.py:222: [actor=<root>] Step   8 | Loss: -0.0039 | Reward: +0.450 | Correct: 29/40
[2026-02-23 06:50:28] INFO simple_rl_multiprocess.py:222: [actor=<root>] Step   9 | Loss: -0.0039 | Reward: +0.400 | Correct: 28/40
[2026-02-23 06:51:03] INFO simple_rl_multiprocess.py:222: [actor=<root>] Step  10 | Loss: -0.0035 | Reward: +0.650 | Correct: 33/40
[2026-02-23 06:51:39] INFO simple_rl_multiprocess.py:222: [actor=<root>] Step  11 | Loss: -0.0034 | Reward: +0.550 | Correct: 31/40
[2026-02-23 06:51:39] INFO simple_rl_multiprocess.py:249: [actor=<root>] RL Training complete
[2026-02-23 06:51:39] INFO simple_rl_multiprocess.py:250: [actor=<root>] Evaluating post-training performance...
[2026-02-23 06:51:50] INFO simple_rl_multiprocess.py:87: [actor=<root>] Eval: Accuracy=70% (14/20) Format=100% (20/20)
[2026-02-23 06:51:50] INFO simple_rl_multiprocess.py:253: [actor=<root>] ================================================================================
[2026-02-23 06:51:50] INFO simple_rl_multiprocess.py:254: [actor=<root>] Pre-training:  Accuracy=20% (4/20) Format=30% (6/20)
[2026-02-23 06:51:50] INFO simple_rl_multiprocess.py:258: [actor=<root>] Post-training: Accuracy=70% (14/20) Format=100% (20/20)
[2026-02-23 06:51:50] INFO simple_rl_multiprocess.py:262: [actor=<root>] ================================================================================

…m_digits

daniellepintz · 2026-02-23T15:05:37Z

torchtitan/experiments/rl/unified/actors/generator.py

        )

-        outputs = self.llm.generate(prompt_texts, sampling_params)
+        outputs = self.llm.generate(prompt_texts, sampling_params, use_tqdm=False)


changed to avoid spamming terminal on each step

daniellepintz · 2026-02-23T15:05:51Z

.gitignore


 # Vibe coding
 .claude
+


will remove this change before landing

daniellepintz · 2026-02-23T15:10:35Z

torchtitan/experiments/rl/unified/actors/generator.py

        # only generator rank 0 saves the weight
        if torch.distributed.get_rank() == 0:
-            logger.info(f"Saving weights to {checkpoint_path}")
+            logger.debug(f"Saving weights to {checkpoint_path}")


changed to avoid spamming terminal on each step

joecummings · 2026-02-23T15:22:08Z

torchtitan/experiments/rl/unified/simple_rl_multiprocess.py

-    num_steps = 10
-    learning_rate = 1e-5
-    max_new_tokens = 20
+    num_steps = 12


What's the reasoning behind this exact number of steps?

This is what I experimented with and it worked well - but is somewhat arbitrary and I can change to something else if you prefer

torchtitan/experiments/rl/unified/simple_rl_multiprocess.py

joecummings · 2026-02-23T15:36:37Z

torchtitan/experiments/rl/unified/simple_rl_multiprocess.py

+    task_spec = SumDigitsSpec(seed=42)
+    system_prompt = task_spec.get_system_prompt()
+
+    prompt_texts = []


Here's my understanding of the flow here:

prompt_text of size num_prompt is gathered. Here it's 5.

These prompt texts are passed to the generator at instantiation time

Every step, the generator uses the SAME prompt texts to generate completions. No new prompts are sampled from the dataloader (task)

This means for num_steps, while completions will be slightly different b/c of a high temperature, the model is only ever trained to respond to 5 prompts. Therefore, this num_steps is actually more accurately number of ppo epochs and num_steps is actually 1. I don't think this is what we want as it doesn't follow the vanilla GRPO formulation.

It looks as though this has been here since before these changes, so please let me know if I'm missing anything here @wwwjn, but I don't immediately see where new prompts are passed to the generator.

Great catch, thanks Joe!

I fixed this, and luckily results are still good:

[2026-02-23 07:59:38] INFO simple_rl_multiprocess.py:237: [actor=<root>] Pre-training: Accuracy=45% (9/20) Format=45% (9/20) [2026-02-23 07:59:38] INFO simple_rl_multiprocess.py:241: [actor=<root>] Post-training: Accuracy=70% (14/20) Format=95% (19/20)

allenwang28

Thanks @daniellepintz!

allenwang28 · 2026-02-23T15:44:38Z

torchtitan/experiments/rl/sum_digits.py

@@ -0,0 +1,116 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.


a nit on file placement, but would it make more sense to put this in torchtitan/experimental/rl rather than only in unified?

yes that makes sense- updated.

my thinking was that we are eventually going to collapse rl/unified and rl/vllm_compat anyway to live just under rl/ - @wwwjn is that the ultimate direction or no?

I actually plan to remove vllm_compat once we suppport bit-wise identity model on unified path, so I guess it makes sense to put the sum_digits.py under unified

torchtitan/experiments/rl/unified/simple_rl_multiprocess.py

allenwang28 · 2026-02-23T15:58:53Z

torchtitan/experiments/rl/unified/simple_rl_multiprocess.py

similar to @joecummings comment I would expect that we iterate on the prompts as a dataloader and pass prompts to the generator here

wconstab · 2026-02-23T23:38:03Z

nit: this caught my eye- ping me on workchat if you want help to figure out how to squelch it

/home/daniellepintz/torchtitan/titan-rl-env/lib/python3.12/site-packages/torch/distributed/c10d_logger.py:83: UserWarning: barrier(): using the device under current context. You can specify device_id in init_process_group to mute this warning.

wwwjn

Thanks, this PR might need rebase after the #2191 landed

wwwjn · 2026-02-25T00:19:22Z

torchtitan/experiments/rl/sum_digits.py

@@ -0,0 +1,116 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.


I actually plan to remove vllm_compat once we suppport bit-wise identity model on unified path, so I guess it makes sense to put the sum_digits.py under unified

wwwjn · 2026-02-25T00:30:08Z

torchtitan/experiments/rl/unified/simple_rl_multiprocess.py

-    logger.info(f"Loaded {len(prompt_texts)} prompts")
+    # Task spec
+    task_spec = SumDigitsSpec(seed=42)
+    system_prompt = task_spec.get_system_prompt()


The concept of task is actually the same as dataloader + reward calculation. Once we rebase onto the config system change, we would need proper config for Task as well. I think we are moving towards "everything is configurable" idea

daniellepintz added 4 commits February 20, 2026 13:54

upd

016cd86

Merge branch 'main' of ssh://github.com/pytorch/torchtitan into dp/su…

e6aeb50

…m_digits

upd

4b7632b

upd

48dc7cd

pytorch-bot bot added the ciflow/8gpu label Feb 23, 2026

upd

54fdf2c

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 23, 2026

daniellepintz changed the title ~~Dp/sum digits~~ [RL] Add sum digits task and pre and post training evaluation Feb 23, 2026

daniellepintz commented Feb 23, 2026

View reviewed changes

.gitignore

# Vibe coding

.claude

Copy link

Contributor Author

daniellepintz Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove this change before landing

daniellepintz marked this pull request as ready for review February 23, 2026 15:06

daniellepintz requested review from fegin, tianyu-l, wconstab and wwwjn as code owners February 23, 2026 15:06

daniellepintz requested review from allenwang28, felipemello1 and joecummings February 23, 2026 15:06

daniellepintz commented Feb 23, 2026

View reviewed changes

fix lint

0dade7c

joecummings reviewed Feb 23, 2026

View reviewed changes

allenwang28 reviewed Feb 23, 2026

View reviewed changes

daniellepintz added 2 commits February 23, 2026 08:30

upd

38677b7

upd

ba8b6a3

wwwjn reviewed Feb 25, 2026

View reviewed changes

		@@ -0,0 +1,116 @@
		# Copyright (c) Meta Platforms, Inc. and affiliates.

Conversation

daniellepintz commented Feb 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joecummings Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allenwang28 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daniellepintz Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wconstab commented Feb 23, 2026

Uh oh!

wwwjn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

joecummings Feb 23, 2026 •

edited

Loading

daniellepintz Feb 23, 2026 •

edited

Loading