[rl] Generator enables TP, using torchtitan as Trainer, add grader for reward calculation#2244
[rl] Generator enables TP, using torchtitan as Trainer, add grader for reward calculation#2244wwwjn wants to merge 20 commits intogh/wwwjn/7/basefrom
Conversation
[ghstack-poisoned]
[ghstack-poisoned]
| rewards: torch.Tensor | ||
|
|
||
|
|
||
| class Grader(Actor): |
There was a problem hiding this comment.
Usage of score / grade / reward is a bit arbitrary right now.
claude tells me:
For general RL work: Stick with "reward function" or "reward model"
For RLHF pipelines: "Reward model" is standard
If choosing between scorer/grader: "Scorer" is slightly more aligned with ML conventions, as it emphasizes the quantitative nature of the output
When comparing scorer & grader, it sounds to me scorer is better than grader because the latter (more or less) suggests discrete reward value only.
There was a problem hiding this comment.
I don't have strong opinion here, in our current task, the reward value is simple discrete value, and actual reward can be continuous or complicated
| metrics = trainer.step.call(batch).get().item(gpus=0) | ||
| # Fully sync RL loop with separate scoring step | ||
| # 1. Generator produces episode (without rewards) | ||
| episode = generator.generate.call().get().item(gpus=0) |
There was a problem hiding this comment.
There should be difference between episode vs. episodes. The generate call seems returning an Episodes object?
There was a problem hiding this comment.
yes, the class is named "Episodes", but we are not doing batching here. So each instance of "Episodes" only contains one training prompt + completion + etc. I would rename it to be "Episode"
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
There was a problem hiding this comment.
This looks like a loss file. We should probably put it under /rl/loss. Here is the work i did in Forge, maybe we can just copy/paste? https://github.com/meta-pytorch/torchforge/tree/main/src/forge/rl/loss
If the number of losses look overwhelming, we can just keep GRPO/DAPO
There was a problem hiding this comment.
+1 to not having a catchall utils here. For now, even just putting all this in the trainer.py file would be fine IMO while we figure out the best structure for losses.
| return total_loss, metrics, batch_token_log_probs | ||
|
|
||
|
|
||
| def verify_logprob_identity( |
There was a problem hiding this comment.
if we log ratio (i.e. logprob train / logprog generator), we can check if this is always equal to 1. This is present in all losses in forge.
|
|
||
|
|
||
| @dataclass | ||
| class Episodes: |
There was a problem hiding this comment.
I think that this should be in some types.py file
[ghstack-poisoned]
| # Parallelism configuration | ||
| tensor_parallel_size=generation_config.parallelism.tensor_parallel_degree, | ||
| distributed_executor_backend=generation_config.distributed_executor_backend, | ||
| distributed_executor_backend="external_launcher", |
There was a problem hiding this comment.
maybe add comment on why we hardcode this?
There was a problem hiding this comment.
yeah we need better documentation everywhere, e.g. to have justifications on why certain things are done -- you can use claude
tianyu-l
left a comment
There was a problem hiding this comment.
At this moment, honestly I think the file organization and naming are stopping me from giving any high-quality reviews. I would suggest we have charts and documentations to help everyone understand the structure.
| # Parallelism configuration | ||
| tensor_parallel_size=generation_config.parallelism.tensor_parallel_degree, | ||
| distributed_executor_backend=generation_config.distributed_executor_backend, | ||
| distributed_executor_backend="external_launcher", |
There was a problem hiding this comment.
yeah we need better documentation everywhere, e.g. to have justifications on why certain things are done -- you can use claude
| @endpoint | ||
| async def update(self, version: int, vllm_compat_state: dict) -> None: | ||
| """Update generate weights. | ||
| async def update(self, version: int, all_weights: dict) -> None: |
There was a problem hiding this comment.
now another update(... dict)
There was a problem hiding this comment.
This update() updates Generator's internal state (Version number + model weights)
| data_parallel_shard_degree = -1 | ||
| fsdp_reshard_after_forward = "default" # default / never / always | ||
| tensor_parallel_degree = 1 | ||
| tensor_parallel_degree = 2 |
…ight tying (#2410) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.13.0) (oldest at bottom): * #2395 * #2244 * #2221 * #2194 * #2191 * __->__ #2410 This is a alternative fix to #2402 (comment). Weight updating between trainer and generator is totally broken because: It's caused by we called "reload_weights" when updating the weights. The reload_weights has following steps: - initialize_layerwise_reload(model): Saves the current real GPU tensors as info.kernel_tensors, and replace all parameters with meta tensor. - Call model.load_weights(weights_iter): This function is written by us and calls set_model_state_dict, Internally, set_model_state_dict tries to do param.data.copy_(loaded_weight) for each parameter. When parameters are meta tensor, it will do "no-op". So the weights never get updated In this PR: - Totally bypass reload_weights, and don't load from a file when we update the weights - Gets the model via self.engine.model_executor.driver_worker.get_model() - Iterates over model.named_parameters() to find the matching parameter by name - Does param.data.copy_(new_tensor) directly
| device_module.set_device(self.device) | ||
|
|
||
| # Initialize distributed | ||
| # When running under Monarch, setup_env_for_distributed already |
There was a problem hiding this comment.
This is a Monarch Actor - under what circumstances would you not be "running under Monarch"?
| return trajectory | ||
|
|
||
| @endpoint | ||
| async def set_reward_fn(self, reward_fn: Callable) -> None: |
There was a problem hiding this comment.
Curious when this would happen that the Grader gets a new reward function?
| advantages = self._compute_advantages(episode.rewards) | ||
|
|
||
| # Compute reference log probs using frozen ref_model | ||
| ref_token_log_probs = [] |
There was a problem hiding this comment.
Ideally reference model should be able to be separate from the trainer due to both memory and performance constraints. Can do that in a follow up PR if you'd like.
There was a problem hiding this comment.
+1 to not having a catchall utils here. For now, even just putting all this in the trainer.py file would be fine IMO while we figure out the best structure for losses.
Stack from ghstack (oldest at bottom):
Current status:
Next Step:
Toy GRPO training: