-
Notifications
You must be signed in to change notification settings - Fork 3k
[data] feat: TransferQueue - Unify the return of reward #4902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to unify the return values of the reward function, which is a good refactoring goal. However, the current implementation introduces a critical bug related to the REMAX baseline calculation and also makes the function's type hint and docstring incorrect and misleading. I've provided two review comments with high and critical severity to address these issues. Please review them carefully.
Updated return type annotation for the reward computation method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the _compute_or_extract_reward method in the PPO ray trainer to simplify and unify its return value logic. The parameter return_dict is renamed to reward_for_val to better reflect its purpose, and the return value handling is streamlined.
Changes:
- Renamed parameter
return_dicttoreward_for_valto clarify that it controls validation vs training reward computation paths - Simplified return logic: now returns either a tensor (for REMAX with sum_reward) or a tuple (all other cases), removing the dictionary return type
- Updated all call sites to match the new return signature (tuple unpacking instead of dictionary access)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
verl/trainer/ppo/ray_trainer.py
Outdated
| If return_dict=True: dict with "reward_tensor" and "reward_extra_info" | ||
| If return_dict=False and sum_reward=True: summed reward_tensor (1D tensor) | ||
| If return_dict=False and sum_reward=False: reward_tensor (2D tensor) | ||
| If reward_for_val=True: reward_tensor (2D tensor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add descriptions for reward_extra_infos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
verl/trainer/ppo/ray_trainer.py
Outdated
| reward_tensor = reward_tensor.sum(dim=-1) | ||
| reward_extra_info = result.get("reward_extra_info", {}) | ||
| return {"reward_tensor": reward_tensor, "reward_extra_info": reward_extra_info} | ||
| return reward_tensor, reward_extra_info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unify with reward_extra_info_dict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
please run |
done |
What does this PR do?
Unify the return values of the reward function to make the logic clearer
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.