From 97a8265e735caf3a06f3b9a111a74774ba8726ce Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Quentin=20Gallou=C3=A9dec?= <45557362+qgallouedec@users.noreply.github.com> Date: Thu, 29 Jan 2026 13:21:14 -0600 Subject: [PATCH] Update reward terminology from 'shaped' to 'dense' In this context, "dense" is more widely used --- .../preference_optimization/grpo_rlvr/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/0_model_customization_recipes/preference_optimization/grpo_rlvr/README.md b/0_model_customization_recipes/preference_optimization/grpo_rlvr/README.md index 5229ae4..81ba1ae 100644 --- a/0_model_customization_recipes/preference_optimization/grpo_rlvr/README.md +++ b/0_model_customization_recipes/preference_optimization/grpo_rlvr/README.md @@ -169,7 +169,7 @@ def accuracy_reward(completions: List[List[Dict]], answer: List[str], **kwargs) ### Reward Design Tips - **Sparse rewards** (0.0 or 1.0): Simple but can be slow to learn -- **Shaped rewards** (0.0 to 1.0): Provide intermediate feedback +- **Dense rewards** (0.0 to 1.0): Provide intermediate feedback - Partial credit for correct tool selection - Partial credit for correct argument types - Full credit for correct final answer @@ -366,7 +366,7 @@ The `GRPOTrainer` (in `grpo_trainer_v2.py`): ## Tips 1. **Start simple**: Begin with 2-3 tools and exact-match rewards -2. **Iterate on rewards**: Experiment with shaped rewards for faster learning +2. **Iterate on rewards**: Experiment with dense rewards for faster learning 3. **Validate tools**: Test your tool functions independently before training 4. **Monitor rewards**: Watch mean reward per batch to track learning 5. **Use clear docstrings**: The model sees your function docstrings as tool descriptions @@ -383,7 +383,7 @@ The `GRPOTrainer` (in `grpo_trainer_v2.py`): **Low rewards throughout training** - Check that expected answers match tool output format exactly -- Try shaped rewards with partial credit +- Try dense rewards with partial credit - Verify tools are being called (check logs) **Model not calling tools**