Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
WIP branch to support on-policy distillation on main
Overview
OPD is already implemented in
verl. For Qwen-2.5-0.5B-Instruct student and Qwen-2.5-7B-Instruct teacher:python -m verl.trainer.main_ppo \ actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \ +actor_rollout_ref.ref.model.path=Qwen/Qwen2.5-7B-Instruct \ actor_rollout_ref.actor.kl_loss_coef=1Note that this will still use rewards for the student model rollouts, but these can be set to zero if you want to do pure distillation. See Results section below for an example.
This implementation can be made more efficient using async, so OPD will be implemented as a part of
verl.experimental.fully_async_policy.Async RL
In RL, there are two main phases:
To make this more efficient,
fully_async_policyimplements various degrees of asynchrony between the rollout and training phases:Async OPD Implementation Plan
In OPD, there is an additional phase for using the teacher model to score the rollouts from the student model, which are then used by a distillation loss to update the student model:
To implement OPD in an async manner, we will add a new phase for the teacher model scoring:
Results
Sync OPD
These results are for OPD using
main. The only change to the code is to make it so that the policy loss = KL loss. The reason for doing it this way instead of just creating an all-0 reward is so that the logged rewards reflect performance on the training task. This change is done by changing this line toNote that we can do OPD + RL rewards, although to isolate effects of OPD, RL rewards aren't used in these experiments.
Distillation loss
Train accuracy
Eval accuracy
Script