Skip to content

Conversation

@JacobHelwig
Copy link
Contributor

@JacobHelwig JacobHelwig commented Jan 12, 2026

WIP branch to support on-policy distillation on main

Overview

OPD is already implemented in verl. For Qwen-2.5-0.5B-Instruct student and Qwen-2.5-7B-Instruct teacher:

python -m verl.trainer.main_ppo \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    +actor_rollout_ref.ref.model.path=Qwen/Qwen2.5-7B-Instruct \
    actor_rollout_ref.actor.kl_loss_coef=1

Note that this will still use rewards for the student model rollouts, but these can be set to zero if you want to do pure distillation. See Results section below for an example.

This implementation can be made more efficient using async, so OPD will be implemented as a part of verl.experimental.fully_async_policy.

Async RL

In RL, there are two main phases:

image

To make this more efficient, fully_async_policy implements various degrees of asynchrony between the rollout and training phases:

image

Async OPD Implementation Plan

In OPD, there is an additional phase for using the teacher model to score the rollouts from the student model, which are then used by a distillation loss to update the student model:

image

To implement OPD in an async manner, we will add a new phase for the teacher model scoring:

image

Results

Sync OPD

These results are for OPD using main. The only change to the code is to make it so that the policy loss = KL loss. The reason for doing it this way instead of just creating an all-0 reward is so that the logged rewards reflect performance on the training task. This change is done by changing this line to

policy_loss = kl_loss

Note that we can do OPD + RL rewards, although to isolate effects of OPD, RL rewards aren't used in these experiments.

Distillation loss

image

Train accuracy

image

Eval accuracy

image

Script

train_files="['$gsm8k_train_path']"
test_files="['$gsm8k_test_path']"

FAMILY=Qwen
STUDENT=Qwen2.5-0.5B
TEACHER=Qwen2.5-7B-Instruct

python -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files="$train_files" \
    data.val_files="$test_files" \
    data.train_batch_size=256 \
    data.max_prompt_length=512 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=$FAMILY/$STUDENT \
    +actor_rollout_ref.ref.model.path=$FAMILY/$TEACHER \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=1 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.n=1 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console", "wandb"]' \
    trainer.project_name='math_opd' \
    trainer.experiment_name=$FAMILY-$STUDENT-$TEACHER \
    trainer.n_gpus_per_node=4 \
    trainer.nnodes=1 \
    trainer.save_freq=20 \
    trainer.test_freq=5 \
    trainer.total_epochs=15 \
    trainer.resume_mode=disable 

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request appears to be a work-in-progress for adding on-policy distillation support. The only change present is a minor whitespace modification in the README.md file. As this is a stylistic change with no functional impact, and I am configured to only report issues of high or critical severity, I have no specific comments on the current state of the pull request. I look forward to reviewing more substantial changes as they are added.

@JacobHelwig JacobHelwig mentioned this pull request Jan 12, 2026
24 tasks
@JacobHelwig JacobHelwig force-pushed the jhelwig/onPolicyDistillation branch from 7ee07a9 to 2e3a783 Compare January 12, 2026 20:10
@JacobHelwig JacobHelwig force-pushed the jhelwig/onPolicyDistillation branch from 4245789 to 840aca3 Compare January 13, 2026 01:24
@yubin1991
Copy link

why the reward is so small (almost 0) before step 40 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants