[WIP] On-Policy Distillation #4897

JacobHelwig · 2026-01-12T17:57:57Z

WIP branch to support on-policy distillation on main

Overview

OPD is already implemented in verl. For Qwen-2.5-0.5B-Instruct student and Qwen-2.5-7B-Instruct teacher:

python -m verl.trainer.main_ppo \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    +actor_rollout_ref.ref.model.path=Qwen/Qwen2.5-7B-Instruct \
    actor_rollout_ref.actor.kl_loss_coef=1

Note that this will still use rewards for the student model rollouts, but these can be set to zero if you want to do pure distillation. See Results section below for an example.

This implementation can be made more efficient using async, so OPD will be implemented as a part of verl.experimental.fully_async_policy.

Async RL

In RL, there are two main phases:

To make this more efficient, fully_async_policy implements various degrees of asynchrony between the rollout and training phases:

Async OPD Implementation Plan

In OPD, there is an additional phase for using the teacher model to score the rollouts from the student model, which are then used by a distillation loss to update the student model:

To implement OPD in an async manner, we will add a new phase for the teacher model scoring:

Results

Sync OPD

These results are for OPD using main. The only change to the code is to make it so that the policy loss = KL loss. The reason for doing it this way instead of just creating an all-0 reward is so that the logged rewards reflect performance on the training task. This change is done by changing this line to

policy_loss = kl_loss

Note that we can do OPD + RL rewards, although to isolate effects of OPD, RL rewards aren't used in these experiments.

Distillation loss

Train accuracy

Eval accuracy

Script

train_files="['$gsm8k_train_path']"
test_files="['$gsm8k_test_path']"

FAMILY=Qwen
STUDENT=Qwen2.5-0.5B
TEACHER=Qwen2.5-7B-Instruct

python -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files="$train_files" \
    data.val_files="$test_files" \
    data.train_batch_size=256 \
    data.max_prompt_length=512 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=$FAMILY/$STUDENT \
    +actor_rollout_ref.ref.model.path=$FAMILY/$TEACHER \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=1 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.n=1 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console", "wandb"]' \
    trainer.project_name='math_opd' \
    trainer.experiment_name=$FAMILY-$STUDENT-$TEACHER \
    trainer.n_gpus_per_node=4 \
    trainer.nnodes=1 \
    trainer.save_freq=20 \
    trainer.test_freq=5 \
    trainer.total_epochs=15 \
    trainer.resume_mode=disable

gemini-code-assist

Code Review

This pull request appears to be a work-in-progress for adding on-policy distillation support. The only change present is a minor whitespace modification in the README.md file. As this is a stylistic change with no functional impact, and I am configured to only report issues of high or critical severity, I have no specific comments on the current state of the pull request. I look forward to reviewing more substantial changes as they are added.

yubin1991 · 2026-01-14T12:50:29Z

why the reward is so small (almost 0) before step 40 ?

init

15d1d3b

gemini-code-assist bot reviewed Jan 12, 2026

View reviewed changes

JacobHelwig mentioned this pull request Jan 12, 2026

[roadmap] verl Q1 roadmap #4880

Open

24 tasks

Init debug script

2e3a783

JacobHelwig force-pushed the jhelwig/onPolicyDistillation branch from 7ee07a9 to 2e3a783 Compare January 12, 2026 20:10

Plan

840aca3

JacobHelwig force-pushed the jhelwig/onPolicyDistillation branch from 4245789 to 840aca3 Compare January 13, 2026 01:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] On-Policy Distillation #4897

[WIP] On-Policy Distillation #4897

JacobHelwig commented Jan 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

yubin1991 commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP] On-Policy Distillation #4897

Are you sure you want to change the base?

[WIP] On-Policy Distillation #4897

Conversation

JacobHelwig commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Async RL

Async OPD Implementation Plan

Results

Sync OPD

Distillation loss

Train accuracy

Eval accuracy

Script

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

yubin1991 commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JacobHelwig commented Jan 12, 2026 •

edited

Loading