-
Notifications
You must be signed in to change notification settings - Fork 46
Open
Description
Hi, thank you for releasing the code and paper — I really appreciate the work!
I'm having difficulty reproducing the results reported in the paper and would appreciate any guidance you could provide.
I followed the instructions in README.md exactly, with two modifications due to hardware constraints:
- Reduced number of GPUs (please specify how many you used if relevant)
- Switched FAISS to CPU due to GPU memory limitations
I disabled validation during training. Evaluation results were obtained using the following script:
PYTHONUNBUFFERED=1 python -m verl.trainer.main_ppo \
--config-name='search_multiturn_grpo' \
data.train_files="$TRAIN_DATA" \
data.val_files="$VAL_DATA" \
data.train_batch_size=256 \
data.val_batch_size=256 \
actor_rollout_ref.model.path=${model} \
actor_rollout_ref.model.enable_gradient_checkpointing=true \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.rollout.name=sglang \
actor_rollout_ref.rollout.tensor_model_parallel_size=${tp} \
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32 \
actor_rollout_ref.rollout.n=1 \
actor_rollout_ref.rollout.multi_turn.tool_config_path="$TOOL_CONFIG" \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.adv_estimator=grpo \
trainer.logger='[]' \
trainer.val_only=true \
trainer.val_before_train=true \
trainer.n_gpus_per_node=${gpus} \
trainer.nnodes=1 \
trainer.validation_data_dir=${ROLLOUT_DIR} \
"$@"
My results of Qwen 3b iter2 and 7b iter 1 are quite different from the numbers reported in the paper.
| Model | NQ | TriviaQA | PopQA | HotpotQA | 2WikiMQA | Musique | Bamboogle | Avg |
|---|---|---|---|---|---|---|---|---|
| Qwen-2.5-3b-iter1 | 37.26 | 54.42 | 37.82 | 28.87 | 26.46 | 10.47 | 28.80 | 32.01 |
| Qwen-2.5-3b-iter2 | 37.20 | 54.31 | 38.62 | 26.62 | 19.01 | 6.79 | 16.00 | 28.36 |
| Model | NQ | TriviaQA | PopQA | HotpotQA | 2WikiMQA | Musique | Bamboogle | Avg |
|---|---|---|---|---|---|---|---|---|
| Qwen-2.5-7b-iter1 | 40.14 | 57.09 | 40.50 | 28.25 | 19.95 | 6.79 | 16.80 | 29.93 |
Full training logs are available here: https://wandb.ai/cyruskwan/dr-zero-public. I would greatly appreciate any guidance or suggestions that could help me reproduce the reported results. Thank you!
Metadata
Metadata
Assignees
Labels
No labels