Skip to content

Unable to reproduce results #5

@KwanWaiChung

Description

@KwanWaiChung

Hi, thank you for releasing the code and paper — I really appreciate the work!

I'm having difficulty reproducing the results reported in the paper and would appreciate any guidance you could provide.

I followed the instructions in README.md exactly, with two modifications due to hardware constraints:

  • Reduced number of GPUs (please specify how many you used if relevant)
  • Switched FAISS to CPU due to GPU memory limitations

I disabled validation during training. Evaluation results were obtained using the following script:

PYTHONUNBUFFERED=1 python -m verl.trainer.main_ppo \
    --config-name='search_multiturn_grpo' \
    data.train_files="$TRAIN_DATA" \
    data.val_files="$VAL_DATA" \
    data.train_batch_size=256 \
    data.val_batch_size=256 \
    actor_rollout_ref.model.path=${model} \
    actor_rollout_ref.model.enable_gradient_checkpointing=true \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.rollout.name=sglang \
    actor_rollout_ref.rollout.tensor_model_parallel_size=${tp} \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32 \
    actor_rollout_ref.rollout.n=1 \
    actor_rollout_ref.rollout.multi_turn.tool_config_path="$TOOL_CONFIG" \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.adv_estimator=grpo \
    trainer.logger='[]' \
    trainer.val_only=true \
    trainer.val_before_train=true \
    trainer.n_gpus_per_node=${gpus} \
    trainer.nnodes=1 \
    trainer.validation_data_dir=${ROLLOUT_DIR} \
    "$@"

My results of Qwen 3b iter2 and 7b iter 1 are quite different from the numbers reported in the paper.

Model NQ TriviaQA PopQA HotpotQA 2WikiMQA Musique Bamboogle Avg
Qwen-2.5-3b-iter1 37.26 54.42 37.82 28.87 26.46 10.47 28.80 32.01
Qwen-2.5-3b-iter2 37.20 54.31 38.62 26.62 19.01 6.79 16.00 28.36
Model NQ TriviaQA PopQA HotpotQA 2WikiMQA Musique Bamboogle Avg
Qwen-2.5-7b-iter1 40.14 57.09 40.50 28.25 19.95 6.79 16.80 29.93

Full training logs are available here: https://wandb.ai/cyruskwan/dr-zero-public. I would greatly appreciate any guidance or suggestions that could help me reproduce the reported results. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions