Unable to reproduce results

Hi, thank you for releasing the code and paper — I really appreciate the work!

I'm having difficulty reproducing the results reported in the paper and would appreciate any guidance you could provide.

I followed the instructions in README.md exactly, with two modifications due to hardware constraints:
- Reduced number of GPUs (please specify how many you used if relevant)
- Switched FAISS to CPU due to GPU memory limitations

I disabled validation during training. Evaluation results were obtained using the following script:
                                                                                                                                                                                                                                             

```
PYTHONUNBUFFERED=1 python -m verl.trainer.main_ppo \
    --config-name='search_multiturn_grpo' \
    data.train_files="$TRAIN_DATA" \
    data.val_files="$VAL_DATA" \
    data.train_batch_size=256 \
    data.val_batch_size=256 \
    actor_rollout_ref.model.path=${model} \
    actor_rollout_ref.model.enable_gradient_checkpointing=true \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.rollout.name=sglang \
    actor_rollout_ref.rollout.tensor_model_parallel_size=${tp} \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32 \
    actor_rollout_ref.rollout.n=1 \
    actor_rollout_ref.rollout.multi_turn.tool_config_path="$TOOL_CONFIG" \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.adv_estimator=grpo \
    trainer.logger='[]' \
    trainer.val_only=true \
    trainer.val_before_train=true \
    trainer.n_gpus_per_node=${gpus} \
    trainer.nnodes=1 \
    trainer.validation_data_dir=${ROLLOUT_DIR} \
    "$@"
```
My results of Qwen 3b iter2 and 7b iter 1 are quite different from the numbers reported in the paper.

Model | NQ | TriviaQA | PopQA | HotpotQA | 2WikiMQA | Musique | Bamboogle | Avg
-- | -- | -- | -- | -- | -- | -- | -- | --
Qwen-2.5-3b-iter1 | 37.26 | 54.42 | 37.82 | 28.87 | 26.46 | 10.47 | 28.80 | 32.01
Qwen-2.5-3b-iter2 | 37.20 | 54.31 | 38.62 | 26.62 | 19.01 | 6.79 | 16.00 | 28.36

Model | NQ | TriviaQA | PopQA | HotpotQA | 2WikiMQA | Musique | Bamboogle | Avg
-- | -- | -- | -- | -- | -- | -- | -- | --
Qwen-2.5-7b-iter1 | 40.14 | 57.09 | 40.50 | 28.25 | 19.95 | 6.79 | 16.80 | 29.93


Full training logs are available here: https://wandb.ai/cyruskwan/dr-zero-public. I would greatly appreciate any guidance or suggestions that could help me reproduce the reported results. Thank you!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to reproduce results #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	NQ	TriviaQA	PopQA	HotpotQA	2WikiMQA	Musique	Bamboogle	Avg
Qwen-2.5-3b-iter1	37.26	54.42	37.82	28.87	26.46	10.47	28.80	32.01
Qwen-2.5-3b-iter2	37.20	54.31	38.62	26.62	19.01	6.79	16.00	28.36

Unable to reproduce results #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions