2025.9.22 gsm8k 93.5%,minif2f 4.5%. qwen2.5-instruct-7B converge after one epoch on gsm8k. Data(8k gsm, 100 minif2f) is not enough. High quality data is even more scarce.
TODO:
- More minif2f
- gsm 93% is not good enough. Temperature, topk, prompt, model parameters all need to be tuned.
- Synthetic theorem proof data should be used instead of gsm8k.
- reproduction on deepseek-math-7B for comparison with deepseek-prover-1.5-base
- More theorem proof task with medium difficulty for RL
bash setup_env.sh
Download lora weights from google drive
bash run.sh
evaluate on miniF2F
python eval.py