Paper: https://arxiv.org/pdf/2510.11967
Note: This is an open-source re-implementation based on agent_loop in verl. This may differ from the code we used to train the model in our paper.
Key Files
FoldAgent/
├── verl/
│ ├── experimental/
│ │ └── agent_loop/ # Base agent loop implementations
│ └── trainer/
│ └── ppo/ # PPO training with FoldGRPO algorithm
├── agents/
│ └── fold_agent.py # Core agent logic (process_item)
├── envs/
│ └── local_search.py # Local search environment
└── scripts/
└── train_fold.py # Training script
1. Start Search Server
Start the search server on a separate machine. This will download the corpus (Tevatron/browsecomp-plus-corpus), pre-computed embeddings (miaolu3/browsecomp-plus), and load the Qwen3-Embedding-8B model on available GPUs.
cd envs && python search_server.py \
--model Qwen/Qwen3-Embedding-8B \
--corpus Tevatron/browsecomp-plus-corpus \
--corpus-embedding-dataset miaolu3/browsecomp-plus \
--host 0.0.0.0 \
--port 8010Set environment variables:
# URL of the local search server (for BrowseComp-Plus)
export LOCAL_SEARCH_URL="http://[IP-of-search-server]:8010"
# For LLM-based answer grading
export OPENAI_API_KEY="your-api-key"2. Download Training Data
Download and decompress the BrowseComp dataset: https://drive.google.com/file/d/1aX5xXAN5R-gLKd8A0AY-troxXJRawyAM/view?usp=sharing
3. Train on BrowseComp
Example script to train Qwen3-8B:
bash scripts/train_bc_qwen3_8b.sh1. Start Search Server
cd envs && python search_server.py \
--model Qwen/Qwen3-Embedding-8B \
--corpus Tevatron/browsecomp-plus-corpus \
--corpus-embedding-dataset miaolu3/browsecomp-plus \
--host 0.0.0.0 \
--port 80002. Evaluate on BrowseComp
-
Download and decompress data: https://drive.google.com/file/d/1aX5xXAN5R-gLKd8A0AY-troxXJRawyAM/view?usp=sharing
-
Fold Agent:
workflow=search_branch
export OPENAI_API_KEY='your-key'
python scripts/eval_bc.py \
--data_path data/bc_test.parquet \
--model_name gpt-5-nano \
--num_workers 150 \
--workflow search_branch \
--prompt_length 16384 \
--response_length 32768 \
--max_turn 200 \
--val_max_turn 200 \
--max_session 10 \
--val_max_session 10 \
--local_search_url http://localhost:8000 \
--output_dir resultsOutput:
Evaluating: 100%|█████████████| 150/150 [32:52<00:00, 13.15s/item, avg_score=0.407, id=122]
============================================================
Overall - Avg Score: 0.4067, Success: 150/150
By Data Source:
bc_test_easy: 0.8200 (50 items)
bc_test_hard: 0.0400 (50 items)
bc_test_medium: 0.3600 (50 items)
- ReAct Agent:
workflow=search
python scripts/eval_bc.py --workflow search [...]- Summary Agent:
workflow=search, enable_summary
python scripts/eval_bc.py --workflow search --enable_summary [...]3. Using Local LLMs (e.g., vLLM)
# Start vLLM server
vllm serve ByteDance-Seed/Seed-OSS-36B-Instruct --port 8001 --max-model-len 131072
# Run evaluation
export OPENAI_API_KEY='dummy'
export OPENAI_BASE_URL='http://localhost:8001/v1'
python scripts/eval_bc.py \
--model_name ByteDance-Seed/Seed-OSS-36B-Instruct \
--workflow search_branch \
--num_workers 32 \
--prompt_length 16384 \
--response_length 32768 \
--max_turn 200 \
--val_max_turn 200 \
--max_session 10 \
--val_max_session 10 \
--output_dir results@article{sun2025scaling,
title = {Scaling Long-Horizon LLM Agent via Context-Folding},
author = {Sun, Weiwei and Lu, Miao and Ling, Zhan and Liu, Kang and Yao, Xuesong and Yang, Yiming and Chen, Jiecao},
journal = {arXiv preprint arXiv:2510.11967},
year = {2025},
}
This implementation is based on verl.