Scaling Long-Horizon LLM Agent via Context-Folding

Paper: https://arxiv.org/pdf/2510.11967

Training

Note: This is an open-source re-implementation based on agent_loop in verl. This may differ from the code we used to train the model in our paper.

Key Files

FoldAgent/
├── verl/
│   ├── experimental/
│   │   └── agent_loop/            # Base agent loop implementations
│   └── trainer/
│       └── ppo/                   # PPO training with FoldGRPO algorithm
├── agents/
│   └── fold_agent.py              # Core agent logic (process_item)
├── envs/
│   └── local_search.py            # Local search environment
└── scripts/
    └── train_fold.py              # Training script

1. Start Search Server

Start the search server on a separate machine. This will download the corpus (Tevatron/browsecomp-plus-corpus), pre-computed embeddings (miaolu3/browsecomp-plus), and load the Qwen3-Embedding-8B model on available GPUs.

cd envs && python search_server.py \
  --model Qwen/Qwen3-Embedding-8B \
  --corpus Tevatron/browsecomp-plus-corpus \
  --corpus-embedding-dataset miaolu3/browsecomp-plus \
  --host 0.0.0.0 \
  --port 8010

Set environment variables:

# URL of the local search server (for BrowseComp-Plus)
export LOCAL_SEARCH_URL="http://[IP-of-search-server]:8010"

# For LLM-based answer grading
export OPENAI_API_KEY="your-api-key"

2. Download Training Data

Download and decompress the BrowseComp dataset: https://drive.google.com/file/d/1aX5xXAN5R-gLKd8A0AY-troxXJRawyAM/view?usp=sharing

3. Train on BrowseComp

Example script to train Qwen3-8B:

bash scripts/train_bc_qwen3_8b.sh

Evaluation

1. Start Search Server

cd envs && python search_server.py \
  --model Qwen/Qwen3-Embedding-8B \
  --corpus Tevatron/browsecomp-plus-corpus \
  --corpus-embedding-dataset miaolu3/browsecomp-plus \
  --host 0.0.0.0 \
  --port 8000

2. Evaluate on BrowseComp

Download and decompress data: https://drive.google.com/file/d/1aX5xXAN5R-gLKd8A0AY-troxXJRawyAM/view?usp=sharing
Fold Agent: workflow=search_branch

export OPENAI_API_KEY='your-key'

python scripts/eval_bc.py \
  --data_path data/bc_test.parquet \
  --model_name gpt-5-nano \
  --num_workers 150 \
  --workflow search_branch \
  --prompt_length 16384 \
  --response_length 32768 \
  --max_turn 200 \
  --val_max_turn 200 \
  --max_session 10 \
  --val_max_session 10 \
  --local_search_url http://localhost:8000 \
  --output_dir results

Output:

Evaluating: 100%|█████████████| 150/150 [32:52<00:00, 13.15s/item, avg_score=0.407, id=122]

============================================================
Overall - Avg Score: 0.4067, Success: 150/150

By Data Source:
  bc_test_easy: 0.8200 (50 items)
  bc_test_hard: 0.0400 (50 items)
  bc_test_medium: 0.3600 (50 items)

ReAct Agent: workflow=search

python scripts/eval_bc.py --workflow search [...]

Summary Agent: workflow=search, enable_summary

python scripts/eval_bc.py --workflow search --enable_summary [...]

3. Using Local LLMs (e.g., vLLM)

# Start vLLM server
vllm serve ByteDance-Seed/Seed-OSS-36B-Instruct --port 8001 --max-model-len 131072

# Run evaluation
export OPENAI_API_KEY='dummy'
export OPENAI_BASE_URL='http://localhost:8001/v1'

python scripts/eval_bc.py \
  --model_name ByteDance-Seed/Seed-OSS-36B-Instruct \
  --workflow search_branch \
  --num_workers 32 \
  --prompt_length 16384 \
  --response_length 32768 \
  --max_turn 200 \
  --val_max_turn 200 \
  --max_session 10 \
  --val_max_session 10 \
  --output_dir results

Evaluation and training on SWE-Bench Verified

Cite

@article{sun2025scaling,
  title   = {Scaling Long-Horizon LLM Agent via Context-Folding},
  author  = {Sun, Weiwei and Lu, Miao and Ling, Zhan and Liu, Kang and Yao, Xuesong and Yang, Yiming and Chen, Jiecao},
  journal = {arXiv preprint arXiv:2510.11967},
  year    = {2025},
}

Acknowledgements

This implementation is based on verl.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
agents		agents
envs		envs
scripts		scripts
verl		verl
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scaling Long-Horizon LLM Agent via Context-Folding

Training

Evaluation

Evaluation and training on SWE-Bench Verified

Cite

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

sunnweiwei/FoldAgent

Folders and files

Latest commit

History

Repository files navigation

Scaling Long-Horizon LLM Agent via Context-Folding

Training

Evaluation

Evaluation and training on SWE-Bench Verified

Cite

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages