Skip to content

Official Repo for "EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies"

Notifications You must be signed in to change notification settings

OPPO-PersonalAI/EcoGym

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 

Repository files navigation


EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

arXiv Paper Hugging Face Dataset Python 3.10

A Generalizable Benchmark for Continuous Plan-and-Execute Decision Making in Interactive Economies

🌟 Overview

Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies.

EcoGym comprises three diverse environments: Vending, Freelance, and Operation, implemented in a unified decision-making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day-loops for evaluation). The evaluation is based on business-relevant outcomes (e.g., net worth, income, and DAU), targeting long-term strategic coherence and robustness under partial observability and stochasticity.

EcoGym Overview

EcoGym's design principles and three economic environments: Vending, Freelance, and Operation.

Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient action executions. EcoGym is released as an open, extensible testbed for transparent long-horizon agent evaluation and for studying controllability–utility trade-offs in realistic economic settings.

πŸ“Š Experimental Results

Our empirical evaluation on EcoGym reveals a significant performance gap in current LLMs: no single model consistently achieves superior performance across all scenarios, highlighting the inherent difficulty of long-horizon economic decision-making. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions. Furthermore, we conduct a comprehensive suite of 8 diagnostic experiments or case studies, encompassing factors such as context window length, agent behavior patterns, additional memory modules, and human baselines.

Vending Results Freelance Results Operation Results
Legend

Performance comparison across eleven leading LLMs in the three EcoGym environments.

πŸ“¦ Code Availability

Note: The source code for EcoGym is currently under internal compliance review and approval process. We are working to make the code publicly available as soon as the review is completed. Thank you for your patience and understanding.

For updates on code release, please check back later or watch this repository for notifications.

πŸ“ Citation

If you find this work useful, please cite:

@misc{hu2026ecogymevaluatingllmslonghorizon,
      title={EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies}, 
      author={Xavier Hu and Jinxiang Xia and Shengze Xu and Kangqi Song and Yishuo Yuan and Guibin Zhang and Jincheng Ren and Boyu Feng and Li Lu and Tieyong Zeng and Jiaheng Liu and Minghao Liu and Yuchen Elenor Jiang and Wei Wang and He Zhu and Wangchunshu Zhou},
      year={2026},
      eprint={2602.09514},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.09514}, 
}

πŸ™ Acknowledgements

This project is adapted from Agno, a framework for building multi-agent systems that learn and improve with every interaction.

About

Official Repo for "EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published