EasyLocomo is a streamlined, easy-to-use version of the evaluation framework for the LoCoMo (Long-term Conversational Memory) benchmark.
This repository adapts the original logic and data from the paper "Evaluating Very Long-Term Conversational Memory of LLM Agents" (ACL 2024), ensuring that the evaluation results are consistent with the original author's repository while providing a much simpler experience for testing any LLM via OpenAI-compatible APIs.
- Result Consistency: Uses the same data and evaluation logic as the original LoCoMo project. Consistency of results has been verified using GPT-4o-mini. See release 0.1.0 for details.
- Simplified Setup: No complex bash scripts or environment setup. Optimized for uv and standard Python environments.
- OpenAI API Compatibility: Call any LLM that supports the OpenAI API format (e.g., GPT-4o, GPT-4o-mini, Claude via proxy, DeepSeek, or local models via Ollama/vLLM).
- Flexible Configuration: Easily set your API key, base URL, and model name.
- Breakpoint Resumption: Automatically saves progress after each sample/batch and skips already predicted samples, allowing for reliable long-running evaluations.
- JSON Mode & Robust Parsing: Utilizes OpenAI's JSON mode for structured outputs and includes advanced cleaning logic (removing reasoning thoughts, markdown blocks) to ensure high parsing success rates.
- Error Logging: Detailed parsing errors are logged to a separate
*_errors.jsonlfile for easy debugging and model output analysis. - Automatic Reporting: Automatically generates performance statistics (Accuracy, BERTScore, etc.) and summaries of the results.
- Token Estimation: Includes a utility script to estimate the token count of the evaluation dataset to help manage costs.
Clone the repository and install the dependencies. We recommend using uv for extremely fast setup:
# Using uv (Recommended)
uv sync
# Or using standard pip
pip install -r requirements.txtYou can configure your API credentials by creating a .env file in the root directory:
OPENAI_API_KEY=your_api_key_here
OPENAI_API_BASE=https://api.openai.com/v1Or you can pass them directly in the run_evaluation.py script.
Simply run the run_evaluation.py script:
# Using uv
uv run run_evaluation.py
# Or using standard python
python run_evaluation.pyBy default, this will evaluate the model on the data/locomo10.json dataset. Results, including predictions and statistical reports, will be saved in the outputs/ directory.
After running the evaluation, you will find the following files in the outputs/ directory:
[model_name]_qa.json: The model's predictions.[model_name]_qa_stats.json: Detailed accuracy metrics (Overall, Session-level, etc.).[model_name]_qa_summary.json: A human-readable summary of the evaluation results.
This project is built upon the work by Maharana et al. (ACL 2024). Please cite the original paper if you use this benchmark:
@inproceedings{maharana2024locomo,
title={Evaluating Very Long-Term Conversational Memory of LLM Agents},
author={Maharana, Adyasha and Lee, Dong-Ho and Tulyakov, Sergey and Bansal, Mohit and Barbieri, Francesco and Fang, Yuwei},
booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)},
year={2024}
}Original Repository: snap-research/locomo
You can customize the evaluation parameters in run_evaluation.py:
run_test(
model_name="gpt-4o-mini",
batch_size=15,
max_context=65536,
data_file="data/locomo10.json",
category=1,
overwrite=False
)- model_name: The identifier of the model to test.
- batch_size: Number of concurrent API calls.
- max_context: Maximum context length (tokens) passed to the model.
- category: (Optional) Filter evaluation for a specific category (1-5). Useful for re-testing specific subsets.
- overwrite: Whether to re-run evaluations for already predicted samples.
This project follows the licensing of the original LoCoMo repository. See LICENSE for details.