-
Notifications
You must be signed in to change notification settings - Fork 215
Open
Labels
bugSomething isn't workingSomething isn't working
Description
LoCoMo Benchmark Results - Significant Accuracy Gap
Issue: EverMemOS achieves 38.38% accuracy vs paper's claimed 93% on LoCoMo benchmark
Environment
- OS: Windows 10
- Python: 3.12.7
- Docker: 28.1.1 (MongoDB, Elasticsearch, Milvus, Redis)
- Dependencies:
uv sync --group evaluation
Configuration
Models:
- LLM:
openai/gpt-4.1-mini(OpenRouter, temp=0.3) - Embedding:
Qwen/Qwen3-Embedding-4B(DeepInfra, dim=1024) - Reranker:
Qwen/Qwen3-Reranker-4B(DeepInfra) - Search mode:
agentic
Commands Run
# Start services
docker-compose up -d
# Smoke test (30 questions, 10 messages/conv)
uv run python -m evaluation.cli --dataset locomo --system evermemos --smoke
# Full conv-26 (152 questions, 419 messages)
uv run python -m evaluation.cli --dataset locomo --system evermemos --from-conv 0 --to-conv 1Results
| Test | Messages | Questions | Accuracy | vs Paper |
|---|---|---|---|---|
| Paper (LoCoMo) | All | 1,986 | 93.0% | - |
| Smoke test | 10/conv | 30 | 52.22% | -40.78% |
| Conv-26 (full) | 419 | 152 | 38.38% | -54.62% |
Category Breakdown (Smoke Test)
- Single-hop: 41.67% (vs 96.08% in paper)
- Multi-hop: 54.76% (vs 91.13% in paper)
- Temporal: 66.67% (vs 89.72% in paper)
- Open domain: 100% (vs 70.83% in paper)
Key Findings
-
Performance degrades with more context:
- 10 messages: 52.22%
- 419 messages: 38.38% (-13.84%)
-
Only tested 1/10 conversations (7.66% of full benchmark)
-
Possible causes:
- Retrieval struggles with large memory banks
- Memory consolidation losing information
- Different evaluation methodology or configuration
Questions for Authors
Can you share some more details on reproducing the results? Thanks!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working