TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models
📄 Paper | 🤗 Dataset | 🏠 Project Website
Fangxu Yu1, Xingang Guo2, Lingzhi Yuan1, Haoqiang Kang3, Hongyu Zhao1, Lianhui Qin3, Furong Huang1, Bin Hu2, Tianyi Zhou4
1University of Maryland, College Park 2University of Illinois Urbana-Champaign 3University of California, San Diego 4Mohamed Bin Zayed University of Artificial Intelligence
TSRBench is a large-scale, comprehensive benchmark designed to stress-test the time series understanding and reasoning capabilities of generalist models (LLMs, VLMs, and TSLLMs). Time series data pervades real-world environments and underpins decision-making in high-stakes domains like finance, healthcare, and industrial systems. However, existing benchmarks often treat time series as isolated numerical sequences, stripping away the semantic context essential for complex problem-solving, or focusing solely on surface-level pattern recognition.
TSRBench is more than a benchmark—it’s a multifaceted, standardized evaluation platform that not only uncovers the current challenges in time series reasoning but also provides actionable insights to push the boundaries of time series reasoning.
-
🛠️ Comprehensive Taxonomy & Scale: TSRBench categorizes capabilities into 4 major dimensions (Perception, Reasoning, Prediction, Decision-Making) spanning 15 specific tasks. With 4,125 problems from 13 diverse domains.
-
🎯 Native Multi-Modal Support: Designed for generalist models, TSRBench supports four distinct modalities: text, image, text-image interleaved, and time series embeddings.
-
🏹 Unified Evaluation Pipeline (API & Local): We provide a standardized setup to evaluate a wide range of models effortlessly:
- Proprietary Models: Seamless integration with APIs (e.g., GPT-5, Gemini-2.5, DeepSeek).
- Open-Source Models: Local execution support via vLLM for efficient inference.
-
🔍 Fine-Grained Capability Assessment: TSRBench evaluates complex cognitive abilities.
| Benchmark | Multi-Dom. | # Tasks | # Questions | Multivariate | Perception | Reasoning | Prediction | Decision | Modality |
|---|---|---|---|---|---|---|---|---|---|
| TimeMMD | ✅ | 1 | 16K | ✅ | ❌ | ❌ | ✅ | ❌ | T |
| CiK | ✅ | 1 | 0.3K | ❌ | ❌ | ❌ | ✅ | ❌ | T |
| TimeSeriesExam | ❌ | 5 | 0.7K | ❌ | ✅ | ❌ | ❌ | ❌ | T, V |
| MTBench | ✅ | 4 | 2.4K | ❌ | ❌ | ✅ | ❌ | ❌ | T |
| EngineMT-QA | ❌ | 4 | 11K | ✅ | ✅ | ✅ | ❌ | ✅ | T |
| SciTS | ✅ | 7 | 51K | ✅ | ✅ | ❌ | ✅ | ❌ | T |
| TimeMQA | ✅ | 5 | 200K | ❌ | ✅ | ❌ | ❌ | ❌ | T |
| TSR-SUITE | ✅ | 4 | 4K | ❌ | ❌ | ✅ | ✅ | ✅ | T |
| TSRBench (Ours) | ✅ | 15 | 4.1K | ✅ | ✅ | ✅ | ✅ | ✅ | T, V, T+V |
Download repo
git clone git@github.com:tianyi-lab/TSRBench.git
cd TSRBenchInstall VLLM for local inference
uv venv myenv --python 3.12 --seed
source myenv/bin/activate
uv pip install vllm
Install openai for API inference
pip install openai==2.2.0
To evaluate the textual time series with LLMs, you could run
bash inference/text_gpt/text_inference.sh "your_oai_api_base_url" "your_oai_api_key"
To evaluate the visual time series with VLMs, you could run
bash inference/vision_gpt/vision_inference.sh "your_oai_api_base_url" "your_oai_api_key"
To input both textual and visual time series to VLMs, you could run
bash inference/multimodal_gpt/multimodal_inference.sh "your_oai_api_base_url" "your_oai_api_key"
To evaluate the textual time series with open-source LLMs, you could run
bash inference/text_opensource/text_inference.sh
To evaluate the textual time series with open-source VLMs, you could run
bash inference/vision_opensource/vision_inference.sh
You could add more models in the *.sh files
@article{yu2026tsrbench,
title={TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models},
author={Yu, Fangxu and Guo, Xingang and Yuan, Lingzhi and Kang, Haoqiang and Zhao, Hongyu and Qin, Lianhui and Huang, Furong and Hu, Bin and Zhou, Tianyi},
journal={arXiv preprint arXiv:2601.18744},
year={2026}
}
