TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models

📄 Paper | 🤗 Dataset | 🏠 Project Website

Fangxu Yu¹, Xingang Guo², Lingzhi Yuan¹, Haoqiang Kang³, Hongyu Zhao¹, Lianhui Qin³, Furong Huang¹, Bin Hu², Tianyi Zhou⁴

¹University of Maryland, College Park ²University of Illinois Urbana-Champaign ³University of California, San Diego ⁴Mohamed Bin Zayed University of Artificial Intelligence

🔥 Overview

TSRBench is a large-scale, comprehensive benchmark designed to stress-test the time series understanding and reasoning capabilities of generalist models (LLMs, VLMs, and TSLLMs). Time series data pervades real-world environments and underpins decision-making in high-stakes domains like finance, healthcare, and industrial systems. However, existing benchmarks often treat time series as isolated numerical sequences, stripping away the semantic context essential for complex problem-solving, or focusing solely on surface-level pattern recognition.

TSRBench is more than a benchmark—it’s a multifaceted, standardized evaluation platform that not only uncovers the current challenges in time series reasoning but also provides actionable insights to push the boundaries of time series reasoning.

🚀 Key Features

🛠️ Comprehensive Taxonomy & Scale: TSRBench categorizes capabilities into 4 major dimensions (Perception, Reasoning, Prediction, Decision-Making) spanning 15 specific tasks. With 4,125 problems from 13 diverse domains.
🎯 Native Multi-Modal Support: Designed for generalist models, TSRBench supports four distinct modalities: text, image, text-image interleaved, and time series embeddings.
🏹 Unified Evaluation Pipeline (API & Local): We provide a standardized setup to evaluate a wide range of models effortlessly:
- Proprietary Models: Seamless integration with APIs (e.g., GPT-5, Gemini-2.5, DeepSeek).
- Open-Source Models: Local execution support via vLLM for efficient inference.
🔍 Fine-Grained Capability Assessment: TSRBench evaluates complex cognitive abilities.

Comparison with related benchmarks

Benchmark	Multi-Dom.	# Tasks	# Questions	Multivariate	Perception	Reasoning	Prediction	Decision	Modality
TimeMMD	✅	1	16K	✅	❌	❌	✅	❌	T
CiK	✅	1	0.3K	❌	❌	❌	✅	❌	T
TimeSeriesExam	❌	5	0.7K	❌	✅	❌	❌	❌	T, V
MTBench	✅	4	2.4K	❌	❌	✅	❌	❌	T
EngineMT-QA	❌	4	11K	✅	✅	✅	❌	✅	T
SciTS	✅	7	51K	✅	✅	❌	✅	❌	T
TimeMQA	✅	5	200K	❌	✅	❌	❌	❌	T
TSR-SUITE	✅	4	4K	❌	❌	✅	✅	✅	T
TSRBench (Ours)	✅	15	4.1K	✅	✅	✅	✅	✅	T, V, T+V

🖥️ Installation

Download repo

git clone git@github.com:tianyi-lab/TSRBench.git
cd TSRBench

Install VLLM for local inference

uv venv myenv --python 3.12 --seed
source myenv/bin/activate
uv pip install vllm

Install openai for API inference

pip install openai==2.2.0

🚀 Quick Start

Proprietary Models

To evaluate the textual time series with LLMs, you could run

bash inference/text_gpt/text_inference.sh "your_oai_api_base_url" "your_oai_api_key"

To evaluate the visual time series with VLMs, you could run

bash inference/vision_gpt/vision_inference.sh "your_oai_api_base_url" "your_oai_api_key"

To input both textual and visual time series to VLMs, you could run

bash inference/multimodal_gpt/multimodal_inference.sh "your_oai_api_base_url" "your_oai_api_key"

Open-source Models

To evaluate the textual time series with open-source LLMs, you could run

bash inference/text_opensource/text_inference.sh

To evaluate the textual time series with open-source VLMs, you could run

bash inference/vision_opensource/vision_inference.sh

You could add more models in the *.sh files

Citation

@article{yu2026tsrbench,
  title={TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models},
  author={Yu, Fangxu and Guo, Xingang and Yuan, Lingzhi and Kang, Haoqiang and Zhao, Hongyu and Qin, Lianhui and Huang, Furong and Hu, Bin and Zhou, Tianyi},
  journal={arXiv preprint arXiv:2601.18744},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
evaluation		evaluation
figures		figures
inference		inference
README.md		README.md
download_data.py		download_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models

🔥 Overview

🚀 Key Features

Comparison with related benchmarks

🖥️ Installation

🚀 Quick Start

Proprietary Models

Open-source Models

Citation

About

Uh oh!

Releases

Packages

Languages

tianyi-lab/TSRBench

Folders and files

Latest commit

History

Repository files navigation

TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models

🔥 Overview

🚀 Key Features

Comparison with related benchmarks

🖥️ Installation

🚀 Quick Start

Proprietary Models

Open-source Models

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages