TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

📰 News

[2026.01.04] Released visualization utilities for inspecting attention score in both attention and Mamba layers inside the LLM (see visualize/nano_attention_visualization_cookbook.ipynb).
[2025.12.11] Added support for any two visual encoder combinations, such as DINOv2, InternVideo2, and SigLIP2.
[2025.11.25] We provide model w/ Nano or w/ Qwen as backbones, and evaluation codes for MCQ (VideoMME, LVBench, MLVU, LongVideoBench, EgoSchema, MVBench, TempCompass, CGBench), TVG (Charades, ActivityNet, TVGBench), VDC (VDC), and DVC (YouCook2) benchmarks.
[2025.11.21] 🚀 Initial release of the TimeViper repository. The paper is available on arXiv.

📖 Introduction

We present TimeViper, a hybrid Mamba-Transformer vision-language model for efficient long video understanding. We introduce TransV, the first token-transfer module that compresses vision tokens into text tokens inside the LLM, enabling the model to process over 10,000 frames.

✨ Key Features

Hybrid MLLM Architecture
Integrates native hybrid LLM for long video understanding.
Efficient Long Video Processing
Capable of handling over 10K frames with significantly lower memory cost compared to standard Video-LLMs.
Flexible Backbones
Supports MLLM construction with Transformer-based LLM backbones such as Qwen2.5 or hybrid LLMs like Nanov2.
Advanced Techniques
Includes token dropping (TD) and token transfer (TransV) for training compression.

📝 TODO List

Add inference code
Add model code
Add training code
Release model weights
Add detailed instructions for preparing data & env & evaluation & training
Support training with Qwen and Nano backbones
Support pdrop and TransV for both training and evaluation

🐍 Model Zoo

Model	Backbone	Max Frames	Checkpoint
TimeViper-9B	Nanov2-9B	5k	Coming Soon
TimeViper-9B-w/TransV	Nanov2-9B	10k+	Coming Soon

🛠️ Installation

We provide comprehensive documentation for setting up TimeViper. Please follow these guides in order:

INSTALL.md - Environment Setup
- Install dependencies and required packages
- Configure CUDA, PyTorch, and other system requirements
- Set up the Python virtual environment
MODEL.md - Model Checkpoint Download
- Download ViT backbone and LLM backbone checkpoints from Hugging Face
- Automated download script for all required models
- Verification of checkpoint integrity
DATA.md - Dataset Preparation
- Prepare training and evaluation datasets
- Instructions for downloading benchmark datasets
- Data directory structure and format specifications
USAGE.md - Usage Guide
- Configure visual encoders, MLPs, and LLM backbones
- Enable training-free token dropping (pdrop)
- Advanced model configuration options

🚀 Quick Start

Training

Coming Soon.

Evaluation

# evaluate
python evaluate.py \
    --dataset videomme \
    --split test \
    --output_dir ./output/timeviper_base/videomme \
    --curr_idx 0 --total_idx 1
# calculate metrics 
python eval/vllm_inference/eval_all.py --model_name $MODEL_NAME --split $CURRENT_SPLIT --dataset $CURRENT_EVAL_DATASET --max_num_frames $MAX_FRAME_NUM \
        --eval_root $PART_OUTPUT_DIR

🔍 Visualization: Attention in Mamba-2 or Attention Layer

We release utilities to visualize internal token interactions inside the LLM, covering both Transformer attention layers and Mamba/SSM (Mamba) layers.

This might be useful for understanding how the model allocates computation across video frames / visual tokens and text tokens.

Notebook

visualize/nano_attention_visualization_cookbook.ipynb

The notebook provides:

Extraction of per-layer signals for attention layers and Mamba/SSM layers.
Side-by-side plots across depth (early → mid → late layers).
Examples that highlight how interactions change with long video inputs.

📄 License

This project is released under the Apache 2.0 License.

📚 Citation

If you find TimeViper useful for your research and applications, please cite our paper:

@article{xu2025timeviper,
  title={TimeViper: A Hybrid Mamba-Transformer Model for Efficient Long Video Understanding},
  author={Xu, Boshen and Xiao, Zihan and Li, Jiaze and Ju, Jianzhong and Luo, Zhenbo and Luan, Jian and Jin, Qin},
  journal={arXiv preprint arXiv:2511.16595},
  year={2025}
}

🙏 Acknowledgement

We thank the following open-source projects for their contributions: Cobra, Vamba, transformers, vllm, mamba, Time-R1, VideoChat-Flash.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
configs		configs
docs		docs
eval		eval
hf_mtask_trainer		hf_mtask_trainer
scripts		scripts
timeviper		timeviper
visualize		visualize
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
evaluate.py		evaluate.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

📰 News

📖 Introduction

✨ Key Features

📝 TODO List

🐍 Model Zoo

🛠️ Installation

🚀 Quick Start

Training

Evaluation

🔍 Visualization: Attention in Mamba-2 or Attention Layer

Notebook

📄 License

📚 Citation

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Languages

xiaomi-research/timeviper

Folders and files

Latest commit

History

Repository files navigation

TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

📰 News

📖 Introduction

✨ Key Features

📝 TODO List

🐍 Model Zoo

🛠️ Installation

🚀 Quick Start

Training

Evaluation

🔍 Visualization: Attention in Mamba-2 or Attention Layer

Notebook

📄 License

📚 Citation

🙏 Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages