Skip to content

xiaomi-research/timeviper

Repository files navigation

TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

🌐 Project Page    |    📑 Paper    |    🤗 Model

TimeViper


📰 News

  • [2026.01.04] Released visualization utilities for inspecting attention score in both attention and Mamba layers inside the LLM (see visualize/nano_attention_visualization_cookbook.ipynb).
  • [2025.12.11] Added support for any two visual encoder combinations, such as DINOv2, InternVideo2, and SigLIP2.
  • [2025.11.25] We provide model w/ Nano or w/ Qwen as backbones, and evaluation codes for MCQ (VideoMME, LVBench, MLVU, LongVideoBench, EgoSchema, MVBench, TempCompass, CGBench), TVG (Charades, ActivityNet, TVGBench), VDC (VDC), and DVC (YouCook2) benchmarks.
  • [2025.11.21] 🚀 Initial release of the TimeViper repository. The paper is available on arXiv.

📖 Introduction

We present TimeViper, a hybrid Mamba-Transformer vision-language model for efficient long video understanding. We introduce TransV, the first token-transfer module that compresses vision tokens into text tokens inside the LLM, enabling the model to process over 10,000 frames.

✨ Key Features

  • Hybrid MLLM Architecture
    Integrates native hybrid LLM for long video understanding.

  • Efficient Long Video Processing
    Capable of handling over 10K frames with significantly lower memory cost compared to standard Video-LLMs.

  • Flexible Backbones
    Supports MLLM construction with Transformer-based LLM backbones such as Qwen2.5 or hybrid LLMs like Nanov2.

  • Advanced Techniques
    Includes token dropping (TD) and token transfer (TransV) for training compression.


📝 TODO List

  • Add inference code
  • Add model code
  • Add training code
  • Release model weights
  • Add detailed instructions for preparing data & env & evaluation & training
  • Support training with Qwen and Nano backbones
  • Support pdrop and TransV for both training and evaluation

🐍 Model Zoo

Model Backbone Max Frames Checkpoint
TimeViper-9B Nanov2-9B 5k Coming Soon
TimeViper-9B-w/TransV Nanov2-9B 10k+ Coming Soon

🛠️ Installation

We provide comprehensive documentation for setting up TimeViper. Please follow these guides in order:

  1. INSTALL.md - Environment Setup

    • Install dependencies and required packages
    • Configure CUDA, PyTorch, and other system requirements
    • Set up the Python virtual environment
  2. MODEL.md - Model Checkpoint Download

    • Download ViT backbone and LLM backbone checkpoints from Hugging Face
    • Automated download script for all required models
    • Verification of checkpoint integrity
  3. DATA.md - Dataset Preparation

    • Prepare training and evaluation datasets
    • Instructions for downloading benchmark datasets
    • Data directory structure and format specifications
  4. USAGE.md - Usage Guide

    • Configure visual encoders, MLPs, and LLM backbones
    • Enable training-free token dropping (pdrop)
    • Advanced model configuration options

🚀 Quick Start

Training

Coming Soon.

Evaluation

# evaluate
python evaluate.py \
    --dataset videomme \
    --split test \
    --output_dir ./output/timeviper_base/videomme \
    --curr_idx 0 --total_idx 1
# calculate metrics 
python eval/vllm_inference/eval_all.py --model_name $MODEL_NAME --split $CURRENT_SPLIT --dataset $CURRENT_EVAL_DATASET --max_num_frames $MAX_FRAME_NUM \
        --eval_root $PART_OUTPUT_DIR

🔍 Visualization: Attention in Mamba-2 or Attention Layer

We release utilities to visualize internal token interactions inside the LLM, covering both Transformer attention layers and Mamba/SSM (Mamba) layers.

This might be useful for understanding how the model allocates computation across video frames / visual tokens and text tokens.

Notebook

  • visualize/nano_attention_visualization_cookbook.ipynb

The notebook provides:

  • Extraction of per-layer signals for attention layers and Mamba/SSM layers.
  • Side-by-side plots across depth (early → mid → late layers).
  • Examples that highlight how interactions change with long video inputs.

📄 License

This project is released under the Apache 2.0 License.

📚 Citation

If you find TimeViper useful for your research and applications, please cite our paper:

@article{xu2025timeviper,
  title={TimeViper: A Hybrid Mamba-Transformer Model for Efficient Long Video Understanding},
  author={Xu, Boshen and Xiao, Zihan and Li, Jiaze and Ju, Jianzhong and Luo, Zhenbo and Luan, Jian and Jin, Qin},
  journal={arXiv preprint arXiv:2511.16595},
  year={2025}
}

🙏 Acknowledgement

We thank the following open-source projects for their contributions: Cobra, Vamba, transformers, vllm, mamba, Time-R1, VideoChat-Flash.

About

[arxiv'25] TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published