🌐 Project Page | 📑 Paper | 🤗 Model
- [2026.01.04] Released visualization utilities for inspecting attention score in both attention and Mamba layers inside the LLM (see
visualize/nano_attention_visualization_cookbook.ipynb). - [2025.12.11] Added support for any two visual encoder combinations, such as DINOv2, InternVideo2, and SigLIP2.
- [2025.11.25] We provide model w/ Nano or w/ Qwen as backbones, and evaluation codes for MCQ (VideoMME, LVBench, MLVU, LongVideoBench, EgoSchema, MVBench, TempCompass, CGBench), TVG (Charades, ActivityNet, TVGBench), VDC (VDC), and DVC (YouCook2) benchmarks.
- [2025.11.21] 🚀 Initial release of the TimeViper repository. The paper is available on arXiv.
We present TimeViper, a hybrid Mamba-Transformer vision-language model for efficient long video understanding. We introduce TransV, the first token-transfer module that compresses vision tokens into text tokens inside the LLM, enabling the model to process over 10,000 frames.
-
Hybrid MLLM Architecture
Integrates native hybrid LLM for long video understanding. -
Efficient Long Video Processing
Capable of handling over 10K frames with significantly lower memory cost compared to standard Video-LLMs. -
Flexible Backbones
Supports MLLM construction with Transformer-based LLM backbones such as Qwen2.5 or hybrid LLMs like Nanov2. -
Advanced Techniques
Includes token dropping (TD) and token transfer (TransV) for training compression.
- Add inference code
- Add model code
- Add training code
- Release model weights
- Add detailed instructions for preparing data & env & evaluation & training
- Support training with Qwen and Nano backbones
- Support pdrop and TransV for both training and evaluation
| Model | Backbone | Max Frames | Checkpoint |
|---|---|---|---|
| TimeViper-9B | Nanov2-9B | 5k | Coming Soon |
| TimeViper-9B-w/TransV | Nanov2-9B | 10k+ | Coming Soon |
We provide comprehensive documentation for setting up TimeViper. Please follow these guides in order:
-
INSTALL.md - Environment Setup
- Install dependencies and required packages
- Configure CUDA, PyTorch, and other system requirements
- Set up the Python virtual environment
-
MODEL.md - Model Checkpoint Download
- Download ViT backbone and LLM backbone checkpoints from Hugging Face
- Automated download script for all required models
- Verification of checkpoint integrity
-
DATA.md - Dataset Preparation
- Prepare training and evaluation datasets
- Instructions for downloading benchmark datasets
- Data directory structure and format specifications
-
USAGE.md - Usage Guide
- Configure visual encoders, MLPs, and LLM backbones
- Enable training-free token dropping (pdrop)
- Advanced model configuration options
Coming Soon.
# evaluate
python evaluate.py \
--dataset videomme \
--split test \
--output_dir ./output/timeviper_base/videomme \
--curr_idx 0 --total_idx 1
# calculate metrics
python eval/vllm_inference/eval_all.py --model_name $MODEL_NAME --split $CURRENT_SPLIT --dataset $CURRENT_EVAL_DATASET --max_num_frames $MAX_FRAME_NUM \
--eval_root $PART_OUTPUT_DIR
We release utilities to visualize internal token interactions inside the LLM, covering both Transformer attention layers and Mamba/SSM (Mamba) layers.
This might be useful for understanding how the model allocates computation across video frames / visual tokens and text tokens.
visualize/nano_attention_visualization_cookbook.ipynb
The notebook provides:
- Extraction of per-layer signals for attention layers and Mamba/SSM layers.
- Side-by-side plots across depth (early → mid → late layers).
- Examples that highlight how interactions change with long video inputs.
This project is released under the Apache 2.0 License.
If you find TimeViper useful for your research and applications, please cite our paper:
@article{xu2025timeviper,
title={TimeViper: A Hybrid Mamba-Transformer Model for Efficient Long Video Understanding},
author={Xu, Boshen and Xiao, Zihan and Li, Jiaze and Ju, Jianzhong and Luo, Zhenbo and Luan, Jian and Jin, Qin},
journal={arXiv preprint arXiv:2511.16595},
year={2025}
}
We thank the following open-source projects for their contributions: Cobra, Vamba, transformers, vllm, mamba, Time-R1, VideoChat-Flash.
