🔥 Code for the SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models.
- [2025/12/12] We are in the process of preparing the data. Please wait a moment.
- [2025/9/21] SAMA is accepted to NeurIPS 2025🔥! See you in San Diego!😉
If you find SAMA useful for your work, please kindly cite using the BibTeX 🙏🙏🙏:
@inproceedings{sun2025sama,
title={SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models},
author={Sun, Ye and Zhang, Hao and Ding, Henghui and Zhang, Tiehua and Ma, Xingjun and Jiang, Yu-Gang},
booktitle={NeurIPS},
year={2025}
}- Installation
- Model Weights
- Training Data preparation
- Training
- Evaluation & Benchmark
- Acknowledgments
Installation
- Please install the python and pytorch first:
> conda create -n vlm python=3.10
> conda activate vlm
> conda install pytorch==2.3.1 torchvision==0.18.1 pytorch-cuda=12.1 cuda -c pytorch -c "nvidia/label/cuda-12.1.0" -c "nvidia/label/cuda-12.1.1"- Install mmcv:
> pip install mmcv==2.2.0 -f https://download.openmmlab.com/mmcv/dist/cu121/torch2.3/index.html- Install other dependencies:
> pip install -r requirements.txtData Preparation
-
Please first download the Sa2VA training datasets and place them in the
datadirectory. The download link is here. -
To support the training of SAMA239K, please first download the LVVIS dataset and the RefYoutube-VOS dataset into the sama239k_data folder.
-
Create symbolic links in sama239k_data folder for the mevis dataset and the sav_train dataset (sam_v_full). These two datasets can be obtained from the Sa2VA training data.
-
For the VidSTG dataset, we have performed frame extraction. Please download this dataset first and conduct frame extraction using our provided
/tools/vidstg_process.py. -
Download our json files here and put them into sama239k_data folder.
The final data structure should be like:
data/
├── sama239k_data
| ├── mevis
| | └── train
| ├── lvvis
| | └── train
| ├── ref_youtube_vos
| | └── train
| ├── sav_train
| | └── sav_000
| | └── .....
| ├── VidSTG
| | └── train
| | └── 2399224635
| | └── frame_0.jpg
| | └── frame_4.jpg
| | └── .....
├── video_datas
| ├── revos
| ├── mevis
| └── davis17
| └── chat_univi
| └── sam_v_full # [!important] please download this from sam-2 directly.
| └── Ref-SAV.json
├── ref_seg
| ├── refclef
| ├── refcoco
| ├── refcoco+
| ├── refcocog
| ├──
├── glamm_data
| ├── images
| ├── annotations
├── osprey-724k
| ├── Osprey-724K
| ├── coco
├── llava_data
| ├── llava_images
| ├── LLaVA-Instruct-150K
| ├── LLaVA-Pretrain