A unified image retrieval system based on large multimodal models, supporting general conversational image retrieval tasks.
UniChatIR is the official implementation of the Chatsearch paper, a generative image retrieval system based on the Emu/LLaVA architecture. The system leverages large multimodal models (LMMs) to achieve high-quality text-to-image retrieval and conversational image retrieval. The system adopts a generative retrieval paradigm, generating unified image representations through large language models, and supports various retrieval tasks and datasets, including:
- Standard Image Retrieval: Flickr30K, COCO, etc.
- Compositional Image Retrieval: CIRR (Compositional Image Retrieval)
- Fashion Image Retrieval: Fashion-IQ
- Visual Story Retrieval: VIST (Visual Storytelling)
- Conversational Image Retrieval: Supports multi-turn conversational context understanding
- 🎯 Multi-task Support: Supports various image retrieval tasks and datasets
- 🚀 Generative Retrieval: Adopts a generative retrieval paradigm, leveraging large language models to generate unified image representations
- 💬 Conversational Retrieval: Supports multi-turn conversational context understanding for general conversational image retrieval
- 🔧 Easy to Use: Provides simple command-line interfaces and Gradio demo interface
- 📊 Flexible Configuration: Supports various model configurations and evaluation metrics
- 🎨 Multimodal Fusion: Unified architecture based on CLIP visual encoder and LLaMA language model
- Python >= 3.8
- PyTorch >= 1.12.0
- CUDA >= 11.0 (recommended)
- Clone the repository:
git clone https://github.com/CASIA-IVA-Lab/ChatSearch.git
cd ChatSearch- Create a virtual environment and install dependencies:
conda create -n unichatir python=3.10 -y
conda activate unichatir
pip install --upgrade pip
pip install -e .- Install training-related dependencies (optional):
pip install ninja
pip install flash-attn --no-build-isolationFirst, you need to prepare the following data:
- Image Feature Files: Pre-computed image features (.pt format)
- Annotation Files: JSON files containing image IDs and metadata
- Image Directory: Root directory of image files
Use the Gradio interface for interactive image retrieval:
python demo.py \
--model-cfg emu_models/Emu-8B_frozenvis_cliploss.json \
--checkpoint /path/to/checkpoint.pth \
--image-feat-path /path/to/image_features.pt \
--annotation-path /path/to/annotations.json \
--image-root /path/to/imagesEvaluate model performance on standard datasets:
python utils/retrieval_new.py \
--checkpoint /path/to/checkpoint.pth \
--model-cfg emu_models/Emu-8B_frozenvis_cliploss_vitl.json \
--vis-roots /path/to/images1,/path/to/images2 \
--ann-paths /path/to/ann1.json,/path/to/ann2.json \
--bs 16 \
--evaluateunichatir/
├── demo.py # Gradio demo interface
├── utils/
│ ├── retrieval_new.py # Standard image retrieval evaluation
│ ├── retrieval_new_cirr.py # CIRR dataset evaluation
│ ├── retrieval_new_fashion.py # Fashion-IQ dataset evaluation
│ ├── retrieval_new_vist.py # VIST dataset evaluation
│ └── extract_vitfeat_*.py # Image feature extraction scripts
├── emu_models/ # Model definitions
│ ├── modeling_uniir.py # Unified image retrieval model (Emu_clip_VIT)
│ ├── modeling_llama.py # LLaMA language model (supports classification and regression)
│ ├── eva_vit.py # EVA ViT visual encoder
│ └── ...
├── llava/ # LLaVA related code
│ ├── dataset_finetune.py # Dataset definitions
│ ├── dataset_cirr.py # CIRR dataset
│ ├── processors/ # Data processors
│ └── train/ # Training scripts
└── scripts/ # Training and evaluation scripts
Before running retrieval, you need to extract image features first. The system supports feature extraction for various datasets:
# Flickr30K dataset
python utils/extract_vitfeat_flickr.py \
--data-dir /path/to/images \
--save-pt-path /path/to/features.pt \
--save-url-path /path/to/urls.json
# COCO dataset
python utils/extract_vitfeat_coco.py \
--data-dir /path/to/images \
--save-pt-path /path/to/features.pt \
--save-url-path /path/to/urls.jsonModel training supports various configurations and dataset combinations. Main training scripts are located in the scripts/ directory:
run_frozenvis_cliploss.sh: Pre-training script using CLIP Lossrun_uniir.sh: Unified image retrieval training scriptrun-ft.sh: Fine-tuning training script
Training process supports:
- Multi-node distributed training
- Mixed precision training (bf16)
- Gradient checkpointing
- Flash Attention acceleration
Please refer to each script file for detailed configurations.
The system supports the following evaluation metrics:
- Recall@K (R@K): Proportion of correct answers in the top K results
- Median Rank: Median rank
- Mean Rank: Mean rank
- Flickr30K: 30,000 images with 5 captions per image
- COCO: Microsoft COCO dataset
- CIRR: Compositional Image Retrieval dataset
- Fashion-IQ: Fashion image retrieval dataset
- VIST: Visual Storytelling dataset
The system adopts a generative retrieval architecture with the following main components:
- Visual Encoder: Frozen visual encoder based on CLIP ViT
- Language Model: LLaMA-based decoder supporting generative retrieval
- Projection Layer: Projects visual features into language model space
- Retrieval Heads: Text head and vision head for computing similarity
The project supports various model configurations. Configuration files are located in the emu_models/ directory:
Emu-8B_frozenvis_cliploss.json: Base configuration (ViT-B)Emu-8B_frozenvis_cliploss_vitl.json: Using ViT-L visual encoder
This project is based on the following open-source projects:
- LLaVA: Large Language and Vision Assistant
- Emu: Multimodal foundation model
- CLIP: Vision-language pre-trained model
- LLaMA: Large language model
If you use this project, please cite the following paper:
@article{zhao2025chatsearch,
title={Chatsearch: A dataset and a generative retrieval model for general conversational image retrieval},
author={Zhao, Zijia and Guo, Longteng and Yue, Tongtian and Hu, Erdong and Shao, Shuai and Yuan, Zehuan and Huang, Hua and Liu, Jing},
journal={Pattern Recognition},
year={2025},
publisher={Elsevier}
}