Skip to content

CASIA-IVA-Lab/ChatSearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chatsearch: A dataset and a generative retrieval model for general conversational image retrieval

A unified image retrieval system based on large multimodal models, supporting general conversational image retrieval tasks.

Introduction

UniChatIR is the official implementation of the Chatsearch paper, a generative image retrieval system based on the Emu/LLaVA architecture. The system leverages large multimodal models (LMMs) to achieve high-quality text-to-image retrieval and conversational image retrieval. The system adopts a generative retrieval paradigm, generating unified image representations through large language models, and supports various retrieval tasks and datasets, including:

  • Standard Image Retrieval: Flickr30K, COCO, etc.
  • Compositional Image Retrieval: CIRR (Compositional Image Retrieval)
  • Fashion Image Retrieval: Fashion-IQ
  • Visual Story Retrieval: VIST (Visual Storytelling)
  • Conversational Image Retrieval: Supports multi-turn conversational context understanding

Key Features

  • 🎯 Multi-task Support: Supports various image retrieval tasks and datasets
  • 🚀 Generative Retrieval: Adopts a generative retrieval paradigm, leveraging large language models to generate unified image representations
  • 💬 Conversational Retrieval: Supports multi-turn conversational context understanding for general conversational image retrieval
  • 🔧 Easy to Use: Provides simple command-line interfaces and Gradio demo interface
  • 📊 Flexible Configuration: Supports various model configurations and evaluation metrics
  • 🎨 Multimodal Fusion: Unified architecture based on CLIP visual encoder and LLaMA language model

Installation

Requirements

  • Python >= 3.8
  • PyTorch >= 1.12.0
  • CUDA >= 11.0 (recommended)

Installation Steps

  1. Clone the repository:
git clone https://github.com/CASIA-IVA-Lab/ChatSearch.git
cd ChatSearch
  1. Create a virtual environment and install dependencies:
conda create -n unichatir python=3.10 -y
conda activate unichatir
pip install --upgrade pip
pip install -e .
  1. Install training-related dependencies (optional):
pip install ninja
pip install flash-attn --no-build-isolation

Quick Start

1. Prepare Data

First, you need to prepare the following data:

  • Image Feature Files: Pre-computed image features (.pt format)
  • Annotation Files: JSON files containing image IDs and metadata
  • Image Directory: Root directory of image files

2. Run Demo

Use the Gradio interface for interactive image retrieval:

python demo.py \
    --model-cfg emu_models/Emu-8B_frozenvis_cliploss.json \
    --checkpoint /path/to/checkpoint.pth \
    --image-feat-path /path/to/image_features.pt \
    --annotation-path /path/to/annotations.json \
    --image-root /path/to/images

3. Evaluate Model

Evaluate model performance on standard datasets:

python utils/retrieval_new.py \
    --checkpoint /path/to/checkpoint.pth \
    --model-cfg emu_models/Emu-8B_frozenvis_cliploss_vitl.json \
    --vis-roots /path/to/images1,/path/to/images2 \
    --ann-paths /path/to/ann1.json,/path/to/ann2.json \
    --bs 16 \
    --evaluate

Project Structure

unichatir/
├── demo.py                    # Gradio demo interface
├── utils/
│   ├── retrieval_new.py           # Standard image retrieval evaluation
│   ├── retrieval_new_cirr.py      # CIRR dataset evaluation
│   ├── retrieval_new_fashion.py   # Fashion-IQ dataset evaluation
│   ├── retrieval_new_vist.py      # VIST dataset evaluation
│   └── extract_vitfeat_*.py       # Image feature extraction scripts
├── emu_models/                # Model definitions
│   ├── modeling_uniir.py      # Unified image retrieval model (Emu_clip_VIT)
│   ├── modeling_llama.py       # LLaMA language model (supports classification and regression)
│   ├── eva_vit.py             # EVA ViT visual encoder
│   └── ...
├── llava/                      # LLaVA related code
│   ├── dataset_finetune.py    # Dataset definitions
│   ├── dataset_cirr.py        # CIRR dataset
│   ├── processors/            # Data processors
│   └── train/                 # Training scripts
└── scripts/                    # Training and evaluation scripts

Usage

Image Feature Extraction

Before running retrieval, you need to extract image features first. The system supports feature extraction for various datasets:

# Flickr30K dataset
python utils/extract_vitfeat_flickr.py \
    --data-dir /path/to/images \
    --save-pt-path /path/to/features.pt \
    --save-url-path /path/to/urls.json

# COCO dataset
python utils/extract_vitfeat_coco.py \
    --data-dir /path/to/images \
    --save-pt-path /path/to/features.pt \
    --save-url-path /path/to/urls.json

Model Training

Model training supports various configurations and dataset combinations. Main training scripts are located in the scripts/ directory:

  • run_frozenvis_cliploss.sh: Pre-training script using CLIP Loss
  • run_uniir.sh: Unified image retrieval training script
  • run-ft.sh: Fine-tuning training script

Training process supports:

  • Multi-node distributed training
  • Mixed precision training (bf16)
  • Gradient checkpointing
  • Flash Attention acceleration

Please refer to each script file for detailed configurations.

Evaluation Metrics

The system supports the following evaluation metrics:

  • Recall@K (R@K): Proportion of correct answers in the top K results
  • Median Rank: Median rank
  • Mean Rank: Mean rank

Dataset Support

  • Flickr30K: 30,000 images with 5 captions per image
  • COCO: Microsoft COCO dataset
  • CIRR: Compositional Image Retrieval dataset
  • Fashion-IQ: Fashion image retrieval dataset
  • VIST: Visual Storytelling dataset

Model Architecture

The system adopts a generative retrieval architecture with the following main components:

  • Visual Encoder: Frozen visual encoder based on CLIP ViT
  • Language Model: LLaMA-based decoder supporting generative retrieval
  • Projection Layer: Projects visual features into language model space
  • Retrieval Heads: Text head and vision head for computing similarity

Model Configuration

The project supports various model configurations. Configuration files are located in the emu_models/ directory:

  • Emu-8B_frozenvis_cliploss.json: Base configuration (ViT-B)
  • Emu-8B_frozenvis_cliploss_vitl.json: Using ViT-L visual encoder

Acknowledgments

This project is based on the following open-source projects:

  • LLaVA: Large Language and Vision Assistant
  • Emu: Multimodal foundation model
  • CLIP: Vision-language pre-trained model
  • LLaMA: Large language model

Citation

If you use this project, please cite the following paper:

@article{zhao2025chatsearch,
  title={Chatsearch: A dataset and a generative retrieval model for general conversational image retrieval},
  author={Zhao, Zijia and Guo, Longteng and Yue, Tongtian and Hu, Erdong and Shao, Shuai and Yuan, Zehuan and Huang, Hua and Liu, Jing},
  journal={Pattern Recognition},
  year={2025},
  publisher={Elsevier}
}

About

ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published