Guanning Zeng · Xiang Zhang · Zirui Wang · Haiyang Xu · Zeyuan Chen · Bingnan Li · Zhuowen Tu
ICCV 2025
This repository contains the official implementation of YOLO-Count, a fully differentiable and open-vocabulary object counting model. YOLO-Count is designed to provide accurate object count estimation and enable fine-grained quantity control for text-to-image (T2I) generation models.
We recommend using Conda to set up the environment.
conda create -n yolocnt python=3.12
conda activate yolocnt
pip install -r requirements.txtYOLO-Count is trained and evaluated on multiple object counting benchmarks. Please download and organize each dataset as follows.
- Download FSC147 from
https://github.com/cvlab-stonybrook/LearningToCountEverything - Place the following folders under:
data/FSC/ ├── gt_density_map_adaptive_384_VarV2 └── images_384_VarV2
Download Open Images v7 using:
python -m scripts.download_oimgv7Download the validation images with:
python -m scripts.download_o365Then organize the data as:
data/Obj365/objects365/val
- Download LVIS
- Place all files under:
data/LVIS/
Pre-trained model weights are available at
https://huggingface.co/zx1239856/yolo-count/tree/main
Please download the weights and place them in the checkpoints/ directory.
Evaluation can be performed using the eval_*.py scripts in the scripts folder.
For example, to evaluate on FSC147:
python -m scripts.eval_fscThe table below reports counting performance using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).
| Dataset | Split | MAE | RMSE |
|---|---|---|---|
| FSC | Test | 15.6745 | 96.3807 |
| FSC | Validation | 14.8297 | 59.6979 |
| LVIS | Validation | 1.5379 | 5.6076 |
| OImgv7 | Validation | 3.7087 | 12.0285 |
| Obj365 | Validation | 3.2749 | 9.2181 |
If you find this work useful in your research, please consider citing:
@InProceedings{zeng2025yolocount,
author = {Zeng, Guanning and Zhang, Xiang and Wang, Zirui and Xu, Haiyang and Chen, Zeyuan and Li, Bingnan and Tu, Zhuowen},
title = {YOLO-Count: Differentiable Object Counting for Text-to-Image Generation},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {16765--16775}
}This repository is released under the CC-BY-SA 4.0 license.
