Skip to content

deepglint/DanQing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

100M Chinese image-text pairs | 12TB dataset | 2024-2025 web data

DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

Hengyu Shenβˆ—, Tiancheng Guβˆ—, Bin Qin, Lan Wu, Yuling Wu, Shuo Tan, Zelong Sun, Jun Wang, Nan Wu, Xiang An, Weidong Cai, Ziyong Feng‑, Kaicheng Yang†

βˆ— Equal Contribution | ‑ Team Leader | † Project Leader

Paper Hugging Face ModelScope License: CC BY 4.0

πŸ“£ News

  • [2026/01/16] ✨ We release the paper of DanQing.
  • [2026/01/15] πŸ”₯ We release the DanQing dataset (images and captions, about 12TB) in ModelScope ModelScope
  • [2026/01/13] ✨ We release the DanQing dataset (URLs of image and captions) in πŸ€— Hugging Face

⚠️ Note: Due to the storage and transmission limitations of Hugging Face, we only release the URLs corresponding to the images on Hugging Face. To access the complete dataset, please download it from ModelScope. We also provide synthetic short captions (generated by GLM4.1-base-9B) for the Danqing100M dataset in the recaption column.


πŸ“‘ Table of Contents


πŸ’‘ Highlights

In this paper, we propose DanQing dataset, which contains 100 million image-text pairs collected from Common Crawl. Different from existing datasets, DanQing is curated through a more rigorous selection process, yielding superior data quality. Moreover, DanQing is primarily built from 2024–2025 web data, enabling models to better capture evolving semantic trends and thus offering greater practical utility.

We compare DanQing with existing datasets by conducting continual pre-training of the SigLIP2 model. Experimental results show that DanQing consistently achieves superior performance across a range of Chinese downstream tasks, including zero-shot classification, cross-modal retrieval, and LMM-based evaluations.


πŸ’» Dataset Information

Data Preview

Topic Assessment

We implement a topic modeling pipeline based on BERTopic. We randomly sample 10M image-text pairs and extract text embeddings using Chinese-CLIP-L/14. To address high-dimensional clustering, we apply UMAP for dimensionality reduction, followed by HDBSCAN to identify semantic clusters with a minimum cluster size of 1,000 for stability and noise reduction. Finally, we use class-based TF-IDF to extract representative keywords for each topic.

Image Resolution and Text Length Distribution

We analyze image resolutions by width, height, and minimum dimension, demonstrating a wide range of visual scales. We also report the distribution of text lengths across 2.2B Chinese words.

Text Quality

We evaluate the text quality of DanQing using two metrics: semantic word density and perplexity (PPL). We randomly sample 10M texts from DanQing, Wukong, and Zero for comparison. Semantic words (nouns, verbs, adjectives) are identified using the jieba toolkit, and their proportion in each sentence is calculated as semantic density. Sentence-level perplexity is computed with a pre-trained Chinese BERT model.

Cosine Similarity and Semantic Distribution

We analyze 10M-sample subsets of DanQing and Wukong by presenting image-text similarity distributions, extracted with FG-CLIP2-L/16@256. For semantic distribution comparison, 10M images from each dataset are clustered into 10K groups using FAISS, with clusters ranked by sample count.


πŸ“Š Performance Comparison

Zero-Shot Classification

Cross-Modal Retrieval (Short Caption)

Cross-Modal Retrieval (Long Caption)

Chinese-Centric Large Multimodal Model Tasks


🧠 Analysis

Data and Model Scaling

We compare the data and model scaling capabilities of DanQing and Wukong, reporting average zero-shot classification and retrieval (long & short caption) performance in the figure below.

New Concept Understanding

We evaluate SigLIP2-L/16 models pre-trained on various Chinese datasets for emergent concept understanding, and find that the model trained on DanQing consistently gives the highest confidence to correct pairs.


πŸ“₯ Download

πŸ€— Hugging Face

Python API

from datasets import load_dataset

ds = load_dataset("DeepGlint-AI/DanQing100M")

Command Line

# Install dependencies
# brew install git-xet  # macOS
# git xet install

# sudo apt update  # Ubuntu/Debian
# sudo apt install aria2

# Install git-lfs
# curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
# sudo apt-get install git-lfs
# git lfs install

# Download dataset URLs and captions
bash hfd.sh DeepGlint-AI/DanQing100M --dataset --tool aria2c -x 10

# Download images using img2dataset
# pip install img2dataset
# For better performance, it's highly recommended to set up a fast dns resolver
# See: https://github.com/rom1504/img2dataset#setting-up-a-high-performance-dns-resolver
img2dataset --url_list DanQing100M/data \
        --input_format "parquet" \
        --url_col "url" \
        --caption_col "alt_text" \
        --output_format webdataset \
        --output_folder DanQing100M-webdataset \
        --processes_count 16 \
        --thread_count 32 \
        --image_size 256 \
        --resize_only_if_bigger=True \
        --resize_mode="keep_ratio" \
        --skip_reencode=True \
        --save_additional_columns '["recaption"]' \
        --enable_wandb False

ModelScope ModelScope

Python API

from modelscope.msdatasets import MsDataset

ds = MsDataset.load('deepglint/DanQing')

Command Line

pip install modelscope

# ref: https://modelscope.cn/docs/datasets/download
# Application approved instantly
export YOUR_MODELSCOPE_ACCESS_TOKEN=''
export LOCAL_DIR=''
modelscope login --token $YOUR_MODELSCOPE_ACCESS_TOKEN
modelscope --token $YOUR_MODELSCOPE_ACCESS_TOKEN download --dataset deepglint/DanQing --local_dir $LOCAL_DIR

πŸ“„ License

The DanQing dataset is licensed under CC-BY-4.0 License. The full license can be found in the LICENSE.cc-by-4.0 file. The dataset is collected from Common Crawl web pages and may contain biased or sensitive content. The collected data is subject to the license to which each content belongs. Users are solely responsible for ensuring compliance with ethical and legal standards in their research or applications.


πŸ“ Citation

If you find this repository useful, please use the following BibTeX entry for citation.

@misc{danqing,
      title={DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset}, 
      author={Hengyu Shen and Tiancheng Gu and Bin Qin and Lan Wu and Yuling Wu and Shuo Tan and Zelong Sun and Jun Wang and Nan Wu and Xiang An and Weidong Cai and Ziyong Feng and Kaicheng Yang},
      year={2026},
      eprint={2601.10305},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.10305}, 
}

⭐ Don't forget to star this repository if you find it helpful!