Skip to content

SJTU-DENG-Lab/Think-Then-Generate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

32 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Think-Then-Generate:
Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

arXiv Hugging Face Hugging Face Project Page

More results can be found in the gallery.


๐Ÿง  How it Works

Pipeline Architecture

Most existing Text-to-Image (T2I) models act as simple text-pixel mappersโ€”they encode text without truly understanding it. To bridge the gap between abstract user prompts and concrete visual pixels, we propose the Think-Then-Generate paradigm:

  1. Phase I: Reasoning Activation

We first activate the reasoning potential of the LLM-based text encoder via lightweight SFT. Instead of directly passing the raw prompt to the generator, the LLM is encouraged to reason about the user's intent and rewrite the prompt into a detailed, structured description that serves as conditioning for the DiT backbone.

  1. Phase II: Co-Evolution via Dual-GRPO

To ensure the reasoning actually improves image quality, we employ Dual-GRPO to co-optimize both the "Brain" (LLM) and the "Painter" (DiT Backbone):

  • For the LLM Encoder: It is reinforced using image-grounded rewards focusing on semantic alignment. This forces the model to activate latent world knowledge and infer precise visual details that are critical for accurate generation.

  • For the DiT Backbone: It is simultaneously trained with visual realism and aesthetic quality rewards conditioned on the refined prompts. This aligns the generator's capabilities with the complex, detailed instructions produced by the LLM.

๐Ÿ› ๏ธ Installation

Install the necessary dependencies:

pip install torch transformers diffusers accelerate
git clone https://github.com/SJTU-DENG-Lab/Think-Then-Generate.git

๐Ÿš€ Inference

Run the model on a single GPU to experience reasoning-aware generation.

python inference.py \
  --model_path "SJTU-Deng-Lab/Think-Then-Generate-T2I" \
  --prompt "A multi-panel illustration showing the story of marking the boat to find a sword, with clear steps from dropping the sword to carving a mark on the boat." \
  --output "sword_result.jpg"

๐Ÿ–Š๏ธ Citation

If you find our work helpful, please consider citing:

@article{kou2026think,
  title={Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders},
  author={Siqi Kou and Jiachun Jin and Zetong Zhou and Ye Ma and Yugang Wang and Quan Chen and Peng Jiang and Xiao Yang and Jun Zhu and Kai Yu and Zhijie Deng},
  journal={arXiv preprint},
  year={2026}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •