Think-Then-Generate:
Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

More results can be found in the gallery.

🧠 How it Works

Most existing Text-to-Image (T2I) models act as simple text-pixel mappers—they encode text without truly understanding it. To bridge the gap between abstract user prompts and concrete visual pixels, we propose the Think-Then-Generate paradigm:

Phase I: Reasoning Activation

We first activate the reasoning potential of the LLM-based text encoder via lightweight SFT. Instead of directly passing the raw prompt to the generator, the LLM is encouraged to reason about the user's intent and rewrite the prompt into a detailed, structured description that serves as conditioning for the DiT backbone.

Phase II: Co-Evolution via Dual-GRPO

To ensure the reasoning actually improves image quality, we employ Dual-GRPO to co-optimize both the "Brain" (LLM) and the "Painter" (DiT Backbone):

For the LLM Encoder: It is reinforced using image-grounded rewards focusing on semantic alignment. This forces the model to activate latent world knowledge and infer precise visual details that are critical for accurate generation.
For the DiT Backbone: It is simultaneously trained with visual realism and aesthetic quality rewards conditioned on the refined prompts. This aligns the generator's capabilities with the complex, detailed instructions produced by the LLM.

🛠️ Installation

Install the necessary dependencies:

pip install torch transformers diffusers accelerate
git clone https://github.com/SJTU-DENG-Lab/Think-Then-Generate.git

🚀 Inference

Run the model on a single GPU to experience reasoning-aware generation.

python inference.py \
  --model_path "SJTU-Deng-Lab/Think-Then-Generate-T2I" \
  --prompt "A multi-panel illustration showing the story of marking the boat to find a sword, with clear steps from dropping the sword to carving a mark on the boat." \
  --output "sword_result.jpg"

🖊️ Citation

If you find our work helpful, please consider citing:

@article{kou2026think,
  title={Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders},
  author={Siqi Kou and Jiachun Jin and Zetong Zhou and Ye Ma and Yugang Wang and Quan Chen and Peng Jiang and Xiao Yang and Jun Zhu and Kai Yu and Zhijie Deng},
  journal={arXiv preprint},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
assets		assets
images		images
README.md		README.md
index.html		index.html
inference.py		inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Think-Then-Generate:
Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

🧠 How it Works

🛠️ Installation

🚀 Inference

🖊️ Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

SJTU-DENG-Lab/Think-Then-Generate

Folders and files

Latest commit

History

Repository files navigation

Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

🧠 How it Works

🛠️ Installation

🚀 Inference

🖊️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Think-Then-Generate:
Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

Packages