RLHF for Flow-Based Diffusion Models

This repository explores how to apply Reinforcement Learning from Human Feedback (RLHF) techniques—specifically for flow-based diffusion models. It discusses the theory behind RLHF in the context of diffusion, highlights popular RLHF methods, and describes the challenges and modifications required to integrate RLHF with flow-matching schedulers.

Introduction to RLHF for Diffusion

Reinforcement Learning from Human Feedback (RLHF) has been widely adopted to refine large language models by aligning them with user preferences. In the context of diffusion-based image generation, RLHF follows similar principles:

We collect human preferences (e.g., which of two images is better aligned with a prompt).
We fine-tune a diffusion model so that it generates images more consistent with these user preferences.

Traditionally, diffusion models are trained with score matching or forward noise schedules, but RLHF adds a new feedback loop:

Generate multiple candidate images.
Ask humans (or a proxy) to label which image(s) they prefer.
Update the model using those preferences.

Existing RLHF Techniques

Several RLHF approaches exist in the diffusion space:

D3PO: A PPO-inspired method for diffusion, comparing the likelihood ratio of the new model vs. a reference model at each diffusion step.
Direct Preference Optimization: Directly optimizes model outputs according to a preference or reward without needing a reward model.
PPO/ILQL-style algorithms adapted to diffusion: Using policy-gradient methods but interpreting the model as a “policy” over latent transitions.

Most of these methods rely on calculating the log-probabilities of the diffusion steps so that one can compare “new” vs. “old” model likelihoods for user-preferred samples.

Challenges with Flow-Based Models

Deterministic Scheduler
Flow-matching or Euler-style schedulers often use (near-)deterministic updates, making it tricky to define a meaningful log-prob distribution for each step.
No Direct Gaussian Parameterization
Standard methods (DDIM, DDPM) naturally produce a mean and variance for each step, allowing straightforward log-prob calculation. Flow-based transitions frequently lack an explicit variance term.
Need to Introduce Variance
To compute a (\log p_{\theta}(x_{t+1} \mid x_t)), we may need to artificially add small noise to the update step. This ensures the transition is treated as a probabilistic process instead of a delta function.

Implementation Steps

Dataset Preparation
- Capture Intermediate Latents: Store (x_t, x_{t+1}), and the scheduler timesteps at each diffusion step.
- Collect Human Preferences: For each prompt, generate multiple final images; label which image a user prefers.
Log-Prob Calculation
- Modify Scheduler: Create a function (e.g. flowmatch_step_with_logprob) that:
  1. Computes the deterministic update (Euler/flow step).
  2. Introduces or assumes a small variance (\sigma^2).
  3. Returns the log-prob of the observed (x_{t+1}).
- Reference vs. New Model: Keep a frozen copy of your flow-based model and compare new vs. reference log-probs at each step.
Training Loop Adjustments
- Pairwise Comparisons: For each batch, compare two final images. If image A is preferred, reward transitions that lead to A being more probable under the new model.
- Ratio-Based Loss: Compute
  [ \text{ratio} = \exp(\log p_\theta(x_{t+1}\mid x_t) ;-; \log p_{\text{ref}}(x_{t+1}\mid x_t)) ] and optimize according to user preferences (like a PPO-style objective).
Gradient Updates
- Accumulate Gradients Across Steps: For each diffusion timestep, compute the ratio-based reward and backprop through the flow-based UNet/transformer.
- Clip Gradients & Stabilize Training: As with PPO or other RL methods, gradient clipping can help.

Next Steps

Hyperparameter Tuning: The artificial variance, learning rate, and batch sizes all play crucial roles.
Scaling Up: Flow-based models can be large; distributed training and half-precision can be key.
User Interface: Creating user-friendly annotation tools for collecting preferences can greatly improve data quality.
Benchmarking: Compare RLHF-trained flow models to established methods (DDIM or DDPM with RLHF) to gauge performance gains.

References

D3PO: yk7333/d3po GitHub Repo
PPO: Proximal Policy Optimization paper
Flow Matching: Song et al., “Score-Based Generative Modeling through Stochastic Differential Equations” (NeurIPS 2020)

Questions or Contributions?
Feel free to open an issue or pull request if you have suggestions on improving RLHF for flow-based diffusion.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
flowmatcheulerschedular_withlogprobs.py		flowmatcheulerschedular_withlogprobs.py
flowmatcheulerscheduler.py		flowmatcheulerscheduler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RLHF for Flow-Based Diffusion Models

Table of Contents

Introduction to RLHF for Diffusion

Existing RLHF Techniques

Challenges with Flow-Based Models

Implementation Steps

Next Steps

References

About

Uh oh!

Releases

Packages

Languages

License

RareSense/rlhf-flow-diffusion

Folders and files

Latest commit

History

Repository files navigation

RLHF for Flow-Based Diffusion Models

Table of Contents

Introduction to RLHF for Diffusion

Existing RLHF Techniques

Challenges with Flow-Based Models

Implementation Steps

Next Steps

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages