This project was developed by me and Soham T. Umbare
This project focuses on re-implementing Stable Diffusion using a custom architecture with a Variational Autoencoder (VAE), CLIP Encoder, and a U-Net for step-by-step denoising of images based on prompts. A scheduler guides the denoising process.
The results shown in the image above are generated using pre-trained weights at this link These are not the results of our custom-trained models.
Our training pipeline follows the diffusion process as described in the original paper. We use two primary losses, which encapsulate the core denoising process. While the paper describes additional losses, they are variations of the following fundamental losses: 1. Forward Trajectory (Adding Noise)
We progressively add noise to the input image
Key Terms:
-
$q$ : Distribution of the noisy state. -
$x_t$ : Image at step$t$ with added noise. -
$x_{t-1}$ : Image at step$t-1$ with less noise. -
$\mathcal{N}$ : Gaussian distribution. -
$\beta_t$ : Noise decay factor. A higher$\beta_t$ adds more noise.
2. Reverse Trajectory (Removing Noise)
The denoising process reconstructs the clean image
Key Terms:
-
$U_t(x_t)$ : Model's prediction of the clean data based on the noisy input$x_t$ . -
$\beta_t \cdot I$ : Variance for randomness, ensuring diversity in reconstructions.
Where:
-
$\epsilon_\theta(x_t, t)$ : Predicted noise by the model.
- Components Implemented:
- Forward and reverse trajectories for noise addition and removal.
- Loss functions based on diffusion model principles.
- Trained Models Used:
- VAE: VAE Re-implementation
- CLIP Encoder: Clip-Encoder Re-implementation.
- Diffusion Model Training:
- Currently in progress, with the aim to achieve competitive results.
- Complete the training of our diffusion model.
- Experiment with novel loss functions and scheduler improvements.
- Benchmark our results against the pre-trained model's performance.
For more details, check out our project report.
