This project implements and compares different Generative AI architectures for conditional image synthesis. The goal is to generate realistic human faces based on the CelebA dataset, enabling control over specific semantic attributes (gender, expression, age).
The system is designed with modularity in mind to support the training and inference of three distinct classes of generative models:
- Conditional Variational Autoencoder (C-VAE)
- Conditional Denoising Diffusion Probabilistic Model (C-DDPM)
- Conditional Vision Mamba
-
Multi-Model Architecture: Native support for VAE, Diffusion Models, and State Space Models (Mamba) via a Factory pattern.
-
Conditional Generation: Models accept attribute vectors to guide generation (e.g., Young, Smiling, Male).
-
Advanced Training Management:
- Automatic Checkpointing system to resume training in case of interruption.
- Progress monitoring with a custom indicator.
- Periodic generation of Visual Snapshots to qualitatively evaluate learning during epochs.
-
Unified Inference Script: Flexible tool for generating single samples or comparative grids of all attribute combinations.
├── config/ # Configuration and hyperparameter management
├── data/ # Dataset loader and preprocessing (CelebA)
├── models/ # Neural architecture implementations
│ ├── autoencoder/ # Classes for Conditional VAE
│ ├── diffusion/ # Classes for Conditional Diffusion (U-Net, Noise Schedule)
│ └── mamba/ # Classes for Vision Mamba (MambaBlock, PScan)
├── train/ # Training logic (Trainer loop)
├── utility/ # Support tools (Checkpoint, visualization, logging)
├── weights/ # Directory for saving model weights
├── generate.py # Image generation script
├── train.py # Training entry point
└── unzip.py # Utility for dataset extraction
The project requires Python 3.x and standard scientific and deep learning libraries.
- Clone the repository.
- Ensure dependencies are installed (PyTorch, Torchvision, NumPy, Matplotlib, PIL).
The project uses the CelebA dataset. Images must be placed in the dataset/ folder. A utility script is provided to automatically decompress archives:
python unzip.py
Note: The script searches for .zip files in the ./dataset folder and extracts contents in place.
Training configuration is managed via Environment Variables or default values defined in config/config.py. Parameters can be modified without changing the code by setting variables before execution.
Main Parameters:
| Variable | Default | Description |
|---|---|---|
MODEL_TYPE |
mamba |
Model type: vae, diff, mamba. |
BATCH_SIZE |
256 |
Training batch size. |
EPOCHS |
1000 |
Total number of epochs. |
LEARNING_RATE |
5e-4 |
Learning rate for the Adam optimizer. |
CHECKPOINT_INTERVAL |
300 |
Interval (seconds) for saving checkpoints. |
DEVICE |
cuda |
Computing device (cuda or cpu). |
To start training, configure the desired model type and run the train.py script. The system will automatically load the last available checkpoint if one exists.
Example (Linux/Mac):
export MODEL_TYPE=vae
export BATCH_SIZE=64
export EPOCHS=50
python train.py
Example (Windows PowerShell):
$env:MODEL_TYPE="diff"
$env:BATCH_SIZE="32"
python train.py
During training, weights will be saved in the temp/CHECKPOINT directory, and visual snapshots will be generated to monitor quality.
The generate.py script allows image generation using trained models. It supports two main modes:
- Manual Mode: Generates a set of images with specific attributes.
python generate.py --model vae --male 1 --smiling 1 --young 1 --num_samples 8 --show
- Grid Mode (All Combos): Generates a grid showing all 8 possible attribute combinations (Male/Female, Smile/NoSmile, Young/Old).
python generate.py --model diff --all_combos --samples_per_combo 2
- Woman, Smiling, Young - 16 images
python generate.py --model vae --male 0 --smiling 1 --young 1 --num_samples 16 --show- All combinations (grid) - 4 images per combo
python generate.py --model vae --all_combos --samples_per_combo 4 --show- All combinations (grid) - fast (500 DDIM steps) eta = 0.0 - 4 images per combo
python generate.py --model diff --all_combos --samples_per_combo 2 --steps 500 --eta 0.8 --show- All combinations (grid) - high quality (DDPM) - 2 images per combo
python generate.py --model diff --all_combos --samples_per_combo 1 --steps 1000 --eta 1.0- All combinations (grid) - low temperature (deterministic) - 4 images per combo
python generate.py --model mamba --all_combos --samples_per_combo 4 --temperature 0.0 --show- All combinations (grid) - high temperature (diversity) - 4 images per combo
python generate.py --model mamba --all_combos --samples_per_combo 4 --temperature 0.01 --showArguments:
--model: Required. Choice betweenvae,diff,mamba.--output_dir: Destination folder (default:./generated_samples).--show: Displays generated images on screen in addition to saving them.
A symmetric Encoder-Decoder architecture with convolutional layers and residual connections (Skip Connections).
-
Encoder: Compresses the image and attributes into a latent space parameterized by mean (
$\mu$ ) and variance ($\sigma$ ). - Decoder: Reconstructs the image starting from the sampled latent vector and attributes.
Based on a U-Net with sinusoidal Time Encoding and attribute conditioning via linear projections.
- Uses a pre-calculated noise schedule (Beta/Alpha schedule).
- Supports guided sampling via the
LAMBDA($\lambda$ ) parameter (Classifier-Free Guidance style).
Innovative implementation based on State Space Models (SSM).
- Utilizes
ResidualMambaLayerblocks integrating parallel scan operations (pscan) for computational efficiency. - Treats images as sequences of flattened patches, combined with positional and attribute embeddings.
Authors: Generative AI - Project Group 03 - UNISA A.A. 2025/26