This is the official repo of the paper SimDA: Simple Diffusion Adapter for Efficient Video Generation.
The project website is here.
sh env.shInstalling xformers is highly recommended for more efficiency and speed on GPUs.
To enable xformers, set enable_xformers_memory_efficient_attention=True (default).
[Stable Diffusion] Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. The pre-trained Stable Diffusion models can be downloaded from Hugging Face (e.g., Stable Diffusion v1-4, v2-1). You can also use fine-tuned Stable Diffusion models trained on different styles (e.g, Modern Disney, Anything V4.0, Redshift, etc.).
To fine-tune the text-to-image diffusion models for text-to-video generation, run this command:
sh train.shNote: Tuning a 24-frame video usually takes 200~500 steps, about 5~10 minutes using one A100 GPU.
Reduce n_sample_frames if your GPU memory is limited.
Once the training is done, run inference:
from simda.pipelines.pipeline_simda import SimDAPipeline
from simda.models.unet import UNet3DConditionModel
from simda.util import save_videos_grid
import torch
pretrained_model_path = "./checkpoints/stable-diffusion-v1-4"
my_model_path = "./outputs/car-turn"
unet = UNet3DConditionModel.from_pretrained(my_model_path, subfolder='unet', torch_dtype=torch.float16).to('cuda')
pipe = SimDAipeline.from_pretrained(pretrained_model_path, unet=unet, torch_dtype=torch.float16).to("cuda")
pipe.enable_xformers_memory_efficient_attention()
pipe.enable_vae_slicing()
prompt = "spider man is skiing"
ddim_inv_latent = torch.load(f"{my_model_path}/inv_latents/ddim_latent-500.pt").to(torch.float16)
video = pipe(prompt, latents=ddim_inv_latent, video_length=24, height=512, width=512, num_inference_steps=50, guidance_scale=12.5).videos
save_videos_grid(video, f"./{prompt}.gif")If you make use of our work, please cite our paper.
@inproceedings{xing2023simda,
title={SimDA: Simple Diffusion Adapter for Efficient Video Generation},
author={Xing, Zhen and Dai, Qi and Hu, Han and Wu, Zuxuan and Jiang, Yu-Gang},
booktitle={CVPR},
year={2024}
}
- This code builds on diffusers and Tune-A-Video. Thanks for open-sourcing!