Stable Diffusion: Difference between revisions
Line 26: | Line 26: | ||
Stable Diffusion XL is a larger model trained on 1024x1024 images. | Stable Diffusion XL is a larger model trained on 1024x1024 images. | ||
==Stable Diffusion Turbo== | ==Stable Diffusion (XL) Turbo== | ||
[https://arxiv.org/abs/2311.17042 | [https://stability.ai/news/stability-ai-sdxl-turbo Blog post] [https://arxiv.org/abs/2311.17042 ADD Paper] | ||
Released Nov 2023, [https://huggingface.co/stabilityai/sd-turbo SD-Turbo] and [https://huggingface.co/stabilityai/sdxl-turbo SDXL-Turbo] are fine-tuned versions of SD2 and SDXL trained using adversarial diffusion distillation (ADD). | Released Nov 2023, [https://huggingface.co/stabilityai/sd-turbo SD-Turbo] and [https://huggingface.co/stabilityai/sdxl-turbo SDXL-Turbo] are fine-tuned versions of SD2 and SDXL trained using adversarial diffusion distillation (ADD). |
Revision as of 16:11, 8 March 2024
Notes on the different versions of Stable Diffusion from what I can find online.
Stable Diffusion 1
Stable diffusion consists of three main components
- CLIP text encoder
- VAE
- UNet latent diffusion model
The main difference between stable diffusion and other diffusion models is that the diffusion operations happens in a low-resolution latent space. For a 512x512 image, the latent may only be 64x64, a factor of 8 times smaller. This significantly reduces the compute resources necessary.
1.x
Stable Diffusion 2
Stable Diffusion 2 replaces the CLIP model with OpenCLIP, a retraining of CLIP using the publicly available LAION-5B dataset with NSFW images removed. By default they generate both 512x512 and 768x768 images.
In additional, SD2 also includes the release of the following:
- Super-resolution model
- Depth to image model
- Inpainting model
2.1
Stable Diffusion XL
Stable Diffusion XL is a larger model trained on 1024x1024 images.
Stable Diffusion (XL) Turbo
Released Nov 2023, SD-Turbo and SDXL-Turbo are fine-tuned versions of SD2 and SDXL trained using adversarial diffusion distillation (ADD).
ADD applies fine-tuning using an adversarial loss (from GANs) and a score distillation loss (from DreamFusion) such that each iteration the model produces a complete image. This allows SD-Turbo to produce realistic images in a single iteration while preserving the ability to contine refining the images with additional diffusion iterations.
Stable Cascade
Release blog post Stable Cascade introduces a latent generator
Stable Diffusion 3
Stable Diffusion 3 replaces the diffusion UNet with a diffusion transformer (DiT).