Stable Diffusion

Notes on the different versions of Stable Diffusion from what I can find online.

Stable Diffusion 1

Stable diffusion consists of three main components

CLIP text encoder
VAE
UNet latent diffusion model

The main difference between stable diffusion and other diffusion models is that the diffusion operations happens in a low-resolution latent space. For a 512x512 image, the latent may only be 64x64, a factor of 8 times smaller. This significantly reduces the compute resources necessary.

Architecture

U-Net

See U-Net for Stable Diffusion and Transformer for Stable Diffusion U-Net

At a high-level Stable diffusion uses a U-Net with 4 down blocks, one mid block, and 4 up blocks. Note that the last down block and first mid block do not change the resolution.

1.x

Stable Diffusion 2

Release Blog Post

Stable Diffusion 2 replaces the CLIP model with OpenCLIP, a retraining of CLIP using the publicly available LAION-5B dataset with NSFW images removed. By default they generate both 512x512 and 768x768 images.

In additional, SD2 also includes the release of the following:

Super-resolution model
Depth to image model
Inpainting model

2.1

Stable Diffusion XL

Stable Diffusion XL is a larger model trained on 1024x1024 images.

Stable Diffusion (XL) Turbo

Blog post ADD Paper

Released Nov 2023, SD-Turbo and SDXL-Turbo are fine-tuned versions of SD2 and SDXL trained using adversarial diffusion distillation (ADD).

ADD applies fine-tuning using an adversarial loss (from GANs) and a score distillation loss (from DreamFusion) such that each iteration the model produces a complete image. This allows SD-Turbo to produce realistic images in a single iteration while preserving the ability to contine refining the images with additional diffusion iterations.

Stable Cascade

Release blog post Stable Cascade introduces a latent generator

Stable Diffusion 3

Stable Diffusion 3 replaces the diffusion UNet with a diffusion transformer (DiT).