\( \newcommand{\P}[]{\unicode{xB6}} \newcommand{\AA}[]{\unicode{x212B}} \newcommand{\empty}[]{\emptyset} \newcommand{\O}[]{\emptyset} \newcommand{\Alpha}[]{Α} \newcommand{\Beta}[]{Β} \newcommand{\Epsilon}[]{Ε} \newcommand{\Iota}[]{Ι} \newcommand{\Kappa}[]{Κ} \newcommand{\Rho}[]{Ρ} \newcommand{\Tau}[]{Τ} \newcommand{\Zeta}[]{Ζ} \newcommand{\Mu}[]{\unicode{x039C}} \newcommand{\Chi}[]{Χ} \newcommand{\Eta}[]{\unicode{x0397}} \newcommand{\Nu}[]{\unicode{x039D}} \newcommand{\Omicron}[]{\unicode{x039F}} \DeclareMathOperator{\sgn}{sgn} \def\oiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x222F}\,}{\unicode{x222F}}{\unicode{x222F}}{\unicode{x222F}}}\,}\nolimits} \def\oiiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x2230}\,}{\unicode{x2230}}{\unicode{x2230}}{\unicode{x2230}}}\,}\nolimits} \)

Notes on the different versions of Stable Diffusion from what I can find online.

Stable Diffusion 1

Stable diffusion consists of three main components

  • CLIP text encoder
  • VAE
  • UNet latent diffusion model

The main difference between stable diffusion and other diffusion models is that the diffusion operations happens in a low-resolution latent space. For a 512x512 image, the latent may only be 64x64, a factor of 8 times smaller. This significantly reduces the compute resources necessary.

Architecture

See U-Net for Stable Diffusion and Transformer for Stable Diffusion U-Net

1.x

Stable Diffusion 2

Release Blog Post

Stable Diffusion 2 replaces the CLIP model with OpenCLIP, a retraining of CLIP using the publicly available LAION-5B dataset with NSFW images removed. By default they generate both 512x512 and 768x768 images.

In additional, SD2 also includes the release of the following:

  • Super-resolution model
  • Depth to image model
  • Inpainting model

2.1

Stable Diffusion XL

Stable Diffusion XL is a larger model trained on 1024x1024 images.

Stable Diffusion (XL) Turbo

Blog post ADD Paper

Released Nov 2023, SD-Turbo and SDXL-Turbo are fine-tuned versions of SD2 and SDXL trained using adversarial diffusion distillation (ADD).

ADD applies fine-tuning using an adversarial loss (from GANs) and a score distillation loss (from DreamFusion) such that each iteration the model produces a complete image. This allows SD-Turbo to produce realistic images in a single iteration while preserving the ability to contine refining the images with additional diffusion iterations.

Stable Cascade

Release blog post Stable Cascade introduces a latent generator

Stable Diffusion 3

Stable Diffusion 3 replaces the diffusion UNet with a diffusion transformer (DiT).

See Also