Stable Diffusion: Difference between revisions

From David's Wiki
Created page with "Notes on the different versions of Stable Diffusion from what I can find online. ==Stable Diffusion 1== Stable diffusion consists of three main components * CLIP text encoder * VAE * UNet latent diffusion model The main difference between stable diffusion and other diffusion models is that the diffusion operations happens in a low-resolution latent space. For a 512x512 image, the latent may only be 64x64, a factor of 8 times smaller. This significantly reduces the comp..."
 
 
(3 intermediate revisions by the same user not shown)
Line 8: Line 8:


The main difference between stable diffusion and other diffusion models is that the diffusion operations happens in a low-resolution latent space. For a 512x512 image, the latent may only be 64x64, a factor of 8 times smaller. This significantly reduces the compute resources necessary.
The main difference between stable diffusion and other diffusion models is that the diffusion operations happens in a low-resolution latent space. For a 512x512 image, the latent may only be 64x64, a factor of 8 times smaller. This significantly reduces the compute resources necessary.
===Architecture===
===U-Net===
See [https://nn.labml.ai/diffusion/stable_diffusion/model/unet.html U-Net for Stable Diffusion] and [https://nn.labml.ai/diffusion/stable_diffusion/model/unet_attention.html Transformer for Stable Diffusion U-Net]
At a high-level Stable diffusion uses a U-Net with 4 down blocks, one mid block, and 4 up blocks. Note that the last down block and first mid block do not change the resolution.


===1.x===
===1.x===
Line 26: Line 33:
Stable Diffusion XL is a larger model trained on 1024x1024 images.
Stable Diffusion XL is a larger model trained on 1024x1024 images.


==Stable Diffusion Turbo==
==Stable Diffusion (XL) Turbo==
[https://arxiv.org/abs/2311.17042 paper]
[https://stability.ai/news/stability-ai-sdxl-turbo Blog post] [https://arxiv.org/abs/2311.17042 ADD Paper]
Release Nov 2023, [https://huggingface.co/stabilityai/sd-turbo SD-Turbo] and [https://huggingface.co/stabilityai/sdxl-turbo SDXL-Turbo] are fine-tuned versions of SD2 and SDXL trained using adversarial diffusion distillation (ADD). ADD applies fine-tuning using an adversarial loss (from GANs) and a score distillation loss (from DreamFusion) such that each iteration the model produces a complete image. This allows SD-Turbo to produce realistic images in a single iteration while preserving the ability to contine refining the images with additional diffusion iterations.
 
Released Nov 2023, [https://huggingface.co/stabilityai/sd-turbo SD-Turbo] and [https://huggingface.co/stabilityai/sdxl-turbo SDXL-Turbo] are fine-tuned versions of SD2 and SDXL trained using adversarial diffusion distillation (ADD).  
 
ADD applies fine-tuning using an adversarial loss (from GANs) and a score distillation loss (from DreamFusion) such that each iteration the model produces a complete image. This allows SD-Turbo to produce realistic images in a single iteration while preserving the ability to contine refining the images with additional diffusion iterations.


==Stable Cascade==
==Stable Cascade==

Latest revision as of 15:12, 15 March 2024

Notes on the different versions of Stable Diffusion from what I can find online.

Stable Diffusion 1

Stable diffusion consists of three main components

  • CLIP text encoder
  • VAE
  • UNet latent diffusion model

The main difference between stable diffusion and other diffusion models is that the diffusion operations happens in a low-resolution latent space. For a 512x512 image, the latent may only be 64x64, a factor of 8 times smaller. This significantly reduces the compute resources necessary.

Architecture

U-Net

See U-Net for Stable Diffusion and Transformer for Stable Diffusion U-Net

At a high-level Stable diffusion uses a U-Net with 4 down blocks, one mid block, and 4 up blocks. Note that the last down block and first mid block do not change the resolution.

1.x

Stable Diffusion 2

Release Blog Post

Stable Diffusion 2 replaces the CLIP model with OpenCLIP, a retraining of CLIP using the publicly available LAION-5B dataset with NSFW images removed. By default they generate both 512x512 and 768x768 images.

In additional, SD2 also includes the release of the following:

  • Super-resolution model
  • Depth to image model
  • Inpainting model

2.1

Stable Diffusion XL

Stable Diffusion XL is a larger model trained on 1024x1024 images.

Stable Diffusion (XL) Turbo

Blog post ADD Paper

Released Nov 2023, SD-Turbo and SDXL-Turbo are fine-tuned versions of SD2 and SDXL trained using adversarial diffusion distillation (ADD).

ADD applies fine-tuning using an adversarial loss (from GANs) and a score distillation loss (from DreamFusion) such that each iteration the model produces a complete image. This allows SD-Turbo to produce realistic images in a single iteration while preserving the ability to contine refining the images with additional diffusion iterations.

Stable Cascade

Release blog post Stable Cascade introduces a latent generator

Stable Diffusion 3

Stable Diffusion 3 replaces the diffusion UNet with a diffusion transformer (DiT).

See Also