Diffusion Models

From David's Wiki
\( \newcommand{\P}[]{\unicode{xB6}} \newcommand{\AA}[]{\unicode{x212B}} \newcommand{\empty}[]{\emptyset} \newcommand{\O}[]{\emptyset} \newcommand{\Alpha}[]{Α} \newcommand{\Beta}[]{Β} \newcommand{\Epsilon}[]{Ε} \newcommand{\Iota}[]{Ι} \newcommand{\Kappa}[]{Κ} \newcommand{\Rho}[]{Ρ} \newcommand{\Tau}[]{Τ} \newcommand{\Zeta}[]{Ζ} \newcommand{\Mu}[]{\unicode{x039C}} \newcommand{\Chi}[]{Χ} \newcommand{\Eta}[]{\unicode{x0397}} \newcommand{\Nu}[]{\unicode{x039D}} \newcommand{\Omicron}[]{\unicode{x039F}} \DeclareMathOperator{\sgn}{sgn} \def\oiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x222F}\,}{\unicode{x222F}}{\unicode{x222F}}{\unicode{x222F}}}\,}\nolimits} \def\oiiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x2230}\,}{\unicode{x2230}}{\unicode{x2230}}{\unicode{x2230}}}\,}\nolimits} \)

Background

By Sohl-Dickstein et al.[1].

The goal is to define a mapping between a complex distribution \(\displaystyle q(\mathbf{x}^{(0)})\) (e.g. set of realistic images) to a simple distribution \(\displaystyle \pi(\mathbf{y})=p(\mathbf{x}^{(T)})\)(e.g. multivariate normal).
This is done by defining a forward trajectory \(\displaystyle q(\mathbf{x}^{(0...T)})\) and optimizing a reverse trajectory \(\displaystyle p(\mathbf{x}^{(0 ... T)})\).
The forward trajectory is repeatedly applying a Markov diffusion kernel (i.e. a function with a steady distribution \(\displaystyle \pi(\mathbf{y})\)), performing T steps of diffusion.
The reverse trajectory is again applying a diffusion kernel but with an estimated mean and variance.

Image Generation

DDPM

See DDPM paper

Here, the diffusion process is modeled as:

  • Forward: \(\displaystyle q(\mathbf{x}_t, \mathbf{x}_{t-1}) \sim N(\sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I})\)
  • Reverse: \(\displaystyle p_\theta(\mathbf{x}_{t-1}, \mathbf{t}) \sim N( \mu_\theta (x_t, t), \beta_t \mathbf{I})\)

The forward diffusion can be sampled for any \(\displaystyle t\) using:
\(\displaystyle \mathbf{x}_{t} = \sqrt{\bar\alpha_t} \mathbf{x}_0 - \sqrt{1-\bar\alpha_t} \boldsymbol{\epsilon}\) where \(\displaystyle \bar\alpha_t = \prod_{s=1}^{t}(1-\beta{s})\)

The loss function is based on the mean of the posterior.
If we estimate \(\displaystyle \mu_\theta(x_t, t)\) as \(\displaystyle \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}} \boldsymbol{\epsilon}_\theta (\mathbf{x}_t, t) \right)\), then the loss function simplifies to:
\(\displaystyle E \left[ \frac{\beta^2_t}{2\sigma^2_t \alpha (1-\bar\alpha_t)} \Vert \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta( \sqrt{\bar\alpha_t} \mathbf{x}_0 - \sqrt{1-\bar\alpha_t} \boldsymbol{\epsilon}, t) \Vert^2 \right]\)

Super-resolution and other Image-to-image generation

See SR3 iterative refinement
Here we use \(\displaystyle \mathbf{y}\) to represent the sequence of priors and we condition on an extra input \(\displaystyle \mathbf{x}\) which is the low-resolution image.
The neural network \(\displaystyle f_{\theta}(\mathbf{x}, \mathbf{y}, \gamma)\) continues to predict the added noise during training the reverse process.

An unofficial PyTorch implementation of SR3 is available at https://github.com/Janspiry/Image-Super-Resolution-via-Iterative-Refinement.

In addition to SR3, the researchers at Google have also unveiled Palette which utilizes the same ideas to perform additional image operations such as colorization, uncropping, and inpainting. These tasks can be performed with a single model.

Text-to-image

OpenAI have unveiled two text-to-image models, GLIDE and DALL-E 2, which rely on diffusion models to generate images.
GLIDE has some open-source code which allows you to test a small version.

At a high-level, GLIDE is a diffusion model which is conditioned on text embeddings and trained with a technique called classifier-free guidance.
DALL-E 2 adds a prior model which first converts a text embedding to a CLIP image embedding. Then the diffusion decoder generates an image based on the image embedding.

Guided Diffusion

Guidance is a method used to push the diffusion process towards the input condition, e.g. the text input.
There are two types of guidance: classifier guidance and classifier-free guidance.
See https://benanne.github.io/2022/05/26/guidance.html.

Classifier guidance uses an image classifier (e.g. clip) to update the noisy input images towards the desired class.
Classifier-free guidance[1] performs inference on the diffusion model to predict the noise with and without the class input, and extrapolating away from the output without noise.

Inversion

See Diffusion Models Beat GANs on Image Synthesis.
Inversion of a diffusion model can be done by using DDIM for the reverse process.
This is done by using a variance of 0 for the sampling, hence making the reverse process (latent to image) deterministic.

Resources

References

  1. Ho, J., & Salimans, T. (2022). Classifier-Free Diffusion Guidance. doi:10.48550/ARXIV.2207.12598 https://arxiv.org/abs/2207.12598