Diffusion Models: Difference between revisions

Revision as of 14:12, 7 April 2022

Background

By Sohl-Dickstein et al.[1].

The goal is to define a mapping between a complex distribution \(\displaystyle q(\mathbf{x}^{(0)})\) (e.g. set of realistic images) to a simple distribution \(\displaystyle \pi(\mathbf{y})=p(\mathbf{x}^{(T)})\)(e.g. multivariate normal).
This is done by defining a forward trajectory \(\displaystyle q(\mathbf{x}^{(0...T)})\) and optimizing a reverse trajectory \(\displaystyle p(\mathbf{x}^{(0 ... T)})\).
The forward trajectory is repeatedly applying a Markov diffusion kernel (i.e. a function with a steady distribution \(\displaystyle \pi(\mathbf{y})\)), performing T steps of diffusion.
The reverse trajectory is again applying a diffusion kernel but with an estimated mean and variance.

Image Generation

DDPM

See DDPM paper

Here, the diffusion process is modeled as:

Forward: \(\displaystyle q(\mathbf{x}_t, \mathbf{x}_{t-1}) \sim N(\sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I})\)
Reverse: \(\displaystyle p_\theta(\mathbf{x}_{t-1}, \mathbf{t}) \sim N( \mu_\theta (x_t, t), \beta_t \mathbf{I})\)

The forward diffusion can be sampled for any \(\displaystyle t\) using:
\(\displaystyle \mathbf{x}_{t} = \sqrt{\bar\alpha_t} \mathbf{x}_0 - \sqrt{1-\bar\alpha_t} \boldsymbol{\epsilon}\) where \(\displaystyle \bar\alpha_t = \prod_{s=1}^{t}(1-\beta{s})\)

The loss function is based on the mean of the posterior.
If we estimate \(\displaystyle \mu_\theta(x_t, t)\) as \(\displaystyle \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}} \boldsymbol{\epsilon}_\theta (\mathbf{x}_t, t) \right)\), then the loss function simplifies to:
\(\displaystyle E \left[ \frac{\beta^2_t}{2\sigma^2_t \alpha (1-\bar\alpha_t)} \Vert \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta( \sqrt{\bar\alpha_t} \mathbf{x}_0 - \sqrt{1-\bar\alpha_t} \boldsymbol{\epsilon}, t) \Vert^2 \right]\)

Super-resolution and other Image-to-image generation

See SR3 iterative refinement
Here we use \(\displaystyle \mathbf{y}\) to represent the sequence of priors and we condition on an extra input \(\displaystyle \mathbf{x}\) which is the low-resolution image.
The neural network \(\displaystyle f_{\theta}(\mathbf{x}, \mathbf{y}, \gamma)\) continues to predict the added noise during training the reverse process.

An unofficial PyTorch implementation of SR3 is available at https://github.com/Janspiry/Image-Super-Resolution-via-Iterative-Refinement.

In addition to SR3, the researchers at Google have also unveiled Palette which utilizes the same ideas to perform additional image operations such as colorization, uncropping, and inpainting. These tasks can be performed with a single model.

Text-to-image

OpenAI have unveiled two text-to-image models, GLIDE and DALL-E 2, which rely on diffusion models to generate images.
GLIDE has some open-source code which allows you to test a small version.

Resources

Google AI Blog High Fidelity Image Generation Using Diffusion Models - discusses SR3 and CDM

@@ Line 22: / Line 22: @@
 <math>E \left[ \frac{\beta^2_t}{2\sigma^2_t \alpha (1-\bar\alpha_t)}  \Vert \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta( \sqrt{\bar\alpha_t} \mathbf{x}_0 - \sqrt{1-\bar\alpha_t} \boldsymbol{\epsilon}, t) \Vert^2 \right]</math>
-===Super-resolution===
+===Super-resolution and other Image-to-image generation===
 See [https://iterative-refinement.github.io/ SR3 iterative refinement]<br>
-Here we use <math>\mathbf{y}</math> to represent the sequence of priors and we condition on an extra input <math>\mathbf{x}</math> which is the low-resolution image.
+Here we use <math>\mathbf{y}</math> to represent the sequence of priors and we condition on an extra input <math>\mathbf{x}</math> which is the low-resolution image.<br>
+The neural network <math>f_{\theta}(\mathbf{x}, \mathbf{y}, \gamma)</math> continues to predict the added noise during training the reverse process.
+An unofficial PyTorch implementation of SR3 is available at [https://github.com/Janspiry/Image-Super-Resolution-via-Iterative-Refinement https://github.com/Janspiry/Image-Super-Resolution-via-Iterative-Refinement].
+In addition to SR3, the researchers at Google have also unveiled [https://iterative-refinement.github.io/palette/ Palette] which utilizes the same ideas to perform additional image operations such as colorization, uncropping, and inpainting. These tasks can be performed with a single model.
+===Text-to-image===
+OpenAI have unveiled two text-to-image models, [https://github.com/openai/glide-text2im GLIDE] and [https://openai.com/dall-e-2/ DALL-E 2], which rely on diffusion models to generate images.<br>
+GLIDE has some open-source code which allows you to test a small version.
 ==Resources==
 * [https://ai.googleblog.com/2021/07/high-fidelity-image-generation-using.html Google AI Blog High Fidelity Image Generation Using Diffusion Models] - discusses SR3 and CDM