Diffusion Models: Difference between revisions
Line 38: | Line 38: | ||
DALL-E 2 adds a ''prior'' model which first converts a text embedding to a CLIP image embedding. | DALL-E 2 adds a ''prior'' model which first converts a text embedding to a CLIP image embedding. | ||
Then the diffusion ''decoder'' generates an image based on the image embedding. | Then the diffusion ''decoder'' generates an image based on the image embedding. | ||
==Inversion== | |||
See [https://arxiv.org/abs/2105.05233 Diffusion Models Beat GANs on Image Synthesis].<br> | |||
Inversion of a diffusion model can be done by using DDIM for the reverse process.<br> | |||
This is done by using a variance of 0 for the sampling, hence making the reverse process (latent to image) deterministic. | |||
==Resources== | ==Resources== | ||
* [https://ai.googleblog.com/2021/07/high-fidelity-image-generation-using.html Google AI Blog High Fidelity Image Generation Using Diffusion Models] - discusses SR3 and CDM | * [https://ai.googleblog.com/2021/07/high-fidelity-image-generation-using.html Google AI Blog High Fidelity Image Generation Using Diffusion Models] - discusses SR3 and CDM |
Revision as of 15:42, 7 April 2022
Background
By Sohl-Dickstein et al.[1].
The goal is to define a mapping between a complex distribution \(\displaystyle q(\mathbf{x}^{(0)})\) (e.g. set of realistic images) to a simple distribution \(\displaystyle \pi(\mathbf{y})=p(\mathbf{x}^{(T)})\)(e.g. multivariate normal).
This is done by defining a forward trajectory \(\displaystyle q(\mathbf{x}^{(0...T)})\) and optimizing a reverse trajectory \(\displaystyle p(\mathbf{x}^{(0 ... T)})\).
The forward trajectory is repeatedly applying a Markov diffusion kernel (i.e. a function with a steady distribution \(\displaystyle \pi(\mathbf{y})\)), performing T steps of diffusion.
The reverse trajectory is again applying a diffusion kernel but with an estimated mean and variance.
Image Generation
DDPM
See DDPM paper
Here, the diffusion process is modeled as:
- Forward: \(\displaystyle q(\mathbf{x}_t, \mathbf{x}_{t-1}) \sim N(\sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I})\)
- Reverse: \(\displaystyle p_\theta(\mathbf{x}_{t-1}, \mathbf{t}) \sim N( \mu_\theta (x_t, t), \beta_t \mathbf{I})\)
The forward diffusion can be sampled for any \(\displaystyle t\) using:
\(\displaystyle \mathbf{x}_{t} = \sqrt{\bar\alpha_t} \mathbf{x}_0 - \sqrt{1-\bar\alpha_t} \boldsymbol{\epsilon}\) where \(\displaystyle \bar\alpha_t = \prod_{s=1}^{t}(1-\beta{s})\)
The loss function is based on the mean of the posterior.
If we estimate \(\displaystyle \mu_\theta(x_t, t)\) as \(\displaystyle \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}} \boldsymbol{\epsilon}_\theta (\mathbf{x}_t, t) \right)\), then the loss function simplifies to:
\(\displaystyle E \left[ \frac{\beta^2_t}{2\sigma^2_t \alpha (1-\bar\alpha_t)} \Vert \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta( \sqrt{\bar\alpha_t} \mathbf{x}_0 - \sqrt{1-\bar\alpha_t} \boldsymbol{\epsilon}, t) \Vert^2 \right]\)
Super-resolution and other Image-to-image generation
See SR3 iterative refinement
Here we use \(\displaystyle \mathbf{y}\) to represent the sequence of priors and we condition on an extra input \(\displaystyle \mathbf{x}\) which is the low-resolution image.
The neural network \(\displaystyle f_{\theta}(\mathbf{x}, \mathbf{y}, \gamma)\) continues to predict the added noise during training the reverse process.
An unofficial PyTorch implementation of SR3 is available at https://github.com/Janspiry/Image-Super-Resolution-via-Iterative-Refinement.
In addition to SR3, the researchers at Google have also unveiled Palette which utilizes the same ideas to perform additional image operations such as colorization, uncropping, and inpainting. These tasks can be performed with a single model.
Text-to-image
OpenAI have unveiled two text-to-image models, GLIDE and DALL-E 2, which rely on diffusion models to generate images.
GLIDE has some open-source code which allows you to test a small version.
At a high-level, GLIDE is a diffusion model which is conditioned on text embeddings and trained with a technique called classifier-free guidance.
DALL-E 2 adds a prior model which first converts a text embedding to a CLIP image embedding.
Then the diffusion decoder generates an image based on the image embedding.
Inversion
See Diffusion Models Beat GANs on Image Synthesis.
Inversion of a diffusion model can be done by using DDIM for the reverse process.
This is done by using a variance of 0 for the sampling, hence making the reverse process (latent to image) deterministic.
Resources
- Google AI Blog High Fidelity Image Generation Using Diffusion Models - discusses SR3 and CDM