SinGAN: Learning a Generative Model from a Single Natural Image

(Redirected from SinGAN)
\( \newcommand{\P}[]{\unicode{xB6}} \newcommand{\AA}[]{\unicode{x212B}} \newcommand{\empty}[]{\emptyset} \newcommand{\O}[]{\emptyset} \newcommand{\Alpha}[]{Α} \newcommand{\Beta}[]{Β} \newcommand{\Epsilon}[]{Ε} \newcommand{\Iota}[]{Ι} \newcommand{\Kappa}[]{Κ} \newcommand{\Rho}[]{Ρ} \newcommand{\Tau}[]{Τ} \newcommand{\Zeta}[]{Ζ} \newcommand{\Mu}[]{\unicode{x039C}} \newcommand{\Chi}[]{Χ} \newcommand{\Eta}[]{\unicode{x0397}} \newcommand{\Nu}[]{\unicode{x039D}} \newcommand{\Omicron}[]{\unicode{x039F}} \DeclareMathOperator{\sgn}{sgn} \def\oiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x222F}\,}{\unicode{x222F}}{\unicode{x222F}}{\unicode{x222F}}}\,}\nolimits} \def\oiiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x2230}\,}{\unicode{x2230}}{\unicode{x2230}}{\unicode{x2230}}}\,}\nolimits} \)

SinGAN
Paper
Author's Website
Supplementary Material Mirror(CVF Host)
ICCV
Github Official PyTorch Implementation
SinGAN: Learning a Generative Model from a Single Natural Image
Authors: Tamar Rott Shaham (Technion), Tali Dekel (Google Research), Tomer Michaeli (Technion)


Basic Idea

Train GANs to fill in details at different scales of the image

  • Start by building a GAN to generate low-resolution versions of the original image
  • Then upscale the image and build a GAN to add details to your upscaled image
  • Fix the parameters of the previous GAN. Upscale the outputs and repeat.

Architecture

They build \(\displaystyle N\) PatchGANs, N usually 7-8
Each GAN \(\displaystyle G_n\) adds details to the image produced by GAN \(\displaystyle G_{n+1}\) below it.
The final GAN \(\displaystyle G_0\) adds only fine details.

Generator

The use N generators which they call a hierarchy of patch-GANs.
Each generator consists of 5 convolutional blocks:
Conv(\(\displaystyle 3 \times 3\))-BatchNorm-LeakyReLU.
Note: This generator is similar to pix2pix.
They use 32 kernels per block at the coarsest scale and increase \(\displaystyle 2 \times\) every 4 scales.
This means their convolutional layers have an input and output of 32 channels.

Batch Normalization

Definitions
  • Internal Covariate Shift - "the change in distribution of network activations as network parameters change"
  • Whitening - "linearly transformed to have zero means and unit variances, and decorrelated"

The idea is to do a element-wise normalization per minibatch to improve training speed.

B <- get minibatch
u_B <- calculate minimatch mean
sigma2_B <- calculate minibatch variance
x_hat <- (x - u_B) / sqrt(sigma2_B + eps) // normalize
y <- gamma*x + b // scale and shift
  • \(\displaystyle \epsilon\) is added for numerical stability (e.g. if sigma is small)
  • In the algorithm, \(\displaystyle \gamma\) and \(\displaystyle \beta\) are learned parameters.
  • The final scale and shift allow the entire BN process to be an identity transformation
    • The BN paper gives a sigmoid activation as an example which is linear around 0.
Batch Normalization would almost eliminate the non-linearity with sigmoid

Leaky Relu

Relu is \(\displaystyle \begin{cases} x & \text{if }x \gt 0\\ 0 & \text{if }x \lt = 0 \end{cases} \).
If the input is \(\displaystyle \lt =0\) then any gradient through that neuron will always be 0.
This leads to dead neurons which remain dead if the neurons below never output a positive number.
That is, you get neurons which always output \(\displaystyle 0\) throughout the training process.
Leaky relu: \(\displaystyle \begin{cases} x & \text{if }x \gt 0\\ 0.01x & \text{if }x \lt = 0 \end{cases} \) always has a gradient so neurons below will always be updated.
SinGAN uses a negative slope of 0.2 (instead of the default 0.01).

Discriminator

The architecture is the same as the generator.
The patch size is \(\displaystyle 11 \times 11\) The GAN used is PatchGAN from pix2pix.
PatchGAN's discriminator is referred to as a Markovian discriminator because the receptive field is smaller than the size of the image.
"Such a discriminator effectively models the image as a Markov random field, assuming independence between pixels separated by more than a patch diameter."

Training and Loss Function

\(\displaystyle \min_{G_n} \max_{D_n} \mathcal{L}_{adv}(G_n, D_n) + \alpha \mathcal{L}_{rec}(G_n)\)
They use a combination of the standard GAN adversarial loss and a reconstruction loss.

Adversarial Loss

They use the WGAN-GP loss.
This drops the log from the traditional cross-entropy loss.
\(\displaystyle \min_{G_n}\max_{D_n}L_{adv}(G_n, D_n)+\alpha L_{rec}(G_n)\)

# When training the discriminator
netD.zero_grad()
output = netD(real).to(opt.device)
#D_real_map = output.detach()
errD_real = -output.mean()#-a
errD_real.backward(retain_graph=True)
#... Make noise and prev ...
fake = netG(noise.detach(),prev)
output = netD(fake.detach())
errD_fake = output.mean()
errD_fake.backward(retain_graph=True)

# When training the generator
output = netD(fake)
errG = -output.mean()
errG.backward(retain_graph=True)

Reconstruction Loss

\(\displaystyle \mathcal{L}_{rec} = \Vert G_n(0,(\bar{x}^{rec}_{n+1}\uparrow^r) - x_n \Vert ^2\)
The reconstruction loss ensures that the original image can be built by the GAN.
Rather than inputting noise to the generators, they input \(\displaystyle \{z_N^{rec}, z_{N-1}^{rec}, ..., z_0^{rec}\} = \{z^*, 0, ..., 0\}\) where the initial noise \(\displaystyle z^*\) is drawn once and then fixed during the rest of the training.
The standard deviation \(\displaystyle \sigma_n\) of the noise \(\displaystyle z_n\) is proportional to the root mean squared error (RMSE) between the reconstructed patch and the original patch.

loss = nn.MSELoss()
Z_opt = opt.noise_amp*z_opt+z_prev
rec_loss = alpha*loss(netG(Z_opt.detach(),z_prev),real)
rec_loss.backward(retain_graph=True)

Evaluation

They evaluate their method using an Amazon Mechanical Turk (AMT) user study and using Single Image Frechet Inception Distance

Amazon Mechanical Turk Study

Frechet Inception Distance

Results

Below are images of their results from their paper and website.


Applications

The following are applications they identify.
The basic idea for each of these applications is to start your input at an intermediate GAN rather than the bottom GAN.
While the bottom layer is a purely unconditional GAN, the intermediate generators are more akin to conditional GANs.

Super-Resolution

Upscaling

Paint-to-Image

Convert a drawing to an image.

Harmonization

Harmonize, or blend the style of a cut-and-pasted piece of image.

Editing

Single Image Animation

Generate a video from a single image.
In SinGAN, they perform a random walk in the noise passed as inputs to the upper levels of GANs.

Repo

The official repo for SinGAN can be found on their Github Repo

Citation

Here is the bibtex:

@InProceedings{Shaham_2019_ICCV,
author = {Shaham, Tamar Rott and Dekel, Tali and Michaeli, Tomer},
title = {SinGAN: Learning a Generative Model From a Single Natural Image},
booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
month = {October},
year = {2019}
}