5,337
edits
(→β-VAE) |
|||
Line 1,018: | Line 1,018: | ||
VQ-VAE (Vector quantized VAE) perform quantization of the latent space. | VQ-VAE (Vector quantized VAE) perform quantization of the latent space. | ||
The quantization is non differentiable but they can copy the gradients. | The quantization is non differentiable but they can copy the gradients. | ||
==Generative Adversarial Networks (GANs)== | |||
Given data <math>\{y_1,...,y_n\}</math>. | |||
The goal of the generator is to take random noise <math>\{x_i,...,x_n\}</math> and generate fake data <math>\{\hat{y}_1,...,\hat{y}_n\}</math>. | |||
Then there is a discriminator which takes in <math>\{y_i\}</math> and <math>\{\hat{y}_i\}</math> and guide the generator. | |||
In practice, both use deep neural networks. | |||
The optimization is <math>\min_{G} \max_{D} f(G, D)</math>. | |||
GAN training is challenging. | |||
Oftentimes, there are convergence issues. | |||
There can also be mode collapsing issues. | |||
Generalization can be poor and performance evaluation is subjective. | |||
A common approach for training GANs is using alternating gradient descent. | |||
However, this usually does not converge to <math>G^*</math>. | |||
===Reducing unsupervised to supervised=== | |||
;Formulating GANs | |||
Given <math>\{y_i\}</math> and <math>\{x_i\}</math>. | |||
We need to find a generator <math>G</math> s.t. <math>G(X) \stackrel{\text{dist}}{\approx} Y</math>. | |||
Given some data <math>\{y_i\}</math>, generate some randomness <math>\{x_i\}</math>. | |||
Create a ''coupling'' <math>\pi()</math> to create paired examples <math>\{(x_{\pi(i)}, y_i)\}</math>. | |||
Then we have: | |||
<math>\min_{\pi} \min_{G} \frac{1}{n} \sum_{i=1}^{n} l(\mathbf{y}_i, G(\mathbf{x}_{\pi(i)}))</math> | |||
We can replace the coupling with a joint distribution: | |||
<math>\min_{\mathbb{P}_{X, Y}} \min_{G} \frac{1}{n} E_{\mathbb{P}_{X,Y}}[ l(\mathbf{y}_i, G(\mathbf{x}_{\pi(i)}))]</math>. | |||
By switching the min and substituting <math>\hat{Y} = G(X)</math>: | |||
<math>\min_{G} \min_{\mathbb{P}} E_{\mathbb{P}_{X,Y}}[l(Y, \hat{Y})]</math>. | |||
The inner minimization is the optimal transport distance. | |||
====Optimal Transport (Earth-Mover)==== | |||
Non-parametric distances between probability measures. | |||
This is well-defined. | |||
Cost of ''transporting'' yellow to red points: | |||
<math>\min_{\pi} \frac{1}{n} \sum_{i=1}^{n} l(y_i, \hat{y}_{\pi(i)})</math>. | |||
If using l2, then <math>dist(P_{Y},P_{\hat{Y}}) = W(P_{Y}, P_{\hat{Y}})</math>. | |||
;Optimization | |||
The primal is <math>dist(P_Y, Y_{\hat{Y}}) = \min E[l(Y, \hat{Y})]</math>. | |||
;WGAN Formulation | |||
The dual of <math>\min_{G} W_1(P_Y, P_{\hat{Y}})</math> is <math>\min_{G} \max_{D} \left[ E[D(Y)] - E[D(\hat{Y})] \right]</math>. | |||
The lipschitz of the discriminator can be enforced by weight clipping. | |||
==Misc== | ==Misc== |