Deep Learning: Difference between revisions

2,345 bytes added ,  13 October 2020
Line 1,018: Line 1,018:
VQ-VAE (Vector quantized VAE) perform quantization of the latent space.   
VQ-VAE (Vector quantized VAE) perform quantization of the latent space.   
The quantization is non differentiable but they can copy the gradients.
The quantization is non differentiable but they can copy the gradients.
==Generative Adversarial Networks (GANs)==
Given data <math>\{y_1,...,y_n\}</math>. 
The goal of the generator is to take random noise <math>\{x_i,...,x_n\}</math> and generate fake data <math>\{\hat{y}_1,...,\hat{y}_n\}</math>. 
Then there is a discriminator which takes in <math>\{y_i\}</math> and <math>\{\hat{y}_i\}</math> and guide the generator. 
In practice, both use deep neural networks. 
The optimization is <math>\min_{G} \max_{D} f(G, D)</math>.
GAN training is challenging. 
Oftentimes, there are convergence issues. 
There can also be mode collapsing issues. 
Generalization can be poor and performance evaluation is subjective.
A common approach for training GANs is using alternating gradient descent. 
However, this usually does not converge to <math>G^*</math>.
===Reducing unsupervised to supervised===
;Formulating GANs
Given <math>\{y_i\}</math> and <math>\{x_i\}</math>. 
We need to find a generator <math>G</math> s.t. <math>G(X) \stackrel{\text{dist}}{\approx} Y</math>.
Given some data <math>\{y_i\}</math>, generate some randomness <math>\{x_i\}</math>. 
Create a ''coupling'' <math>\pi()</math> to create paired examples <math>\{(x_{\pi(i)}, y_i)\}</math>. 
Then we have:
<math>\min_{\pi} \min_{G} \frac{1}{n} \sum_{i=1}^{n} l(\mathbf{y}_i, G(\mathbf{x}_{\pi(i)}))</math> 
We can replace the coupling with a joint distribution:
<math>\min_{\mathbb{P}_{X, Y}} \min_{G} \frac{1}{n} E_{\mathbb{P}_{X,Y}}[ l(\mathbf{y}_i, G(\mathbf{x}_{\pi(i)}))]</math>. 
By switching the min and substituting <math>\hat{Y} = G(X)</math>: 
<math>\min_{G} \min_{\mathbb{P}} E_{\mathbb{P}_{X,Y}}[l(Y, \hat{Y})]</math>. 
The inner minimization is the optimal transport distance.
====Optimal Transport (Earth-Mover)====
Non-parametric distances between probability measures. 
This is well-defined. 
Cost of ''transporting'' yellow to red points:
<math>\min_{\pi} \frac{1}{n} \sum_{i=1}^{n} l(y_i, \hat{y}_{\pi(i)})</math>. 
If using l2, then <math>dist(P_{Y},P_{\hat{Y}}) = W(P_{Y}, P_{\hat{Y}})</math>.
;Optimization
The primal is <math>dist(P_Y, Y_{\hat{Y}}) = \min E[l(Y, \hat{Y})]</math>. 
;WGAN Formulation
The dual of <math>\min_{G} W_1(P_Y, P_{\hat{Y}})</math> is <math>\min_{G} \max_{D} \left[ E[D(Y)] - E[D(\hat{Y})] \right]</math>. 
The lipschitz of the discriminator can be enforced by weight clipping.


==Misc==
==Misc==