5,321
edits
(→GANs) |
|||
(11 intermediate revisions by the same user not shown) | |||
Line 186: | Line 186: | ||
* KL is always >= 0 | * KL is always >= 0 | ||
* KL is not symmetric | * KL is not symmetric | ||
* Jensen-Shannon Divergence | |||
** <math>JSD(P \Vert Q) = \frac{1}{2}KL(P \Vert Q) + \frac{1}{2}KL(Q \Vert P)</math> | |||
** This is symmetric | |||
====Model==== | ====Model==== | ||
Our model for how the data is generated is as follows: | Our model for how the data is generated is as follows: | ||
* Generate latent variables <math>z^{(1)},...,z^{(m)} \in \mathbb{R}^r<math> iid where dimension r is less than n. | * Generate latent variables <math>z^{(1)},...,z^{(m)} \in \mathbb{R}^r</math> iid where dimension r is less than n. | ||
** We assume <math>Z^{(i)} \sim N(\mathbf{0},\mathbf{I})</math> | ** We assume <math>Z^{(i)} \sim N(\mathbf{0},\mathbf{I})</math> | ||
* Generate <math>x^{(i)}</math> where <math>X^{(i)} \vert Z^{(i)} \ | * Generate <math>x^{(i)}</math> where <math>X^{(i)} \vert Z^{(i)} \sim N(g_{\theta}(z), \sigma^2 \mathbf{I})</math> | ||
** For some function <math>g_{\theta_1}</math> parameterized by <math>\theta_1</math> | ** For some function <math>g_{\theta_1}</math> parameterized by <math>\theta_1</math> | ||
====Variational Bound==== | ====Variational Bound==== | ||
The variational bound is: | The variational bound is: | ||
* <math>\log P(x^{(i)}) \geq E_{Z}[\log P(X^{(i)} \vert Z)] - KL(Q_i(z) \Vert P(z))</math> | * <math>\log P(x^{(i)}) \geq E_{Z \sim Q_i}[\log P(X^{(i)} \vert Z)] - KL(Q_i(z) \Vert P(z))</math> | ||
{{hidden | Derivation | | {{hidden | Derivation | | ||
We know from Baye's rule | We know from Baye's rule that <math>P(z|X) = \frac{P(X|z)P(z)}{P(X)}</math>.<br> | ||
Plugging this into the equation for <math>KL(Q_i(z) \Vert P(z|X))</math> yields our inequality.<br> | |||
<math>KL(Q_i(z) \Vert P(z|X)) = E_{Q} \left[ \log(\frac{Q_i(z)}{P(z|X)}) \right]</math><br> | |||
<math>=E_Q(\log(\frac{Q_i(z) P(X^{(i)})}{P(X_z)P(z)})</math><br> | |||
<math>=E_Q(\log(\frac{Q_i(z)}{P(z)})) + \log(P(x^{(i)})) - E_Q(\log(P(X|z))</math><br> | |||
<math>=KL(Q_i(z) \Vert P(z)) + \log(P(x^{(i)}) - E_Q(\log(P(X|z))</math><br> | |||
Rearranging terms we get:<br> | |||
<math>\log P(x^{(i)}) - KL(Q_i(z) \Vert P(z|X)) = E_Q(\log(P(X|z)) - KL(Q_i(z) \Vert P(z))</math><br> | |||
Since the KL divergence is greater than or equal to 0, our variational bound follows. | |||
}} | }} | ||
Line 208: | Line 219: | ||
The main idea is to ensure the that discriminator is lipschitz continuous and to limit the lipschitz constant (i.e. the derivative) of the discriminator.<br> | The main idea is to ensure the that discriminator is lipschitz continuous and to limit the lipschitz constant (i.e. the derivative) of the discriminator.<br> | ||
If the correct answer is 1.0 and the generator produces 1.0001, we don't want the discriminator to give us a very high loss.<br> | If the correct answer is 1.0 and the generator produces 1.0001, we don't want the discriminator to give us a very high loss.<br> | ||
====Earth mover's distance==== | |||
{{main | wikipedia:earth mover's distance}} | |||
The minimum cost of converting one pile of dirt to another.<br> | |||
Where cost is the cost of moving (amount * distance)<br> | |||
Given a set <math>P</math> with m clusters and a set <math>Q</math> with n clusters:<br> | |||
... | |||
<math>EMD(P, Q) = \frac{\sum_{i=1}^{m}\sum_{j=1}^{n}f_{i,j}d_{i,j}}{\sum_{i=1}^{m}\sum_{j=1}^{n}f_{i,j}}</math><br> | |||
;Notes | |||
* Also known as Wasserstein metric | |||
==Dimension Reduction== | |||
Goal: Reduce the dimension of a dataset.<br> | |||
If each example <math>x \in \mathbb{R}^n</math>, we want to reduce each example to be in <math>\mathbb{R}^k</math> where <math>k < n</math> | |||
===PCA=== | |||
Principal Component Analysis<br> | |||
Preprocessing: Subtract the sample mean from each example so that the new sample mean is 0.<br> | |||
Goal: Find a vector <math>v_1</math> such that the projection <math>v_1 \cdot x</math> has maximum variance.<br> | |||
Result: These principal components are the eigenvectors of <math>X^TX</math>.<br> | |||
Idea: Maximize the variance of the projections.<br> | |||
<math>\max \frac{1}{m}\sum (v_1 \cdot x^{(i)})^2</math><br> | |||
Note that <math>\sum (v_1 \cdot x^{(i)})^2 = \sum v_1^T x^{(i)} (x^{(i)})^T v_1 = v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1</math>.<br> | |||
<!-- | |||
Thus our lagrangian is <math>\max_{\alpha} v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1 + \alpha(\Vert v_1 \Vert^2 - 1)</math><br> | |||
Taking the gradient of this we get:<br> | |||
<math>\nabla_{v_1} (v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1) + \alpha(\Vert v_1 \Vert^2 - 1)</math><br> | |||
<math>= 2(\sum x^{(i)} (x^{(i)})^T) v_1 + 2\alpha v_1</math> | |||
--> | |||
===Kernel PCA=== | |||
{{main | Wikipedia: Kernel principal component analysis}} | |||
===Autoencoder=== | |||
You have a encoder and a decoder which are both neural networks. |