5,337
edits
(26 intermediate revisions by the same user not shown) | |||
Line 60: | Line 60: | ||
Expectation Maximization<br> | Expectation Maximization<br> | ||
The key idea is to introduce intermediate coefficients to apply Jensen's inequality.<br> | The key idea is to introduce intermediate coefficients to apply Jensen's inequality.<br> | ||
=====Maximum Likelihood===== | |||
Let <math>\theta_j = [\phi_j, \mu_j, \sigma_j]</math> denote our parameters.<br> | Let <math>\theta_j = [\phi_j, \mu_j, \sigma_j]</math> denote our parameters.<br> | ||
The likelihood is <math>L(\theta) = \Pi_{i=1}^{m} P(x^{(i)}; \theta)</math> where <math>P</math> is our pdf/pmf.<br> | The likelihood is <math>L(\theta) = \Pi_{i=1}^{m} P(x^{(i)}; \theta)</math> where <math>P</math> is our pdf/pmf.<br> | ||
Line 77: | Line 77: | ||
</math><br> | </math><br> | ||
<math> | <math> | ||
\implies \log\left[E_{Q}\left(\frac{ | \implies \log\left[E_{Q}\left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}_{(j)}}\right)\right] \geq E_{Q} \left[ \log \left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}(j)} \right)\right] | ||
</math><br> | </math><br> | ||
; E-Step | Let <math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br> | ||
We will fix <math>\theta</math> and | The EM algorithm is an iterative algorithm which alternates between an E-Step and an M-Step. | ||
=====E-Step===== | |||
We will fix <math>\theta</math> and maximize J wrt <math>Q</math>.<br> | |||
Jensen's inequality holds with equality iff either the function is linear or if the random variable is degenerate.<br> | Jensen's inequality holds with equality iff either the function is linear or if the random variable is degenerate.<br> | ||
We will assume <math>\frac{P(x^{(i)}, z^{(i)}=j; \theta) | We will assume <math>\frac{P(x^{(i)}, z^{(i)}=j; \theta) | ||
Line 86: | Line 88: | ||
This implies <math>Q^{(i)}(j) = c * P(x^{(i)}, z^{(i)} = j ; \theta)</math>.<br> | This implies <math>Q^{(i)}(j) = c * P(x^{(i)}, z^{(i)} = j ; \theta)</math>.<br> | ||
Since Q is a pmf, we have <math>Q^{(i)}(j) = \frac{1}{P(x^{(i)})} * P(x^{(i)}, z^{(i)} = j ; \theta) = P(z^{(i)} ; x^{(i)}, \theta)</math><br> | Since Q is a pmf, we have <math>Q^{(i)}(j) = \frac{1}{P(x^{(i)})} * P(x^{(i)}, z^{(i)} = j ; \theta) = P(z^{(i)} ; x^{(i)}, \theta)</math><br> | ||
; M-Step | Q is updated with <math>Q^{(i)}(j) = \frac{P(z^{(i)}=j)P(x^{(i)}|z^{(i)}=j)}{P(x^{(i)})}</math> | ||
We will fix <math>Q</math> and | {{hidden | Maximization w.r.t Q | | ||
<math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br> | |||
We are assuming that Q is a pmf so <math>\sum_j Q = 1 </math><br> | |||
Our lagrangian is:<br> | |||
<math> | |||
\max_{Q} \min_{\beta} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right) +\beta (\sum_j Q^{(i)}_{(j)} - 1)\right] | |||
</math> | |||
We can maximize each <math>Q^{(i)}</math> independently.<br> | |||
Taking the derivative wrt Q we get:<br> | |||
<math> | |||
\frac{\partial}{\partial Q^{(i)}_{(j)}} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right) +\beta (\sum_j Q^{(i)}_{(j)} - 1) | |||
</math><br> | |||
<math> | |||
= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - Q \frac{Q}{P(x^{(i)}, z^{(i)}=j;\theta)} (P(x^{(i)}, z^{(i)}=j;\theta))(Q^{-2}) + \beta | |||
</math> | |||
</math><br> | |||
<math> | |||
= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - 1 + \beta = 0 | |||
</math><br> | |||
<math> | |||
\implies Q^{(i)}_{(j)} = (\frac{1}{exp(1-\beta)})P(x^{(i)}, z^{(i)}=j;\theta) | |||
</math><br> | |||
Since Q is a pmf, we know it sums to 1 so we get the same result replacing <math>(\frac{1}{exp(1-\beta)})</math> with <math>P(x^{(i)}</math> | |||
}} | |||
=====M-Step===== | |||
We will fix <math>Q</math> and maximize J wrt <math>\theta</math> | |||
<math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br> | |||
Assume <math>\Sigma_j = I</math> for simplicity.<br> | |||
Then<br> | |||
<math> | |||
J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right) | |||
</math><br> | |||
<math> | |||
= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)}, z^{(i)}=j;\theta)) + C_1 | |||
</math><br> | |||
<math> | |||
= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j) P(z^{(i)}=j)) + C_1 | |||
</math><br> | |||
<math> | |||
= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j)) - Q^{(i)}_{(j)} \log( P(z^{(i)}=j)) + C_1 | |||
</math><br> | |||
<math> | |||
= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( (2\pi)^{-n/2}exp(-\Vert x^{(i)} + \mu_j \Vert^2 / 2)) - Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_1 | |||
</math><br> | |||
<math> | |||
= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} -\Vert x^{(i)} - \mu_j \Vert^2 / 2) + Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_2 | |||
</math><br> | |||
Maximizing wrt <math>\mu</math>, we get <math>\mu_j^* = (\sum_{i} Q^{(i)}_{(i)}x^{(i)}) / (\sum_{i}Q^{(i)}_{(j)})</math>.<br> | |||
Maximizing wrt <math>\phi</math> we get <math>\phi_j^* = \frac{1}{m} \sum_{i=1}^{m}Q^{(i)}_{(j)}</math><br> | |||
{{hidden | Maximization wrt phi| | |||
We want to maximize <math>Q^{(i)}_{(j)} \log( \phi_j)</math> subject to the condition <math>\sum \phi_j = 1</math> since <math>\phi_j = P(z^{(i)}=j)</math> is a pmf.<br> | |||
The lagrangian for this is <math>\max_{\phi_1,...,\phi_k} \min_{\beta} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}Q^{(i)}_{(j)} \log( \phi_j) + \beta(\sum_{j=1}^{k}\phi_j - 1) \right]</math><br> | |||
This can be rewritten as <math>\max_{\phi_1,...,\phi_k} \min_{\beta} \sum_{j=1}^{k}\left[\log( \phi_j) \sum_{i=1}^{m} Q^{(i)}_{(j)} + \beta(\phi_j - 1/k) \right]</math><br> | |||
The dual of this problem is <math> \min_{\beta} \max_{\phi_1,...,\phi_k} \sum_{j=1}^{k} \left[\log( \phi_j) \sum_{i=1}^{m} Q^{(i)}_{(j)} + \beta(\phi_j - 1/k) \right]</math><br> | |||
Taking the gradient w.r.t <math>\phi</math>, we get <math>\frac{1}{\phi_j}\sum_{i}Q^{(i)}_{(j)} + \beta = 0 \implies \phi_j = \frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)})</math><br> | |||
Plugging this into our dual problem we get:<br> | |||
<math> \min_{\beta} \sum_{j=1}^{k} \left[\log(\frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)})) \sum_{i=1}^{m} Q^{(i)}_{(j)} + \beta(\frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)}) - 1/k) \right]</math><br> | |||
<math>= \min_{\beta} \sum_{j=1}^{k} \left[\log(\frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)})) \sum_{i=1}^{m} Q^{(i)}_{(j)} -(\sum_{i}Q^{(i)}_{(j)}) - (\beta/k) \right]</math><br> | |||
Taking the derivative with respect to <math>\beta</math>, we get:<br> | |||
<math> | |||
\sum_{j=1}^{k} [(\frac{1}{(-1/\beta)(\sum Q)})(-\sum Q)(-\beta^{-2})(\sum Q) - \frac{1}{k}] | |||
</math><br> | |||
<math> | |||
=\sum_{j=1}^{k} [ (\beta)(-\beta^{-2})(\sum Q) - \frac{1}{k}] | |||
</math><br> | |||
<math> | |||
= \sum_{j=1}^{k} [\frac{-1}{\beta}(\sum_{i=1}^{m} Q) - \frac{1}{k}] | |||
</math><br> | |||
<math> | |||
=[\sum_{i=1}^{m} \frac{-1}{\beta} \sum_{j=1}^{k}P(z^{(i)} = j | x^{(i)}) - \sum_{j=1}^{k}\frac{1}{k}] | |||
</math><br> | |||
<math> | |||
= [\frac{-1}{\beta}\sum_{i=1}^{m}1 - 1] | |||
</math><br> | |||
<math> | |||
= \frac{-m}{\beta} - 1 = 0 | |||
</math><br> | |||
<math> | |||
\implies \beta = -m | |||
</math><br> | |||
Plugging in <math>\beta = -m</math> into our equation for <math>\phi_j</math> we get <math>\phi_j = \frac{1}{m}\sum_{i=1}^{m}Q^{(i)}_{(j)}</math> | |||
}} | |||
==Generative Models== | |||
Goal: Generate realistic but fake samples. | |||
Applications: Denoising, impainting | |||
===VAEs=== | |||
Variational Auto-Encoders<br> | |||
[https://arxiv.org/abs/1606.05908 Tutorial]<br> | |||
====KL Divergence==== | |||
* [[Wikipedia:Kullback–Leibler divergence]]<br> | |||
* Kullback–Leibler divergence<br> | |||
* <math>KL(P \Vert Q) =E_{P}\left[ \log(\frac{P(X)}{Q(X)}) \right]</math> | |||
; Notes | |||
* KL is always >= 0 | |||
* KL is not symmetric | |||
* Jensen-Shannon Divergence | |||
** <math>JSD(P \Vert Q) = \frac{1}{2}KL(P \Vert Q) + \frac{1}{2}KL(Q \Vert P)</math> | |||
** This is symmetric | |||
====Model==== | |||
Our model for how the data is generated is as follows: | |||
* Generate latent variables <math>z^{(1)},...,z^{(m)} \in \mathbb{R}^r</math> iid where dimension r is less than n. | |||
** We assume <math>Z^{(i)} \sim N(\mathbf{0},\mathbf{I})</math> | |||
* Generate <math>x^{(i)}</math> where <math>X^{(i)} \vert Z^{(i)} \sim N(g_{\theta}(z), \sigma^2 \mathbf{I})</math> | |||
** For some function <math>g_{\theta_1}</math> parameterized by <math>\theta_1</math> | |||
====Variational Bound==== | |||
The variational bound is: | |||
* <math>\log P(x^{(i)}) \geq E_{Z \sim Q_i}[\log P(X^{(i)} \vert Z)] - KL(Q_i(z) \Vert P(z))</math> | |||
{{hidden | Derivation | | |||
We know from Baye's rule that <math>P(z|X) = \frac{P(X|z)P(z)}{P(X)}</math>.<br> | |||
Plugging this into the equation for <math>KL(Q_i(z) \Vert P(z|X))</math> yields our inequality.<br> | |||
<math>KL(Q_i(z) \Vert P(z|X)) = E_{Q} \left[ \log(\frac{Q_i(z)}{P(z|X)}) \right]</math><br> | |||
<math>=E_Q(\log(\frac{Q_i(z) P(X^{(i)})}{P(X_z)P(z)})</math><br> | |||
<math>=E_Q(\log(\frac{Q_i(z)}{P(z)})) + \log(P(x^{(i)})) - E_Q(\log(P(X|z))</math><br> | |||
<math>=KL(Q_i(z) \Vert P(z)) + \log(P(x^{(i)}) - E_Q(\log(P(X|z))</math><br> | |||
Rearranging terms we get:<br> | |||
<math>\log P(x^{(i)}) - KL(Q_i(z) \Vert P(z|X)) = E_Q(\log(P(X|z)) - KL(Q_i(z) \Vert P(z))</math><br> | |||
Since the KL divergence is greater than or equal to 0, our variational bound follows. | |||
}} | |||
==GANs== | |||
{{main | Generative adversarial network}} | |||
===Wasserstein GAN=== | |||
[https://arxiv.org/abs/1701.07875 Paper]<br> | |||
The main idea is to ensure the that discriminator is lipschitz continuous and to limit the lipschitz constant (i.e. the derivative) of the discriminator.<br> | |||
If the correct answer is 1.0 and the generator produces 1.0001, we don't want the discriminator to give us a very high loss.<br> | |||
====Earth mover's distance==== | |||
{{main | wikipedia:earth mover's distance}} | |||
The minimum cost of converting one pile of dirt to another.<br> | |||
Where cost is the cost of moving (amount * distance)<br> | |||
Given a set <math>P</math> with m clusters and a set <math>Q</math> with n clusters:<br> | |||
... | |||
<math>EMD(P, Q) = \frac{\sum_{i=1}^{m}\sum_{j=1}^{n}f_{i,j}d_{i,j}}{\sum_{i=1}^{m}\sum_{j=1}^{n}f_{i,j}}</math><br> | |||
;Notes | |||
* Also known as Wasserstein metric | |||
==Dimension Reduction== | |||
Goal: Reduce the dimension of a dataset.<br> | |||
If each example <math>x \in \mathbb{R}^n</math>, we want to reduce each example to be in <math>\mathbb{R}^k</math> where <math>k < n</math> | |||
===PCA=== | |||
Principal Component Analysis<br> | |||
Preprocessing: Subtract the sample mean from each example so that the new sample mean is 0.<br> | |||
Goal: Find a vector <math>v_1</math> such that the projection <math>v_1 \cdot x</math> has maximum variance.<br> | |||
Result: These principal components are the eigenvectors of <math>X^TX</math>.<br> | |||
Idea: Maximize the variance of the projections.<br> | |||
<math>\max \frac{1}{m}\sum (v_1 \cdot x^{(i)})^2</math><br> | |||
Note that <math>\sum (v_1 \cdot x^{(i)})^2 = \sum v_1^T x^{(i)} (x^{(i)})^T v_1 = v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1</math>.<br> | |||
<!-- | |||
Thus our lagrangian is <math>\max_{\alpha} v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1 + \alpha(\Vert v_1 \Vert^2 - 1)</math><br> | |||
Taking the gradient of this we get:<br> | |||
<math>\nabla_{v_1} (v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1) + \alpha(\Vert v_1 \Vert^2 - 1)</math><br> | |||
<math>= 2(\sum x^{(i)} (x^{(i)})^T) v_1 + 2\alpha v_1</math> | |||
--> | |||
===Kernel PCA=== | |||
{{main | Wikipedia: Kernel principal component analysis}} | |||
===Autoencoder=== | |||
You have a encoder and a decoder which are both neural networks. |