Unsupervised Learning: Difference between revisions

Unsupervised Learning (view source)

Revision as of 13:36, 9 December 2019

4,022 bytes added , 9 December 2019

→‎Kernel PCA

David

Bureaucrats, Interface administrators, Administrators

5,321

edits

@@ Line 77: / Line 77: @@
 </math><br>
 <math>
-\implies \log\left[E_{Q}\left(\frac{Pr(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}_{(j)}}\right)\right] \geq E_{Q} \left[ \log \left(\frac{Pr(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}(j)} \right)\right]
+\implies \log\left[E_{Q}\left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}_{(j)}}\right)\right] \geq E_{Q} \left[ \log \left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}(j)} \right)\right]
 </math><br>
-Let <math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{Pr(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br>
+Let <math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br>
 The EM algorithm is an iterative algorithm which alternates between an E-Step and an M-Step.
 =====E-Step=====
@@ Line 88: / Line 88: @@
 This implies <math>Q^{(i)}(j) = c * P(x^{(i)}, z^{(i)} = j ; \theta)</math>.<br>
 Since Q is a pmf, we have <math>Q^{(i)}(j) = \frac{1}{P(x^{(i)})} * P(x^{(i)}, z^{(i)} = j ; \theta) = P(z^{(i)} ; x^{(i)}, \theta)</math><br>
-Q is updated with <math>Q^{(i)}(j) = \frac{Pr(z^{(i)}=j)Pr(x^{(i)}|z^{(i)}=j)}{Pr(x^{(i)})}</math>
+Q is updated with <math>Q^{(i)}(j) = \frac{P(z^{(i)}=j)P(x^{(i)}|z^{(i)}=j)}{P(x^{(i)})}</math>
 {{hidden | Maximization w.r.t Q |
-<math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{Pr(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br>
+<math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br>
 We are assuming that Q is a pmf so <math>\sum_j Q = 1 </math><br>
 Our lagrangian is:<br>
 <math>
-\max_{Q} \min_{\beta} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{Pr(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right) +\beta (\sum_j Q^{(i)}_{(j)} - 1)\right]
+\max_{Q} \min_{\beta} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right) +\beta (\sum_j Q^{(i)}_{(j)} - 1)\right]
 </math>
 We can maximize each <math>Q^{(i)}</math> independently.<br>
 Taking the derivative wrt Q we get:<br>
 <math>
-\frac{\partial}{\partial Q^{(i)}_{(j)}} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{Pr(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right) +\beta (\sum_j Q^{(i)}_{(j)} - 1)
+\frac{\partial}{\partial Q^{(i)}_{(j)}} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right) +\beta (\sum_j Q^{(i)}_{(j)} - 1)
 </math><br>
 <math>
-= \log(\frac{Pr(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - Q \frac{Q}{Pr(x^{(i)}, z^{(i)}=j;\theta)} (Pr(x^{(i)}, z^{(i)}=j;\theta))(Q^{-2}) + \beta
+= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - Q \frac{Q}{P(x^{(i)}, z^{(i)}=j;\theta)} (P(x^{(i)}, z^{(i)}=j;\theta))(Q^{-2}) + \beta
 </math>
 </math><br>
 <math>
-= \log(\frac{Pr(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - 1 + \beta = 0
+= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - 1 + \beta = 0
 </math><br>
 <math>
-\implies Q^{(i)}_{(j)} = (\frac{1}{exp(1-\beta)})Pr(x^{(i)}, z^{(i)}=j;\theta)
+\implies Q^{(i)}_{(j)} = (\frac{1}{exp(1-\beta)})P(x^{(i)}, z^{(i)}=j;\theta)
 </math><br>
-Since Q is a pmf, we know it sums to 1 so we get the same result replacing <math>(\frac{1}{exp(1-\beta)})</math> with <math>Pr(x^{(i)}</math>
+Since Q is a pmf, we know it sums to 1 so we get the same result replacing <math>(\frac{1}{exp(1-\beta)})</math> with <math>P(x^{(i)}</math>
 }}
 =====M-Step=====
 We will fix <math>Q</math> and maximize J wrt <math>\theta</math>
-<math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{Pr(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br>
+<math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br>
 Assume <math>\Sigma_j = I</math> for simplicity.<br>
 Then<br>
 <math>
-J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{Pr(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)
+J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)
 </math><br>
 <math>
-= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( Pr(x^{(i)}, z^{(i)}=j;\theta)) + C_1
+= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)}, z^{(i)}=j;\theta)) + C_1
 </math><br>
 <math>
-= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( Pr(x^{(i)} \mid z^{(i)}=j) Pr(z^{(i)}=j)) + C_1
+= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j) P(z^{(i)}=j)) + C_1
 </math><br>
 <math>
-= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( Pr(x^{(i)} \mid z^{(i)}=j)) -  Q^{(i)}_{(j)} \log( Pr(z^{(i)}=j)) + C_1
+= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j)) -  Q^{(i)}_{(j)} \log( P(z^{(i)}=j)) + C_1
 </math><br>
 <math>
@@ Line 172: / Line 172: @@
 Plugging in <math>\beta = -m</math> into our equation for <math>\phi_j</math> we get <math>\phi_j = \frac{1}{m}\sum_{i=1}^{m}Q^{(i)}_{(j)}</math>
 }}
+==Generative Models==
+Goal: Generate realistic but fake samples.
+Applications: Denoising, impainting
+===VAEs===
+Variational Auto-Encoders<br>
+[https://arxiv.org/abs/1606.05908 Tutorial]<br>
+====KL Divergence====
+* [[Wikipedia:Kullback–Leibler divergence]]<br>
+* Kullback–Leibler divergence<br>
+* <math>KL(P \Vert Q) =E_{P}\left[ \log(\frac{P(X)}{Q(X)}) \right]</math>
+; Notes
+* KL is always >= 0
+* KL is not symmetric
+* Jensen-Shannon Divergence
+** <math>JSD(P \Vert Q) = \frac{1}{2}KL(P \Vert Q) + \frac{1}{2}KL(Q \Vert P)</math>
+** This is symmetric
+====Model====
+Our model for how the data is generated is as follows:
+* Generate latent variables <math>z^{(1)},...,z^{(m)} \in \mathbb{R}^r</math> iid where dimension r is less than n.
+** We assume <math>Z^{(i)} \sim N(\mathbf{0},\mathbf{I})</math>
+* Generate <math>x^{(i)}</math> where <math>X^{(i)} \vert Z^{(i)} \sim N(g_{\theta}(z), \sigma^2 \mathbf{I})</math>
+** For some function <math>g_{\theta_1}</math> parameterized by <math>\theta_1</math>
+====Variational Bound====
+The variational bound is:
+* <math>\log P(x^{(i)}) \geq E_{Z \sim Q_i}[\log P(X^{(i)} \vert Z)] - KL(Q_i(z) \Vert P(z))</math>
+{{hidden | Derivation |
+We know from Baye's rule that <math>P(z|X) = \frac{P(X|z)P(z)}{P(X)}</math>.<br>
+Plugging this into the equation for <math>KL(Q_i(z) \Vert P(z|X))</math> yields our inequality.<br>
+<math>KL(Q_i(z) \Vert P(z|X)) = E_{Q} \left[ \log(\frac{Q_i(z)}{P(z|X)}) \right]</math><br>
+<math>=E_Q(\log(\frac{Q_i(z) P(X^{(i)})}{P(X_z)P(z)})</math><br>
+<math>=E_Q(\log(\frac{Q_i(z)}{P(z)})) + \log(P(x^{(i)})) - E_Q(\log(P(X|z))</math><br>
+<math>=KL(Q_i(z) \Vert P(z)) + \log(P(x^{(i)}) - E_Q(\log(P(X|z))</math><br>
+Rearranging terms we get:<br>
+<math>\log P(x^{(i)}) - KL(Q_i(z) \Vert P(z|X)) = E_Q(\log(P(X|z)) - KL(Q_i(z) \Vert P(z))</math><br>
+Since the KL divergence is greater than or equal to 0, our variational bound follows.
+}}
+==GANs==
+{{main | Generative adversarial network}}
+===Wasserstein GAN===
+[https://arxiv.org/abs/1701.07875 Paper]<br>
+The main idea is to ensure the that discriminator is lipschitz continuous and to limit the lipschitz constant (i.e. the derivative) of the discriminator.<br>
+If the correct answer is 1.0 and the generator produces 1.0001, we don't want the discriminator to give us a very high loss.<br>
+====Earth mover's distance====
+{{main | wikipedia:earth mover's distance}}
+The minimum cost of converting one pile of dirt to another.<br>
+Where cost is the cost of moving (amount * distance)<br>
+Given a set <math>P</math> with m clusters and a set <math>Q</math> with n clusters:<br>
+...
+<math>EMD(P, Q) = \frac{\sum_{i=1}^{m}\sum_{j=1}^{n}f_{i,j}d_{i,j}}{\sum_{i=1}^{m}\sum_{j=1}^{n}f_{i,j}}</math><br>
+;Notes
+* Also known as Wasserstein metric
+==Dimension Reduction==
+Goal: Reduce the dimension of a dataset.<br>
+If each example <math>x \in \mathbb{R}^n</math>, we want to reduce each example to be in <math>\mathbb{R}^k</math> where <math>k < n</math>
+===PCA===
+Principal Component Analysis<br>
+Preprocessing: Subtract the sample mean from each example so that the new sample mean is 0.<br>
+Goal: Find a vector <math>v_1</math> such that the projection <math>v_1 \cdot x</math> has maximum variance.<br>
+Result: These principal components are the eigenvectors of <math>X^TX</math>.<br>
+Idea: Maximize the variance of the projections.<br>
+<math>\max \frac{1}{m}\sum (v_1 \cdot x^{(i)})^2</math><br>
+Note that <math>\sum (v_1 \cdot x^{(i)})^2 = \sum v_1^T x^{(i)} (x^{(i)})^T v_1 = v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1</math>.<br>
+<!--
+Thus our lagrangian is <math>\max_{\alpha} v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1 + \alpha(\Vert v_1 \Vert^2 - 1)</math><br>
+Taking the gradient of this we get:<br>
+<math>\nabla_{v_1} (v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1) + \alpha(\Vert v_1 \Vert^2 - 1)</math><br>
+<math>= 2(\sum x^{(i)} (x^{(i)})^T) v_1 + 2\alpha v_1</math>
+-->
+===Kernel PCA===
+{{main | Wikipedia: Kernel principal component analysis}}
+===Autoencoder===
+You have a encoder and a decoder which are both neural networks.