Unsupervised Learning: Difference between revisions

 
(21 intermediate revisions by the same user not shown)
Line 8: Line 8:
Here we wish to cluster them to minimize the distance to the mean of their cluster.<br>
Here we wish to cluster them to minimize the distance to the mean of their cluster.<br>
In our formulation we have k clusters.<br>
In our formulation we have k clusters.<br>
The mean of each cluster <math>\mu_i</math> is called the centroid.</math><br>
The mean of each cluster <math>\mu_i</math> is called the centroid.<br>
====Optimization====
====Optimization====
Let <math>\mathbf{\mu}</math> denote the centroids and let <math>\mathbf{z}</math> denote the cluster labels for our data.<br>
Let <math>\mathbf{\mu}</math> denote the centroids and let <math>\mathbf{z}</math> denote the cluster labels for our data.<br>
Line 18: Line 18:
If we fix <math>\mathbf{z}</math> then we need to minimize <math>L(\mu, \mathbf{z})</math> wrt <math>\mu</math>.<br>
If we fix <math>\mathbf{z}</math> then we need to minimize <math>L(\mu, \mathbf{z})</math> wrt <math>\mu</math>.<br>
Taking the gradient and setting it to 0 we get:<br>
Taking the gradient and setting it to 0 we get:<br>
<math>
<math display="block">
\nabla_{\mu} L(\mu, \mathbf{z}) = \nabla_{\mu} \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2
\begin{aligned}
</math><br>
\nabla_{\mu} L(\mu, \mathbf{z}) &= \nabla_{\mu} \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\\
<math>
&= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\\
= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2
&= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{j} \Vert ^2\\
</math><br>
&=  \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \nabla_{\mu} \Vert x^{(i)} - \mu_{j} \Vert ^2\\
<math>
&=  -\sum_{j=1}^{k} \sum_{i\mid z(i)=j} 2(x^{(i)} - \mu_{j}) = 0\\
= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{j} \Vert ^2
\implies \mu_{j} &= (\sum_{i\mid z(i)=j} x^{(i)})/(\sum_{i\mid z(i)=j} 1) \quad \forall j
</math><br>
\end{aligned}
<math>
=  \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \nabla_{\mu} \Vert x^{(i)} - \mu_{j} \Vert ^2
</math><br>
<math>
=  -\sum_{j=1}^{k} \sum_{i\mid z(i)=j} 2(x^{(i)} - \mu_{j}) = 0
</math><br>
<math>
\implies \mu_{j} = (\sum_{i\mid z(i)=j} x^{(i)})/(\sum_{i\mid z(i)=j} 1) \quad \forall j
</math>
</math>


Line 66: Line 58:
By introducing an extra variable <math>Q^{(i)}_{(j)}</math> we'll be able to apply Jensen's inequality to the concave log function.<br>
By introducing an extra variable <math>Q^{(i)}_{(j)}</math> we'll be able to apply Jensen's inequality to the concave log function.<br>
Assume <math>Q^{(i)}{(j)}</math> is a probability mass function.<br>
Assume <math>Q^{(i)}{(j)}</math> is a probability mass function.<br>
<math>
<math display="block">
l(\theta) = \sum_{i=1}^{m} \log \sum_{j=1}^{k}P(x^{(i)}, z^{(i)}=j; \theta)
\begin{align*}
</math><br>
l(\theta) &= \sum_{i=1}^{m} \log \sum_{j=1}^{k}P(x^{(i)}, z^{(i)}=j; \theta)\\
<math>
&=\sum_{i=1}^{m} \log \sum_{j=1}^{k} \frac{Q^{(i)}_{(j)}}{Q^{(i)}_{(j)}} P(x^{(i)}, z^{(i)}=j; \theta)\\
=\sum_{i=1}^{m} \log \sum_{j=1}^{k} \frac{Q^{(i)}_{(j)}}{Q^{(i)}_{(j)}} P(x^{(i)}, z^{(i)}=j; \theta)
&\geq \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log(\frac{P(x^{(i)}, z^{(i)}=j; \theta)
</math><br>
}{Q^{(i)}_{(j)}})\\
<math>
\implies \log\left[E_{Q}\left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}_{(j)}}\right)\right] &\geq E_{Q} \left[ \log \left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}(j)} \right)\right]
\geq \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log(\frac{P(x^{(i)}, z^{(i)}=j; \theta)
\end{align*}
}{Q^{(i)}_{(j)}})
</math><br>
<math>
\implies \log\left[E_{Q}\left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}_{(j)}}\right)\right] \geq E_{Q} \left[ \log \left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}(j)} \right)\right]
</math><br>
</math><br>
Let <math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br>
Let <math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br>
The EM algorithm is an iterative algorithm which alternates between an E-Step and an M-Step.
The EM algorithm is an iterative algorithm which alternates between an E-Step and an M-Step.
=====E-Step=====
=====E-Step=====
We will fix <math>\theta</math> and maximize J wrt <math>Q</math>.<br>
We will fix <math>\theta</math> and maximize J wrt <math>Q</math>.<br>
Line 98: Line 87:
We can maximize each <math>Q^{(i)}</math> independently.<br>
We can maximize each <math>Q^{(i)}</math> independently.<br>
Taking the derivative wrt Q we get:<br>
Taking the derivative wrt Q we get:<br>
<math>
<math display="block">
\begin{aligned}
\frac{\partial}{\partial Q^{(i)}_{(j)}} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right) +\beta (\sum_j Q^{(i)}_{(j)} - 1)
\frac{\partial}{\partial Q^{(i)}_{(j)}} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right) +\beta (\sum_j Q^{(i)}_{(j)} - 1)
</math><br>
&= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - Q \frac{Q}{P(x^{(i)}, z^{(i)}=j;\theta)} (P(x^{(i)}, z^{(i)}=j;\theta))(Q^{-2}) + \beta\\
<math>
&= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - 1 + \beta \\
= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - Q \frac{Q}{P(x^{(i)}, z^{(i)}=j;\theta)} (P(x^{(i)}, z^{(i)}=j;\theta))(Q^{-2}) + \beta
&= 0\\
</math>
\implies Q^{(i)}_{(j)} &= (\frac{1}{exp(1-\beta)})P(x^{(i)}, z^{(i)}=j;\theta)
</math><br>
\end{aligned}
<math>
= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - 1 + \beta = 0
</math><br>
<math>
\implies Q^{(i)}_{(j)} = (\frac{1}{exp(1-\beta)})P(x^{(i)}, z^{(i)}=j;\theta)
</math><br>
</math><br>
Since Q is a pmf, we know it sums to 1 so we get the same result replacing <math>(\frac{1}{exp(1-\beta)})</math> with <math>P(x^{(i)}</math>
Since Q is a pmf, we know it sums to 1 so we get the same result replacing <math>(\frac{1}{exp(1-\beta)})</math> with <math>P(x^{(i)}</math>
Line 118: Line 103:
<math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br>
<math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br>
Assume <math>\Sigma_j = I</math> for simplicity.<br>
Assume <math>\Sigma_j = I</math> for simplicity.<br>
Then<br>
Then:
<math>
<math display="block">
J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)
\begin{aligned}
</math><br>
J(\theta, Q) &= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)\\
<math>
&= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)}, z^{(i)}=j;\theta)) + C_1 \\
= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)}, z^{(i)}=j;\theta)) + C_1  
&= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j) P(z^{(i)}=j)) + C_1\\
</math><br>
&= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j)) -  Q^{(i)}_{(j)} \log( P(z^{(i)}=j)) + C_1\\
<math>
&= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( (2\pi)^{-n/2}exp(-\Vert x^{(i)} + \mu_j \Vert^2 / 2)) -  Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_1 \\
= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j) P(z^{(i)}=j)) + C_1
&= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} -\Vert x^{(i)} - \mu_j \Vert^2 / 2) +  Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_2  
</math><br>
\end{aligned}
<math>
= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j)) -  Q^{(i)}_{(j)} \log( P(z^{(i)}=j)) + C_1
</math><br>
<math>
= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( (2\pi)^{-n/2}exp(-\Vert x^{(i)} + \mu_j \Vert^2 / 2)) -  Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_1  
</math><br>
<math>
= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} -\Vert x^{(i)} - \mu_j \Vert^2 / 2) +  Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_2  
</math><br>
</math><br>
Maximizing wrt <math>\mu</math>, we get <math>\mu_j^* = (\sum_{i} Q^{(i)}_{(i)}x^{(i)}) / (\sum_{i}Q^{(i)}_{(j)})</math>.<br>
Maximizing wrt <math>\mu</math>, we get <math>\mu_j^* = (\sum_{i} Q^{(i)}_{(i)}x^{(i)}) / (\sum_{i}Q^{(i)}_{(j)})</math>.<br>
Line 149: Line 126:
<math>= \min_{\beta} \sum_{j=1}^{k} \left[\log(\frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)})) \sum_{i=1}^{m} Q^{(i)}_{(j)}  -(\sum_{i}Q^{(i)}_{(j)}) - (\beta/k) \right]</math><br>
<math>= \min_{\beta} \sum_{j=1}^{k} \left[\log(\frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)})) \sum_{i=1}^{m} Q^{(i)}_{(j)}  -(\sum_{i}Q^{(i)}_{(j)}) - (\beta/k) \right]</math><br>
Taking the derivative with respect to <math>\beta</math>, we get:<br>
Taking the derivative with respect to <math>\beta</math>, we get:<br>
<math>
<math display="block">
\sum_{j=1}^{k} [(\frac{1}{(-1/\beta)(\sum Q)})(-\sum Q)(-\beta^{-2})(\sum Q) - \frac{1}{k}]  
\begin{aligned}
</math><br>
\sum_{j=1}^{k} [(\frac{1}{(-1/\beta)(\sum Q)})(-\sum Q)(-\beta^{-2})(\sum Q) - \frac{1}{k}]
<math>
&= \sum_{j=1}^{k} [ (\beta)(-\beta^{-2})(\sum Q) - \frac{1}{k}] \\
=\sum_{j=1}^{k} [ (\beta)(-\beta^{-2})(\sum Q) - \frac{1}{k}]  
&= \sum_{j=1}^{k} [\frac{-1}{\beta}(\sum_{i=1}^{m} Q) - \frac{1}{k}]\\
</math><br>
&= [\sum_{i=1}^{m} \frac{-1}{\beta} \sum_{j=1}^{k}P(z^{(i)} = j | x^{(i)}) - \sum_{j=1}^{k}\frac{1}{k}]\\
<math>
&= [\frac{-1}{\beta}\sum_{i=1}^{m}1 - 1]\\
= \sum_{j=1}^{k} [\frac{-1}{\beta}(\sum_{i=1}^{m} Q) - \frac{1}{k}]
&= \frac{-m}{\beta} - 1 = 0\\
</math><br>
\implies \beta &= -m
<math>
\end{aligned}
=[\sum_{i=1}^{m} \frac{-1}{\beta} \sum_{j=1}^{k}P(z^{(i)} = j | x^{(i)}) - \sum_{j=1}^{k}\frac{1}{k}]
</math><br>
<math>
= [\frac{-1}{\beta}\sum_{i=1}^{m}1 - 1]
</math><br>
<math>
= \frac{-m}{\beta} - 1 = 0
</math><br>
<math>
\implies \beta = -m
</math><br>
</math><br>
Plugging in <math>\beta = -m</math> into our equation for <math>\phi_j</math> we get <math>\phi_j = \frac{1}{m}\sum_{i=1}^{m}Q^{(i)}_{(j)}</math>
Plugging in <math>\beta = -m</math> into our equation for <math>\phi_j</math> we get <math>\phi_j = \frac{1}{m}\sum_{i=1}^{m}Q^{(i)}_{(j)}</math>
}}
}}
===Mean-Shift===
===DBSCAN===


==Generative Models==
==Generative Models==
Line 186: Line 156:
* KL is always >= 0
* KL is always >= 0
* KL is not symmetric
* KL is not symmetric
* Jensen-Shannon Divergence
** <math>JSD(P \Vert Q) = \frac{1}{2}KL(P \Vert Q) + \frac{1}{2}KL(Q \Vert P)</math>
** This is symmetric


====Model====
====Model====
Line 191: Line 164:
* Generate latent variables <math>z^{(1)},...,z^{(m)} \in \mathbb{R}^r</math> iid where dimension r is less than n.
* Generate latent variables <math>z^{(1)},...,z^{(m)} \in \mathbb{R}^r</math> iid where dimension r is less than n.
** We assume <math>Z^{(i)} \sim N(\mathbf{0},\mathbf{I})</math>
** We assume <math>Z^{(i)} \sim N(\mathbf{0},\mathbf{I})</math>
* Generate <math>x^{(i)}</math> where <math>X^{(i)} \vert Z^{(i)} \sin N(g_{\theta}(z), \sigma^2 \mathbf{I})</math>
* Generate <math>x^{(i)}</math> where <math>X^{(i)} \vert Z^{(i)} \sim N(g_{\theta}(z), \sigma^2 \mathbf{I})</math>
** For some function <math>g_{\theta_1}</math> parameterized by <math>\theta_1</math>
** For some function <math>g_{\theta_1}</math> parameterized by <math>\theta_1</math>


Line 200: Line 173:
We know from Baye's rule that <math>P(z|X) = \frac{P(X|z)P(z)}{P(X)}</math>.<br>
We know from Baye's rule that <math>P(z|X) = \frac{P(X|z)P(z)}{P(X)}</math>.<br>
Plugging this into the equation for <math>KL(Q_i(z) \Vert P(z|X))</math> yields our inequality.<br>
Plugging this into the equation for <math>KL(Q_i(z) \Vert P(z|X))</math> yields our inequality.<br>
<math>KL(Q_i(z) \Vert P(z|X)) = E_{Q} \left[ \log(\frac{Q_i(z)}{P(z|X)}) \right]</math><br>
<math display="block">
<math>=E_Q(\log(\frac{Q_i(z) P(X^{(i)})}{P(X_z)P(z)})</math><br>
\begin{aligned}
<math>=E_Q(\log(\frac{Q_i(z)}{P(z)})) + \log(P(x^{(i)})) - E_Q(\log(P(X|z))</math><br>
KL(Q_i(z) \Vert P(z|X)) &= E_{Q} \left[ \log(\frac{Q_i(z)}{P(z|X)}) \right]\\
<math>=KL(Q_i(z) \Vert P(z)) + \log(P(x^{(i)}) - E_Q(\log(P(X|z))</math><br>
&=E_Q(\log(\frac{Q_i(z) P(X^{(i)})}{P(X|z)P(z)})\\
&=E_Q(\log(\frac{Q_i(z)}{P(z)})) + \log(P(x^{(i)})) - E_Q(\log(P(X|z))\\
&=KL(Q_i(z) \Vert P(z)) + \log(P(x^{(i)}) - E_Q(\log(P(X|z))
\end{aligned}
</math>
Rearranging terms we get:<br>
Rearranging terms we get:<br>
<math>\log P(x^{(i)}) - KL(Q_i(z) \Vert P(z|X)) = E_Q(\log(P(X|z)) - KL(Q_i(z) \Vert P(z))</math><br>
<math>\log P(x^{(i)}) - KL(Q_i(z) \Vert P(z|X)) = E_Q(\log(P(X|z)) - KL(Q_i(z) \Vert P(z))</math><br>
Line 209: Line 186:
}}
}}


==GANs==
===GANs===
{{main | Generative adversarial network}}
{{main | Generative adversarial network}}


===Wasserstein GAN===
====Wasserstein GAN====
[https://arxiv.org/abs/1701.07875 Paper]<br>
[https://arxiv.org/abs/1701.07875 Paper]<br>
The main idea is to ensure the that discriminator is lipschitz continuous and to limit the lipschitz constant (i.e. the derivative) of the discriminator.<br>
The main idea is to ensure the that discriminator is lipschitz continuous and to limit the lipschitz constant (i.e. the derivative) of the discriminator.<br>
If the correct answer is 1.0 and the generator produces 1.0001, we don't want the discriminator to give us a very high loss.<br>
If the correct answer is 1.0 and the generator produces 1.0001, we don't want the discriminator to give us a very high loss.<br>
;Earth mover's distance
{{main | wikipedia:earth mover's distance}}
The minimum cost of converting one pile of dirt to another.<br>
Where cost is the cost of moving (amount * distance)<br>
Given a set <math>P</math> with m clusters and a set <math>Q</math> with n clusters:<br>
...
<math>EMD(P, Q) = \frac{\sum_{i=1}^{m}\sum_{j=1}^{n}f_{i,j}d_{i,j}}{\sum_{i=1}^{m}\sum_{j=1}^{n}f_{i,j}}</math><br>
;Notes
* Also known as Wasserstein metric
==Dimension Reduction==
Goal: Reduce the dimension of a dataset.<br>
If each example <math>x \in \mathbb{R}^n</math>, we want to reduce each example to be in <math>\mathbb{R}^k</math> where <math>k < n</math>
===PCA===
Principal Component Analysis<br>
Preprocessing: Subtract the sample mean from each example so that the new sample mean is 0.<br>
Goal: Find a vector <math>v_1</math> such that the projection <math>v_1 \cdot x</math> has maximum variance.<br>
Result: These principal components are the eigenvectors of <math>X^TX</math>.<br>
Idea: Maximize the variance of the projections.<br>
<math>\max \frac{1}{m}\sum (v_1 \cdot x^{(i)})^2</math><br>
Note that <math>\sum (v_1 \cdot x^{(i)})^2 = \sum v_1^T x^{(i)} (x^{(i)})^T v_1 = v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1</math>.<br>
<!--
Thus our lagrangian is <math>\max_{\alpha} v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1 + \alpha(\Vert v_1 \Vert^2 - 1)</math><br>
Taking the gradient of this we get:<br>
<math>\nabla_{v_1} (v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1) + \alpha(\Vert v_1 \Vert^2 - 1)</math><br>
<math>= 2(\sum x^{(i)} (x^{(i)})^T) v_1 + 2\alpha v_1</math>
-->
===Kernel PCA===
{{main | Wikipedia: Kernel principal component analysis}}
===Autoencoder===
You have a encoder and a decoder which are both neural networks.
==Resources==
;Clustering
* [https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68 TowardsDataScience: top 5 clustering algos]
* [https://machinelearningmastery.com/clustering-algorithms-with-python/ MachineLearningMastery: Clustering Algorithms]