Unsupervised Learning: Difference between revisions
(11 intermediate revisions by the same user not shown) | |||
Line 8: | Line 8: | ||
Here we wish to cluster them to minimize the distance to the mean of their cluster.<br> | Here we wish to cluster them to minimize the distance to the mean of their cluster.<br> | ||
In our formulation we have k clusters.<br> | In our formulation we have k clusters.<br> | ||
The mean of each cluster <math>\mu_i</math> is called the centroid. | The mean of each cluster <math>\mu_i</math> is called the centroid.<br> | ||
====Optimization==== | ====Optimization==== | ||
Let <math>\mathbf{\mu}</math> denote the centroids and let <math>\mathbf{z}</math> denote the cluster labels for our data.<br> | Let <math>\mathbf{\mu}</math> denote the centroids and let <math>\mathbf{z}</math> denote the cluster labels for our data.<br> | ||
Line 18: | Line 18: | ||
If we fix <math>\mathbf{z}</math> then we need to minimize <math>L(\mu, \mathbf{z})</math> wrt <math>\mu</math>.<br> | If we fix <math>\mathbf{z}</math> then we need to minimize <math>L(\mu, \mathbf{z})</math> wrt <math>\mu</math>.<br> | ||
Taking the gradient and setting it to 0 we get:<br> | Taking the gradient and setting it to 0 we get:<br> | ||
<math> | <math display="block"> | ||
\nabla_{\mu} L(\mu, \mathbf{z}) = \nabla_{\mu} \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2 | \begin{aligned} | ||
\nabla_{\mu} L(\mu, \mathbf{z}) &= \nabla_{\mu} \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\\ | |||
&= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\\ | |||
= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2 | &= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{j} \Vert ^2\\ | ||
&= \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \nabla_{\mu} \Vert x^{(i)} - \mu_{j} \Vert ^2\\ | |||
&= -\sum_{j=1}^{k} \sum_{i\mid z(i)=j} 2(x^{(i)} - \mu_{j}) = 0\\ | |||
= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{j} \Vert ^2 | \implies \mu_{j} &= (\sum_{i\mid z(i)=j} x^{(i)})/(\sum_{i\mid z(i)=j} 1) \quad \forall j | ||
\end{aligned} | |||
= \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \nabla_{\mu} \Vert x^{(i)} - \mu_{j} \Vert ^2 | |||
= -\sum_{j=1}^{k} \sum_{i\mid z(i)=j} 2(x^{(i)} - \mu_{j}) = 0 | |||
\implies \mu_{j} = (\sum_{i\mid z(i)=j} x^{(i)})/(\sum_{i\mid z(i)=j} 1) \quad \forall j | |||
</math> | </math> | ||
Line 66: | Line 58: | ||
By introducing an extra variable <math>Q^{(i)}_{(j)}</math> we'll be able to apply Jensen's inequality to the concave log function.<br> | By introducing an extra variable <math>Q^{(i)}_{(j)}</math> we'll be able to apply Jensen's inequality to the concave log function.<br> | ||
Assume <math>Q^{(i)}{(j)}</math> is a probability mass function.<br> | Assume <math>Q^{(i)}{(j)}</math> is a probability mass function.<br> | ||
<math> | <math display="block"> | ||
l(\theta) = \sum_{i=1}^{m} \log \sum_{j=1}^{k}P(x^{(i)}, z^{(i)}=j; \theta) | \begin{align*} | ||
l(\theta) &= \sum_{i=1}^{m} \log \sum_{j=1}^{k}P(x^{(i)}, z^{(i)}=j; \theta)\\ | |||
&=\sum_{i=1}^{m} \log \sum_{j=1}^{k} \frac{Q^{(i)}_{(j)}}{Q^{(i)}_{(j)}} P(x^{(i)}, z^{(i)}=j; \theta)\\ | |||
=\sum_{i=1}^{m} \log \sum_{j=1}^{k} \frac{Q^{(i)}_{(j)}}{Q^{(i)}_{(j)}} P(x^{(i)}, z^{(i)}=j; \theta) | &\geq \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log(\frac{P(x^{(i)}, z^{(i)}=j; \theta) | ||
}{Q^{(i)}_{(j)}})\\ | |||
\implies \log\left[E_{Q}\left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}_{(j)}}\right)\right] &\geq E_{Q} \left[ \log \left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}(j)} \right)\right] | |||
\geq \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log(\frac{P(x^{(i)}, z^{(i)}=j; \theta) | \end{align*} | ||
}{Q^{(i)}_{(j)}}) | |||
\implies \log\left[E_{Q}\left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}_{(j)}}\right)\right] \geq E_{Q} \left[ \log \left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}(j)} \right)\right] | |||
</math><br> | </math><br> | ||
Let <math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br> | Let <math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br> | ||
The EM algorithm is an iterative algorithm which alternates between an E-Step and an M-Step. | The EM algorithm is an iterative algorithm which alternates between an E-Step and an M-Step. | ||
=====E-Step===== | =====E-Step===== | ||
We will fix <math>\theta</math> and maximize J wrt <math>Q</math>.<br> | We will fix <math>\theta</math> and maximize J wrt <math>Q</math>.<br> | ||
Line 98: | Line 87: | ||
We can maximize each <math>Q^{(i)}</math> independently.<br> | We can maximize each <math>Q^{(i)}</math> independently.<br> | ||
Taking the derivative wrt Q we get:<br> | Taking the derivative wrt Q we get:<br> | ||
<math> | <math display="block"> | ||
\begin{aligned} | |||
\frac{\partial}{\partial Q^{(i)}_{(j)}} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right) +\beta (\sum_j Q^{(i)}_{(j)} - 1) | \frac{\partial}{\partial Q^{(i)}_{(j)}} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right) +\beta (\sum_j Q^{(i)}_{(j)} - 1) | ||
&= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - Q \frac{Q}{P(x^{(i)}, z^{(i)}=j;\theta)} (P(x^{(i)}, z^{(i)}=j;\theta))(Q^{-2}) + \beta\\ | |||
&= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - 1 + \beta \\ | |||
= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - Q \frac{Q}{P(x^{(i)}, z^{(i)}=j;\theta)} (P(x^{(i)}, z^{(i)}=j;\theta))(Q^{-2}) + \beta | &= 0\\ | ||
\implies Q^{(i)}_{(j)} &= (\frac{1}{exp(1-\beta)})P(x^{(i)}, z^{(i)}=j;\theta) | |||
\end{aligned} | |||
= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - 1 + \beta = 0 | |||
\implies Q^{(i)}_{(j)} = (\frac{1}{exp(1-\beta)})P(x^{(i)}, z^{(i)}=j;\theta) | |||
</math><br> | </math><br> | ||
Since Q is a pmf, we know it sums to 1 so we get the same result replacing <math>(\frac{1}{exp(1-\beta)})</math> with <math>P(x^{(i)}</math> | Since Q is a pmf, we know it sums to 1 so we get the same result replacing <math>(\frac{1}{exp(1-\beta)})</math> with <math>P(x^{(i)}</math> | ||
Line 118: | Line 103: | ||
<math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br> | <math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br> | ||
Assume <math>\Sigma_j = I</math> for simplicity.<br> | Assume <math>\Sigma_j = I</math> for simplicity.<br> | ||
Then | Then: | ||
<math> | <math display="block"> | ||
J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right) | \begin{aligned} | ||
J(\theta, Q) &= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)\\ | |||
&= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)}, z^{(i)}=j;\theta)) + C_1 \\ | |||
= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)}, z^{(i)}=j;\theta)) + C_1 | &= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j) P(z^{(i)}=j)) + C_1\\ | ||
&= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j)) - Q^{(i)}_{(j)} \log( P(z^{(i)}=j)) + C_1\\ | |||
&= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( (2\pi)^{-n/2}exp(-\Vert x^{(i)} + \mu_j \Vert^2 / 2)) - Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_1 \\ | |||
= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j) P(z^{(i)}=j)) + C_1 | &= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} -\Vert x^{(i)} - \mu_j \Vert^2 / 2) + Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_2 | ||
\end{aligned} | |||
= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j)) - Q^{(i)}_{(j)} \log( P(z^{(i)}=j)) + C_1 | |||
= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( (2\pi)^{-n/2}exp(-\Vert x^{(i)} + \mu_j \Vert^2 / 2)) - Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_1 | |||
= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} -\Vert x^{(i)} - \mu_j \Vert^2 / 2) + Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_2 | |||
</math><br> | </math><br> | ||
Maximizing wrt <math>\mu</math>, we get <math>\mu_j^* = (\sum_{i} Q^{(i)}_{(i)}x^{(i)}) / (\sum_{i}Q^{(i)}_{(j)})</math>.<br> | Maximizing wrt <math>\mu</math>, we get <math>\mu_j^* = (\sum_{i} Q^{(i)}_{(i)}x^{(i)}) / (\sum_{i}Q^{(i)}_{(j)})</math>.<br> | ||
Line 149: | Line 126: | ||
<math>= \min_{\beta} \sum_{j=1}^{k} \left[\log(\frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)})) \sum_{i=1}^{m} Q^{(i)}_{(j)} -(\sum_{i}Q^{(i)}_{(j)}) - (\beta/k) \right]</math><br> | <math>= \min_{\beta} \sum_{j=1}^{k} \left[\log(\frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)})) \sum_{i=1}^{m} Q^{(i)}_{(j)} -(\sum_{i}Q^{(i)}_{(j)}) - (\beta/k) \right]</math><br> | ||
Taking the derivative with respect to <math>\beta</math>, we get:<br> | Taking the derivative with respect to <math>\beta</math>, we get:<br> | ||
<math> | <math display="block"> | ||
\sum_{j=1}^{k} [(\frac{1}{(-1/\beta)(\sum Q)})(-\sum Q)(-\beta^{-2})(\sum Q) - \frac{1}{k}] | \begin{aligned} | ||
\sum_{j=1}^{k} [(\frac{1}{(-1/\beta)(\sum Q)})(-\sum Q)(-\beta^{-2})(\sum Q) - \frac{1}{k}] | |||
&= \sum_{j=1}^{k} [ (\beta)(-\beta^{-2})(\sum Q) - \frac{1}{k}] \\ | |||
=\sum_{j=1}^{k} [ (\beta)(-\beta^{-2})(\sum Q) - \frac{1}{k}] | &= \sum_{j=1}^{k} [\frac{-1}{\beta}(\sum_{i=1}^{m} Q) - \frac{1}{k}]\\ | ||
&= [\sum_{i=1}^{m} \frac{-1}{\beta} \sum_{j=1}^{k}P(z^{(i)} = j | x^{(i)}) - \sum_{j=1}^{k}\frac{1}{k}]\\ | |||
&= [\frac{-1}{\beta}\sum_{i=1}^{m}1 - 1]\\ | |||
= | &= \frac{-m}{\beta} - 1 = 0\\ | ||
\implies \beta &= -m | |||
\end{aligned} | |||
=[\sum_{i=1}^{m} \frac{-1}{\beta} \sum_{j=1}^{k}P(z^{(i)} = j | x^{(i)}) - \sum_{j=1}^{k}\frac{1}{k}] | |||
= | |||
= \frac{-m}{\beta} - 1 = 0 | |||
\implies \beta = -m | |||
</math><br> | </math><br> | ||
Plugging in <math>\beta = -m</math> into our equation for <math>\phi_j</math> we get <math>\phi_j = \frac{1}{m}\sum_{i=1}^{m}Q^{(i)}_{(j)}</math> | Plugging in <math>\beta = -m</math> into our equation for <math>\phi_j</math> we get <math>\phi_j = \frac{1}{m}\sum_{i=1}^{m}Q^{(i)}_{(j)}</math> | ||
}} | }} | ||
===Mean-Shift=== | |||
===DBSCAN=== | |||
==Generative Models== | ==Generative Models== | ||
Line 203: | Line 173: | ||
We know from Baye's rule that <math>P(z|X) = \frac{P(X|z)P(z)}{P(X)}</math>.<br> | We know from Baye's rule that <math>P(z|X) = \frac{P(X|z)P(z)}{P(X)}</math>.<br> | ||
Plugging this into the equation for <math>KL(Q_i(z) \Vert P(z|X))</math> yields our inequality.<br> | Plugging this into the equation for <math>KL(Q_i(z) \Vert P(z|X))</math> yields our inequality.<br> | ||
<math>KL(Q_i(z) \Vert P(z|X)) = E_{Q} \left[ \log(\frac{Q_i(z)}{P(z|X)}) \right] | <math display="block"> | ||
\begin{aligned} | |||
KL(Q_i(z) \Vert P(z|X)) &= E_{Q} \left[ \log(\frac{Q_i(z)}{P(z|X)}) \right]\\ | |||
&=E_Q(\log(\frac{Q_i(z) P(X^{(i)})}{P(X|z)P(z)})\\ | |||
&=E_Q(\log(\frac{Q_i(z)}{P(z)})) + \log(P(x^{(i)})) - E_Q(\log(P(X|z))\\ | |||
&=KL(Q_i(z) \Vert P(z)) + \log(P(x^{(i)}) - E_Q(\log(P(X|z)) | |||
\end{aligned} | |||
</math> | |||
Rearranging terms we get:<br> | Rearranging terms we get:<br> | ||
<math>\log P(x^{(i)}) - KL(Q_i(z) \Vert P(z|X)) = E_Q(\log(P(X|z)) - KL(Q_i(z) \Vert P(z))</math><br> | <math>\log P(x^{(i)}) - KL(Q_i(z) \Vert P(z|X)) = E_Q(\log(P(X|z)) - KL(Q_i(z) \Vert P(z))</math><br> | ||
Line 212: | Line 186: | ||
}} | }} | ||
==GANs== | ===GANs=== | ||
{{main | Generative adversarial network}} | {{main | Generative adversarial network}} | ||
===Wasserstein GAN=== | ====Wasserstein GAN==== | ||
[https://arxiv.org/abs/1701.07875 Paper]<br> | [https://arxiv.org/abs/1701.07875 Paper]<br> | ||
The main idea is to ensure the that discriminator is lipschitz continuous and to limit the lipschitz constant (i.e. the derivative) of the discriminator.<br> | The main idea is to ensure the that discriminator is lipschitz continuous and to limit the lipschitz constant (i.e. the derivative) of the discriminator.<br> | ||
If the correct answer is 1.0 and the generator produces 1.0001, we don't want the discriminator to give us a very high loss.<br> | If the correct answer is 1.0 and the generator produces 1.0001, we don't want the discriminator to give us a very high loss.<br> | ||
;Earth mover's distance | |||
{{main | wikipedia:earth mover's distance}} | {{main | wikipedia:earth mover's distance}} | ||
The minimum cost of converting one pile of dirt to another.<br> | The minimum cost of converting one pile of dirt to another.<br> | ||
Line 252: | Line 227: | ||
===Autoencoder=== | ===Autoencoder=== | ||
You have a encoder and a decoder which are both neural networks. | You have a encoder and a decoder which are both neural networks. | ||
==Resources== | |||
;Clustering | |||
* [https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68 TowardsDataScience: top 5 clustering algos] | |||
* [https://machinelearningmastery.com/clustering-algorithms-with-python/ MachineLearningMastery: Clustering Algorithms] |