Unsupervised Learning: Difference between revisions
(33 intermediate revisions by the same user not shown) | |||
Line 8: | Line 8: | ||
Here we wish to cluster them to minimize the distance to the mean of their cluster.<br> | Here we wish to cluster them to minimize the distance to the mean of their cluster.<br> | ||
In our formulation we have k clusters.<br> | In our formulation we have k clusters.<br> | ||
The mean of each cluster <math>\mu_i</math> is called the centroid. | The mean of each cluster <math>\mu_i</math> is called the centroid.<br> | ||
====Optimization==== | ====Optimization==== | ||
Let <math>\mathbf{\mu}</math> denote the centroids and let <math>\mathbf{z}</math> denote the cluster labels for our data.<br> | Let <math>\mathbf{\mu}</math> denote the centroids and let <math>\mathbf{z}</math> denote the cluster labels for our data.<br> | ||
Line 18: | Line 18: | ||
If we fix <math>\mathbf{z}</math> then we need to minimize <math>L(\mu, \mathbf{z})</math> wrt <math>\mu</math>.<br> | If we fix <math>\mathbf{z}</math> then we need to minimize <math>L(\mu, \mathbf{z})</math> wrt <math>\mu</math>.<br> | ||
Taking the gradient and setting it to 0 we get:<br> | Taking the gradient and setting it to 0 we get:<br> | ||
<math> | <math display="block"> | ||
\nabla_{\mu} L(\mu, \mathbf{z}) = \nabla_{\mu} \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2 | \begin{aligned} | ||
\nabla_{\mu} L(\mu, \mathbf{z}) &= \nabla_{\mu} \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\\ | |||
&= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\\ | |||
= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2 | &= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{j} \Vert ^2\\ | ||
&= \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \nabla_{\mu} \Vert x^{(i)} - \mu_{j} \Vert ^2\\ | |||
&= -\sum_{j=1}^{k} \sum_{i\mid z(i)=j} 2(x^{(i)} - \mu_{j}) = 0\\ | |||
= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{j} \Vert ^2 | \implies \mu_{j} &= (\sum_{i\mid z(i)=j} x^{(i)})/(\sum_{i\mid z(i)=j} 1) \quad \forall j | ||
\end{aligned} | |||
= \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \nabla_{\mu} \Vert x^{(i)} - \mu_{j} \Vert ^2 | |||
= -\sum_{j=1}^{k} \sum_{i\mid z(i)=j} 2(x^{(i)} - \mu_{j}) = 0 | |||
\implies \mu_{j} = (\sum_{i\mid z(i)=j} x^{(i)})/(\sum_{i\mid z(i)=j} 1) \quad \forall j | |||
</math> | </math> | ||
Line 60: | Line 52: | ||
Expectation Maximization<br> | Expectation Maximization<br> | ||
The key idea is to introduce intermediate coefficients to apply Jensen's inequality.<br> | The key idea is to introduce intermediate coefficients to apply Jensen's inequality.<br> | ||
=====Maximum Likelihood===== | |||
Let <math>\theta_j = [\phi_j, \mu_j, \sigma_j]</math> denote our parameters.<br> | Let <math>\theta_j = [\phi_j, \mu_j, \sigma_j]</math> denote our parameters.<br> | ||
The likelihood is <math>L(\theta) = \Pi_{i=1}^{m} P(x^{(i)}; \theta)</math> where <math>P</math> is our pdf/pmf.<br> | The likelihood is <math>L(\theta) = \Pi_{i=1}^{m} P(x^{(i)}; \theta)</math> where <math>P</math> is our pdf/pmf.<br> | ||
Line 66: | Line 58: | ||
By introducing an extra variable <math>Q^{(i)}_{(j)}</math> we'll be able to apply Jensen's inequality to the concave log function.<br> | By introducing an extra variable <math>Q^{(i)}_{(j)}</math> we'll be able to apply Jensen's inequality to the concave log function.<br> | ||
Assume <math>Q^{(i)}{(j)}</math> is a probability mass function.<br> | Assume <math>Q^{(i)}{(j)}</math> is a probability mass function.<br> | ||
<math> | <math display="block"> | ||
l(\theta) = \sum_{i=1}^{m} \log \sum_{j=1}^{k}P(x^{(i)}, z^{(i)}=j; \theta) | \begin{align*} | ||
l(\theta) &= \sum_{i=1}^{m} \log \sum_{j=1}^{k}P(x^{(i)}, z^{(i)}=j; \theta)\\ | |||
&=\sum_{i=1}^{m} \log \sum_{j=1}^{k} \frac{Q^{(i)}_{(j)}}{Q^{(i)}_{(j)}} P(x^{(i)}, z^{(i)}=j; \theta)\\ | |||
=\sum_{i=1}^{m} \log \sum_{j=1}^{k} \frac{Q^{(i)}_{(j)}}{Q^{(i)}_{(j)}} P(x^{(i)}, z^{(i)}=j; \theta) | &\geq \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log(\frac{P(x^{(i)}, z^{(i)}=j; \theta) | ||
}{Q^{(i)}_{(j)}})\\ | |||
\implies \log\left[E_{Q}\left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}_{(j)}}\right)\right] &\geq E_{Q} \left[ \log \left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}(j)} \right)\right] | |||
\geq \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log(\frac{P(x^{(i)}, z^{(i)}=j; \theta) | \end{align*} | ||
}{Q^{(i)}_{(j)}}) | |||
\implies \log\left[E_{Q}\left(\frac{ | |||
</math><br> | </math><br> | ||
Let <math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{ | Let <math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br> | ||
The EM algorithm is an iterative algorithm which alternates between an E-Step and an M-Step. | The EM algorithm is an iterative algorithm which alternates between an E-Step and an M-Step. | ||
=====E-Step===== | |||
We will fix <math>\theta</math> and maximize J wrt <math>Q</math>.<br> | We will fix <math>\theta</math> and maximize J wrt <math>Q</math>.<br> | ||
Jensen's inequality holds with equality iff either the function is linear or if the random variable is degenerate.<br> | Jensen's inequality holds with equality iff either the function is linear or if the random variable is degenerate.<br> | ||
Line 88: | Line 77: | ||
This implies <math>Q^{(i)}(j) = c * P(x^{(i)}, z^{(i)} = j ; \theta)</math>.<br> | This implies <math>Q^{(i)}(j) = c * P(x^{(i)}, z^{(i)} = j ; \theta)</math>.<br> | ||
Since Q is a pmf, we have <math>Q^{(i)}(j) = \frac{1}{P(x^{(i)})} * P(x^{(i)}, z^{(i)} = j ; \theta) = P(z^{(i)} ; x^{(i)}, \theta)</math><br> | Since Q is a pmf, we have <math>Q^{(i)}(j) = \frac{1}{P(x^{(i)})} * P(x^{(i)}, z^{(i)} = j ; \theta) = P(z^{(i)} ; x^{(i)}, \theta)</math><br> | ||
Q is updated with <math>Q^{(i)}(j) = \frac{ | Q is updated with <math>Q^{(i)}(j) = \frac{P(z^{(i)}=j)P(x^{(i)}|z^{(i)}=j)}{P(x^{(i)})}</math> | ||
{{hidden | Maximization w.r.t Q | | {{hidden | Maximization w.r.t Q | | ||
<math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{ | <math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br> | ||
We are assuming that Q is a pmf so <math>\sum_j Q = 1 </math><br> | We are assuming that Q is a pmf so <math>\sum_j Q = 1 </math><br> | ||
Our lagrangian is:<br> | Our lagrangian is:<br> | ||
<math> | <math> | ||
\max_{Q} \min_{\beta} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{ | \max_{Q} \min_{\beta} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right) +\beta (\sum_j Q^{(i)}_{(j)} - 1)\right] | ||
</math> | </math> | ||
We can maximize each <math>Q^{(i)}</math> independently.<br> | |||
Taking the derivative wrt Q we get:<br> | |||
<math display="block"> | |||
\begin{aligned} | |||
\frac{\partial}{\partial Q^{(i)}_{(j)}} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right) +\beta (\sum_j Q^{(i)}_{(j)} - 1) | |||
&= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - Q \frac{Q}{P(x^{(i)}, z^{(i)}=j;\theta)} (P(x^{(i)}, z^{(i)}=j;\theta))(Q^{-2}) + \beta\\ | |||
&= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - 1 + \beta \\ | |||
&= 0\\ | |||
\implies Q^{(i)}_{(j)} &= (\frac{1}{exp(1-\beta)})P(x^{(i)}, z^{(i)}=j;\theta) | |||
\end{aligned} | |||
</math><br> | |||
Since Q is a pmf, we know it sums to 1 so we get the same result replacing <math>(\frac{1}{exp(1-\beta)})</math> with <math>P(x^{(i)}</math> | |||
}} | }} | ||
=====M-Step===== | |||
We will fix <math>Q</math> and maximize J wrt <math>\theta</math> | We will fix <math>Q</math> and maximize J wrt <math>\theta</math> | ||
<math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{ | <math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br> | ||
Assume <math>\Sigma_j = I</math> for simplicity.<br> | Assume <math>\Sigma_j = I</math> for simplicity.<br> | ||
Then | Then: | ||
<math> | <math display="block"> | ||
J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{ | \begin{aligned} | ||
J(\theta, Q) &= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)\\ | |||
&= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)}, z^{(i)}=j;\theta)) + C_1 \\ | |||
= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( | &= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j) P(z^{(i)}=j)) + C_1\\ | ||
&= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j)) - Q^{(i)}_{(j)} \log( P(z^{(i)}=j)) + C_1\\ | |||
&= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( (2\pi)^{-n/2}exp(-\Vert x^{(i)} + \mu_j \Vert^2 / 2)) - Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_1 \\ | |||
= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( | &= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} -\Vert x^{(i)} - \mu_j \Vert^2 / 2) + Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_2 | ||
\end{aligned} | |||
= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( | |||
= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( (2\pi)^{-n/2}exp(-\Vert x^{(i)} + \mu_j \Vert^2 / 2)) - Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_1 | |||
= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} -\Vert x^{(i)} - \mu_j \Vert^2 / 2) + Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_2 | |||
</math><br> | </math><br> | ||
Maximizing wrt <math>\mu</math>, we get <math>\mu_j^* = (\sum_{i} Q^{(i)}_{(i)}x^{(i)}) / (\sum_{i}Q^{(i)}_{(j)})</math>.<br> | Maximizing wrt <math>\mu</math>, we get <math>\mu_j^* = (\sum_{i} Q^{(i)}_{(i)}x^{(i)}) / (\sum_{i}Q^{(i)}_{(j)})</math>.<br> | ||
Line 132: | Line 126: | ||
<math>= \min_{\beta} \sum_{j=1}^{k} \left[\log(\frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)})) \sum_{i=1}^{m} Q^{(i)}_{(j)} -(\sum_{i}Q^{(i)}_{(j)}) - (\beta/k) \right]</math><br> | <math>= \min_{\beta} \sum_{j=1}^{k} \left[\log(\frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)})) \sum_{i=1}^{m} Q^{(i)}_{(j)} -(\sum_{i}Q^{(i)}_{(j)}) - (\beta/k) \right]</math><br> | ||
Taking the derivative with respect to <math>\beta</math>, we get:<br> | Taking the derivative with respect to <math>\beta</math>, we get:<br> | ||
<math> | <math display="block"> | ||
\sum_{j=1}^{k} [(\frac{1}{(-1/\beta)(\sum Q)})(-\sum Q)(-\beta^{-2})(\sum Q) - \frac{1}{k}] | \begin{aligned} | ||
\sum_{j=1}^{k} [(\frac{1}{(-1/\beta)(\sum Q)})(-\sum Q)(-\beta^{-2})(\sum Q) - \frac{1}{k}] | |||
&= \sum_{j=1}^{k} [ (\beta)(-\beta^{-2})(\sum Q) - \frac{1}{k}] \\ | |||
=\sum_{j=1}^{k} [ (\beta)(-\beta^{-2})(\sum Q) - \frac{1}{k}] | &= \sum_{j=1}^{k} [\frac{-1}{\beta}(\sum_{i=1}^{m} Q) - \frac{1}{k}]\\ | ||
&= [\sum_{i=1}^{m} \frac{-1}{\beta} \sum_{j=1}^{k}P(z^{(i)} = j | x^{(i)}) - \sum_{j=1}^{k}\frac{1}{k}]\\ | |||
&= [\frac{-1}{\beta}\sum_{i=1}^{m}1 - 1]\\ | |||
= | &= \frac{-m}{\beta} - 1 = 0\\ | ||
\implies \beta &= -m | |||
\end{aligned} | |||
=[\sum_{i=1}^{m} \frac{-1}{\beta} \sum_{j=1}^{k}P(z^{(i)} = j | x^{(i)}) - \sum_{j=1}^{k}\frac{1}{k}] | |||
= | |||
= \frac{-m}{\beta} - 1 = 0 | |||
\implies \beta = -m | |||
</math><br> | </math><br> | ||
Plugging in <math>\beta = -m</math> into our equation for <math>\phi_j</math> we get <math>\phi_j = \frac{1}{m}\sum_{i=1}^{m}Q^{(i)}_{(j)}</math> | Plugging in <math>\beta = -m</math> into our equation for <math>\phi_j</math> we get <math>\phi_j = \frac{1}{m}\sum_{i=1}^{m}Q^{(i)}_{(j)}</math> | ||
}} | }} | ||
===Mean-Shift=== | |||
===DBSCAN=== | |||
==Generative Models== | |||
Goal: Generate realistic but fake samples. | |||
Applications: Denoising, impainting | |||
===VAEs=== | |||
Variational Auto-Encoders<br> | |||
[https://arxiv.org/abs/1606.05908 Tutorial]<br> | |||
====KL Divergence==== | |||
* [[Wikipedia:Kullback–Leibler divergence]]<br> | |||
* Kullback–Leibler divergence<br> | |||
* <math>KL(P \Vert Q) =E_{P}\left[ \log(\frac{P(X)}{Q(X)}) \right]</math> | |||
; Notes | |||
* KL is always >= 0 | |||
* KL is not symmetric | |||
* Jensen-Shannon Divergence | |||
** <math>JSD(P \Vert Q) = \frac{1}{2}KL(P \Vert Q) + \frac{1}{2}KL(Q \Vert P)</math> | |||
** This is symmetric | |||
====Model==== | |||
Our model for how the data is generated is as follows: | |||
* Generate latent variables <math>z^{(1)},...,z^{(m)} \in \mathbb{R}^r</math> iid where dimension r is less than n. | |||
** We assume <math>Z^{(i)} \sim N(\mathbf{0},\mathbf{I})</math> | |||
* Generate <math>x^{(i)}</math> where <math>X^{(i)} \vert Z^{(i)} \sim N(g_{\theta}(z), \sigma^2 \mathbf{I})</math> | |||
** For some function <math>g_{\theta_1}</math> parameterized by <math>\theta_1</math> | |||
====Variational Bound==== | |||
The variational bound is: | |||
* <math>\log P(x^{(i)}) \geq E_{Z \sim Q_i}[\log P(X^{(i)} \vert Z)] - KL(Q_i(z) \Vert P(z))</math> | |||
{{hidden | Derivation | | |||
We know from Baye's rule that <math>P(z|X) = \frac{P(X|z)P(z)}{P(X)}</math>.<br> | |||
Plugging this into the equation for <math>KL(Q_i(z) \Vert P(z|X))</math> yields our inequality.<br> | |||
<math display="block"> | |||
\begin{aligned} | |||
KL(Q_i(z) \Vert P(z|X)) &= E_{Q} \left[ \log(\frac{Q_i(z)}{P(z|X)}) \right]\\ | |||
&=E_Q(\log(\frac{Q_i(z) P(X^{(i)})}{P(X|z)P(z)})\\ | |||
&=E_Q(\log(\frac{Q_i(z)}{P(z)})) + \log(P(x^{(i)})) - E_Q(\log(P(X|z))\\ | |||
&=KL(Q_i(z) \Vert P(z)) + \log(P(x^{(i)}) - E_Q(\log(P(X|z)) | |||
\end{aligned} | |||
</math> | |||
Rearranging terms we get:<br> | |||
<math>\log P(x^{(i)}) - KL(Q_i(z) \Vert P(z|X)) = E_Q(\log(P(X|z)) - KL(Q_i(z) \Vert P(z))</math><br> | |||
Since the KL divergence is greater than or equal to 0, our variational bound follows. | |||
}} | |||
===GANs=== | |||
{{main | Generative adversarial network}} | |||
====Wasserstein GAN==== | |||
[https://arxiv.org/abs/1701.07875 Paper]<br> | |||
The main idea is to ensure the that discriminator is lipschitz continuous and to limit the lipschitz constant (i.e. the derivative) of the discriminator.<br> | |||
If the correct answer is 1.0 and the generator produces 1.0001, we don't want the discriminator to give us a very high loss.<br> | |||
;Earth mover's distance | |||
{{main | wikipedia:earth mover's distance}} | |||
The minimum cost of converting one pile of dirt to another.<br> | |||
Where cost is the cost of moving (amount * distance)<br> | |||
Given a set <math>P</math> with m clusters and a set <math>Q</math> with n clusters:<br> | |||
... | |||
<math>EMD(P, Q) = \frac{\sum_{i=1}^{m}\sum_{j=1}^{n}f_{i,j}d_{i,j}}{\sum_{i=1}^{m}\sum_{j=1}^{n}f_{i,j}}</math><br> | |||
;Notes | |||
* Also known as Wasserstein metric | |||
==Dimension Reduction== | |||
Goal: Reduce the dimension of a dataset.<br> | |||
If each example <math>x \in \mathbb{R}^n</math>, we want to reduce each example to be in <math>\mathbb{R}^k</math> where <math>k < n</math> | |||
===PCA=== | |||
Principal Component Analysis<br> | |||
Preprocessing: Subtract the sample mean from each example so that the new sample mean is 0.<br> | |||
Goal: Find a vector <math>v_1</math> such that the projection <math>v_1 \cdot x</math> has maximum variance.<br> | |||
Result: These principal components are the eigenvectors of <math>X^TX</math>.<br> | |||
Idea: Maximize the variance of the projections.<br> | |||
<math>\max \frac{1}{m}\sum (v_1 \cdot x^{(i)})^2</math><br> | |||
Note that <math>\sum (v_1 \cdot x^{(i)})^2 = \sum v_1^T x^{(i)} (x^{(i)})^T v_1 = v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1</math>.<br> | |||
<!-- | |||
Thus our lagrangian is <math>\max_{\alpha} v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1 + \alpha(\Vert v_1 \Vert^2 - 1)</math><br> | |||
Taking the gradient of this we get:<br> | |||
<math>\nabla_{v_1} (v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1) + \alpha(\Vert v_1 \Vert^2 - 1)</math><br> | |||
<math>= 2(\sum x^{(i)} (x^{(i)})^T) v_1 + 2\alpha v_1</math> | |||
--> | |||
===Kernel PCA=== | |||
{{main | Wikipedia: Kernel principal component analysis}} | |||
===Autoencoder=== | |||
You have a encoder and a decoder which are both neural networks. | |||
==Resources== | |||
;Clustering | |||
* [https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68 TowardsDataScience: top 5 clustering algos] | |||
* [https://machinelearningmastery.com/clustering-algorithms-with-python/ MachineLearningMastery: Clustering Algorithms] |
Latest revision as of 00:05, 4 December 2020
Basics of Unsupervised Learning
Clustering
Given points \(\displaystyle \{x^{(1)},...,x^{(m)}\}\) we wish to group them into clusters 1,...,k.
Usually we are given the number of clusters, k.
The problem is to assign labels \(\displaystyle \{z^{(1)},...,z^{(m)}\}\) where \(\displaystyle z^{(i)} \in \{1,...,k\}\)
K-means
Here we wish to cluster them to minimize the distance to the mean of their cluster.
In our formulation we have k clusters.
The mean of each cluster \(\displaystyle \mu_i\) is called the centroid.
Optimization
Let \(\displaystyle \mathbf{\mu}\) denote the centroids and let \(\displaystyle \mathbf{z}\) denote the cluster labels for our data.
Our loss function is \(\displaystyle L(\mu, \mathbf{z}) = \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\).
- Using coordinate-block descent
\(\displaystyle L(\mu, \mathbf{z}) = \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\).
If we fix \(\displaystyle \mu\) then we have only need to minimize \(\displaystyle z^{(i)}\).
Since each term of the sum is independent, we simply choose the closest centroid for each point
If we fix \(\displaystyle \mathbf{z}\) then we need to minimize \(\displaystyle L(\mu, \mathbf{z})\) wrt \(\displaystyle \mu\).
Taking the gradient and setting it to 0 we get:
\[
\begin{aligned}
\nabla_{\mu} L(\mu, \mathbf{z}) &= \nabla_{\mu} \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\\
&= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\\
&= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{j} \Vert ^2\\
&= \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \nabla_{\mu} \Vert x^{(i)} - \mu_{j} \Vert ^2\\
&= -\sum_{j=1}^{k} \sum_{i\mid z(i)=j} 2(x^{(i)} - \mu_{j}) = 0\\
\implies \mu_{j} &= (\sum_{i\mid z(i)=j} x^{(i)})/(\sum_{i\mid z(i)=j} 1) \quad \forall j
\end{aligned}
\]
- Notes
This procedure will yield us a sequence of parameters and losses.
\(\displaystyle \mu^{0}, z^{0}, \mu^1, z^1, ...\)
\(\displaystyle L(0) \geq L(1) \geq L(2) \geq ...\)
- Since the loss is monotone decreasing and bounded below, it will converge by the monotone convergence theorem.
- However, this does not imply that the parameters \(\displaystyle \mathbf{\mu}\) and \(\displaystyle \mathbf{z}\) will converge.
Algorithm
- Randomly initialize labels \(\displaystyle \mathbf{z}\).
- Then calculate the centroids \(\displaystyle \mathbf{\mu}\).
- Then update the labels for each example to the closest centroid.
- Update the centroids by taking the mean of every point in the cluster.
- Repeat steps 3 and 4
Soft K-means
We will develop a model for how our data is generated.
We will then find probabilities for each element being from a cluster (ala Bayesian paradigm).
Given \(\displaystyle k\) clusters, the probability of a point being from cluster k is \(\displaystyle \phi_k = P(z^{(i)} = k)\)
We will assume each cluster is from a normal distribution \(\displaystyle N(\mu_j, \sigma_j)\)
EM Algorithm
Expectation Maximization
The key idea is to introduce intermediate coefficients to apply Jensen's inequality.
Maximum Likelihood
Let \(\displaystyle \theta_j = [\phi_j, \mu_j, \sigma_j]\) denote our parameters.
The likelihood is \(\displaystyle L(\theta) = \Pi_{i=1}^{m} P(x^{(i)}; \theta)\) where \(\displaystyle P\) is our pdf/pmf.
Then the log-likelihood is \(\displaystyle l(\theta) = \sum_{i=1}^{m} \log P(x^{(i)}; \theta) = \sum_{i=1}^{m} \log \sum_{j=1}^{k}P(x^{(i)}, z^{(i)}=j; \theta)\)
By introducing an extra variable \(\displaystyle Q^{(i)}_{(j)}\) we'll be able to apply Jensen's inequality to the concave log function.
Assume \(\displaystyle Q^{(i)}{(j)}\) is a probability mass function.
\[
\begin{align*}
l(\theta) &= \sum_{i=1}^{m} \log \sum_{j=1}^{k}P(x^{(i)}, z^{(i)}=j; \theta)\\
&=\sum_{i=1}^{m} \log \sum_{j=1}^{k} \frac{Q^{(i)}_{(j)}}{Q^{(i)}_{(j)}} P(x^{(i)}, z^{(i)}=j; \theta)\\
&\geq \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log(\frac{P(x^{(i)}, z^{(i)}=j; \theta)
}{Q^{(i)}_{(j)}})\\
\implies \log\left[E_{Q}\left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}_{(j)}}\right)\right] &\geq E_{Q} \left[ \log \left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}(j)} \right)\right]
\end{align*}
\]
Let \(\displaystyle J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)\)
The EM algorithm is an iterative algorithm which alternates between an E-Step and an M-Step.
E-Step
We will fix \(\displaystyle \theta\) and maximize J wrt \(\displaystyle Q\).
Jensen's inequality holds with equality iff either the function is linear or if the random variable is degenerate.
We will assume \(\displaystyle \frac{P(x^{(i)}, z^{(i)}=j; \theta)
}{Q^{(i)}_{(j)}}\) is a constant.
This implies \(\displaystyle Q^{(i)}(j) = c * P(x^{(i)}, z^{(i)} = j ; \theta)\).
Since Q is a pmf, we have \(\displaystyle Q^{(i)}(j) = \frac{1}{P(x^{(i)})} * P(x^{(i)}, z^{(i)} = j ; \theta) = P(z^{(i)} ; x^{(i)}, \theta)\)
Q is updated with \(\displaystyle Q^{(i)}(j) = \frac{P(z^{(i)}=j)P(x^{(i)}|z^{(i)}=j)}{P(x^{(i)})}\)
\(\displaystyle J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)\)
We are assuming that Q is a pmf so \(\displaystyle \sum_j Q = 1 \)
Our lagrangian is:
\(\displaystyle
\max_{Q} \min_{\beta} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right) +\beta (\sum_j Q^{(i)}_{(j)} - 1)\right]
\)
We can maximize each \(\displaystyle Q^{(i)}\) independently.
Taking the derivative wrt Q we get:
\[
\begin{aligned}
\frac{\partial}{\partial Q^{(i)}_{(j)}} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right) +\beta (\sum_j Q^{(i)}_{(j)} - 1)
&= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - Q \frac{Q}{P(x^{(i)}, z^{(i)}=j;\theta)} (P(x^{(i)}, z^{(i)}=j;\theta))(Q^{-2}) + \beta\\
&= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - 1 + \beta \\
&= 0\\
\implies Q^{(i)}_{(j)} &= (\frac{1}{exp(1-\beta)})P(x^{(i)}, z^{(i)}=j;\theta)
\end{aligned}
\]
Since Q is a pmf, we know it sums to 1 so we get the same result replacing \(\displaystyle (\frac{1}{exp(1-\beta)})\) with \(\displaystyle P(x^{(i)}\)
M-Step
We will fix \(\displaystyle Q\) and maximize J wrt \(\displaystyle \theta\)
\(\displaystyle J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)\)
Assume \(\displaystyle \Sigma_j = I\) for simplicity.
Then:
\[
\begin{aligned}
J(\theta, Q) &= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)\\
&= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)}, z^{(i)}=j;\theta)) + C_1 \\
&= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j) P(z^{(i)}=j)) + C_1\\
&= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j)) - Q^{(i)}_{(j)} \log( P(z^{(i)}=j)) + C_1\\
&= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( (2\pi)^{-n/2}exp(-\Vert x^{(i)} + \mu_j \Vert^2 / 2)) - Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_1 \\
&= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} -\Vert x^{(i)} - \mu_j \Vert^2 / 2) + Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_2
\end{aligned}
\]
Maximizing wrt \(\displaystyle \mu\), we get \(\displaystyle \mu_j^* = (\sum_{i} Q^{(i)}_{(i)}x^{(i)}) / (\sum_{i}Q^{(i)}_{(j)})\).
Maximizing wrt \(\displaystyle \phi\) we get \(\displaystyle \phi_j^* = \frac{1}{m} \sum_{i=1}^{m}Q^{(i)}_{(j)}\)
We want to maximize \(\displaystyle Q^{(i)}_{(j)} \log( \phi_j)\) subject to the condition \(\displaystyle \sum \phi_j = 1\) since \(\displaystyle \phi_j = P(z^{(i)}=j)\) is a pmf.
The lagrangian for this is \(\displaystyle \max_{\phi_1,...,\phi_k} \min_{\beta} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}Q^{(i)}_{(j)} \log( \phi_j) + \beta(\sum_{j=1}^{k}\phi_j - 1) \right]\)
This can be rewritten as \(\displaystyle \max_{\phi_1,...,\phi_k} \min_{\beta} \sum_{j=1}^{k}\left[\log( \phi_j) \sum_{i=1}^{m} Q^{(i)}_{(j)} + \beta(\phi_j - 1/k) \right]\)
The dual of this problem is \(\displaystyle \min_{\beta} \max_{\phi_1,...,\phi_k} \sum_{j=1}^{k} \left[\log( \phi_j) \sum_{i=1}^{m} Q^{(i)}_{(j)} + \beta(\phi_j - 1/k) \right]\)
Taking the gradient w.r.t \(\displaystyle \phi\), we get \(\displaystyle \frac{1}{\phi_j}\sum_{i}Q^{(i)}_{(j)} + \beta = 0 \implies \phi_j = \frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)})\)
Plugging this into our dual problem we get:
\(\displaystyle \min_{\beta} \sum_{j=1}^{k} \left[\log(\frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)})) \sum_{i=1}^{m} Q^{(i)}_{(j)} + \beta(\frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)}) - 1/k) \right]\)
\(\displaystyle = \min_{\beta} \sum_{j=1}^{k} \left[\log(\frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)})) \sum_{i=1}^{m} Q^{(i)}_{(j)} -(\sum_{i}Q^{(i)}_{(j)}) - (\beta/k) \right]\)
Taking the derivative with respect to \(\displaystyle \beta\), we get:
\[
\begin{aligned}
\sum_{j=1}^{k} [(\frac{1}{(-1/\beta)(\sum Q)})(-\sum Q)(-\beta^{-2})(\sum Q) - \frac{1}{k}]
&= \sum_{j=1}^{k} [ (\beta)(-\beta^{-2})(\sum Q) - \frac{1}{k}] \\
&= \sum_{j=1}^{k} [\frac{-1}{\beta}(\sum_{i=1}^{m} Q) - \frac{1}{k}]\\
&= [\sum_{i=1}^{m} \frac{-1}{\beta} \sum_{j=1}^{k}P(z^{(i)} = j | x^{(i)}) - \sum_{j=1}^{k}\frac{1}{k}]\\
&= [\frac{-1}{\beta}\sum_{i=1}^{m}1 - 1]\\
&= \frac{-m}{\beta} - 1 = 0\\
\implies \beta &= -m
\end{aligned}
\]
Plugging in \(\displaystyle \beta = -m\) into our equation for \(\displaystyle \phi_j\) we get \(\displaystyle \phi_j = \frac{1}{m}\sum_{i=1}^{m}Q^{(i)}_{(j)}\)
Mean-Shift
DBSCAN
Generative Models
Goal: Generate realistic but fake samples. Applications: Denoising, impainting
VAEs
Variational Auto-Encoders
Tutorial
KL Divergence
- Wikipedia:Kullback–Leibler divergence
- Kullback–Leibler divergence
- \(\displaystyle KL(P \Vert Q) =E_{P}\left[ \log(\frac{P(X)}{Q(X)}) \right]\)
- Notes
- KL is always >= 0
- KL is not symmetric
- Jensen-Shannon Divergence
- \(\displaystyle JSD(P \Vert Q) = \frac{1}{2}KL(P \Vert Q) + \frac{1}{2}KL(Q \Vert P)\)
- This is symmetric
Model
Our model for how the data is generated is as follows:
- Generate latent variables \(\displaystyle z^{(1)},...,z^{(m)} \in \mathbb{R}^r\) iid where dimension r is less than n.
- We assume \(\displaystyle Z^{(i)} \sim N(\mathbf{0},\mathbf{I})\)
- Generate \(\displaystyle x^{(i)}\) where \(\displaystyle X^{(i)} \vert Z^{(i)} \sim N(g_{\theta}(z), \sigma^2 \mathbf{I})\)
- For some function \(\displaystyle g_{\theta_1}\) parameterized by \(\displaystyle \theta_1\)
Variational Bound
The variational bound is:
- \(\displaystyle \log P(x^{(i)}) \geq E_{Z \sim Q_i}[\log P(X^{(i)} \vert Z)] - KL(Q_i(z) \Vert P(z))\)
We know from Baye's rule that \(\displaystyle P(z|X) = \frac{P(X|z)P(z)}{P(X)}\).
Plugging this into the equation for \(\displaystyle KL(Q_i(z) \Vert P(z|X))\) yields our inequality.
\[
\begin{aligned}
KL(Q_i(z) \Vert P(z|X)) &= E_{Q} \left[ \log(\frac{Q_i(z)}{P(z|X)}) \right]\\
&=E_Q(\log(\frac{Q_i(z) P(X^{(i)})}{P(X|z)P(z)})\\
&=E_Q(\log(\frac{Q_i(z)}{P(z)})) + \log(P(x^{(i)})) - E_Q(\log(P(X|z))\\
&=KL(Q_i(z) \Vert P(z)) + \log(P(x^{(i)}) - E_Q(\log(P(X|z))
\end{aligned}
\]
Rearranging terms we get:
\(\displaystyle \log P(x^{(i)}) - KL(Q_i(z) \Vert P(z|X)) = E_Q(\log(P(X|z)) - KL(Q_i(z) \Vert P(z))\)
Since the KL divergence is greater than or equal to 0, our variational bound follows.
GANs
Wasserstein GAN
Paper
The main idea is to ensure the that discriminator is lipschitz continuous and to limit the lipschitz constant (i.e. the derivative) of the discriminator.
If the correct answer is 1.0 and the generator produces 1.0001, we don't want the discriminator to give us a very high loss.
- Earth mover's distance
The minimum cost of converting one pile of dirt to another.
Where cost is the cost of moving (amount * distance)
Given a set \(\displaystyle P\) with m clusters and a set \(\displaystyle Q\) with n clusters:
...
\(\displaystyle EMD(P, Q) = \frac{\sum_{i=1}^{m}\sum_{j=1}^{n}f_{i,j}d_{i,j}}{\sum_{i=1}^{m}\sum_{j=1}^{n}f_{i,j}}\)
- Notes
- Also known as Wasserstein metric
Dimension Reduction
Goal: Reduce the dimension of a dataset.
If each example \(\displaystyle x \in \mathbb{R}^n\), we want to reduce each example to be in \(\displaystyle \mathbb{R}^k\) where \(\displaystyle k \lt n\)
PCA
Principal Component Analysis
Preprocessing: Subtract the sample mean from each example so that the new sample mean is 0.
Goal: Find a vector \(\displaystyle v_1\) such that the projection \(\displaystyle v_1 \cdot x\) has maximum variance.
Result: These principal components are the eigenvectors of \(\displaystyle X^TX\).
Idea: Maximize the variance of the projections.
\(\displaystyle \max \frac{1}{m}\sum (v_1 \cdot x^{(i)})^2\)
Note that \(\displaystyle \sum (v_1 \cdot x^{(i)})^2 = \sum v_1^T x^{(i)} (x^{(i)})^T v_1 = v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1\).
Kernel PCA
Autoencoder
You have a encoder and a decoder which are both neural networks.
Resources
- Clustering