Unsupervised Learning: Difference between revisions

(21 intermediate revisions by the same user not shown)

Line 8:

Here we wish to cluster them to minimize the distance to the mean of their cluster.

In our formulation we have k clusters.

The mean of each cluster <math>\mu_i</math> is called the centroid.~~</math>~~

The mean of each cluster <math>\mu_i</math> is called the centroid.

====Optimization====

Let <math>\mathbf{\mu}</math> denote the centroids and let <math>\mathbf{z}</math> denote the cluster labels for our data.

Line 18:

If we fix <math>\mathbf{z}</math> then we need to minimize <math>L(\mu, \mathbf{z})</math> wrt <math>\mu</math>.

Taking the gradient and setting it to 0 we get:

<math>

\nabla_{\mu} L(\mu, \mathbf{z}) = \nabla_{\mu} \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2

\begin{aligned}

~~</math> ~~

\nabla_{\mu} L(\mu, \mathbf{z}) &= \nabla_{\mu} \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\\

~~<math>~~

&= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\\

= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2

&= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{j} \Vert ^2\\

~~</math> ~~

&= \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \nabla_{\mu} \Vert x^{(i)} - \mu_{j} \Vert ^2\\

~~<math>~~

&= -\sum_{j=1}^{k} \sum_{i\mid z(i)=j} 2(x^{(i)} - \mu_{j}) = 0\\

= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{j} \Vert ^2

\implies \mu_{j} &= (\sum_{i\mid z(i)=j} x^{(i)})/(\sum_{i\mid z(i)=j} 1) \quad \forall j

~~</math> ~~

\end{aligned}

~~<math>~~

= \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \nabla_{\mu} \Vert x^{(i)} - \mu_{j} \Vert ^2

~~</math> ~~

~~<math>~~

= -\sum_{j=1}^{k} \sum_{i\mid z(i)=j} 2(x^{(i)} - \mu_{j}) = 0

~~</math> ~~

~~<math>~~

\implies \mu_{j} = (\sum_{i\mid z(i)=j} x^{(i)})/(\sum_{i\mid z(i)=j} 1) \quad \forall j

</math>

Line 66:

Line 58:

By introducing an extra variable <math>Q^{(i)}_{(j)}</math> we'll be able to apply Jensen's inequality to the concave log function.

Assume <math>Q^{(i)}{(j)}</math> is a probability mass function.

<math>

l(\theta) = \sum_{i=1}^{m} \log \sum_{j=1}^{k}P(x^{(i)}, z^{(i)}=j; \theta)

\begin{align*}

~~</math> ~~

l(\theta) &= \sum_{i=1}^{m} \log \sum_{j=1}^{k}P(x^{(i)}, z^{(i)}=j; \theta)\\

~~<math>~~

&=\sum_{i=1}^{m} \log \sum_{j=1}^{k} \frac{Q^{(i)}_{(j)}}{Q^{(i)}_{(j)}} P(x^{(i)}, z^{(i)}=j; \theta)\\

=\sum_{i=1}^{m} \log \sum_{j=1}^{k} \frac{Q^{(i)}_{(j)}}{Q^{(i)}_{(j)}} P(x^{(i)}, z^{(i)}=j; \theta)

&\geq \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log(\frac{P(x^{(i)}, z^{(i)}=j; \theta)

~~</math> ~~

}{Q^{(i)}_{(j)}})\\

~~<math>~~

\implies \log\left[E_{Q}\left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}_{(j)}}\right)\right] &\geq E_{Q} \left[ \log \left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}(j)} \right)\right]

\geq \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log(\frac{P(x^{(i)}, z^{(i)}=j; \theta)

\end{align*}

}{Q^{(i)}_{(j)}})

~~</math> ~~

~~<math>~~

\implies \log\left[E_{Q}\left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}_{(j)}}\right)\right] \geq E_{Q} \left[ \log \left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}(j)} \right)\right]

</math>

Let <math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math>

The EM algorithm is an iterative algorithm which alternates between an E-Step and an M-Step.

=====E-Step=====

We will fix <math>\theta</math> and maximize J wrt <math>Q</math>.

Line 98:

Line 87:

We can maximize each <math>Q^{(i)}</math> independently.

Taking the derivative wrt Q we get:

<math>

\begin{aligned}

\frac{\partial}{\partial Q^{(i)}_{(j)}} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right) +\beta (\sum_j Q^{(i)}_{(j)} - 1)

~~</math> ~~

&= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - Q \frac{Q}{P(x^{(i)}, z^{(i)}=j;\theta)} (P(x^{(i)}, z^{(i)}=j;\theta))(Q^{-2}) + \beta\\

~~<math>~~

&= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - 1 + \beta \\

= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - Q \frac{Q}{P(x^{(i)}, z^{(i)}=j;\theta)} (P(x^{(i)}, z^{(i)}=j;\theta))(Q^{-2}) + \beta

&= 0\\

~~</math>~~

\implies Q^{(i)}_{(j)} &= (\frac{1}{exp(1-\beta)})P(x^{(i)}, z^{(i)}=j;\theta)

~~</math> ~~

\end{aligned}

~~<math>~~

= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - 1 + \beta = 0

~~</math> ~~

~~<math>~~

\implies Q^{(i)}_{(j)} = (\frac{1}{exp(1-\beta)})P(x^{(i)}, z^{(i)}=j;\theta)

</math>

Since Q is a pmf, we know it sums to 1 so we get the same result replacing <math>(\frac{1}{exp(1-\beta)})</math> with <math>P(x^{(i)}</math>

Line 118:

Line 103:

<math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math>

Assume <math>\Sigma_j = I</math> for simplicity.

Then~~ ~~

Then:

<math>

J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)

\begin{aligned}

~~</math> ~~

J(\theta, Q) &= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)\\

~~<math>~~

&= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)}, z^{(i)}=j;\theta)) + C_1 \\

= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)}, z^{(i)}=j;\theta)) + C_1

&= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j) P(z^{(i)}=j)) + C_1\\

~~</math> ~~

&= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j)) - Q^{(i)}_{(j)} \log( P(z^{(i)}=j)) + C_1\\

~~<math>~~

&= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( (2\pi)^{-n/2}exp(-\Vert x^{(i)} + \mu_j \Vert^2 / 2)) - Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_1 \\

= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j) P(z^{(i)}=j)) + C_1

&= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} -\Vert x^{(i)} - \mu_j \Vert^2 / 2) + Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_2

~~</math> ~~

\end{aligned}

~~<math>~~

= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j)) - Q^{(i)}_{(j)} \log( P(z^{(i)}=j)) + C_1

~~</math> ~~

~~<math>~~

= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( (2\pi)^{-n/2}exp(-\Vert x^{(i)} + \mu_j \Vert^2 / 2)) - Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_1

~~</math> ~~

~~<math>~~

= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} -\Vert x^{(i)} - \mu_j \Vert^2 / 2) + Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_2

</math>

Maximizing wrt <math>\mu</math>, we get <math>\mu_j^* = (\sum_{i} Q^{(i)}_{(i)}x^{(i)}) / (\sum_{i}Q^{(i)}_{(j)})</math>.

Line 149:

Line 126:

<math>= \min_{\beta} \sum_{j=1}^{k} \left[\log(\frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)})) \sum_{i=1}^{m} Q^{(i)}_{(j)} -(\sum_{i}Q^{(i)}_{(j)}) - (\beta/k) \right]</math>

Taking the derivative with respect to <math>\beta</math>, we get:

<math>

\sum_{j=1}^{k} [(\frac{1}{(-1/\beta)(\sum Q)})(-\sum Q)(-\beta^{-2})(\sum Q) - \frac{1}{k}]

\begin{aligned}

~~</math> ~~

\sum_{j=1}^{k} [(\frac{1}{(-1/\beta)(\sum Q)})(-\sum Q)(-\beta^{-2})(\sum Q) - \frac{1}{k}]

~~<math>~~

&= \sum_{j=1}^{k} [ (\beta)(-\beta^{-2})(\sum Q) - \frac{1}{k}] \\

=\sum_{j=1}^{k} [ (\beta)(-\beta^{-2})(\sum Q) - \frac{1}{k}]

&= \sum_{j=1}^{k} [\frac{-1}{\beta}(\sum_{i=1}^{m} Q) - \frac{1}{k}]\\

~~</math> ~~

&= [\sum_{i=1}^{m} \frac{-1}{\beta} \sum_{j=1}^{k}P(z^{(i)} = j | x^{(i)}) - \sum_{j=1}^{k}\frac{1}{k}]\\

~~<math>~~

&= [\frac{-1}{\beta}\sum_{i=1}^{m}1 - 1]\\

= \sum_{j=1}^{k} [\frac{-1}{\beta}(\sum_{i=1}^{m} Q) - \frac{1}{k}]

&= \frac{-m}{\beta} - 1 = 0\\

~~</math> ~~

\implies \beta &= -m

~~<math>~~

\end{aligned}

=[\sum_{i=1}^{m} \frac{-1}{\beta} \sum_{j=1}^{k}P(z^{(i)} = j | x^{(i)}) - \sum_{j=1}^{k}\frac{1}{k}]

~~</math> ~~

~~<math>~~

= [\frac{-1}{\beta}\sum_{i=1}^{m}1 - 1]

~~</math> ~~

~~<math>~~

= \frac{-m}{\beta} - 1 = 0

~~</math> ~~

~~<math>~~

\implies \beta = -m

</math>

Plugging in <math>\beta = -m</math> into our equation for <math>\phi_j</math> we get <math>\phi_j = \frac{1}{m}\sum_{i=1}^{m}Q^{(i)}_{(j)}</math>

}}

===Mean-Shift===

===DBSCAN===

==Generative Models==

Line 186:

Line 156:

* KL is always >= 0

* KL is not symmetric

* Jensen-Shannon Divergence

** <math>JSD(P \Vert Q) = \frac{1}{2}KL(P \Vert Q) + \frac{1}{2}KL(Q \Vert P)</math>

** This is symmetric

====Model====

Line 191:

Line 164:

* Generate latent variables <math>z^{(1)},...,z^{(m)} \in \mathbb{R}^r</math> iid where dimension r is less than n.

** We assume <math>Z^{(i)} \sim N(\mathbf{0},\mathbf{I})</math>

* Generate <math>x^{(i)}</math> where <math>X^{(i)} \vert Z^{(i)} \~~sin~~ N(g_{\theta}(z), \sigma^2 \mathbf{I})</math>

* Generate <math>x^{(i)}</math> where <math>X^{(i)} \vert Z^{(i)} \sim N(g_{\theta}(z), \sigma^2 \mathbf{I})</math>

** For some function <math>g_{\theta_1}</math> parameterized by <math>\theta_1</math>

Line 200:

Line 173:

We know from Baye's rule that <math>P(z|X) = \frac{P(X|z)P(z)}{P(X)}</math>.

Plugging this into the equation for <math>KL(Q_i(z) \Vert P(z|X))</math> yields our inequality.

<math>KL(Q_i(z) \Vert P(z|X)) = E_{Q} \left[ \log(\frac{Q_i(z)}{P(z|X)}) \right]~~</math> ~~

~~<math>~~=E_Q(\log(\frac{Q_i(z) P(X^{(i)})}{P(~~X_z~~)P(z)})~~</math> ~~

\begin{aligned}

~~<math>~~=E_Q(\log(\frac{Q_i(z)}{P(z)})) + \log(P(x^{(i)})) - E_Q(\log(P(X|z))~~</math> ~~

KL(Q_i(z) \Vert P(z|X)) &= E_{Q} \left[ \log(\frac{Q_i(z)}{P(z|X)}) \right]\\

~~<math>~~=KL(Q_i(z) \Vert P(z)) + \log(P(x^{(i)}) - E_Q(\log(P(X|z))</math~~><br~~>

&=E_Q(\log(\frac{Q_i(z) P(X^{(i)})}{P(X|z)P(z)})\\

&=E_Q(\log(\frac{Q_i(z)}{P(z)})) + \log(P(x^{(i)})) - E_Q(\log(P(X|z))\\

&=KL(Q_i(z) \Vert P(z)) + \log(P(x^{(i)}) - E_Q(\log(P(X|z))

\end{aligned}

</math>

Rearranging terms we get:

Line 209:

Line 186:

}}

==GANs==

===GANs===

===Wasserstein GAN===

====Wasserstein GAN====

[https://arxiv.org/abs/1701.07875 Paper]

The main idea is to ensure the that discriminator is lipschitz continuous and to limit the lipschitz constant (i.e. the derivative) of the discriminator.

If the correct answer is 1.0 and the generator produces 1.0001, we don't want the discriminator to give us a very high loss.

;Earth mover's distance

The minimum cost of converting one pile of dirt to another.

Where cost is the cost of moving (amount * distance)

Given a set <math>P</math> with m clusters and a set <math>Q</math> with n clusters:

...

;Notes

* Also known as Wasserstein metric

==Dimension Reduction==

Goal: Reduce the dimension of a dataset.

If each example <math>x \in \mathbb{R}^n</math>, we want to reduce each example to be in <math>\mathbb{R}^k</math> where <math>k < n</math>

===PCA===

Principal Component Analysis

Preprocessing: Subtract the sample mean from each example so that the new sample mean is 0.

Goal: Find a vector <math>v_1</math> such that the projection <math>v_1 \cdot x</math> has maximum variance.

Result: These principal components are the eigenvectors of <math>X^TX</math>.

Idea: Maximize the variance of the projections.

Note that <math>\sum (v_1 \cdot x^{(i)})^2 = \sum v_1^T x^{(i)} (x^{(i)})^T v_1 = v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1</math>.

<!--

Thus our lagrangian is <math>\max_{\alpha} v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1 + \alpha(\Vert v_1 \Vert^2 - 1)</math>

Taking the gradient of this we get:

<math>\nabla_{v_1} (v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1) + \alpha(\Vert v_1 \Vert^2 - 1)</math>

<math>= 2(\sum x^{(i)} (x^{(i)})^T) v_1 + 2\alpha v_1</math>

-->

===Kernel PCA===

===Autoencoder===

You have a encoder and a decoder which are both neural networks.

==Resources==

;Clustering

* [https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68 TowardsDataScience: top 5 clustering algos]

* [https://machinelearningmastery.com/clustering-algorithms-with-python/ MachineLearningMastery: Clustering Algorithms]

@@ Line 8: / Line 8: @@
 Here we wish to cluster them to minimize the distance to the mean of their cluster.<br>
 In our formulation we have k clusters.<br>
-The mean of each cluster <math>\mu_i</math> is called the centroid.</math><br>
+The mean of each cluster <math>\mu_i</math> is called the centroid.<br>
 ====Optimization====
 Let <math>\mathbf{\mu}</math> denote the centroids and let <math>\mathbf{z}</math> denote the cluster labels for our data.<br>
@@ Line 18: / Line 18: @@
 If we fix <math>\mathbf{z}</math> then we need to minimize <math>L(\mu, \mathbf{z})</math> wrt <math>\mu</math>.<br>
 Taking the gradient and setting it to 0 we get:<br>
-<math>
+<math display="block">
-\nabla_{\mu} L(\mu, \mathbf{z}) = \nabla_{\mu} \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2
+\begin{aligned}
-</math><br>
+\nabla_{\mu} L(\mu, \mathbf{z}) &= \nabla_{\mu} \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\\
-<math>
+&= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\\
-= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2
+&= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{j} \Vert ^2\\
-</math><br>
+&=  \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \nabla_{\mu} \Vert x^{(i)} - \mu_{j} \Vert ^2\\
-<math>
+&=  -\sum_{j=1}^{k} \sum_{i\mid z(i)=j} 2(x^{(i)} - \mu_{j}) = 0\\
-= \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{j} \Vert ^2
+\implies \mu_{j} &= (\sum_{i\mid z(i)=j} x^{(i)})/(\sum_{i\mid z(i)=j} 1) \quad \forall j
-</math><br>
+\end{aligned}
-<math>
-=  \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \nabla_{\mu} \Vert x^{(i)} - \mu_{j} \Vert ^2
-</math><br>
-<math>
-=  -\sum_{j=1}^{k} \sum_{i\mid z(i)=j} 2(x^{(i)} - \mu_{j}) = 0
-</math><br>
-<math>
-\implies \mu_{j} = (\sum_{i\mid z(i)=j} x^{(i)})/(\sum_{i\mid z(i)=j} 1) \quad \forall j
 </math>
@@ Line 66: / Line 58: @@
 By introducing an extra variable <math>Q^{(i)}_{(j)}</math> we'll be able to apply Jensen's inequality to the concave log function.<br>
 Assume <math>Q^{(i)}{(j)}</math> is a probability mass function.<br>
-<math>
+<math display="block">
-l(\theta) = \sum_{i=1}^{m} \log \sum_{j=1}^{k}P(x^{(i)}, z^{(i)}=j; \theta)
+\begin{align*}
-</math><br>
+l(\theta) &= \sum_{i=1}^{m} \log \sum_{j=1}^{k}P(x^{(i)}, z^{(i)}=j; \theta)\\
-<math>
+&=\sum_{i=1}^{m} \log \sum_{j=1}^{k} \frac{Q^{(i)}_{(j)}}{Q^{(i)}_{(j)}} P(x^{(i)}, z^{(i)}=j; \theta)\\
-=\sum_{i=1}^{m} \log \sum_{j=1}^{k} \frac{Q^{(i)}_{(j)}}{Q^{(i)}_{(j)}} P(x^{(i)}, z^{(i)}=j; \theta)
+&\geq \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log(\frac{P(x^{(i)}, z^{(i)}=j; \theta)
-</math><br>
+}{Q^{(i)}_{(j)}})\\
-<math>
+\implies \log\left[E_{Q}\left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}_{(j)}}\right)\right] &\geq E_{Q} \left[ \log \left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}(j)} \right)\right]
-\geq \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log(\frac{P(x^{(i)}, z^{(i)}=j; \theta)
+\end{align*}
-}{Q^{(i)}_{(j)}})
-</math><br>
-<math>
-\implies \log\left[E_{Q}\left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}_{(j)}}\right)\right] \geq E_{Q} \left[ \log \left(\frac{P(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}(j)} \right)\right]
 </math><br>
 Let <math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br>
 The EM algorithm is an iterative algorithm which alternates between an E-Step and an M-Step.
 =====E-Step=====
 We will fix <math>\theta</math> and maximize J wrt <math>Q</math>.<br>
@@ Line 98: / Line 87: @@
 We can maximize each <math>Q^{(i)}</math> independently.<br>
 Taking the derivative wrt Q we get:<br>
-<math>
+<math display="block">
+\begin{aligned}
 \frac{\partial}{\partial Q^{(i)}_{(j)}} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right) +\beta (\sum_j Q^{(i)}_{(j)} - 1)
-</math><br>
+&= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - Q \frac{Q}{P(x^{(i)}, z^{(i)}=j;\theta)} (P(x^{(i)}, z^{(i)}=j;\theta))(Q^{-2}) + \beta\\
-<math>
+&= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - 1 + \beta \\
-= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - Q \frac{Q}{P(x^{(i)}, z^{(i)}=j;\theta)} (P(x^{(i)}, z^{(i)}=j;\theta))(Q^{-2}) + \beta
+&= 0\\
-</math>
+\implies Q^{(i)}_{(j)} &= (\frac{1}{exp(1-\beta)})P(x^{(i)}, z^{(i)}=j;\theta)
-</math><br>
+\end{aligned}
-<math>
-= \log(\frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q}) - 1 + \beta = 0
-</math><br>
-<math>
-\implies Q^{(i)}_{(j)} = (\frac{1}{exp(1-\beta)})P(x^{(i)}, z^{(i)}=j;\theta)
 </math><br>
 Since Q is a pmf, we know it sums to 1 so we get the same result replacing <math>(\frac{1}{exp(1-\beta)})</math> with <math>P(x^{(i)}</math>
@@ Line 118: / Line 103: @@
 <math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br>
 Assume <math>\Sigma_j = I</math> for simplicity.<br>
-Then<br>
+Then:
-<math>
+<math display="block">
-J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)
+\begin{aligned}
-</math><br>
+J(\theta, Q) &= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{P(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)\\
-<math>
+&= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)}, z^{(i)}=j;\theta)) + C_1 \\
-= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)}, z^{(i)}=j;\theta)) + C_1
+&= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j) P(z^{(i)}=j)) + C_1\\
-</math><br>
+&= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j)) -  Q^{(i)}_{(j)} \log( P(z^{(i)}=j)) + C_1\\
-<math>
+&= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( (2\pi)^{-n/2}exp(-\Vert x^{(i)} + \mu_j \Vert^2 / 2)) -  Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_1 \\
-= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j) P(z^{(i)}=j)) + C_1
+&= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} -\Vert x^{(i)} - \mu_j \Vert^2 / 2) +  Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_2
-</math><br>
+\end{aligned}
-<math>
-= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( P(x^{(i)} \mid z^{(i)}=j)) -  Q^{(i)}_{(j)} \log( P(z^{(i)}=j)) + C_1
-</math><br>
-<math>
-= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( (2\pi)^{-n/2}exp(-\Vert x^{(i)} + \mu_j \Vert^2 / 2)) -  Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_1
-</math><br>
-<math>
-= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} -\Vert x^{(i)} - \mu_j \Vert^2 / 2) +  Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_2
 </math><br>
 Maximizing wrt <math>\mu</math>, we get <math>\mu_j^* = (\sum_{i} Q^{(i)}_{(i)}x^{(i)}) / (\sum_{i}Q^{(i)}_{(j)})</math>.<br>
@@ Line 149: / Line 126: @@
 <math>= \min_{\beta} \sum_{j=1}^{k} \left[\log(\frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)})) \sum_{i=1}^{m} Q^{(i)}_{(j)}  -(\sum_{i}Q^{(i)}_{(j)}) - (\beta/k) \right]</math><br>
 Taking the derivative with respect to <math>\beta</math>, we get:<br>
-<math>
+<math display="block">
-\sum_{j=1}^{k} [(\frac{1}{(-1/\beta)(\sum Q)})(-\sum Q)(-\beta^{-2})(\sum Q) - \frac{1}{k}]
+\begin{aligned}
-</math><br>
+\sum_{j=1}^{k} [(\frac{1}{(-1/\beta)(\sum Q)})(-\sum Q)(-\beta^{-2})(\sum Q) - \frac{1}{k}]
-<math>
+&= \sum_{j=1}^{k} [ (\beta)(-\beta^{-2})(\sum Q) - \frac{1}{k}] \\
-=\sum_{j=1}^{k} [ (\beta)(-\beta^{-2})(\sum Q) - \frac{1}{k}]
+&= \sum_{j=1}^{k} [\frac{-1}{\beta}(\sum_{i=1}^{m} Q) - \frac{1}{k}]\\
-</math><br>
+&= [\sum_{i=1}^{m} \frac{-1}{\beta} \sum_{j=1}^{k}P(z^{(i)} = j | x^{(i)}) - \sum_{j=1}^{k}\frac{1}{k}]\\
-<math>
+&= [\frac{-1}{\beta}\sum_{i=1}^{m}1 - 1]\\
-=  \sum_{j=1}^{k} [\frac{-1}{\beta}(\sum_{i=1}^{m} Q) - \frac{1}{k}]
+&= \frac{-m}{\beta} - 1 = 0\\
-</math><br>
+\implies \beta &= -m
-<math>
+\end{aligned}
-=[\sum_{i=1}^{m} \frac{-1}{\beta} \sum_{j=1}^{k}P(z^{(i)} = j | x^{(i)}) - \sum_{j=1}^{k}\frac{1}{k}]
-</math><br>
-<math>
-=  [\frac{-1}{\beta}\sum_{i=1}^{m}1 - 1]
-</math><br>
-<math>
-= \frac{-m}{\beta} - 1 = 0
-</math><br>
-<math>
-\implies \beta = -m
 </math><br>
 Plugging in <math>\beta = -m</math> into our equation for <math>\phi_j</math> we get <math>\phi_j = \frac{1}{m}\sum_{i=1}^{m}Q^{(i)}_{(j)}</math>
 }}
+===Mean-Shift===
+===DBSCAN===
 ==Generative Models==
@@ Line 186: / Line 156: @@
 * KL is always >= 0
 * KL is not symmetric
+* Jensen-Shannon Divergence
+** <math>JSD(P \Vert Q) = \frac{1}{2}KL(P \Vert Q) + \frac{1}{2}KL(Q \Vert P)</math>
+** This is symmetric
 ====Model====
@@ Line 191: / Line 164: @@
 * Generate latent variables <math>z^{(1)},...,z^{(m)} \in \mathbb{R}^r</math> iid where dimension r is less than n.
 ** We assume <math>Z^{(i)} \sim N(\mathbf{0},\mathbf{I})</math>
-* Generate <math>x^{(i)}</math> where <math>X^{(i)} \vert Z^{(i)} \sin N(g_{\theta}(z), \sigma^2 \mathbf{I})</math>
+* Generate <math>x^{(i)}</math> where <math>X^{(i)} \vert Z^{(i)} \sim N(g_{\theta}(z), \sigma^2 \mathbf{I})</math>
 ** For some function <math>g_{\theta_1}</math> parameterized by <math>\theta_1</math>
@@ Line 200: / Line 173: @@
 We know from Baye's rule that <math>P(z|X) = \frac{P(X|z)P(z)}{P(X)}</math>.<br>
 Plugging this into the equation for <math>KL(Q_i(z) \Vert P(z|X))</math> yields our inequality.<br>
-<math>KL(Q_i(z) \Vert P(z|X)) = E_{Q} \left[ \log(\frac{Q_i(z)}{P(z|X)}) \right]</math><br>
+<math display="block">
-<math>=E_Q(\log(\frac{Q_i(z) P(X^{(i)})}{P(X_z)P(z)})</math><br>
+\begin{aligned}
-<math>=E_Q(\log(\frac{Q_i(z)}{P(z)})) + \log(P(x^{(i)})) - E_Q(\log(P(X|z))</math><br>
+KL(Q_i(z) \Vert P(z|X)) &= E_{Q} \left[ \log(\frac{Q_i(z)}{P(z|X)}) \right]\\
-<math>=KL(Q_i(z) \Vert P(z)) + \log(P(x^{(i)}) - E_Q(\log(P(X|z))</math><br>
+&=E_Q(\log(\frac{Q_i(z) P(X^{(i)})}{P(X|z)P(z)})\\
+&=E_Q(\log(\frac{Q_i(z)}{P(z)})) + \log(P(x^{(i)})) - E_Q(\log(P(X|z))\\
+&=KL(Q_i(z) \Vert P(z)) + \log(P(x^{(i)}) - E_Q(\log(P(X|z))
+\end{aligned}
+</math>
 Rearranging terms we get:<br>
 <math>\log P(x^{(i)}) - KL(Q_i(z) \Vert P(z|X)) = E_Q(\log(P(X|z)) - KL(Q_i(z) \Vert P(z))</math><br>
@@ Line 209: / Line 186: @@
 }}
-==GANs==
+===GANs===
 {{main | Generative adversarial network}}
-===Wasserstein GAN===
+====Wasserstein GAN====
 [https://arxiv.org/abs/1701.07875 Paper]<br>
 The main idea is to ensure the that discriminator is lipschitz continuous and to limit the lipschitz constant (i.e. the derivative) of the discriminator.<br>
 If the correct answer is 1.0 and the generator produces 1.0001, we don't want the discriminator to give us a very high loss.<br>
+;Earth mover's distance
+{{main | wikipedia:earth mover's distance}}
+The minimum cost of converting one pile of dirt to another.<br>
+Where cost is the cost of moving (amount * distance)<br>
+Given a set <math>P</math> with m clusters and a set <math>Q</math> with n clusters:<br>
+...
+<math>EMD(P, Q) = \frac{\sum_{i=1}^{m}\sum_{j=1}^{n}f_{i,j}d_{i,j}}{\sum_{i=1}^{m}\sum_{j=1}^{n}f_{i,j}}</math><br>
+;Notes
+* Also known as Wasserstein metric
+==Dimension Reduction==
+Goal: Reduce the dimension of a dataset.<br>
+If each example <math>x \in \mathbb{R}^n</math>, we want to reduce each example to be in <math>\mathbb{R}^k</math> where <math>k < n</math>
+===PCA===
+Principal Component Analysis<br>
+Preprocessing: Subtract the sample mean from each example so that the new sample mean is 0.<br>
+Goal: Find a vector <math>v_1</math> such that the projection <math>v_1 \cdot x</math> has maximum variance.<br>
+Result: These principal components are the eigenvectors of <math>X^TX</math>.<br>
+Idea: Maximize the variance of the projections.<br>
+<math>\max \frac{1}{m}\sum (v_1 \cdot x^{(i)})^2</math><br>
+Note that <math>\sum (v_1 \cdot x^{(i)})^2 = \sum v_1^T x^{(i)} (x^{(i)})^T v_1 = v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1</math>.<br>
+<!--
+Thus our lagrangian is <math>\max_{\alpha} v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1 + \alpha(\Vert v_1 \Vert^2 - 1)</math><br>
+Taking the gradient of this we get:<br>
+<math>\nabla_{v_1} (v_1^T (\sum x^{(i)} (x^{(i)})^T) v_1) + \alpha(\Vert v_1 \Vert^2 - 1)</math><br>
+<math>= 2(\sum x^{(i)} (x^{(i)})^T) v_1 + 2\alpha v_1</math>
+-->
+===Kernel PCA===
+{{main | Wikipedia: Kernel principal component analysis}}
+===Autoencoder===
+You have a encoder and a decoder which are both neural networks.
+==Resources==
+;Clustering
+* [https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68 TowardsDataScience: top 5 clustering algos]
+* [https://machinelearningmastery.com/clustering-algorithms-with-python/ MachineLearningMastery: Clustering Algorithms]