# Unsupervised Learning

Basics of Unsupervised Learning

## Clustering

Given points ${\displaystyle \{x^{(1)},...,x^{(m)}\}}$ we wish to group them into clusters 1,...,k.
Usually we are given the number of clusters, k.
The problem is to assign labels ${\displaystyle \{z^{(1)},...,z^{(m)}\}}$ where ${\displaystyle z^{(i)}\in \{1,...,k\}}$

### K-means

Here we wish to cluster them to minimize the distance to the mean of their cluster.
In our formulation we have k clusters.
The mean of each cluster ${\displaystyle \mu _{i}}$ is called the centroid.[/itex]

#### Optimization

Let ${\displaystyle \mathbf {\mu } }$ denote the centroids and let ${\displaystyle \mathbf {z} }$ denote the cluster labels for our data.
Our loss function is ${\displaystyle L(\mu ,\mathbf {z} )=\sum _{i}\Vert x^{(i)}-\mu _{z^{(i)}}\Vert ^{2}}$.

Using coordinate-block descent

${\displaystyle L(\mu ,\mathbf {z} )=\sum _{i}\Vert x^{(i)}-\mu _{z^{(i)}}\Vert ^{2}}$.
If we fix ${\displaystyle \mu }$ then we have only need to minimize ${\displaystyle z^{(i)}}$.
Since each term of the sum is independent, we simply choose the closest centroid for each point
If we fix ${\displaystyle \mathbf {z} }$ then we need to minimize ${\displaystyle L(\mu ,\mathbf {z} )}$ wrt ${\displaystyle \mu }$.
Taking the gradient and setting it to 0 we get:
${\displaystyle \nabla _{\mu }L(\mu ,\mathbf {z} )=\nabla _{\mu }\sum _{i}\Vert x^{(i)}-\mu _{z^{(i)}}\Vert ^{2}}$
${\displaystyle =\nabla _{\mu }\sum _{j=1}^{k}\sum _{i\mid z(i)=j}\Vert x^{(i)}-\mu _{z^{(i)}}\Vert ^{2}}$
${\displaystyle =\nabla _{\mu }\sum _{j=1}^{k}\sum _{i\mid z(i)=j}\Vert x^{(i)}-\mu _{j}\Vert ^{2}}$
${\displaystyle =\sum _{j=1}^{k}\sum _{i\mid z(i)=j}\nabla _{\mu }\Vert x^{(i)}-\mu _{j}\Vert ^{2}}$
${\displaystyle =-\sum _{j=1}^{k}\sum _{i\mid z(i)=j}2(x^{(i)}-\mu _{j})=0}$
${\displaystyle \implies \mu _{j}=(\sum _{i\mid z(i)=j}x^{(i)})/(\sum _{i\mid z(i)=j}1)\quad \forall j}$

Notes

This procedure will yield us a sequence of parameters and losses.
${\displaystyle \mu ^{0},z^{0},\mu ^{1},z^{1},...}$
${\displaystyle L(0)\geq L(1)\geq L(2)\geq ...}$

• Since the loss is monotone decreasing and bounded below, it will converge by the monotone convergence theorem.
• However, this does not imply that the parameters ${\displaystyle \mathbf {\mu } }$ and ${\displaystyle \mathbf {z} }$ will converge.

#### Algorithm

1. Randomly initialize labels ${\displaystyle \mathbf {z} }$.
2. Then calculate the centroids ${\displaystyle \mathbf {\mu } }$.
3. Then update the labels for each example to the closest centroid.
4. Update the centroids by taking the mean of every point in the cluster.
5. Repeat steps 3 and 4

### Soft K-means

We will develop a model for how our data is generated.
We will then find probabilities for each element being from a cluster (ala Bayesian paradigm).
Given ${\displaystyle k}$ clusters, the probability of a point being from cluster k is ${\displaystyle \phi _{k}=P(z^{(i)}=k)}$
We will assume each cluster is from a normal distribution ${\displaystyle N(\mu _{j},\sigma _{j})}$

#### EM Algorithm

Expectation Maximization
The key idea is to introduce intermediate coefficients to apply Jensen's inequality.

##### Maximum Likelihood

Let ${\displaystyle \theta _{j}=[\phi _{j},\mu _{j},\sigma _{j}]}$ denote our parameters.
The likelihood is ${\displaystyle L(\theta )=\Pi _{i=1}^{m}P(x^{(i)};\theta )}$ where ${\displaystyle P}$ is our pdf/pmf.
Then the log-likelihood is ${\displaystyle l(\theta )=\sum _{i=1}^{m}\log P(x^{(i)};\theta )=\sum _{i=1}^{m}\log \sum _{j=1}^{k}P(x^{(i)},z^{(i)}=j;\theta )}$
By introducing an extra variable ${\displaystyle Q_{(j)}^{(i)}}$ we'll be able to apply Jensen's inequality to the concave log function.
Assume ${\displaystyle Q^{(i)}{(j)}}$ is a probability mass function.
${\displaystyle l(\theta )=\sum _{i=1}^{m}\log \sum _{j=1}^{k}P(x^{(i)},z^{(i)}=j;\theta )}$
${\displaystyle =\sum _{i=1}^{m}\log \sum _{j=1}^{k}{\frac {Q_{(j)}^{(i)}}{Q_{(j)}^{(i)}}}P(x^{(i)},z^{(i)}=j;\theta )}$
${\displaystyle \geq \sum _{i=1}^{m}\sum _{j=1}^{k}Q_{(j)}^{(i)}\log({\frac {P(x^{(i)},z^{(i)}=j;\theta )}{Q_{(j)}^{(i)}}})}$
${\displaystyle \implies \log \left[E_{Q}\left({\frac {P(x^{(i)},q^{(i)};\theta )}{Q_{(j)}^{(i)}}}\right)\right]\geq E_{Q}\left[\log \left({\frac {P(x^{(i)},q^{(i)};\theta )}{Q^{(i)}(j)}}\right)\right]}$
Let ${\displaystyle J(\theta ,Q)=\sum _{i=1}^{m}\sum _{j=1}^{k}Q_{(j)}^{(i)}\log \left({\frac {P(x^{(i)},z^{(i)}=j;\theta )}{Q^{(i)}(j)}}\right)}$
The EM algorithm is an iterative algorithm which alternates between an E-Step and an M-Step.

##### E-Step

We will fix ${\displaystyle \theta }$ and maximize J wrt ${\displaystyle Q}$.
Jensen's inequality holds with equality iff either the function is linear or if the random variable is degenerate.
We will assume ${\displaystyle {\frac {P(x^{(i)},z^{(i)}=j;\theta )}{Q_{(j)}^{(i)}}}}$ is a constant.
This implies ${\displaystyle Q^{(i)}(j)=c*P(x^{(i)},z^{(i)}=j;\theta )}$.
Since Q is a pmf, we have ${\displaystyle Q^{(i)}(j)={\frac {1}{P(x^{(i)})}}*P(x^{(i)},z^{(i)}=j;\theta )=P(z^{(i)};x^{(i)},\theta )}$
Q is updated with ${\displaystyle Q^{(i)}(j)={\frac {P(z^{(i)}=j)P(x^{(i)}|z^{(i)}=j)}{P(x^{(i)})}}}$

Maximization w.r.t Q

${\displaystyle J(\theta ,Q)=\sum _{i=1}^{m}\sum _{j=1}^{k}Q_{(j)}^{(i)}\log \left({\frac {P(x^{(i)},z^{(i)}=j;\theta )}{Q^{(i)}(j)}}\right)}$
We are assuming that Q is a pmf so ${\displaystyle \sum _{j}Q=1}$
Our lagrangian is:
${\displaystyle \max _{Q}\min _{\beta }\left[\sum _{i=1}^{m}\sum _{j=1}^{k}Q_{(j)}^{(i)}\log \left({\frac {P(x^{(i)},z^{(i)}=j;\theta )}{Q^{(i)}(j)}}\right)+\beta (\sum _{j}Q_{(j)}^{(i)}-1)\right]}$ We can maximize each ${\displaystyle Q^{(i)}}$ independently.
Taking the derivative wrt Q we get:
${\displaystyle {\frac {\partial }{\partial Q_{(j)}^{(i)}}}\sum _{j=1}^{k}Q_{(j)}^{(i)}\log \left({\frac {P(x^{(i)},z^{(i)}=j;\theta )}{Q^{(i)}(j)}}\right)+\beta (\sum _{j}Q_{(j)}^{(i)}-1)}$
${\displaystyle =\log({\frac {P(x^{(i)},z^{(i)}=j;\theta )}{Q}})-Q{\frac {Q}{P(x^{(i)},z^{(i)}=j;\theta )}}(P(x^{(i)},z^{(i)}=j;\theta ))(Q^{-2})+\beta }$ [/itex]
${\displaystyle =\log({\frac {P(x^{(i)},z^{(i)}=j;\theta )}{Q}})-1+\beta =0}$
${\displaystyle \implies Q_{(j)}^{(i)}=({\frac {1}{exp(1-\beta )}})P(x^{(i)},z^{(i)}=j;\theta )}$
Since Q is a pmf, we know it sums to 1 so we get the same result replacing ${\displaystyle ({\frac {1}{exp(1-\beta )}})}$ with ${\displaystyle P(x^{(i)}}$

##### M-Step

We will fix ${\displaystyle Q}$ and maximize J wrt ${\displaystyle \theta }$ ${\displaystyle J(\theta ,Q)=\sum _{i=1}^{m}\sum _{j=1}^{m}Q_{(j)}^{(i)}\log \left({\frac {P(x^{(i)},z^{(i)}=j;\theta )}{Q^{(i)}(j)}}\right)}$
Assume ${\displaystyle \Sigma _{j}=I}$ for simplicity.
Then
${\displaystyle J(\theta ,Q)=\sum _{i=1}^{m}\sum _{j=1}^{m}Q_{(j)}^{(i)}\log \left({\frac {P(x^{(i)},z^{(i)}=j;\theta )}{Q^{(i)}(j)}}\right)}$
${\displaystyle =\sum _{i=1}^{m}\sum _{j=1}^{m}Q_{(j)}^{(i)}\log(P(x^{(i)},z^{(i)}=j;\theta ))+C_{1}}$
${\displaystyle =\sum _{i=1}^{m}\sum _{j=1}^{m}Q_{(j)}^{(i)}\log(P(x^{(i)}\mid z^{(i)}=j)P(z^{(i)}=j))+C_{1}}$
${\displaystyle =\sum _{i=1}^{m}\sum _{j=1}^{m}Q_{(j)}^{(i)}\log(P(x^{(i)}\mid z^{(i)}=j))-Q_{(j)}^{(i)}\log(P(z^{(i)}=j))+C_{1}}$
${\displaystyle =\sum _{i=1}^{m}\left[\sum _{j=1}^{m}Q_{(j)}^{(i)}\log((2\pi )^{-n/2}exp(-\Vert x^{(i)}+\mu _{j}\Vert ^{2}/2))-Q_{(j)}^{(i)}\log(\phi _{j})\right]+C_{1}}$
${\displaystyle =\sum _{i=1}^{m}\left[\sum _{j=1}^{m}Q_{(j)}^{(i)}-\Vert x^{(i)}-\mu _{j}\Vert ^{2}/2)+Q_{(j)}^{(i)}\log(\phi _{j})\right]+C_{2}}$
Maximizing wrt ${\displaystyle \mu }$, we get ${\displaystyle \mu _{j}^{*}=(\sum _{i}Q_{(i)}^{(i)}x^{(i)})/(\sum _{i}Q_{(j)}^{(i)})}$.
Maximizing wrt ${\displaystyle \phi }$ we get ${\displaystyle \phi _{j}^{*}={\frac {1}{m}}\sum _{i=1}^{m}Q_{(j)}^{(i)}}$

Maximization wrt phi

We want to maximize ${\displaystyle Q_{(j)}^{(i)}\log(\phi _{j})}$ subject to the condition ${\displaystyle \sum \phi _{j}=1}$ since ${\displaystyle \phi _{j}=P(z^{(i)}=j)}$ is a pmf.
The lagrangian for this is ${\displaystyle \max _{\phi _{1},...,\phi _{k}}\min _{\beta }\left[\sum _{i=1}^{m}\sum _{j=1}^{k}Q_{(j)}^{(i)}\log(\phi _{j})+\beta (\sum _{j=1}^{k}\phi _{j}-1)\right]}$
This can be rewritten as ${\displaystyle \max _{\phi _{1},...,\phi _{k}}\min _{\beta }\sum _{j=1}^{k}\left[\log(\phi _{j})\sum _{i=1}^{m}Q_{(j)}^{(i)}+\beta (\phi _{j}-1/k)\right]}$
The dual of this problem is ${\displaystyle \min _{\beta }\max _{\phi _{1},...,\phi _{k}}\sum _{j=1}^{k}\left[\log(\phi _{j})\sum _{i=1}^{m}Q_{(j)}^{(i)}+\beta (\phi _{j}-1/k)\right]}$
Taking the gradient w.r.t ${\displaystyle \phi }$, we get ${\displaystyle {\frac {1}{\phi _{j}}}\sum _{i}Q_{(j)}^{(i)}+\beta =0\implies \phi _{j}={\frac {-1}{\beta }}(\sum _{i}Q_{(j)}^{(i)})}$
Plugging this into our dual problem we get:
${\displaystyle \min _{\beta }\sum _{j=1}^{k}\left[\log({\frac {-1}{\beta }}(\sum _{i}Q_{(j)}^{(i)}))\sum _{i=1}^{m}Q_{(j)}^{(i)}+\beta ({\frac {-1}{\beta }}(\sum _{i}Q_{(j)}^{(i)})-1/k)\right]}$
${\displaystyle =\min _{\beta }\sum _{j=1}^{k}\left[\log({\frac {-1}{\beta }}(\sum _{i}Q_{(j)}^{(i)}))\sum _{i=1}^{m}Q_{(j)}^{(i)}-(\sum _{i}Q_{(j)}^{(i)})-(\beta /k)\right]}$
Taking the derivative with respect to ${\displaystyle \beta }$, we get:
${\displaystyle \sum _{j=1}^{k}[({\frac {1}{(-1/\beta )(\sum Q)}})(-\sum Q)(-\beta ^{-2})(\sum Q)-{\frac {1}{k}}]}$
${\displaystyle =\sum _{j=1}^{k}[(\beta )(-\beta ^{-2})(\sum Q)-{\frac {1}{k}}]}$
${\displaystyle =\sum _{j=1}^{k}[{\frac {-1}{\beta }}(\sum _{i=1}^{m}Q)-{\frac {1}{k}}]}$
${\displaystyle =[\sum _{i=1}^{m}{\frac {-1}{\beta }}\sum _{j=1}^{k}P(z^{(i)}=j|x^{(i)})-\sum _{j=1}^{k}{\frac {1}{k}}]}$
${\displaystyle =[{\frac {-1}{\beta }}\sum _{i=1}^{m}1-1]}$
${\displaystyle ={\frac {-m}{\beta }}-1=0}$
${\displaystyle \implies \beta =-m}$
Plugging in ${\displaystyle \beta =-m}$ into our equation for ${\displaystyle \phi _{j}}$ we get ${\displaystyle \phi _{j}={\frac {1}{m}}\sum _{i=1}^{m}Q_{(j)}^{(i)}}$