Unsupervised Learning

Basics of Unsupervised Learning

Clustering

Given points \(\displaystyle \{x^{(1)},...,x^{(m)}\}\) we wish to group them into clusters 1,...,k.
Usually we are given the number of clusters, k.
The problem is to assign labels \(\displaystyle \{z^{(1)},...,z^{(m)}\}\) where \(\displaystyle z^{(i)} \in \{1,...,k\}\)

K-means

Here we wish to cluster them to minimize the distance to the mean of their cluster.
In our formulation we have k clusters.
The mean of each cluster \(\displaystyle \mu_i\) is called the centroid.</math>

Optimization

Let \(\displaystyle \mathbf{\mu}\) denote the centroids and let \(\displaystyle \mathbf{z}\) denote the cluster labels for our data.
Our loss function is \(\displaystyle L(\mu, \mathbf{z}) = \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\).

Using coordinate-block descent

\(\displaystyle L(\mu, \mathbf{z}) = \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\).
If we fix \(\displaystyle \mu\) then we have only need to minimize \(\displaystyle z^{(i)}\).
Since each term of the sum is independent, we simply choose the closest centroid for each point
If we fix \(\displaystyle \mathbf{z}\) then we need to minimize \(\displaystyle L(\mu, \mathbf{z})\) wrt \(\displaystyle \mu\).
Taking the gradient and setting it to 0 we get:
\(\displaystyle kmo \begin{align} \nabla L(\mu, \mathbf{z}) &= \nabla \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\\ &= \nabla \sum_{j=1}{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\\ &= \nabla \sum_{j=1}{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{j} \Vert ^2\\ &= \sum_{j=1}{k} \sum_{i\mid z(i)=j} \nabla \Vert x^{(i)} - \mu_{j} \Vert ^2\\ &= \sum_{j=1}{k} \sum_{i\mid z(i)=j} \nabla \Vert x^{(i)} - \mu_{j} \Vert ^2\\ &= \sum_{j=1}{k} \sum_{i\mid z(i)=j} 2(x^{(i)} - \mu_{j})\\ \implies \mu_{j} &= (\sum_{i\mid z(i)=j} x^{(i)})/(\sum_{i\mid z(i)=j} 1) \quad \forall j \end{align} \)

Notes

This procedure will yield us a sequence of parameters and losses.
\(\displaystyle \mu^{0}, z^{0}, \mu^1, z^1, ...\)
\(\displaystyle L(0) \geq L(1) \geq L(2) \geq ...\)

Since the loss is monotone decreasing and bounded below, it will converge by the monotone convergence theorem.
However, this does not imply that the parameters \(\displaystyle \mathbf{\mu}\) and \(\displaystyle \mathbf{z}\) will converge.

Algorithm

Soft K-means

We will develop a model for how our data is generated:
Given \(\displaystyle k\) clusters, the probability of a point being from cluster k is \(\displaystyle \phi_k = P(z^{(i)} = k)\)