Unsupervised Learning: Difference between revisions

From David's Wiki
Line 19: Line 19:
Taking the gradient and setting it to 0 we get:<br>
Taking the gradient and setting it to 0 we get:<br>
<math>
<math>
\nabla L(\mu, \mathbf{z}) &= \nabla \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2
\nabla L(\mu, \mathbf{z}) = \nabla \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2
</math><br>
</math><br>
<math>
<math>
&= \nabla \sum_{j=1}{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2
= \nabla \sum_{j=1}{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2
</math><br>
</math><br>
<math>
<math>
&= \nabla \sum_{j=1}{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{j} \Vert ^2
= \nabla \sum_{j=1}{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{j} \Vert ^2
</math><br>
</math><br>
<math>
<math>
&= \sum_{j=1}{k} \sum_{i\mid z(i)=j}  \nabla  \Vert x^{(i)} - \mu_{j} \Vert ^2
= \sum_{j=1}{k} \sum_{i\mid z(i)=j}  \nabla  \Vert x^{(i)} - \mu_{j} \Vert ^2
</math><br>
</math><br>
<math>
<math>
&=  \sum_{j=1}{k} \sum_{i\mid z(i)=j} \nabla \Vert x^{(i)} - \mu_{j} \Vert ^2
=  \sum_{j=1}{k} \sum_{i\mid z(i)=j} \nabla \Vert x^{(i)} - \mu_{j} \Vert ^2
</math><br>
</math><br>
<math>
<math>
&=  \sum_{j=1}{k} \sum_{i\mid z(i)=j} 2(x^{(i)} - \mu_{j})
=  \sum_{j=1}{k} \sum_{i\mid z(i)=j} 2(x^{(i)} - \mu_{j})
</math><br>
</math><br>
<math>
<math>
\implies \mu_{j} &= (\sum_{i\mid z(i)=j} x^{(i)})/(\sum_{i\mid z(i)=j} 1) \quad \forall j
\implies \mu_{j} = (\sum_{i\mid z(i)=j} x^{(i)})/(\sum_{i\mid z(i)=j} 1) \quad \forall j
</math>
</math>



Revision as of 17:56, 12 November 2019

Basics of Unsupervised Learning

Clustering

Given points \(\displaystyle \{x^{(1)},...,x^{(m)}\}\) we wish to group them into clusters 1,...,k.
Usually we are given the number of clusters, k.
The problem is to assign labels \(\displaystyle \{z^{(1)},...,z^{(m)}\}\) where \(\displaystyle z^{(i)} \in \{1,...,k\}\)

K-means

Here we wish to cluster them to minimize the distance to the mean of their cluster.
In our formulation we have k clusters.
The mean of each cluster \(\displaystyle \mu_i\) is called the centroid.</math>

Optimization

Let \(\displaystyle \mathbf{\mu}\) denote the centroids and let \(\displaystyle \mathbf{z}\) denote the cluster labels for our data.
Our loss function is \(\displaystyle L(\mu, \mathbf{z}) = \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\).

Using coordinate-block descent

\(\displaystyle L(\mu, \mathbf{z}) = \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\).
If we fix \(\displaystyle \mu\) then we have only need to minimize \(\displaystyle z^{(i)}\).
Since each term of the sum is independent, we simply choose the closest centroid for each point
If we fix \(\displaystyle \mathbf{z}\) then we need to minimize \(\displaystyle L(\mu, \mathbf{z})\) wrt \(\displaystyle \mu\).
Taking the gradient and setting it to 0 we get:
\(\displaystyle \nabla L(\mu, \mathbf{z}) = \nabla \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2 \)
\(\displaystyle = \nabla \sum_{j=1}{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2 \)
\(\displaystyle = \nabla \sum_{j=1}{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{j} \Vert ^2 \)
\(\displaystyle = \sum_{j=1}{k} \sum_{i\mid z(i)=j} \nabla \Vert x^{(i)} - \mu_{j} \Vert ^2 \)
\(\displaystyle = \sum_{j=1}{k} \sum_{i\mid z(i)=j} \nabla \Vert x^{(i)} - \mu_{j} \Vert ^2 \)
\(\displaystyle = \sum_{j=1}{k} \sum_{i\mid z(i)=j} 2(x^{(i)} - \mu_{j}) \)
\(\displaystyle \implies \mu_{j} = (\sum_{i\mid z(i)=j} x^{(i)})/(\sum_{i\mid z(i)=j} 1) \quad \forall j \)

Notes

This procedure will yield us a sequence of parameters and losses.
\(\displaystyle \mu^{0}, z^{0}, \mu^1, z^1, ...\)
\(\displaystyle L(0) \geq L(1) \geq L(2) \geq ...\)

  • Since the loss is monotone decreasing and bounded below, it will converge by the monotone convergence theorem.
  • However, this does not imply that the parameters \(\displaystyle \mathbf{\mu}\) and \(\displaystyle \mathbf{z}\) will converge.

Algorithm

  1. Randomly initialize labels \(\displaystyle \mathbf{z}\).
  2. Then calculate the centroids \(\displaystyle \mathbf{\mu}\).
  3. Then update the labels for each example to the closest centroid.
  4. Update the centroids by taking the mean of every point in the cluster.
  5. Repeat steps 3 and 4

Soft K-means

We will develop a model for how our data is generated:
Given \(\displaystyle k\) clusters, the probability of a point being from cluster k is \(\displaystyle \phi_k = P(z^{(i)} = k)\)