Unsupervised Learning

Revision as of 17:46, 12 November 2019 by David (talk | contribs) (→‎Optimization)
\( \newcommand{\P}[]{\unicode{xB6}} \newcommand{\AA}[]{\unicode{x212B}} \newcommand{\empty}[]{\emptyset} \newcommand{\O}[]{\emptyset} \newcommand{\Alpha}[]{Α} \newcommand{\Beta}[]{Β} \newcommand{\Epsilon}[]{Ε} \newcommand{\Iota}[]{Ι} \newcommand{\Kappa}[]{Κ} \newcommand{\Rho}[]{Ρ} \newcommand{\Tau}[]{Τ} \newcommand{\Zeta}[]{Ζ} \newcommand{\Mu}[]{\unicode{x039C}} \newcommand{\Chi}[]{Χ} \newcommand{\Eta}[]{\unicode{x0397}} \newcommand{\Nu}[]{\unicode{x039D}} \newcommand{\Omicron}[]{\unicode{x039F}} \DeclareMathOperator{\sgn}{sgn} \def\oiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x222F}\,}{\unicode{x222F}}{\unicode{x222F}}{\unicode{x222F}}}\,}\nolimits} \def\oiiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x2230}\,}{\unicode{x2230}}{\unicode{x2230}}{\unicode{x2230}}}\,}\nolimits} \)

Basics of Unsupervised Learning

Clustering

Given points \(\displaystyle \{x^{(1)},...,x^{(m)}\}\) we wish to group them into clusters 1,...,k.
Usually we are given the number of clusters, k.
The problem is to assign labels \(\displaystyle \{z^{(1)},...,z^{(m)}\}\) where \(\displaystyle z^{(i)} \in \{1,...,k\}\)

K-means

Here we wish to cluster them to minimize the distance to the mean of their cluster.
In our formulation we have k clusters.
The mean of each cluster \(\displaystyle \mu_i\) is called the centroid.</math>

Optimization

Let \(\displaystyle \mathbf{\mu}\) denote the centroids and let \(\displaystyle \mathbf{z}\) denote the cluster labels for our data.
Our loss function is \(\displaystyle L(\mu, \mathbf{z}) = \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\).

Using coordinate-block descent

\(\displaystyle L(\mu, \mathbf{z}) = \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\).
If we fix \(\displaystyle \mu\) then we have only need to minimize \(\displaystyle z^{(i)}\).
Since each term of the sum is independent, we simply choose the closest centroid for each point
If we fix \(\displaystyle \mathbf{z}\) then we need to minimize \(\displaystyle L(\mu, \mathbf{z})\) wrt \(\displaystyle \mu\).
Taking the gradient and setting it to 0 we get:
\(\displaystyle \begin{align} \nabla L(\mu, \mathbf{z}) &= \nabla \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\\ &= \nabla \sum_{j=1}{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\\ &= \nabla \sum_{j=1}{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{j} \Vert ^2\\ &= \sum_{j=1}{k} \sum_{i\mid z(i)=j} \nabla \Vert x^{(i)} - \mu_{j} \Vert ^2\\ &= \sum_{j=1}{k} \sum_{i\mid z(i)=j} \nabla \Vert x^{(i)} - \mu_{j} \Vert ^2\\ &= \sum_{j=1}{k} \sum_{i\mid z(i)=j} 2(x^{(i)} - \mu_{j})\\ \implies \mu_{j} &= (\sum_{i\mid z(i)=j} x^(i))/(\sum_{i\mid z(i)=j} 1) \quad \forall j \end{align} \)

Algorithm

Soft K-means

We will develop a model for how our data is generated:
Given \(\displaystyle k\) clusters, the probability of a point being from cluster k is \(\displaystyle \phi_k = P(z^{(i)} = k)\)