Unsupervised Learning

Revision as of 19:41, 14 November 2019 by David (talk | contribs) (→‎EM Algorithm)
\( \newcommand{\P}[]{\unicode{xB6}} \newcommand{\AA}[]{\unicode{x212B}} \newcommand{\empty}[]{\emptyset} \newcommand{\O}[]{\emptyset} \newcommand{\Alpha}[]{Α} \newcommand{\Beta}[]{Β} \newcommand{\Epsilon}[]{Ε} \newcommand{\Iota}[]{Ι} \newcommand{\Kappa}[]{Κ} \newcommand{\Rho}[]{Ρ} \newcommand{\Tau}[]{Τ} \newcommand{\Zeta}[]{Ζ} \newcommand{\Mu}[]{\unicode{x039C}} \newcommand{\Chi}[]{Χ} \newcommand{\Eta}[]{\unicode{x0397}} \newcommand{\Nu}[]{\unicode{x039D}} \newcommand{\Omicron}[]{\unicode{x039F}} \DeclareMathOperator{\sgn}{sgn} \def\oiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x222F}\,}{\unicode{x222F}}{\unicode{x222F}}{\unicode{x222F}}}\,}\nolimits} \def\oiiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x2230}\,}{\unicode{x2230}}{\unicode{x2230}}{\unicode{x2230}}}\,}\nolimits} \)

Basics of Unsupervised Learning

Clustering

Given points \(\displaystyle \{x^{(1)},...,x^{(m)}\}\) we wish to group them into clusters 1,...,k.
Usually we are given the number of clusters, k.
The problem is to assign labels \(\displaystyle \{z^{(1)},...,z^{(m)}\}\) where \(\displaystyle z^{(i)} \in \{1,...,k\}\)

K-means

Here we wish to cluster them to minimize the distance to the mean of their cluster.
In our formulation we have k clusters.
The mean of each cluster \(\displaystyle \mu_i\) is called the centroid.</math>

Optimization

Let \(\displaystyle \mathbf{\mu}\) denote the centroids and let \(\displaystyle \mathbf{z}\) denote the cluster labels for our data.
Our loss function is \(\displaystyle L(\mu, \mathbf{z}) = \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\).

Using coordinate-block descent

\(\displaystyle L(\mu, \mathbf{z}) = \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2\).
If we fix \(\displaystyle \mu\) then we have only need to minimize \(\displaystyle z^{(i)}\).
Since each term of the sum is independent, we simply choose the closest centroid for each point
If we fix \(\displaystyle \mathbf{z}\) then we need to minimize \(\displaystyle L(\mu, \mathbf{z})\) wrt \(\displaystyle \mu\).
Taking the gradient and setting it to 0 we get:
\(\displaystyle \nabla_{\mu} L(\mu, \mathbf{z}) = \nabla_{\mu} \sum_{i} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2 \)
\(\displaystyle = \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{z^{(i)}} \Vert ^2 \)
\(\displaystyle = \nabla_{\mu} \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \Vert x^{(i)} - \mu_{j} \Vert ^2 \)
\(\displaystyle = \sum_{j=1}^{k} \sum_{i\mid z(i)=j} \nabla_{\mu} \Vert x^{(i)} - \mu_{j} \Vert ^2 \)
\(\displaystyle = -\sum_{j=1}^{k} \sum_{i\mid z(i)=j} 2(x^{(i)} - \mu_{j}) = 0 \)
\(\displaystyle \implies \mu_{j} = (\sum_{i\mid z(i)=j} x^{(i)})/(\sum_{i\mid z(i)=j} 1) \quad \forall j \)

Notes

This procedure will yield us a sequence of parameters and losses.
\(\displaystyle \mu^{0}, z^{0}, \mu^1, z^1, ...\)
\(\displaystyle L(0) \geq L(1) \geq L(2) \geq ...\)

  • Since the loss is monotone decreasing and bounded below, it will converge by the monotone convergence theorem.
  • However, this does not imply that the parameters \(\displaystyle \mathbf{\mu}\) and \(\displaystyle \mathbf{z}\) will converge.

Algorithm

  1. Randomly initialize labels \(\displaystyle \mathbf{z}\).
  2. Then calculate the centroids \(\displaystyle \mathbf{\mu}\).
  3. Then update the labels for each example to the closest centroid.
  4. Update the centroids by taking the mean of every point in the cluster.
  5. Repeat steps 3 and 4

Soft K-means

We will develop a model for how our data is generated.
We will then find probabilities for each element being from a cluster (ala Bayesian paradigm).
Given \(\displaystyle k\) clusters, the probability of a point being from cluster k is \(\displaystyle \phi_k = P(z^{(i)} = k)\)
We will assume each cluster is from a normal distribution \(\displaystyle N(\mu_j, \sigma_j)\)

EM Algorithm

Expectation Maximization
The key idea is to introduce intermediate coefficients to apply Jensen's inequality.

Maximum Likelihood

Let \(\displaystyle \theta_j = [\phi_j, \mu_j, \sigma_j]\) denote our parameters.
The likelihood is \(\displaystyle L(\theta) = \Pi_{i=1}^{m} P(x^{(i)}; \theta)\) where \(\displaystyle P\) is our pdf/pmf.
Then the log-likelihood is \(\displaystyle l(\theta) = \sum_{i=1}^{m} \log P(x^{(i)}; \theta) = \sum_{i=1}^{m} \log \sum_{j=1}^{k}P(x^{(i)}, z^{(i)}=j; \theta)\)
By introducing an extra variable \(\displaystyle Q^{(i)}_{(j)}\) we'll be able to apply Jensen's inequality to the concave log function.
Assume \(\displaystyle Q^{(i)}{(j)}\) is a probability mass function.
\(\displaystyle l(\theta) = \sum_{i=1}^{m} \log \sum_{j=1}^{k}P(x^{(i)}, z^{(i)}=j; \theta) \)
\(\displaystyle =\sum_{i=1}^{m} \log \sum_{j=1}^{k} \frac{Q^{(i)}_{(j)}}{Q^{(i)}_{(j)}} P(x^{(i)}, z^{(i)}=j; \theta) \)
\(\displaystyle \geq \sum_{i=1}^{m} \sum_{j=1}^{k} Q^{(i)}_{(j)} \log(\frac{P(x^{(i)}, z^{(i)}=j; \theta) }{Q^{(i)}_{(j)}}) \)
\(\displaystyle \implies \log\left[E_{Q}\left(\frac{Pr(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}_{(j)}}\right)\right] \geq E_{Q} \left[ \log \left(\frac{Pr(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}(j)} \right)\right] \)
Let \(\displaystyle J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{Pr(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)\)
The EM algorithm is an iterative algorithm which alternates between an E-Step and an M-Step.

E-Step

We will fix \(\displaystyle \theta\) and maximize J wrt \(\displaystyle Q\).
Jensen's inequality holds with equality iff either the function is linear or if the random variable is degenerate.
We will assume \(\displaystyle \frac{P(x^{(i)}, z^{(i)}=j; \theta) }{Q^{(i)}_{(j)}}\) is a constant.
This implies \(\displaystyle Q^{(i)}(j) = c * P(x^{(i)}, z^{(i)} = j ; \theta)\).
Since Q is a pmf, we have \(\displaystyle Q^{(i)}(j) = \frac{1}{P(x^{(i)})} * P(x^{(i)}, z^{(i)} = j ; \theta) = P(z^{(i)} ; x^{(i)}, \theta)\)
Q is updated with \(\displaystyle Q^{(i)}(j) = \frac{Pr(z^{(i)}=j)Pr(x^{(i)}|z^{(i)}=j)}{Pr(x^{(i)})}\)

M-Step

We will fix \(\displaystyle Q\) and maximize J wrt \(\displaystyle \theta\) \(\displaystyle J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{Pr(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)\)
Assume \(\displaystyle \Sigma_j = I\) for simplicity.
Then
\(\displaystyle J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{Pr(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right) \)
\(\displaystyle = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( Pr(x^{(i)}, z^{(i)}=j;\theta)) + C_1 \)
\(\displaystyle = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( Pr(x^{(i)} \mid z^{(i)}=j) Pr(z^{(i)}=j)) + C_1 \)
\(\displaystyle = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( Pr(x^{(i)} \mid z^{(i)}=j)) - Q^{(i)}_{(j)} \log( Pr(z^{(i)}=j)) + C_1 \)
\(\displaystyle = \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( (2\pi)^{-n/2}exp(-\Vert x^{(i)} + \mu_j \Vert^2 / 2)) - Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_1 \)
\(\displaystyle = \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} -\Vert x^{(i)} - \mu_j \Vert^2 / 2) + Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_2 \)
Maximizing wrt \(\displaystyle \mu\), we get \(\displaystyle \mu_j^* = (\sum_{i} Q^{(i)}_{(i)}x^{(i)}) / (\sum_{i}Q^{(i)}_{(j)})\).
Maximizing wrt \(\displaystyle \phi\) we get \(\displaystyle \phi_j^* = \frac{1}{m} \sum_{i=1}^{m}Q^{(i)}_{(j)}\)

Maximization wrt phi

We want to maximize \(\displaystyle Q^{(i)}_{(j)} \log( \phi_j)\) subject to the condition \(\displaystyle \sum \phi_j = 1\) since \(\displaystyle \phi_j = P(z^{(i)}=j)\) is a pmf.
The lagrangian for this is \(\displaystyle \max_{\phi_1,...,\phi_k} \min_{\beta} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}Q^{(i)}_{(j)} \log( \phi_j) + \beta(\sum_{j=1}^{k}\phi_j - 1) \right]\)
This can be rewritten as \(\displaystyle \max_{\phi_1,...,\phi_k} \min_{\beta} \sum_{j=1}^{k}\left[\log( \phi_j) \sum_{i=1}^{m} Q^{(i)}_{(j)} + \beta(\phi_j - 1/k) \right]\)
The dual of this problem is \(\displaystyle \min_{\beta} \max_{\phi_1,...,\phi_k} \sum_{j=1}^{k} \left[\log( \phi_j) \sum_{i=1}^{m} Q^{(i)}_{(j)} + \beta(\phi_j - 1/k) \right]\)
Taking the gradient w.r.t \(\displaystyle \phi\), we get \(\displaystyle \frac{1}{\phi_j}\sum_{i}Q^{(i)}_{(j)} + \beta = 0 \implies \phi_j = \frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)})\)
Plugging this into our dual problem we get:
\(\displaystyle \min_{\beta} \sum_{j=1}^{k} \left[\log(\frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)})) \sum_{i=1}^{m} Q^{(i)}_{(j)} + \beta(\frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)}) - 1/k) \right]\)
\(\displaystyle = \min_{\beta} \sum_{j=1}^{k} \left[\log(\frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)})) \sum_{i=1}^{m} Q^{(i)}_{(j)} -(\sum_{i}Q^{(i)}_{(j)}) - (\beta/k) \right]\)
Taking the derivative with respect to \(\displaystyle \beta\), we get:
\(\displaystyle (\frac{1}{(-1/\beta)(\sum Q)})(-\sum Q)(-\beta^{-2}) - \frac{1}{k} = (\beta)(-\beta^{-2}) - \frac{1}{k} = \frac{-1}{\beta} - \frac{1}{k} = 0 \)
\(\displaystyle \implies \beta = -k \)
Plugging in \(\displaystyle \beta = -k\) into our equation for \(\displaystyle \phi_j\) we get \(\displaystyle \phi_j = \frac{1}{k}\sum_{i=1}^{m}Q^{(i)}_{(j)}\)