Unsupervised Learning: Difference between revisions

Line 77: Line 77:
</math><br>
</math><br>
<math>
<math>
\implies \log\left[E_{Q}\left(\frac{Pr(x^{(i)}, z^{(i)}; \theta)}{Q^{(i)}_{(j)}}\right)\right] \geq E_{Q} \left[ \log \left(\frac{Pr(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}(j)} \right)\right]
\implies \log\left[E_{Q}\left(\frac{Pr(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}_{(j)}}\right)\right] \geq E_{Q} \left[ \log \left(\frac{Pr(x^{(i)}, q^{(i)}; \theta)}{Q^{(i)}(j)} \right)\right]
</math><br>
</math><br>
Let <math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{Pr(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br>
The EM algorithm is an iterative algorithm which alternates between an E-Step and an M-Step.
; E-Step
; E-Step
We will fix <math>\theta</math> and optimize wrt <math>Q</math>.<br>
We will fix <math>\theta</math> and maximize J wrt <math>Q</math>.<br>
Jensen's inequality holds with equality iff either the function is linear or if the random variable is degenerate.<br>
Jensen's inequality holds with equality iff either the function is linear or if the random variable is degenerate.<br>
We will assume <math>\frac{P(x^{(i)}, z^{(i)}=j; \theta)
We will assume <math>\frac{P(x^{(i)}, z^{(i)}=j; \theta)
Line 86: Line 88:
This implies <math>Q^{(i)}(j) = c * P(x^{(i)}, z^{(i)} = j ; \theta)</math>.<br>
This implies <math>Q^{(i)}(j) = c * P(x^{(i)}, z^{(i)} = j ; \theta)</math>.<br>
Since Q is a pmf, we have <math>Q^{(i)}(j) = \frac{1}{P(x^{(i)})} * P(x^{(i)}, z^{(i)} = j ; \theta) = P(z^{(i)} ; x^{(i)}, \theta)</math><br>
Since Q is a pmf, we have <math>Q^{(i)}(j) = \frac{1}{P(x^{(i)})} * P(x^{(i)}, z^{(i)} = j ; \theta) = P(z^{(i)} ; x^{(i)}, \theta)</math><br>
Q is updated with <math>Q^{(i)}(j) = \frac{Pr(z^{(i)}=j)Pr(x^{(i)}|z^{(i)}=j)}{Pr(x^{(i)})}</math>
; M-Step
; M-Step
We will fix <math>Q</math> and optimize wrt <math>\theta</math>
We will fix <math>Q</math> and maximize J wrt <math>\theta</math>
<math>J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{Pr(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)</math><br>
Assume <math>\Sigma_j = I</math> for simplicity.<br>
Then<br>
<math>
J(\theta, Q) = \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log \left( \frac{Pr(x^{(i)}, z^{(i)}=j;\theta)}{Q^{(i)}(j)} \right)
</math><br>
<math>
= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( Pr(x^{(i)}, z^{(i)}=j;\theta)) + C_1
</math><br>
<math>
= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( Pr(x^{(i)} \mid z^{(i)}=j) Pr(z^{(i)}=j)) + C_1
</math><br>
<math>
= \sum_{i=1}^{m} \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( Pr(x^{(i)} \mid z^{(i)}=j)) -  Q^{(i)}_{(j)} \log( Pr(z^{(i)}=j)) + C_1
</math><br>
<math>
= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} \log ( (2\pi)^{-n/2}exp(-\Vert x^{(i)} + \mu_j \Vert^2 / 2)) -  Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_1
</math><br>
<math>
= \sum_{i=1}^{m}\left[ \sum_{j=1}^{m} Q^{(i)}_{(j)} -\Vert x^{(i)} - \mu_j \Vert^2 / 2) +  Q^{(i)}_{(j)} \log( \phi_j) \right]+ C_2
</math><br>
Maximizing wrt <math>\mu</math>, we get <math>\mu_j^* = (\sum_{i} Q^{(i)}_{(i)}x^{(i)}) / (\sum_{i}Q^{(i)}_{(j)})</math>.<br>
Maximizing wrt <math>\phi</math> we get <math>\phi_j^* = \frac{1}{m} \sum_{i=1}^{m}Q^{(i)}_{(j)}</math><br>
{{hidden | Maximization wrt phi|
We want to maximize <math>Q^{(i)}_{(j)} \log( \phi_j)</math> subject to the condition <math>\sum \phi_j = 1</math> since <math>\phi_j = P(z^{(i)}=j)</math> is a pmf.<br>
The lagrangian for this is <math>\max_{\phi_1,...,\phi_k} \min_{\beta} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}Q^{(i)}_{(j)} \log( \phi_j) + \beta(\sum_{j=1}^{k}\phi_j - 1) \right]</math><br>
This can be rewritten as <math>\max_{\phi_1,...,\phi_k} \min_{\beta} \sum_{j=1}^{k}\left[\log( \phi_j) \sum_{i=1}^{m} Q^{(i)}_{(j)}  + \beta(\phi_j - 1/k) \right]</math><br>
The dual of this problem is <math> \min_{\beta} \sum_{j=1}^{k} \max_{\phi_1,...,\phi_k} \left[\log( \phi_j) \sum_{i=1}^{m} Q^{(i)}_{(j)}  + \beta(\phi_j - 1/k) \right]</math><br>
Taking the gradient w.r.t <math>\phi</math>, we get <math>\frac{1}{\phi_j}\sum_{i}Q^{(i)}_{(j)} + \beta = 0 \implies \phi_j = \frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)})</math><br>
Plugging this into our dual problem we get:<br>
<math> \min_{\beta} \sum_{j=1}^{k} \left[\log(\frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)})) \sum_{i=1}^{m} Q^{(i)}_{(j)}  + \beta(\frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)}) - 1/k) \right]</math><br>
<math>= \min_{\beta} \sum_{j=1}^{k} \left[\log(\frac{-1}{\beta}(\sum_{i}Q^{(i)}_{(j)})) \sum_{i=1}^{m} Q^{(i)}_{(j)}  -(\sum_{i}Q^{(i)}_{(j)}) - (\beta/k) \right]</math><br>
Taking the derivative with respect to <math>\beta</math>, we get:<br>
<math>
(\frac{1}{(-1/\beta)(\sum Q)})(-\sum Q)(-\beta^{-2}) - \frac{1}{k} = (\beta)(-\beta^{-2}) - \frac{1}{k} = \frac{-1}{\beta} - \frac{1}{k} = 0
</math><br>
<math>
\implies \beta = -k
</math><br>
Plugging in <math>\beta = -k</math> into our equation for <math>\phi_j</math> we get <math>\phi_j = \frac{1}{k}\sum_{i=1}^{m}Q^{(i)}_{(j)}</math>
}}