Deep Learning: Difference between revisions

Line 943: Line 943:


Probabilistic Model:
Probabilistic Model:
suppose our dataset is <math>\{x_i\}_{1}^{n}</math> with <math>x_i \in \mathbb{R}^d</math>
Suppose our dataset is <math>\{x_i\}_{1}^{n}</math> with <math>x_i \in \mathbb{R}^d</math>
# Generate latent variables <math>z_1,...,z_n \in \mathbb{R}^r</math> where <math>r << d</math>.
# Generate latent variables <math>z_1,...,z_n \in \mathbb{R}^r</math> where <math>r << d</math>.
# Assume <math>X=x_i | Z = z_i \sim N \left( g_{theta}(z_i), \sigma^2 I \right)</math>
# Assume <math>X=x_i | Z = z_i \sim N \left( g_{theta}(z_i), \sigma^2 I \right)</math>.
#* Here <math>g_\theta</math> is called the ''generator'' or ''decoder'' function.
 
Q: How can we pick good model parameters <math>\theta</math>? 
Using maximum likelihood:
 
<math>
\begin{align*}
\max_{\theta} P(\{x_i\}; \theta) &= \prod P(x_i; \theta)\\
&= \max_{\theta} \sum_{i=1}^{n} \log P_{\theta}(x_i)\\
&= \max_{\theta} \sum_{i=1}^{n} \log \left( \int_{z} P(z) P(x_i|z) dz \right)\\
\end{align*}
</math>
 
This is hard to compute. 
Instead we calculate a lower bound and maximize the lower bound:
<math> \max_{\theta} l(\theta) \geq \max_{\theta, \phi} J(\theta, \phi)</math>
 
;ELBO / Variational lower bound:
<math>
\begin{aligned}
&P(x_i | z) = \frac{P(z | x_i) P(x_i)}{P(z)}\\
\implies& \log P(x_i | z) + \log P(z) = \log P(z | x_i) + \log P(x_i)\\
\implies& E_z[\log P(x_i)] = E[ \log P(x_i | z) + \log P(z) - \log P(z | x_i)] \\
\implies& \log P(x_i) = E_{z \sim q_i}[\log P_{\theta}(x_i | z)] + E[\log P(z)] - E[\log P(z|x_i)] + (E[\log q_i(z)] - E[\log q_i(z)])\\
\implies& \log P(x_i) = E_{z \sim q_i}[\log P_{\theta}(x_i | z)]  + (E_q[\log q_i(z)]- E_q[\log P(z|x_i)]) - (E_q[\log q_i(z) - E_q[\log P(z)]])\\
\implies& \log P(x_i) = E_{z \sim q_i} \left[\log P_{\theta}(x_i | z) \right] + KL \left(q_i \Vert P(z|x_i) \right) - KL \left(q_i \Vert P(z) \right)\\
\end{aligned}
</math>
 
The second term is hard to compute so we ignore it. It is a positive term. 
Thus:
<math>\log P(x_i) \geq E_{z \sim q_i} \left[\log P_{\theta}(x_i | z) \right] - KL \left(q_i \Vert P(z) \right)</math>
 
Optimization: 
<math>\max_{\theta, \phi} \sum_{i=1}^{n} E_{z \sim q} \left[\log P_{\theta}(x_i | z) \right] - KL \left(q_i \Vert P(z) \right)</math> 
<math>q(z|x) \sim N\left( f_{\phi}(x), \sigma^2 I \right)</math> 
Here, <math>f_{\phi}(x)</math> is called the encoder.
 
The claim is that <math>KL \left(q_i \Vert P(z) \right)</math> is easier to compute: 
<math>
\begin{align*}
&\max_{\theta, \phi} \sum_{i=1}^{n} E_{z \sim q} \left[\log P_{\theta}(x_i | z) \right] - KL \left(q_i \Vert P(z) \right)\\
=&\max_{\theta, \phi} \sum_{i=1}^{n} E_{z \sim q} \left[ \log \exp(-\Vert x_i - g_{\theta}(z) \Vert^2 /(2\sigma^2)) - \log \exp(-\Vert z - f_{\phi}(z) \Vert^2 /(2\sigma^2)) \right]\\
=&\max_{\theta, \phi} \sum_{i=1}^{n} E_{z \sim q} \left[ -\Vert x_i - g_{\theta}(z) \Vert^2 /(2\sigma^2) + \Vert z - f_{\phi}(z) \Vert^2 /(2\sigma^2) \right]\\
\end{align*}
</math> 
We use SGD to optimize <math>\theta, \phi</math>. 
Using the reparameterization trick, <math>z = \mu + \Sigma^{1/2}\epsilon</math> for <math>\epsilon \sim N(0, I)</math>.


==Misc==
==Misc==