Deep Learning: Difference between revisions

Line 943:

Probabilistic Model:

~~suppose~~ our dataset is <math>\{x_i\}_{1}^{n}</math> with <math>x_i \in \mathbb{R}^d</math>

Suppose our dataset is <math>\{x_i\}_{1}^{n}</math> with <math>x_i \in \mathbb{R}^d</math>

# Generate latent variables <math>z_1,...,z_n \in \mathbb{R}^r</math> where <math>r << d</math>.

# Assume <math>X=x_i | Z = z_i \sim N \left( g_{theta}(z_i), \sigma^2 I \right)</math>

# Assume <math>X=x_i | Z = z_i \sim N \left( g_{theta}(z_i), \sigma^2 I \right)</math>.

#* Here <math>g_\theta</math> is called the ''generator'' or ''decoder'' function.

Q: How can we pick good model parameters <math>\theta</math>?

Using maximum likelihood:

<math>

\begin{align*}

\max_{\theta} P(\{x_i\}; \theta) &= \prod P(x_i; \theta)\\

&= \max_{\theta} \sum_{i=1}^{n} \log P_{\theta}(x_i)\\

&= \max_{\theta} \sum_{i=1}^{n} \log \left( \int_{z} P(z) P(x_i|z) dz \right)\\

\end{align*}

</math>

This is hard to compute.

Instead we calculate a lower bound and maximize the lower bound:

<math> \max_{\theta} l(\theta) \geq \max_{\theta, \phi} J(\theta, \phi)</math>

;ELBO / Variational lower bound:

<math>

\begin{aligned}

&P(x_i | z) = \frac{P(z | x_i) P(x_i)}{P(z)}\\

\implies& \log P(x_i | z) + \log P(z) = \log P(z | x_i) + \log P(x_i)\\

\implies& E_z[\log P(x_i)] = E[ \log P(x_i | z) + \log P(z) - \log P(z | x_i)] \\

\implies& \log P(x_i) = E_{z \sim q_i}[\log P_{\theta}(x_i | z)] + E[\log P(z)] - E[\log P(z|x_i)] + (E[\log q_i(z)] - E[\log q_i(z)])\\

\implies& \log P(x_i) = E_{z \sim q_i}[\log P_{\theta}(x_i | z)] + (E_q[\log q_i(z)]- E_q[\log P(z|x_i)]) - (E_q[\log q_i(z) - E_q[\log P(z)]])\\

\implies& \log P(x_i) = E_{z \sim q_i} \left[\log P_{\theta}(x_i | z) \right] + KL \left(q_i \Vert P(z|x_i) \right) - KL \left(q_i \Vert P(z) \right)\\

\end{aligned}

</math>

The second term is hard to compute so we ignore it. It is a positive term.

Thus:

<math>\log P(x_i) \geq E_{z \sim q_i} \left[\log P_{\theta}(x_i | z) \right] - KL \left(q_i \Vert P(z) \right)</math>

Optimization:

<math>\max_{\theta, \phi} \sum_{i=1}^{n} E_{z \sim q} \left[\log P_{\theta}(x_i | z) \right] - KL \left(q_i \Vert P(z) \right)</math>

<math>q(z|x) \sim N\left( f_{\phi}(x), \sigma^2 I \right)</math>

Here, <math>f_{\phi}(x)</math> is called the encoder.

The claim is that <math>KL \left(q_i \Vert P(z) \right)</math> is easier to compute:

<math>

\begin{align*}

&\max_{\theta, \phi} \sum_{i=1}^{n} E_{z \sim q} \left[\log P_{\theta}(x_i | z) \right] - KL \left(q_i \Vert P(z) \right)\\

=&\max_{\theta, \phi} \sum_{i=1}^{n} E_{z \sim q} \left[ \log \exp(-\Vert x_i - g_{\theta}(z) \Vert^2 /(2\sigma^2)) - \log \exp(-\Vert z - f_{\phi}(z) \Vert^2 /(2\sigma^2)) \right]\\

=&\max_{\theta, \phi} \sum_{i=1}^{n} E_{z \sim q} \left[ -\Vert x_i - g_{\theta}(z) \Vert^2 /(2\sigma^2) + \Vert z - f_{\phi}(z) \Vert^2 /(2\sigma^2) \right]\\

\end{align*}

</math>

We use SGD to optimize <math>\theta, \phi</math>.

Using the reparameterization trick, <math>z = \mu + \Sigma^{1/2}\epsilon</math> for <math>\epsilon \sim N(0, I)</math>.

==Misc==

@@ Line 943: / Line 943: @@
 Probabilistic Model:
-suppose our dataset is <math>\{x_i\}_{1}^{n}</math> with <math>x_i \in \mathbb{R}^d</math>
+Suppose our dataset is <math>\{x_i\}_{1}^{n}</math> with <math>x_i \in \mathbb{R}^d</math>
 # Generate latent variables <math>z_1,...,z_n \in \mathbb{R}^r</math> where <math>r << d</math>.
-# Assume <math>X=x_i | Z = z_i \sim N \left( g_{theta}(z_i), \sigma^2 I \right)</math>
+# Assume <math>X=x_i | Z = z_i \sim N \left( g_{theta}(z_i), \sigma^2 I \right)</math>.
+#* Here <math>g_\theta</math> is called the ''generator'' or ''decoder'' function.
+Q: How can we pick good model parameters <math>\theta</math>?
+Using maximum likelihood:
+<math>
+\begin{align*}
+\max_{\theta} P(\{x_i\}; \theta) &= \prod P(x_i; \theta)\\
+&= \max_{\theta} \sum_{i=1}^{n} \log P_{\theta}(x_i)\\
+&= \max_{\theta} \sum_{i=1}^{n} \log \left( \int_{z} P(z) P(x_i|z) dz \right)\\
+\end{align*}
+</math>
+This is hard to compute.
+Instead we calculate a lower bound and maximize the lower bound:
+<math> \max_{\theta} l(\theta) \geq \max_{\theta, \phi} J(\theta, \phi)</math>
+;ELBO / Variational lower bound:
+<math>
+\begin{aligned}
+&P(x_i | z) = \frac{P(z | x_i) P(x_i)}{P(z)}\\
+\implies& \log P(x_i | z) + \log P(z) = \log P(z | x_i) + \log P(x_i)\\
+\implies& E_z[\log P(x_i)] = E[ \log P(x_i | z) + \log P(z) - \log P(z | x_i)] \\
+\implies& \log P(x_i) = E_{z \sim q_i}[\log P_{\theta}(x_i | z)] + E[\log P(z)] - E[\log P(z|x_i)] + (E[\log q_i(z)] - E[\log q_i(z)])\\
+\implies& \log P(x_i) = E_{z \sim q_i}[\log P_{\theta}(x_i | z)]  + (E_q[\log q_i(z)]- E_q[\log P(z|x_i)]) - (E_q[\log q_i(z) - E_q[\log P(z)]])\\
+\implies& \log P(x_i) = E_{z \sim q_i} \left[\log P_{\theta}(x_i | z) \right] + KL \left(q_i \Vert P(z|x_i) \right) - KL \left(q_i \Vert P(z) \right)\\
+\end{aligned}
+</math>
+The second term is hard to compute so we ignore it. It is a positive term.
+Thus:
+<math>\log P(x_i) \geq E_{z \sim q_i} \left[\log P_{\theta}(x_i | z) \right] - KL \left(q_i \Vert P(z) \right)</math>
+Optimization:
+<math>\max_{\theta, \phi} \sum_{i=1}^{n} E_{z \sim q} \left[\log P_{\theta}(x_i | z) \right] - KL \left(q_i \Vert P(z) \right)</math>
+<math>q(z|x) \sim N\left( f_{\phi}(x), \sigma^2 I \right)</math>
+Here, <math>f_{\phi}(x)</math> is called the encoder.
+The claim is that <math>KL \left(q_i \Vert P(z) \right)</math> is easier to compute:
+<math>
+\begin{align*}
+&\max_{\theta, \phi} \sum_{i=1}^{n} E_{z \sim q} \left[\log P_{\theta}(x_i | z) \right] - KL \left(q_i \Vert P(z) \right)\\
+=&\max_{\theta, \phi} \sum_{i=1}^{n} E_{z \sim q} \left[ \log \exp(-\Vert x_i - g_{\theta}(z) \Vert^2 /(2\sigma^2)) - \log \exp(-\Vert z - f_{\phi}(z) \Vert^2 /(2\sigma^2)) \right]\\
+=&\max_{\theta, \phi} \sum_{i=1}^{n} E_{z \sim q} \left[ -\Vert x_i - g_{\theta}(z) \Vert^2 /(2\sigma^2) + \Vert z - f_{\phi}(z) \Vert^2 /(2\sigma^2) \right]\\
+\end{align*}
+</math>
+We use SGD to optimize <math>\theta, \phi</math>.
+Using the reparameterization trick, <math>z = \mu + \Sigma^{1/2}\epsilon</math> for <math>\epsilon \sim N(0, I)</math>.
 ==Misc==