Difference between revisions of "Unsupervised Learning"

From David's Wiki
Jump to navigation Jump to search
(3 intermediate revisions by the same user not shown)
Line 189: Line 189:
 
====Model====
 
====Model====
 
Our model for how the data is generated is as follows:
 
Our model for how the data is generated is as follows:
* Generate latent variables <math>z^{(1)},...,z^{(m)} \in \mathbb{R}^r<math> iid where dimension r is less than n.
+
* Generate latent variables <math>z^{(1)},...,z^{(m)} \in \mathbb{R}^r</math> iid where dimension r is less than n.
 
** We assume <math>Z^{(i)} \sim N(\mathbf{0},\mathbf{I})</math>
 
** We assume <math>Z^{(i)} \sim N(\mathbf{0},\mathbf{I})</math>
 
* Generate <math>x^{(i)}</math> where <math>X^{(i)} \vert Z^{(i)} \sin N(g_{\theta}(z), \sigma^2 \mathbf{I})</math>
 
* Generate <math>x^{(i)}</math> where <math>X^{(i)} \vert Z^{(i)} \sin N(g_{\theta}(z), \sigma^2 \mathbf{I})</math>
Line 196: Line 196:
 
====Variational Bound====
 
====Variational Bound====
 
The variational bound is:
 
The variational bound is:
* <math>\log P(x^{(i)}) \geq E_{Z}[\log P(X^{(i)} \vert Z)] - KL(Q_i(z) \Vert P(z))</math>
+
* <math>\log P(x^{(i)}) \geq E_{Z \sim Q_i}[\log P(X^{(i)} \vert Z)] - KL(Q_i(z) \Vert P(z))</math>
 
{{hidden | Derivation |
 
{{hidden | Derivation |
We know from Baye's rule
+
We know from Baye's rule that <math>P(z|X) = \frac{P(X|z)P(z)}{P(X)}</math>.<br>
 +
Plugging this into the equation for <math>KL(Q_i(z) \Vert P(z|X))</math> yields our inequality.<br>
 +
<math>KL(Q_i(z) \Vert P(z|X)) = E_{Q} \left[ \log(\frac{Q_i(z)}{P(z|X)}) \right]</math><br>
 +
<math>=E_Q(\log(\frac{Q_i(z) P(X^{(i)})}{P(X_z)P(z)})</math><br>
 +
<math>=E_Q(\log(\frac{Q_i(z)}{P(z)})) + \log(P(x^{(i)})) - E_Q(\log(P(X|z))</math><br>
 +
<math>=KL(Q_i(z) \Vert P(z)) + \log(P(x^{(i)}) - E_Q(\log(P(X|z))</math><br>
 +
Rearranging terms we get:<br>
 +
<math>\log P(x^{(i)}) - KL(Q_i(z) \Vert P(z|X)) = E_Q(\log(P(X|z)) - KL(Q_i(z) \Vert P(z))</math><br>
 +
Since the KL divergence is greater than or equal to 0, our variational bound follows.
 
}}
 
}}
 +
 +
==GANs==
 +
{{main | Generative adversarial network}}
 +
 +
===Wasserstein GAN===
 +
[https://arxiv.org/abs/1701.07875 Paper]<br>
 +
The main idea is to ensure the that discriminator is lipschitz continuous and to limit the lipschitz constant (i.e. the derivative) of the discriminator.<br>
 +
If the correct answer is 1.0 and the generator produces 1.0001, we don't want the discriminator to give us a very high loss.<br>

Revision as of 10:13, 2 December 2019

Basics of Unsupervised Learning

Clustering

Given points we wish to group them into clusters 1,...,k.
Usually we are given the number of clusters, k.
The problem is to assign labels where

K-means

Here we wish to cluster them to minimize the distance to the mean of their cluster.
In our formulation we have k clusters.
The mean of each cluster is called the centroid.</math>

Optimization

Let denote the centroids and let denote the cluster labels for our data.
Our loss function is .

Using coordinate-block descent

.
If we fix then we have only need to minimize .
Since each term of the sum is independent, we simply choose the closest centroid for each point
If we fix then we need to minimize wrt .
Taking the gradient and setting it to 0 we get:





Notes

This procedure will yield us a sequence of parameters and losses.


  • Since the loss is monotone decreasing and bounded below, it will converge by the monotone convergence theorem.
  • However, this does not imply that the parameters and will converge.

Algorithm

  1. Randomly initialize labels .
  2. Then calculate the centroids .
  3. Then update the labels for each example to the closest centroid.
  4. Update the centroids by taking the mean of every point in the cluster.
  5. Repeat steps 3 and 4

Soft K-means

We will develop a model for how our data is generated.
We will then find probabilities for each element being from a cluster (ala Bayesian paradigm).
Given clusters, the probability of a point being from cluster k is
We will assume each cluster is from a normal distribution

EM Algorithm

Expectation Maximization
The key idea is to introduce intermediate coefficients to apply Jensen's inequality.

Maximum Likelihood

Let denote our parameters.
The likelihood is where is our pdf/pmf.
Then the log-likelihood is
By introducing an extra variable we'll be able to apply Jensen's inequality to the concave log function.
Assume is a probability mass function.




Let
The EM algorithm is an iterative algorithm which alternates between an E-Step and an M-Step.

E-Step

We will fix and maximize J wrt .
Jensen's inequality holds with equality iff either the function is linear or if the random variable is degenerate.
We will assume is a constant.
This implies .
Since Q is a pmf, we have
Q is updated with

Maximization w.r.t Q


We are assuming that Q is a pmf so
Our lagrangian is:
We can maximize each independently.
Taking the derivative wrt Q we get:

</math>


Since Q is a pmf, we know it sums to 1 so we get the same result replacing with

M-Step

We will fix and maximize J wrt
Assume for simplicity.
Then






Maximizing wrt , we get .
Maximizing wrt we get

Maximization wrt phi

We want to maximize subject to the condition since is a pmf.
The lagrangian for this is
This can be rewritten as
The dual of this problem is
Taking the gradient w.r.t , we get
Plugging this into our dual problem we get:


Taking the derivative with respect to , we get:







Plugging in into our equation for we get

Generative Models

Goal: Generate realistic but fake samples. Applications: Denoising, impainting

VAEs

Variational Auto-Encoders
Tutorial

KL Divergence

Notes
  • KL is always >= 0
  • KL is not symmetric

Model

Our model for how the data is generated is as follows:

  • Generate latent variables iid where dimension r is less than n.
    • We assume
  • Generate where
    • For some function parameterized by

Variational Bound

The variational bound is:

Derivation

We know from Baye's rule that .
Plugging this into the equation for yields our inequality.




Rearranging terms we get:

Since the KL divergence is greater than or equal to 0, our variational bound follows.

GANs

Wasserstein GAN

Paper
The main idea is to ensure the that discriminator is lipschitz continuous and to limit the lipschitz constant (i.e. the derivative) of the discriminator.
If the correct answer is 1.0 and the generator produces 1.0001, we don't want the discriminator to give us a very high loss.