Jump to content

Deep Learning: Difference between revisions

3,923 bytes added ,  12 November 2020
Line 1,724: Line 1,724:
We need pairs of similar images and dissimilar images.
We need pairs of similar images and dissimilar images.


SimCLR [Chen ''et al.'' 2020]
;SimCLR [Chen ''et al.'' 2020]
# Create two correlated views of an image <math>x</math>: <math>\tilde{x}_i</math> and <math>\tilde{x}_j</math>.
# Create two correlated views of an image <math>x</math>: <math>\tilde{x}_i</math> and <math>\tilde{x}_j</math>.
#* Random cropping + resize
#* Random cropping + resize
Line 1,744: Line 1,744:


Training is <math>\min_{f,g} L = \frac{1}{N} \sum_{k=1}^{N} \frac{l(2k-1,2k) + l(2k, 2k-1)}{2}</math>.
Training is <math>\min_{f,g} L = \frac{1}{N} \sum_{k=1}^{N} \frac{l(2k-1,2k) + l(2k, 2k-1)}{2}</math>.
Practical observations:
* Composition of data augmentation is important.
* Larger batch sizes and longer training helps self-supervised learning!
* Optimization over the MLP projection head <math>g</math> helps!
** f is able to keep some information useful for the classification (e.g. color or orientation)
Empirical results:
* For image net, top-1 accuracy increases as number of parameters increases.
* After learning embeddings <math>f</math>, you don't need much labeled data for supervised-training.
===Theory of self-supervised learning===
;[Arora ''et al.'' 2019]
Modeling of similar pairs:
* <math>x \sim D_{C1}(x)</math>
* <math>x^+ \sim D_{C1}(x)</math> is semantically similar
* <math>x^- \sim D_{C2}(x)</math> negative samples
Downstream task:
* pairwise classification
** nature picks two classes C1, C2
** generate samples from C1 & C2 and evaluate the classification loss
* Assume <math>m \to \infty</math> so just look at population loss
Notations:
* <math>L_{sup}(f)</math> is the supervised learning loss with optimal last layer
* <math>l=E_{(x, x^+) \sim D_{sim}, x^- \sim D_{net}}[\log(1+\exp(f(x)^T f(x^-) - f(x)^Tf(x^+)]</math> is a logistic loss
Result 1:
* <math>L_{sup}^{mean}(f) \leq \frac{1}{1-\tau}(L_{un}(f)-\tau)</math> for all <math>f \in F</math>
** <math>\tau</math> is the collision probability for pair of random classes.
Idea result:
* <math>L_{sup}(f^*) \leq \alpha L_{sup}(f)</math> forall <math>f</math>.
* In general it is impossible to get this.
;[Tosh et al. 2020]
This work connects self-supervised learning with multi-view representation theory. 
Start with (x, z, y) where x,z are two views of the data and y is the label. 
* x and z should share redundant information w.r.t. y. I.e. predicting y from x or z individually should be equivalent to predicting y from both.
* <math>e_x = E(E[Y|X] - E[Y|X,Z])^2</math> and <math>e_z = E(E[Y|Z] - E[Y|X,Z])^2</math> should be small
* Formulate contrastive learning as an artificial ''classification'' problem:
** Classify between <math>(x, z, 1)</math> and <math>(x, \tilde{z}, 0)</math>.
** <math>g^*(x,z) = \operatorname{argmin}(\text{classification loss}) = \frac{P_{X,Z}(x,z)}{P_X(x) P_Z(z)}</math>.
* x,z have redundant info from Y so if we first predict z from x, then use the result to predict Y, we should be good
<math>
\begin{aligned}
\mu(x) &= E[E[Y|Z]|X=x] \\
&= \int E[Y|Z=z] P_{Z|X}(z|x) dz\\
&= \int E[Y|Z=z] g^*(x,z) P_Z(z) dz
\end{aligned}
</math>
Lemma: <math>E[(\mu(x) - E[Y | x,z])^2] \leq e_x + e_z + 2\sqrt{e_x e_z}</math>
Landmark embedding: <math>g^*</math> is computed via contrastive learning. 
How to embed x? <math>\phi(x) = (g^*(x, z_1),...,g^*(x,z_m))</math> 
z's are randomly generated from <math>P_Z</math>. 
Each <math>g^*</math> output is a single number.
* <math>\phi()</math> is good embedding if a linear classifier of <math>\phi</math> is approx <math>\mu(x)</math>.
** I.e. <math>\exists w : w^t \phi(x) \approx \mu(x)</math>
** <math>w^t \phi(x) = \frac{1}{m} \sum_{i=1}^{m} E[Y|Z_i] g^*(x, z_i) \stackrel{m \to \infty}{\rightarrow} \int E[Y|Z=z] g^*(x,z)P_{Z}*z dz = \mu(x)</math>
Direct embedding 
Instead of learning a bivariate <math>g(x,z)</math>, learn <math>\eta(x)^T \psi(z)</math>. 
Lemma: For every <math>\eta: x \to \mathbb{R}^m</math> and <math>\psi:z \to \mathbb{R}^m</math>, there exists <math>w \in \mathbb{R}^m</math> such that <math>E[(w^t \eta(x) - \mu(x))^2] \leq E[Y^s \epsilon_{direct}(\eta, \psi)</math> where <math>\epsilon_{direct} = [(\eta(x)^t \psi(x) - g^*(x,z))^2</math>. 
Can we write <math>g^*(x,z)</math> as <math>\eta(x)^T \psi(z)</math>? 
* If there is a hidden variable H s.t. X and Z are conditionally independent given H.
# Case 1: H is a discrete variable then <math>g^*(x, z) = \eta^*(x)^t \psi(z^*)</math>
# Case 2: There exists <math>\eta,\psi</math> such that <math>E(\eta(x)^t \psi(z) - g^*(x,z))^2 \leq o(\frac{1}{m})</math>.


==Misc==
==Misc==