5,337
edits
Line 1,724: | Line 1,724: | ||
We need pairs of similar images and dissimilar images. | We need pairs of similar images and dissimilar images. | ||
SimCLR [Chen ''et al.'' 2020] | ;SimCLR [Chen ''et al.'' 2020] | ||
# Create two correlated views of an image <math>x</math>: <math>\tilde{x}_i</math> and <math>\tilde{x}_j</math>. | # Create two correlated views of an image <math>x</math>: <math>\tilde{x}_i</math> and <math>\tilde{x}_j</math>. | ||
#* Random cropping + resize | #* Random cropping + resize | ||
Line 1,744: | Line 1,744: | ||
Training is <math>\min_{f,g} L = \frac{1}{N} \sum_{k=1}^{N} \frac{l(2k-1,2k) + l(2k, 2k-1)}{2}</math>. | Training is <math>\min_{f,g} L = \frac{1}{N} \sum_{k=1}^{N} \frac{l(2k-1,2k) + l(2k, 2k-1)}{2}</math>. | ||
Practical observations: | |||
* Composition of data augmentation is important. | |||
* Larger batch sizes and longer training helps self-supervised learning! | |||
* Optimization over the MLP projection head <math>g</math> helps! | |||
** f is able to keep some information useful for the classification (e.g. color or orientation) | |||
Empirical results: | |||
* For image net, top-1 accuracy increases as number of parameters increases. | |||
* After learning embeddings <math>f</math>, you don't need much labeled data for supervised-training. | |||
===Theory of self-supervised learning=== | |||
;[Arora ''et al.'' 2019] | |||
Modeling of similar pairs: | |||
* <math>x \sim D_{C1}(x)</math> | |||
* <math>x^+ \sim D_{C1}(x)</math> is semantically similar | |||
* <math>x^- \sim D_{C2}(x)</math> negative samples | |||
Downstream task: | |||
* pairwise classification | |||
** nature picks two classes C1, C2 | |||
** generate samples from C1 & C2 and evaluate the classification loss | |||
* Assume <math>m \to \infty</math> so just look at population loss | |||
Notations: | |||
* <math>L_{sup}(f)</math> is the supervised learning loss with optimal last layer | |||
* <math>l=E_{(x, x^+) \sim D_{sim}, x^- \sim D_{net}}[\log(1+\exp(f(x)^T f(x^-) - f(x)^Tf(x^+)]</math> is a logistic loss | |||
Result 1: | |||
* <math>L_{sup}^{mean}(f) \leq \frac{1}{1-\tau}(L_{un}(f)-\tau)</math> for all <math>f \in F</math> | |||
** <math>\tau</math> is the collision probability for pair of random classes. | |||
Idea result: | |||
* <math>L_{sup}(f^*) \leq \alpha L_{sup}(f)</math> forall <math>f</math>. | |||
* In general it is impossible to get this. | |||
;[Tosh et al. 2020] | |||
This work connects self-supervised learning with multi-view representation theory. | |||
Start with (x, z, y) where x,z are two views of the data and y is the label. | |||
* x and z should share redundant information w.r.t. y. I.e. predicting y from x or z individually should be equivalent to predicting y from both. | |||
* <math>e_x = E(E[Y|X] - E[Y|X,Z])^2</math> and <math>e_z = E(E[Y|Z] - E[Y|X,Z])^2</math> should be small | |||
* Formulate contrastive learning as an artificial ''classification'' problem: | |||
** Classify between <math>(x, z, 1)</math> and <math>(x, \tilde{z}, 0)</math>. | |||
** <math>g^*(x,z) = \operatorname{argmin}(\text{classification loss}) = \frac{P_{X,Z}(x,z)}{P_X(x) P_Z(z)}</math>. | |||
* x,z have redundant info from Y so if we first predict z from x, then use the result to predict Y, we should be good | |||
<math> | |||
\begin{aligned} | |||
\mu(x) &= E[E[Y|Z]|X=x] \\ | |||
&= \int E[Y|Z=z] P_{Z|X}(z|x) dz\\ | |||
&= \int E[Y|Z=z] g^*(x,z) P_Z(z) dz | |||
\end{aligned} | |||
</math> | |||
Lemma: <math>E[(\mu(x) - E[Y | x,z])^2] \leq e_x + e_z + 2\sqrt{e_x e_z}</math> | |||
Landmark embedding: <math>g^*</math> is computed via contrastive learning. | |||
How to embed x? <math>\phi(x) = (g^*(x, z_1),...,g^*(x,z_m))</math> | |||
z's are randomly generated from <math>P_Z</math>. | |||
Each <math>g^*</math> output is a single number. | |||
* <math>\phi()</math> is good embedding if a linear classifier of <math>\phi</math> is approx <math>\mu(x)</math>. | |||
** I.e. <math>\exists w : w^t \phi(x) \approx \mu(x)</math> | |||
** <math>w^t \phi(x) = \frac{1}{m} \sum_{i=1}^{m} E[Y|Z_i] g^*(x, z_i) \stackrel{m \to \infty}{\rightarrow} \int E[Y|Z=z] g^*(x,z)P_{Z}*z dz = \mu(x)</math> | |||
Direct embedding | |||
Instead of learning a bivariate <math>g(x,z)</math>, learn <math>\eta(x)^T \psi(z)</math>. | |||
Lemma: For every <math>\eta: x \to \mathbb{R}^m</math> and <math>\psi:z \to \mathbb{R}^m</math>, there exists <math>w \in \mathbb{R}^m</math> such that <math>E[(w^t \eta(x) - \mu(x))^2] \leq E[Y^s \epsilon_{direct}(\eta, \psi)</math> where <math>\epsilon_{direct} = [(\eta(x)^t \psi(x) - g^*(x,z))^2</math>. | |||
Can we write <math>g^*(x,z)</math> as <math>\eta(x)^T \psi(z)</math>? | |||
* If there is a hidden variable H s.t. X and Z are conditionally independent given H. | |||
# Case 1: H is a discrete variable then <math>g^*(x, z) = \eta^*(x)^t \psi(z^*)</math> | |||
# Case 2: There exists <math>\eta,\psi</math> such that <math>E(\eta(x)^t \psi(z) - g^*(x,z))^2 \leq o(\frac{1}{m})</math>. | |||
==Misc== | ==Misc== |