Deep Learning: Difference between revisions

Deep Learning (view source)

Revision as of 17:17, 12 November 2020

3,923 bytes added , 12 November 2020

→‎Contrastive Learning

David

Bureaucrats, Interface administrators, Administrators

5,337

edits

@@ Line 1,724: / Line 1,724: @@
 We need pairs of similar images and dissimilar images.
-SimCLR [Chen ''et al.'' 2020]
+;SimCLR [Chen ''et al.'' 2020]
 # Create two correlated views of an image <math>x</math>: <math>\tilde{x}_i</math> and <math>\tilde{x}_j</math>.
 #* Random cropping + resize
@@ Line 1,744: / Line 1,744: @@
 Training is <math>\min_{f,g} L = \frac{1}{N} \sum_{k=1}^{N} \frac{l(2k-1,2k) + l(2k, 2k-1)}{2}</math>.
+Practical observations:
+* Composition of data augmentation is important.
+* Larger batch sizes and longer training helps self-supervised learning!
+* Optimization over the MLP projection head <math>g</math> helps!
+** f is able to keep some information useful for the classification (e.g. color or orientation)
+Empirical results:
+* For image net, top-1 accuracy increases as number of parameters increases.
+* After learning embeddings <math>f</math>, you don't need much labeled data for supervised-training.
+===Theory of self-supervised learning===
+;[Arora ''et al.'' 2019]
+Modeling of similar pairs:
+* <math>x \sim D_{C1}(x)</math>
+* <math>x^+ \sim D_{C1}(x)</math> is semantically similar
+* <math>x^- \sim D_{C2}(x)</math> negative samples
+Downstream task:
+* pairwise classification
+** nature picks two classes C1, C2
+** generate samples from C1 & C2 and evaluate the classification loss
+* Assume <math>m \to \infty</math> so just look at population loss
+Notations:
+* <math>L_{sup}(f)</math> is the supervised learning loss with optimal last layer
+* <math>l=E_{(x, x^+) \sim D_{sim}, x^- \sim D_{net}}[\log(1+\exp(f(x)^T f(x^-) - f(x)^Tf(x^+)]</math> is a logistic loss
+Result 1:
+* <math>L_{sup}^{mean}(f) \leq \frac{1}{1-\tau}(L_{un}(f)-\tau)</math> for all <math>f \in F</math>
+** <math>\tau</math> is the collision probability for pair of random classes.
+Idea result:
+* <math>L_{sup}(f^*) \leq \alpha L_{sup}(f)</math> forall <math>f</math>.
+* In general it is impossible to get this.
+;[Tosh et al. 2020]
+This work connects self-supervised learning with multi-view representation theory.
+Start with (x, z, y) where x,z are two views of the data and y is the label.
+* x and z should share redundant information w.r.t. y. I.e. predicting y from x or z individually should be equivalent to predicting y from both.
+* <math>e_x = E(E[Y|X] - E[Y|X,Z])^2</math> and <math>e_z = E(E[Y|Z] - E[Y|X,Z])^2</math> should be small
+* Formulate contrastive learning as an artificial ''classification'' problem:
+** Classify between <math>(x, z, 1)</math> and <math>(x, \tilde{z}, 0)</math>.
+** <math>g^*(x,z) = \operatorname{argmin}(\text{classification loss}) = \frac{P_{X,Z}(x,z)}{P_X(x) P_Z(z)}</math>.
+* x,z have redundant info from Y so if we first predict z from x, then use the result to predict Y, we should be good
+<math>
+\begin{aligned}
+\mu(x) &= E[E[Y|Z]|X=x] \\
+&= \int E[Y|Z=z] P_{Z|X}(z|x) dz\\
+&= \int E[Y|Z=z] g^*(x,z) P_Z(z) dz
+\end{aligned}
+</math>
+Lemma: <math>E[(\mu(x) - E[Y | x,z])^2] \leq e_x + e_z + 2\sqrt{e_x e_z}</math>
+Landmark embedding: <math>g^*</math> is computed via contrastive learning.
+How to embed x? <math>\phi(x) = (g^*(x, z_1),...,g^*(x,z_m))</math>
+z's are randomly generated from <math>P_Z</math>.
+Each <math>g^*</math> output is a single number.
+* <math>\phi()</math> is good embedding if a linear classifier of <math>\phi</math> is approx <math>\mu(x)</math>.
+** I.e. <math>\exists w : w^t \phi(x) \approx \mu(x)</math>
+** <math>w^t \phi(x) = \frac{1}{m} \sum_{i=1}^{m} E[Y|Z_i] g^*(x, z_i) \stackrel{m \to \infty}{\rightarrow} \int E[Y|Z=z] g^*(x,z)P_{Z}*z dz = \mu(x)</math>
+Direct embedding
+Instead of learning a bivariate <math>g(x,z)</math>, learn <math>\eta(x)^T \psi(z)</math>.
+Lemma: For every <math>\eta: x \to \mathbb{R}^m</math> and <math>\psi:z \to \mathbb{R}^m</math>, there exists <math>w \in \mathbb{R}^m</math> such that <math>E[(w^t \eta(x) - \mu(x))^2] \leq E[Y^s \epsilon_{direct}(\eta, \psi)</math> where <math>\epsilon_{direct} = [(\eta(x)^t \psi(x) - g^*(x,z))^2</math>.
+Can we write <math>g^*(x,z)</math> as <math>\eta(x)^T \psi(z)</math>?
+* If there is a hidden variable H s.t. X and Z are conditionally independent given H.
+# Case 1: H is a discrete variable then <math>g^*(x, z) = \eta^*(x)^t \psi(z^*)</math>
+# Case 2: There exists <math>\eta,\psi</math> such that <math>E(\eta(x)^t \psi(z) - g^*(x,z))^2 \leq o(\frac{1}{m})</math>.
 ==Misc==