5,332
edits
Line 1,689: | Line 1,689: | ||
==Self-supervised Learning== | ==Self-supervised Learning== | ||
Lecture 21 (November 10, 2020) | Lecture 21 (November 10, 2020) | ||
Given data <math>x</math>, we want to find a good representation <math>f(x)</math>. | |||
We can use <math>f(x)</math>. to solve the classification problem more efficiently (e.g. using linear classifiers). | |||
Task 1: Learn a good <math>f(x)</math> from ''unlabeled'' samples. | |||
Task 2: Use <math>f(x)</math> + a few labels to solve classification problem using linear models. | |||
Note that in semi-supervised learning, you have unlabeled examples and a few labelled examples but you know what the task is. | |||
In self-supervised learning, we use ''structure'' in labelled data to create artificial supervised learning problems solved via deep models. | |||
In this process, the learning method ''hopefully'' will create internal representations for the data <math>f(x)</math> useful for downstream tasks. | |||
===Image embedding=== | |||
Surprising observation for image embedding: | |||
[Gidaris ''et al.'' ICLR 2018] + [Zhang et al. 2019] | |||
# Rotate images and use the angle of rotation as labels (e.g. <math>\theta = 0, 90, 180, 270</math>). | |||
# Train a CNN to predict the rotation angle from images. | |||
# Use <math>f(x)</math> with linear classification models for the true labels. | |||
;Why should <math>f(x)</math> be a good representation for images? | |||
This is an open question. | |||
===Contrastive Learning=== | |||
[Logeswaren & Lee ICLR 2018] use a text corpus (Wikipedia) to train deep representation <math>f(x)</math>: | |||
<math>x, x^{+}</math> are adjacent sentences. | |||
<math>x, x^{-}</math> are random sentences. | |||
Optimization: | |||
<math>\min_{f} E[\log(1 + \exp^{f(x)^T f(x^-) - f(x)^T f(x^+)]})] \approx E[f(x)^T f(x^-) - f(x)^T f(x^+)]</math>. | |||
This is known as contrastive learning. | |||
Sentence embeddings capture human notions of similarities: | |||
E.g. the tiger rules this jungle is similar to a lion hunts in a forest. | |||
;Can we use contrastive learning to obtain power data representations for images? | |||
We need pairs of similar images and dissimilar images. | |||
SimCLR [Chen ''et al.'' 2020] | |||
# Create two correlated views of an image <math>x</math>: <math>\tilde{x}_i</math> and <math>\tilde{x}_j</math>. | |||
#* Random cropping + resize | |||
#* Random color distortion | |||
#* Random Gaussian blur | |||
# Use base encoder (ResNet) to map <math>\tilde{x}_i,\tilde{x}_j</math> to embeddings <math>h_i, h_j</math>. | |||
# Train a project head <math>g()</math> (one hidden layer MLP) to map h's to z's which maximize the agreement between z's. | |||
# Loss function: <math>sim(z_i, z_j) = \frac{z_i^t z_j}{\Vert z_i \Vert \Vert z_j \Vert}</math> | |||
Randomly select <math>N</math> samples and add their augmentations to get 2N samples. | |||
Compute similarity matrix <math>S \in \mathbb{R}^{2N \times 2N}</math>. | |||
<math>S_{ij}=\exp(sim(z_i, z_j)) = | |||
\begin{cases} | |||
1 & \text{if }i=j\\ | |||
high-number & \text{if }{j=i+1}\\ | |||
low & otherwise | |||
\end{cases} | |||
</math> | |||
Training is <math>\min_{f,g} L = \frac{1}{N} \sum_{k=1}^{N} \frac{l(2k-1,2k) + l(2k, 2k-1)}{2}</math>. | |||
==Misc== | ==Misc== |