Visual Learning and Recognition: Difference between revisions
| (24 intermediate revisions by the same user not shown) | |||
| Line 377: | Line 377: | ||
* 4 Texture Related | * 4 Texture Related | ||
=== | ===Deformable Part Models (DPM)=== | ||
Lecture (Oct 6-8, 2020) | |||
Deformable Part Models (DPM) | |||
Felzenszwalb et al (2009) <ref name="felzenszwalb2009dpm"></ref> | Felzenszwalb et al (2009) <ref name="felzenszwalb2009dpm"></ref> | ||
'''Important: Read this paper''' | '''Important: Read this paper''' | ||
| Line 390: | Line 393: | ||
* <math>w_{ij}^T \in \mathbb{R}^5</math> is the deformation parameter between parts i and j | * <math>w_{ij}^T \in \mathbb{R}^5</math> is the deformation parameter between parts i and j | ||
The total number of configurations is <math>10^(4*N)</math> since for each <math>100\ | The total number of configurations is <math>10^{(4*N)}</math> since for each <math>100 \times 100</math> image, each of <math>p_i</math> can take 100*100 values. <math>N</math> is the number of parts. | ||
The trick is to use dynamic programming and a tree-based model. | The trick is to use dynamic programming and a tree-based model. | ||
| Line 457: | Line 460: | ||
===TextonBoost=== | ===TextonBoost=== | ||
Shotton ''et al.'' <ref name="shotton2009texton"></ref> | Shotton ''et al.'' <ref name="shotton2009texton"></ref> | ||
Incorporates texture-layout, color, location, and edge in a conditional random field. | |||
Jointly considers appearance, shape, context. | |||
==Semantic Segmentation== | ==Semantic Segmentation== | ||
| Line 516: | Line 521: | ||
==Region-based Object Detection Systems== | ==Region-based Object Detection Systems== | ||
Lecture (Oct 15, 2020) | |||
===R-CNN=== | ===R-CNN=== | ||
;R-CNN at test time | ;R-CNN at test time | ||
| Line 746: | Line 751: | ||
===Approaches for Pose Estimation=== | ===Approaches for Pose Estimation=== | ||
Lecture Oct 27 | |||
* Top-down approaches | * Top-down approaches | ||
** Do person detection then do pose estimation. | ** Do person detection then do pose estimation. | ||
| Line 993: | Line 1,000: | ||
Primitives | Primitives | ||
* Depth | * Depth - not normalized making them hard to use, have discontinuities, do not represent objects | ||
* Surface normals | * Surface normals - are gradient of depth | ||
===Scene Intrinsics=== | ===Scene Intrinsics=== | ||
| Line 1,056: | Line 1,063: | ||
==GANs and VAEs== | ==GANs and VAEs== | ||
===Pixel-RNN/CNN=== | |||
* Fully-visible belief network | |||
* Explicit density model: | |||
** Each pixel depends on all previous pixels | |||
** <math>P_{\theta}(x) = \prod_{i=1}^{n} P_{\theta}(x_i | x_1, ..., x_{i-1})</math> | |||
** You need to define what is ''previous pixels'' (e.g. all pixels above and left) | |||
* Then maximize likelihood of training data | |||
;Pros: | |||
* Can explicitly compute P(x) | |||
* Explicit P(x) gives good evaluation metric | |||
;Cons: | |||
* Sequence generation is slow | |||
* Optimizing P(x) is hard. | |||
Types of ''previous pixels'' connections: | |||
* PixelCNN looks at all previous pixels (fastest) | |||
* Row LSTM has a triangular receptive field (slow) | |||
* Diagonal LSTM | |||
* Diagonal BiLSTM has a full dependency field (slowest) | |||
;Multi-scale PixelRNN | |||
* Takes subsampled pixels as additional input pixels | |||
* Can capture better global information | |||
* Slightly better results | |||
===Generative Adversarial Networks (GANs)=== | |||
* Generator generates images | |||
* Discriminator classifies real or fake | |||
* Loss: <math>\min_{G} \max_{D} E_x[\log D(x)] + E_z[\log(1-D(G(z)))]</math> | |||
;Image-to-image Conditional GANS | |||
* Add an image encoder which outputs z | |||
;pix2pix | |||
* Add L1 loss to the loss function | |||
* UNet generator | |||
* PatchGAN discriminator | |||
** PatchGAN outputs N*N values with real-fake with each patch (i.e. limited receptive field) | |||
* Requires paired samples | |||
;CycleGAN | |||
* Unpaired image-to-image translation | |||
* Cycle-consistency loss | |||
;BicycleGAN | |||
* First learn to reconstruct images with a nice latent code representation in between (cVAE-GAN) | |||
* The main difference is that we have a many-to-many mapping (multi-modal image-to-image) between the two domains. | |||
;MUNIT | |||
* Multimodal UNsupervised Image-to-image Translation | |||
* Maps images from each domain into a shared content space and domain-specific style space. | |||
===Training Problems with GANs=== | |||
* Instability | |||
* Difficult to keep generator and discriminator in sync. | |||
** Discriminator cannot be too good or too bad. Same with generator. | |||
** Tricks: LR scheduling, keep discriminator small, update generator more frequently. | |||
* Mode collapse | |||
Mode collapse happens when the generator cannot model different parts of the distribution. | |||
;DCGAN architecture guidelines | |||
* Use strided conv instead of pooling for discriminator. | |||
* Use batchnorm in generator and discriminator. | |||
* Remove FC hidden layers. | |||
* Use Relu for hidden layers, tanh for output layers of generator. | |||
* Use LRelu for discriminator. | |||
LSGAN, WGAN have tricks to mitigate mode collapse. | |||
===Evaluation of GANs=== | |||
* Turing test (User study) | |||
* Inception score | |||
===Variational Auto-encoders (VAEs)=== | |||
;Training a VAE | |||
* Data likelihood: <math>P(x) = \int P(x|z) P(z) dz</math> | |||
* Approx with samples of z during training: <math>P(x) \approx \frac{1}{n} \sum_{i=0}^{n} P(x | z_i)</math> | |||
* This is impractical. | |||
Assume we can learn a distribution <math>Q(z)</math> such that <math>z \sim Q(z)</math> generates <math>P(x|z) > 0</math>. | |||
Relating <math>P(x)</math> and <math>E_{z \sim Q(z|x)}</math>? | |||
<math> | |||
\begin{aligned} | |||
D_{KL}[Q(z|x) \Vert P(z|x)] &= E_{z \sim Q}[\log Q(z|x) - \log P(z)] - E_{z \sim Q}[\log P(x|z)] + \log P(x)\\ | |||
&= D_{KL}[Q|P] - E_{z \sim Q}[\log P(x|z)] + \log P(x) | |||
\end{aligned} | |||
</math> | |||
Rearranging we get: | |||
<math> | |||
\begin{aligned} | |||
&\log P(x) - D_{KL}[Q(z|x) \Vert P(z|x)] = E_{z \sim Q}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)]\\ | |||
\implies &\log P(x) \geq E_{z \sim Q(z|x)}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)] | |||
\end{aligned} | |||
</math> | |||
This is known as variational lower bound or ''ELBO''. | |||
* We first have the encoder output a mean <math>\mu_{z|x}</math> and covariance matrix diagonal <math>\Sigma_{z|x}</math>. | |||
* For ELBO we want to optimize <math>E_{z \sim Q(z|x)}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)]</math>. | |||
* Our first loss is <math>D_{KL}(N(\mu_{z|x}, \Sigma_{z|x}) \Vert N(0, I))</math>. | |||
* We sample z from <math>N(\mu_{z|x}, \Sigma_{z|x})</math> and pass it to the decoder which outputs <math>\mu_{x|z}, \Sigma_{x|z}</math>. | |||
* Sample <math>\hat{x}</math> from the distribution and have reconstruction loss <math>\Vert x - \hat{x} \Vert^2</math>. | |||
* Most blog posts will forget to sample from <math>P(x|z)</math>. | |||
;Modeling P(x|z) | |||
Let f(z) be the network output. | |||
* Assume <math>P(x|z)</math> is iid Gaussian. | |||
* <math>\hat{x} = f(z) + \eta</math> where <math>\eta \sim N(0,1)</math> | |||
* Simplifies to an L2 loss <math>\Vert x - f(z) \Vert^2</math> | |||
Importance weighted VAE uses N samples for the loss. | |||
;Reparameterization trick | |||
To sample from the latent space, you do <math>z = \mu + \sigma \varepsilon</math> where <math>\varepsilon \sim N(0,1)</math>. | |||
This way, you can backprop through through the sampling step. | |||
;Conditional VAE | |||
Just input the condition into the encoder and decoder. | |||
;Pros | |||
* Principled approach to generative models | |||
* Allows inference of <math>q(z|x)</math> which can be used as a feature representation | |||
;Cons | |||
* Maximizes ELBO | |||
* Samples are blurrier than GANs | |||
Why are samples blurry? | |||
* Samples are not blurry but noisy | |||
** Sample vs Mean/Expected Value | |||
* L2 loss | |||
===Flow-based Models=== | |||
Flow-based models minimize the negative log-likelihood. | |||
==Attribute-based Representation== | |||
;Motivation | |||
Typically in recognition, we only predict the class of the image. | |||
From the category, we can guess the attributes but the category provides only limited information. | |||
The network cannot perform prediction on unseen new classes. | |||
This problem used to be called ''graceful degradation''. | |||
;Goal | |||
Learn intermediate structure with object categories. | |||
;Should we care about attributes in DL? | |||
;Why is attributes not simply supervised recognition? | |||
;Benefits | |||
* Dealing with inevitable failure. | |||
* We can infer things about unseen categories. | |||
* We can make comparison between objects or categories. | |||
;Datasets | |||
* a-Pascal | |||
* a-Yahoo | |||
* CORE | |||
* COCO Attributes | |||
Deep networks should be able to learn attributes implicitly. | |||
However, you don't know if it has actually learned them. | |||
==Extra Topics== | |||
===Fine-grained Recognition=== | |||
===Few-shot Recognition=== | |||
* Metric learning methods | |||
* Meta-learning methods | |||
* Data Augmentation Methods | |||
* Semantics | |||
===Zero-shot Recognition=== | |||
Goal is train a classifier without having seen a single labeled example. | |||
The information comes from a knowledge graph e.g. from word embeddings. | |||
===Beyond Labelled Datasets=== | |||
* Semi-supervised: We have both labelled and unlabeled training samples. | |||
* Weakly-supervised: The labels are weak, noisy, and non-necessarily for the task we want. | |||
* Learning from the Web: Download data from the internet | |||
==Will be on the exam== | ==Will be on the exam== | ||
| Line 1,061: | Line 1,249: | ||
* Softmax, sigmoid, cross entropy | * Softmax, sigmoid, cross entropy | ||
* RCNN vs Fast-RCNN vs Faster-RCNN | * RCNN vs Fast-RCNN vs Faster-RCNN | ||
* DPM | |||
* Selective search vs RPM | |||
* ELBO | |||
Final exam: | |||
* Friday Dec 4, 2020 | |||
* 4pm-6pm on gradescope | |||
* Will have multiple choice, fill-in-the-blank, question answering, open-ended | |||
** Generally simple questions; either you know or you don't | |||
* No practice exams | |||
* Only need to know major names (RCNN, Fast(er)-RCNN) | |||
* Only covers lecture material. | |||
* 1 letter page of open notes, both sides allowed (honor system) | |||
Homework 2 | |||
* Released Nov 30, 2020 | |||
* Take SLIC super pixels, extract deep features and classify them. | |||
* Will have 2 bonus credits, must pick one | |||
** Use features from multiple layers (multi-scale) | |||
** Use multiple-levels of SLIC in input (SLIC feature pyramid) | |||
Final Project | |||
* Presentations Dec 10 | |||
* Recorded videos with presentations | |||
* Final reports Dec 18 | |||
[https://docs.google.com/document/d/1BKmpBWBWuEEywDyBw9CsHgOPB6DH7I0oKEs8zXS7XQw/edit?usp=sharing My Exam Cheat Sheet] | |||
==Project Notes== | ==Project Notes== | ||
| Line 1,066: | Line 1,281: | ||
* Challenges | * Challenges | ||
* What methods worked and didn't work. | * What methods worked and didn't work. | ||
==References== | ==References== | ||
| Line 1,074: | Line 1,286: | ||
<ref name="torralba2008tinyimages">Antonio Torralba, Rob Fergus and William T. Freeman (2008). 80 million tiny images: a large dataset for | <ref name="torralba2008tinyimages">Antonio Torralba, Rob Fergus and William T. Freeman (2008). 80 million tiny images: a large dataset for | ||
non-parametric object and scene recognition (PAMI 2008) [https://people.csail.mit.edu/torralba/publications/80millionImages.pdf Link]</ref> | non-parametric object and scene recognition (PAMI 2008) [https://people.csail.mit.edu/torralba/publications/80millionImages.pdf Link]</ref> | ||
<ref name="standing1973learning">Lionel Standing (1973). Learning 10000 pictures. ''Journal | <ref name="standing1973learning">Lionel Standing (1973). Learning 10000 pictures. ''Journal Quarterly Journal of Experimental Psychology'' [https://www.tandfonline.com/doi/abs/10.1080/14640747308400340 Link]</ref> | ||
Quarterly Journal of Experimental Psychology'' [https://www.tandfonline.com/doi/abs/10.1080/14640747308400340 Link]</ref> | |||
<ref name="brady2008visual">Timothy F. Brady, Talia Konkle, George A. Alvarez, and Aude Oliva (2008). Visual long-term memory has a massive storage capacity for object details. [http://olivalab.mit.edu/MM/pdfs/BradyKonkleAlvarezOliva2008.pdf Link].</ref> | <ref name="brady2008visual">Timothy F. Brady, Talia Konkle, George A. Alvarez, and Aude Oliva (2008). Visual long-term memory has a massive storage capacity for object details. [http://olivalab.mit.edu/MM/pdfs/BradyKonkleAlvarezOliva2008.pdf Link].</ref> | ||
<ref name="torralba2011unbiased>Antonio Torralba, Alexei A. Efros (2011). Unbiased Look at Dataset Bias (CVPR 2011) [https://people.csail.mit.edu/torralba/publications/datasets_cvpr11.pdf Link]</ref> | <ref name="torralba2011unbiased>Antonio Torralba, Alexei A. Efros (2011). Unbiased Look at Dataset Bias (CVPR 2011) [https://people.csail.mit.edu/torralba/publications/datasets_cvpr11.pdf Link]</ref> | ||