Visual Learning and Recognition: Difference between revisions

 
(26 intermediate revisions by the same user not shown)
Line 377: Line 377:
* 4 Texture Related
* 4 Texture Related


===Discriminatively Trained Part Based Models (DPM)===
===Deformable Part Models (DPM)===
Lecture (Oct 6-8, 2020) 
Deformable Part Models (DPM)
 
Felzenszwalb et al (2009) <ref name="felzenszwalb2009dpm"></ref>
Felzenszwalb et al (2009) <ref name="felzenszwalb2009dpm"></ref>
'''Important: Read this paper'''
'''Important: Read this paper'''
Line 390: Line 393:
* <math>w_{ij}^T \in \mathbb{R}^5</math> is the deformation parameter between parts i and j
* <math>w_{ij}^T \in \mathbb{R}^5</math> is the deformation parameter between parts i and j


The total number of configurations is <math>10^(4*N)</math> since for each <math>100\times100</math> image, each of <math>p_i</math> can take 100*100 values. <math>N</math> is the number of parts.
The total number of configurations is <math>10^{(4*N)}</math> since for each <math>100 \times 100</math> image, each of <math>p_i</math> can take 100*100 values. <math>N</math> is the number of parts.


The trick is to use dynamic programming and a tree-based model.   
The trick is to use dynamic programming and a tree-based model.   
Line 457: Line 460:


===TextonBoost===
===TextonBoost===
Shotton ''et al.'' <ref name="shotton2009texton"></ref>
Shotton ''et al.'' <ref name="shotton2009texton"></ref>
Incorporates texture-layout, color, location, and edge in a conditional random field. 
Jointly considers appearance, shape, context.


==Semantic Segmentation==
==Semantic Segmentation==
Line 516: Line 521:


==Region-based Object Detection Systems==
==Region-based Object Detection Systems==
 
Lecture (Oct 15, 2020)
===R-CNN===
===R-CNN===
;R-CNN at test time
;R-CNN at test time
Line 746: Line 751:


===Approaches for Pose Estimation===
===Approaches for Pose Estimation===
Lecture Oct 27
* Top-down approaches
* Top-down approaches
** Do person detection then do pose estimation.
** Do person detection then do pose estimation.
Line 993: Line 1,000:


Primitives
Primitives
* Depth
* Depth - not normalized making them hard to use, have discontinuities, do not represent objects
* Surface normals
* Surface normals - are gradient of depth


===Scene Intrinsics===
===Scene Intrinsics===
Line 1,049: Line 1,056:
Shape carving:
Shape carving:
* Assume everything is cubeoids and remove cubes.
* Assume everything is cubeoids and remove cubes.
Without segmentation masks:
* Train autoencoder for 3D voxels shape
* Have encoder for rendered chairs.
* At test time, images go to encoder for rendered and out the decoder for 3D voxels.
==GANs and VAEs==
===Pixel-RNN/CNN===
* Fully-visible belief network
* Explicit density model:
** Each pixel depends on all previous pixels
** <math>P_{\theta}(x) = \prod_{i=1}^{n} P_{\theta}(x_i | x_1, ..., x_{i-1})</math>
** You need to define what is ''previous pixels'' (e.g. all pixels above and left)
* Then maximize likelihood of training data
;Pros:
* Can explicitly compute P(x)
* Explicit P(x) gives good evaluation metric
;Cons:
* Sequence generation is slow
* Optimizing P(x) is hard.
Types of ''previous pixels'' connections:
* PixelCNN looks at all previous pixels (fastest)
* Row LSTM has a triangular receptive field (slow)
* Diagonal LSTM
* Diagonal BiLSTM has a full dependency field (slowest)
;Multi-scale PixelRNN
* Takes subsampled pixels as additional input pixels
* Can capture better global information
* Slightly better results
===Generative Adversarial Networks (GANs)===
* Generator generates images
* Discriminator classifies real or fake
* Loss: <math>\min_{G} \max_{D} E_x[\log D(x)] + E_z[\log(1-D(G(z)))]</math>
;Image-to-image Conditional GANS
* Add an image encoder which outputs z
;pix2pix
* Add L1 loss to the loss function
* UNet generator
* PatchGAN discriminator
** PatchGAN outputs N*N values with real-fake with each patch (i.e. limited receptive field)
* Requires paired samples
;CycleGAN
* Unpaired image-to-image translation
* Cycle-consistency loss
;BicycleGAN
* First learn to reconstruct images with a nice latent code representation in between (cVAE-GAN)
* The main difference is that we have a many-to-many mapping (multi-modal image-to-image) between the two domains.
;MUNIT
* Multimodal UNsupervised Image-to-image Translation
* Maps images from each domain into a shared content space and domain-specific style space.
===Training Problems with GANs===
* Instability
* Difficult to keep generator and discriminator in sync.
** Discriminator cannot be too good or too bad. Same with generator.
** Tricks: LR scheduling, keep discriminator small, update generator more frequently.
* Mode collapse
Mode collapse happens when the generator cannot model different parts of the distribution.
;DCGAN architecture guidelines
* Use strided conv instead of pooling for discriminator.
* Use batchnorm in generator and discriminator.
* Remove FC hidden layers.
* Use Relu for hidden layers, tanh for output layers of generator.
* Use LRelu for discriminator.
LSGAN, WGAN have tricks to mitigate mode collapse.
===Evaluation of GANs===
* Turing test (User study)
* Inception score
===Variational Auto-encoders (VAEs)===
;Training a VAE
* Data likelihood: <math>P(x) = \int P(x|z) P(z) dz</math>
* Approx with samples of z during training: <math>P(x) \approx \frac{1}{n} \sum_{i=0}^{n} P(x | z_i)</math>
* This is impractical.
Assume we can learn a distribution <math>Q(z)</math> such that <math>z \sim Q(z)</math> generates <math>P(x|z) > 0</math>. 
Relating <math>P(x)</math> and <math>E_{z \sim Q(z|x)}</math>? 
<math>
\begin{aligned}
D_{KL}[Q(z|x) \Vert P(z|x)] &= E_{z \sim Q}[\log Q(z|x) - \log P(z)] - E_{z \sim Q}[\log P(x|z)] + \log P(x)\\
&= D_{KL}[Q|P] - E_{z \sim Q}[\log P(x|z)] + \log P(x)
\end{aligned}
</math> 
Rearranging we get: 
<math>
\begin{aligned}
&\log P(x) - D_{KL}[Q(z|x) \Vert P(z|x)] = E_{z \sim Q}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)]\\
\implies &\log P(x) \geq E_{z \sim Q(z|x)}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)]
\end{aligned}
</math> 
This is known as variational lower bound or ''ELBO''.
* We first have the encoder output a mean <math>\mu_{z|x}</math> and covariance matrix diagonal <math>\Sigma_{z|x}</math>. 
* For ELBO we want to optimize <math>E_{z \sim Q(z|x)}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)]</math>. 
* Our first loss is <math>D_{KL}(N(\mu_{z|x}, \Sigma_{z|x}) \Vert N(0, I))</math>. 
* We sample z from <math>N(\mu_{z|x}, \Sigma_{z|x})</math> and pass it to the decoder which outputs <math>\mu_{x|z}, \Sigma_{x|z}</math>. 
* Sample <math>\hat{x}</math> from the distribution and have reconstruction loss <math>\Vert x - \hat{x} \Vert^2</math>. 
* Most blog posts will forget to sample from <math>P(x|z)</math>.
;Modeling P(x|z)
Let f(z) be the network output.
* Assume <math>P(x|z)</math> is iid Gaussian.
* <math>\hat{x} = f(z) + \eta</math> where <math>\eta \sim N(0,1)</math>
* Simplifies to an L2 loss <math>\Vert x - f(z) \Vert^2</math>
Importance weighted VAE uses N samples for the loss.
;Reparameterization trick
To sample from the latent space, you do <math>z = \mu + \sigma \varepsilon</math> where <math>\varepsilon \sim N(0,1)</math>. 
This way, you can backprop through through the sampling step.
;Conditional VAE
Just input the condition into the encoder and decoder.
;Pros
* Principled approach to generative models
* Allows inference of <math>q(z|x)</math> which can be used as a feature representation
;Cons
* Maximizes ELBO
* Samples are blurrier than GANs
Why are samples blurry?
* Samples are not blurry but noisy
** Sample vs Mean/Expected Value
* L2 loss
===Flow-based Models===
Flow-based models minimize the negative log-likelihood.
==Attribute-based Representation==
;Motivation
Typically in recognition, we only predict the class of the image. 
From the category, we can guess the attributes but the category provides only limited information. 
The network cannot perform prediction on unseen new classes. 
This problem used to be called ''graceful degradation''.
;Goal
Learn intermediate structure with object categories.
;Should we care about attributes in DL?
;Why is attributes not simply supervised recognition?
;Benefits
* Dealing with inevitable failure.
* We can infer things about unseen categories.
* We can make comparison between objects or categories.
;Datasets
* a-Pascal
* a-Yahoo
* CORE
* COCO Attributes
Deep networks should be able to learn attributes implicitly. 
However, you don't know if it has actually learned them.
==Extra Topics==
===Fine-grained Recognition===
===Few-shot Recognition===
* Metric learning methods
* Meta-learning methods
* Data Augmentation Methods
* Semantics
===Zero-shot Recognition===
Goal is train a classifier without having seen a single labeled example. 
The information comes from a knowledge graph e.g. from word embeddings.
===Beyond Labelled Datasets===
* Semi-supervised: We have both labelled and unlabeled training samples.
* Weakly-supervised: The labels are weak, noisy, and non-necessarily for the task we want.
* Learning from the Web: Download data from the internet


==Will be on the exam==
==Will be on the exam==
Line 1,054: Line 1,249:
* Softmax, sigmoid, cross entropy
* Softmax, sigmoid, cross entropy
* RCNN vs Fast-RCNN vs Faster-RCNN
* RCNN vs Fast-RCNN vs Faster-RCNN
* DPM
* Selective search vs RPM
* ELBO
Final exam:
* Friday Dec 4, 2020
* 4pm-6pm on gradescope
* Will have multiple choice, fill-in-the-blank, question answering, open-ended
** Generally simple questions; either you know or you don't
* No practice exams
* Only need to know major names (RCNN, Fast(er)-RCNN)
* Only covers lecture material.
* 1 letter page of open notes, both sides allowed (honor system)
Homework 2
* Released Nov 30, 2020
* Take SLIC super pixels, extract deep features and classify them.
* Will have 2 bonus credits, must pick one
** Use features from multiple layers (multi-scale)
** Use multiple-levels of SLIC in input (SLIC feature pyramid)
Final Project
* Presentations Dec 10
* Recorded videos with presentations
* Final reports Dec 18
[https://docs.google.com/document/d/1BKmpBWBWuEEywDyBw9CsHgOPB6DH7I0oKEs8zXS7XQw/edit?usp=sharing My Exam Cheat Sheet]


==Project Notes==
==Project Notes==
Line 1,059: Line 1,281:
* Challenges
* Challenges
* What methods worked and didn't work.
* What methods worked and didn't work.
==Misc==
[[Visible to::users]]


==References==
==References==
Line 1,067: Line 1,286:
<ref name="torralba2008tinyimages">Antonio Torralba, Rob Fergus and William T. Freeman (2008). 80 million tiny images: a large dataset for
<ref name="torralba2008tinyimages">Antonio Torralba, Rob Fergus and William T. Freeman (2008). 80 million tiny images: a large dataset for
non-parametric object and scene recognition (PAMI 2008) [https://people.csail.mit.edu/torralba/publications/80millionImages.pdf Link]</ref>
non-parametric object and scene recognition (PAMI 2008) [https://people.csail.mit.edu/torralba/publications/80millionImages.pdf Link]</ref>
<ref name="standing1973learning">Lionel Standing (1973). Learning 10000 pictures. ''Journal
<ref name="standing1973learning">Lionel Standing (1973). Learning 10000 pictures. ''Journal Quarterly Journal of Experimental Psychology'' [https://www.tandfonline.com/doi/abs/10.1080/14640747308400340 Link]</ref>
Quarterly Journal of Experimental Psychology'' [https://www.tandfonline.com/doi/abs/10.1080/14640747308400340 Link]</ref>
<ref name="brady2008visual">Timothy F. Brady, Talia Konkle, George A. Alvarez, and Aude Oliva (2008). Visual long-term memory has a massive storage capacity for object details. [http://olivalab.mit.edu/MM/pdfs/BradyKonkleAlvarezOliva2008.pdf Link].</ref>
<ref name="brady2008visual">Timothy F. Brady, Talia Konkle, George A. Alvarez, and Aude Oliva (2008). Visual long-term memory has a massive storage capacity for object details. [http://olivalab.mit.edu/MM/pdfs/BradyKonkleAlvarezOliva2008.pdf Link].</ref>
<ref name="torralba2011unbiased>Antonio Torralba, Alexei A. Efros (2011). Unbiased Look at Dataset Bias (CVPR 2011) [https://people.csail.mit.edu/torralba/publications/datasets_cvpr11.pdf Link]</ref>
<ref name="torralba2011unbiased>Antonio Torralba, Alexei A. Efros (2011). Unbiased Look at Dataset Bias (CVPR 2011) [https://people.csail.mit.edu/torralba/publications/datasets_cvpr11.pdf Link]</ref>