Visual Learning and Recognition: Difference between revisions
| (18 intermediate revisions by the same user not shown) | |||
| Line 377: | Line 377: | ||
* 4 Texture Related | * 4 Texture Related | ||
=== | ===Deformable Part Models (DPM)=== | ||
Lecture (Oct 6-8, 2020) | |||
Deformable Part Models (DPM) | |||
Felzenszwalb et al (2009) <ref name="felzenszwalb2009dpm"></ref> | Felzenszwalb et al (2009) <ref name="felzenszwalb2009dpm"></ref> | ||
'''Important: Read this paper''' | '''Important: Read this paper''' | ||
| Line 390: | Line 393: | ||
* <math>w_{ij}^T \in \mathbb{R}^5</math> is the deformation parameter between parts i and j | * <math>w_{ij}^T \in \mathbb{R}^5</math> is the deformation parameter between parts i and j | ||
The total number of configurations is <math>10^(4*N)</math> since for each <math>100\ | The total number of configurations is <math>10^{(4*N)}</math> since for each <math>100 \times 100</math> image, each of <math>p_i</math> can take 100*100 values. <math>N</math> is the number of parts. | ||
The trick is to use dynamic programming and a tree-based model. | The trick is to use dynamic programming and a tree-based model. | ||
| Line 457: | Line 460: | ||
===TextonBoost=== | ===TextonBoost=== | ||
Shotton ''et al.'' <ref name="shotton2009texton"></ref> | Shotton ''et al.'' <ref name="shotton2009texton"></ref> | ||
Incorporates texture-layout, color, location, and edge in a conditional random field. | |||
Jointly considers appearance, shape, context. | |||
==Semantic Segmentation== | ==Semantic Segmentation== | ||
| Line 516: | Line 521: | ||
==Region-based Object Detection Systems== | ==Region-based Object Detection Systems== | ||
Lecture (Oct 15, 2020) | |||
===R-CNN=== | ===R-CNN=== | ||
;R-CNN at test time | ;R-CNN at test time | ||
| Line 746: | Line 751: | ||
===Approaches for Pose Estimation=== | ===Approaches for Pose Estimation=== | ||
Lecture Oct 27 | |||
* Top-down approaches | * Top-down approaches | ||
** Do person detection then do pose estimation. | ** Do person detection then do pose estimation. | ||
| Line 993: | Line 1,000: | ||
Primitives | Primitives | ||
* Depth | * Depth - not normalized making them hard to use, have discontinuities, do not represent objects | ||
* Surface normals | * Surface normals - are gradient of depth | ||
===Scene Intrinsics=== | ===Scene Intrinsics=== | ||
| Line 1,151: | Line 1,158: | ||
\begin{aligned} | \begin{aligned} | ||
&\log P(x) - D_{KL}[Q(z|x) \Vert P(z|x)] = E_{z \sim Q}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)]\\ | &\log P(x) - D_{KL}[Q(z|x) \Vert P(z|x)] = E_{z \sim Q}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)]\\ | ||
\implies &\log P(x) \geq E_{z \sim Q}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)] | \implies &\log P(x) \geq E_{z \sim Q(z|x)}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)] | ||
\end{aligned} | \end{aligned} | ||
</math> | </math> | ||
This is known as variational lower bound or ''ELBO''. | |||
* We first have the encoder output a mean <math>\mu_{z|x}</math> and covariance matrix diagonal <math>\Sigma_{z|x}</math>. | |||
* For ELBO we want to optimize <math>E_{z \sim Q(z|x)}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)]</math>. | |||
* Our first loss is <math>D_{KL}(N(\mu_{z|x}, \Sigma_{z|x}) \Vert N(0, I))</math>. | |||
* We sample z from <math>N(\mu_{z|x}, \Sigma_{z|x})</math> and pass it to the decoder which outputs <math>\mu_{x|z}, \Sigma_{x|z}</math>. | |||
* Sample <math>\hat{x}</math> from the distribution and have reconstruction loss <math>\Vert x - \hat{x} \Vert^2</math>. | |||
* Most blog posts will forget to sample from <math>P(x|z)</math>. | |||
;Modeling P(x|z) | |||
Let f(z) be the network output. | |||
* Assume <math>P(x|z)</math> is iid Gaussian. | |||
* <math>\hat{x} = f(z) + \eta</math> where <math>\eta \sim N(0,1)</math> | |||
* Simplifies to an L2 loss <math>\Vert x - f(z) \Vert^2</math> | |||
Importance weighted VAE uses N samples for the loss. | |||
;Reparameterization trick | |||
To sample from the latent space, you do <math>z = \mu + \sigma \varepsilon</math> where <math>\varepsilon \sim N(0,1)</math>. | |||
This way, you can backprop through through the sampling step. | |||
;Conditional VAE | |||
Just input the condition into the encoder and decoder. | |||
;Pros | |||
* Principled approach to generative models | |||
* Allows inference of <math>q(z|x)</math> which can be used as a feature representation | |||
;Cons | |||
* Maximizes ELBO | |||
* Samples are blurrier than GANs | |||
Why are samples blurry? | |||
* Samples are not blurry but noisy | |||
** Sample vs Mean/Expected Value | |||
* L2 loss | |||
===Flow-based Models=== | |||
Flow-based models minimize the negative log-likelihood. | |||
==Attribute-based Representation== | |||
;Motivation | |||
Typically in recognition, we only predict the class of the image. | |||
From the category, we can guess the attributes but the category provides only limited information. | |||
The network cannot perform prediction on unseen new classes. | |||
This problem used to be called ''graceful degradation''. | |||
;Goal | |||
Learn intermediate structure with object categories. | |||
;Should we care about attributes in DL? | |||
;Why is attributes not simply supervised recognition? | |||
;Benefits | |||
* Dealing with inevitable failure. | |||
* We can infer things about unseen categories. | |||
* We can make comparison between objects or categories. | |||
;Datasets | |||
* a-Pascal | |||
* a-Yahoo | |||
* CORE | |||
* COCO Attributes | |||
Deep networks should be able to learn attributes implicitly. | |||
However, you don't know if it has actually learned them. | |||
==Extra Topics== | |||
===Fine-grained Recognition=== | |||
===Few-shot Recognition=== | |||
* Metric learning methods | |||
* Meta-learning methods | |||
* Data Augmentation Methods | |||
* Semantics | |||
===Zero-shot Recognition=== | |||
Goal is train a classifier without having seen a single labeled example. | |||
The information comes from a knowledge graph e.g. from word embeddings. | |||
===Beyond Labelled Datasets=== | |||
* Semi-supervised: We have both labelled and unlabeled training samples. | |||
* Weakly-supervised: The labels are weak, noisy, and non-necessarily for the task we want. | |||
* Learning from the Web: Download data from the internet | |||
==Will be on the exam== | ==Will be on the exam== | ||
| Line 1,160: | Line 1,250: | ||
* RCNN vs Fast-RCNN vs Faster-RCNN | * RCNN vs Fast-RCNN vs Faster-RCNN | ||
* DPM | * DPM | ||
* Selective search vs RPM | |||
* ELBO | |||
Final exam: | Final exam: | ||
| Line 1,182: | Line 1,274: | ||
* Recorded videos with presentations | * Recorded videos with presentations | ||
* Final reports Dec 18 | * Final reports Dec 18 | ||
[https://docs.google.com/document/d/1BKmpBWBWuEEywDyBw9CsHgOPB6DH7I0oKEs8zXS7XQw/edit?usp=sharing My Exam Cheat Sheet] | |||
==Project Notes== | ==Project Notes== | ||
| Line 1,187: | Line 1,281: | ||
* Challenges | * Challenges | ||
* What methods worked and didn't work. | * What methods worked and didn't work. | ||
==References== | ==References== | ||
| Line 1,195: | Line 1,286: | ||
<ref name="torralba2008tinyimages">Antonio Torralba, Rob Fergus and William T. Freeman (2008). 80 million tiny images: a large dataset for | <ref name="torralba2008tinyimages">Antonio Torralba, Rob Fergus and William T. Freeman (2008). 80 million tiny images: a large dataset for | ||
non-parametric object and scene recognition (PAMI 2008) [https://people.csail.mit.edu/torralba/publications/80millionImages.pdf Link]</ref> | non-parametric object and scene recognition (PAMI 2008) [https://people.csail.mit.edu/torralba/publications/80millionImages.pdf Link]</ref> | ||
<ref name="standing1973learning">Lionel Standing (1973). Learning 10000 pictures. ''Journal | <ref name="standing1973learning">Lionel Standing (1973). Learning 10000 pictures. ''Journal Quarterly Journal of Experimental Psychology'' [https://www.tandfonline.com/doi/abs/10.1080/14640747308400340 Link]</ref> | ||
Quarterly Journal of Experimental Psychology'' [https://www.tandfonline.com/doi/abs/10.1080/14640747308400340 Link]</ref> | |||
<ref name="brady2008visual">Timothy F. Brady, Talia Konkle, George A. Alvarez, and Aude Oliva (2008). Visual long-term memory has a massive storage capacity for object details. [http://olivalab.mit.edu/MM/pdfs/BradyKonkleAlvarezOliva2008.pdf Link].</ref> | <ref name="brady2008visual">Timothy F. Brady, Talia Konkle, George A. Alvarez, and Aude Oliva (2008). Visual long-term memory has a massive storage capacity for object details. [http://olivalab.mit.edu/MM/pdfs/BradyKonkleAlvarezOliva2008.pdf Link].</ref> | ||
<ref name="torralba2011unbiased>Antonio Torralba, Alexei A. Efros (2011). Unbiased Look at Dataset Bias (CVPR 2011) [https://people.csail.mit.edu/torralba/publications/datasets_cvpr11.pdf Link]</ref> | <ref name="torralba2011unbiased>Antonio Torralba, Alexei A. Efros (2011). Unbiased Look at Dataset Bias (CVPR 2011) [https://people.csail.mit.edu/torralba/publications/datasets_cvpr11.pdf Link]</ref> | ||