Visual Learning and Recognition: Difference between revisions
| (35 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
Notes for CMSC828I Visual Learning and Recognition (Fall 2020) taught by Abhinav Shrivastava | Notes for CMSC828I Visual Learning and Recognition (Fall 2020) taught by [http://abhinavsh.info/ Abhinav Shrivastava] | ||
[https://www.cs.umd.edu/class/fall2020/cmsc828i/ Course Website] | |||
This class covers: | This class covers: | ||
| Line 375: | Line 377: | ||
* 4 Texture Related | * 4 Texture Related | ||
=== | ===Deformable Part Models (DPM)=== | ||
Lecture (Oct 6-8, 2020) | |||
Deformable Part Models (DPM) | |||
Felzenszwalb et al (2009) <ref name="felzenszwalb2009dpm"></ref> | Felzenszwalb et al (2009) <ref name="felzenszwalb2009dpm"></ref> | ||
'''Important: Read this paper''' | '''Important: Read this paper''' | ||
| Line 388: | Line 393: | ||
* <math>w_{ij}^T \in \mathbb{R}^5</math> is the deformation parameter between parts i and j | * <math>w_{ij}^T \in \mathbb{R}^5</math> is the deformation parameter between parts i and j | ||
The total number of configurations is <math>10^(4*N)</math> since for each <math>100\ | The total number of configurations is <math>10^{(4*N)}</math> since for each <math>100 \times 100</math> image, each of <math>p_i</math> can take 100*100 values. <math>N</math> is the number of parts. | ||
The trick is to use dynamic programming and a tree-based model. | The trick is to use dynamic programming and a tree-based model. | ||
| Line 455: | Line 460: | ||
===TextonBoost=== | ===TextonBoost=== | ||
Shotton ''et al.'' <ref name="shotton2009texton"></ref> | Shotton ''et al.'' <ref name="shotton2009texton"></ref> | ||
Incorporates texture-layout, color, location, and edge in a conditional random field. | |||
Jointly considers appearance, shape, context. | |||
==Semantic Segmentation== | ==Semantic Segmentation== | ||
| Line 514: | Line 521: | ||
==Region-based Object Detection Systems== | ==Region-based Object Detection Systems== | ||
Lecture (Oct 15, 2020) | |||
===R-CNN=== | ===R-CNN=== | ||
;R-CNN at test time | ;R-CNN at test time | ||
| Line 744: | Line 751: | ||
===Approaches for Pose Estimation=== | ===Approaches for Pose Estimation=== | ||
Lecture Oct 27 | |||
* Top-down approaches | * Top-down approaches | ||
** Do person detection then do pose estimation. | ** Do person detection then do pose estimation. | ||
| Line 841: | Line 850: | ||
===Models for Video Recognition=== | ===Models for Video Recognition=== | ||
;Basic Video Pipeline | |||
# Extract features | # Extract features | ||
# Learn space-time bag of words | # Learn space-time bag of words | ||
# Train/test BoW classifier | # Train/test BoW classifier | ||
;Spatio-temporal Feature Detectors | |||
* Harris3D | |||
* Cuboid | |||
* Hessian | |||
* Dense | |||
;Spatio-temporal Feature Descriptors | |||
* HOG/HOF | |||
* Cuboid | |||
* HOG3D | |||
* ExtendedSURF | |||
;Add trajectories | |||
# Track a keypoint's movement over time | |||
# Make a ''feature tube'' around the trajectory. Existing methods used a cube instead of a tube. | |||
# Then do whatever pooling you want (e.g. HOG) to get a trajectory description. | |||
* Two-stream ConvNet uses a spatial stream (RGB) and temporal stream (optical flow) | |||
* Add/Stack Trajectories: Flow should be added on top of original points. | |||
* Pool Along Trajectories instead of cubes | |||
* I3D stacks 8 frames, passes to 3D convnet and gets an output | |||
* Late fusion extracts features per-frame and then combines later. | |||
===What is an action?=== | |||
ActionVLAD | |||
* BoW for actions | |||
* Actions are made up of subactions | |||
** E.g. basketball shoot = dribbling + jump + throw + running + ball | |||
Gaussian Temporal Awareness Networks | |||
* Key idea: Not all actions have the same temporal support. | |||
** Depending on frame-rate & action speed, actions can take a variable number of frames. | |||
Compressed Video Action Recognition | |||
* Idea is to present P-frames directly to the CNN which are essentially optical-flow. | |||
R-C3D: Region Convolutional 3D Network for Temporal Activity Detection | |||
* Inspired by Faster R-CNN Architecture | |||
SST: Single-Stream Temporal Action Proposals | |||
* Single-shot proposal network | |||
Action Tubelet Detector for Spatio-Temporal Action Localization | |||
* For every frame, regress how to move the tubelet up or down | |||
Tube Convolutional Neural Network (T-CNN) | |||
===Complementary Approaches=== | |||
Pose-based Action Recognition | |||
* Convert a video into pose maps and do classifications on poses | |||
PoTion: Pose MoTion Representation | |||
* Do pose estimation to get joint heatmaps. | |||
* Represent pose position & movement as an image with red=start and green=end. | |||
* Stack this across time. | |||
PA3D: Pose-Action 3D Machine | |||
* Focus on pose to do action recognition | |||
VideoGraph: Recognizing Minutes-Long Activites | |||
==Recognition== | |||
===Context=== | |||
Context is a 1-2% idea. Traditionally, it has only provided 1-2% of improved performance. | |||
;When is context helpful? | |||
* Typical answer: to ''guess'' small/blurry objects based on a prior. | |||
* Deeper answer: to make sense of the visual world. | |||
** When to use context and when not to use context. | |||
** 80% of context is automatically handled by neural networks, but 20% of work still remains. | |||
;Why context is important? | |||
* To resolve ambiguity. | |||
** Even high-res objects can be ambiguous. | |||
** There are 30,000+ types of objects but only a few can occur in an image. | |||
* To notice ''unusual'' things. | |||
* In infer function of unknown object. | |||
===Pixel Context=== | |||
Look at nearby pixels by inputting a slightly bigger region. | |||
===Semantic Context=== | |||
Use other objects present to answer what is present in the target pixels. | |||
===Geometric Context=== | |||
[Hoiem ''et al.'' 2005] | |||
Use segmentation to interpret geometry: | |||
* Sky | |||
* Ground | |||
* Building with normal of direction | |||
===Photometric Context=== | |||
If you know where the camera is, you can estimate the size of people and cars. | |||
If you know where the sun is, you can estimate where the scene | |||
===Geographic Context=== | |||
===Autocontext=== | |||
Do a prediction and then feed in the output to the model again for the model to refine the prediction. | |||
This is similar to pose machines, hourglass networks, iterative bounding box regression. | |||
===Classemes=== | |||
Descriptor is formed by concatenating outputs of weakly trained classifiers. | |||
==3D Scene Understanding== | |||
What can you get from knowing pairwise pixel distances? (i.e. given two sets of pixels, which pair is closer in 3D space) | |||
You can get horizons. | |||
Single Image Reconstruction | |||
By finding vanishing points and lines, you can do 3D reconstruction. | |||
;Taxonomy | |||
How: Bottom up classifiers to explicit constraints and reasoning. | |||
What: Qualitative to explicit/quantitative. | |||
From qualitative to quantitative: | |||
* Surface labels | |||
* Boundaries + objects | |||
* Stronger geometric constraints | |||
* Reasoning on aspects & poses | |||
* 3D point clouds | |||
Using depth ordering, surface labels, and occlusion cues can give us a planar reconstruction. | |||
Benefits of volumes: | |||
* Finite volumes | |||
* Spatial exclusion (no intersections) | |||
* Mechanical relationships and physical stability (one volume atop another) | |||
Room layout estimation: | |||
* Estimate walls and floor from vanishing points. | |||
* Three principle directions | |||
* Every room is a box | |||
* Minimum number of walls is 1, maximum is 6 but most see 5 walls if camera is facing one wall. | |||
* Use geometric context, optimizing to get a room context. | |||
* Given segmentation masks, you can estimate clutter vs free space. | |||
Functional constraints: | |||
* People sit on laptops, people can open drawer, ... | |||
Primitives | |||
* Depth - not normalized making them hard to use, have discontinuities, do not represent objects | |||
* Surface normals - are gradient of depth | |||
===Scene Intrinsics=== | |||
Recovering Intrinsic Scene Characteristics: | |||
Given the following: | |||
* original scene | |||
* distance (depth) | |||
* reflectance | |||
* orientation (normal) | |||
* illumination | |||
You can extract the scene perfectly. | |||
Learning ordinal relationships: | |||
* Which point is closer? | |||
** This gets you depth for 3D | |||
* Which point is darker? | |||
** This gets you reflectance for shading | |||
Depth vs surface normals: | |||
* Surface normals are gradient of depth | |||
* Depth is hard to use due to large discontinuities and unbounded values. | |||
===Reasoning=== | |||
Qualitative Parse Graph | |||
* Understanding of 3D support, support surfaces (physics) | |||
** E.g. lamp is supported by nightstand | |||
* Dataset: NYU v2 | |||
* Given an image, identify surfaces, then classify edges as concave (pop in) or convex (pop out). | |||
** From this, you can create a popup scene. | |||
==Objects + 3D== | |||
* Rasterized 3D representations: | |||
** multi-view images | |||
** depth maps | |||
** volumetric (voxels) | |||
* Geometric 3D representations: | |||
** mesh | |||
** point cloud | |||
** CAD models | |||
** primitive-based CAD models | |||
;Datasets | |||
* Pascal 3D | |||
* ObjectNet3D | |||
* ShapeNet | |||
* Matterport3D | |||
Rough 3D reconstruction: | |||
* Do classification & segmentation using a NN. | |||
* Fix existing CAD model to the image. | |||
Shape carving: | |||
* Assume everything is cubeoids and remove cubes. | |||
Without segmentation masks: | |||
* Train autoencoder for 3D voxels shape | |||
* Have encoder for rendered chairs. | |||
* At test time, images go to encoder for rendered and out the decoder for 3D voxels. | |||
==GANs and VAEs== | |||
===Pixel-RNN/CNN=== | |||
* Fully-visible belief network | |||
* Explicit density model: | |||
** Each pixel depends on all previous pixels | |||
** <math>P_{\theta}(x) = \prod_{i=1}^{n} P_{\theta}(x_i | x_1, ..., x_{i-1})</math> | |||
** You need to define what is ''previous pixels'' (e.g. all pixels above and left) | |||
* Then maximize likelihood of training data | |||
;Pros: | |||
* Can explicitly compute P(x) | |||
* Explicit P(x) gives good evaluation metric | |||
;Cons: | |||
* Sequence generation is slow | |||
* Optimizing P(x) is hard. | |||
Types of ''previous pixels'' connections: | |||
* PixelCNN looks at all previous pixels (fastest) | |||
* Row LSTM has a triangular receptive field (slow) | |||
* Diagonal LSTM | |||
* Diagonal BiLSTM has a full dependency field (slowest) | |||
;Multi-scale PixelRNN | |||
* Takes subsampled pixels as additional input pixels | |||
* Can capture better global information | |||
* Slightly better results | |||
===Generative Adversarial Networks (GANs)=== | |||
* Generator generates images | |||
* Discriminator classifies real or fake | |||
* Loss: <math>\min_{G} \max_{D} E_x[\log D(x)] + E_z[\log(1-D(G(z)))]</math> | |||
;Image-to-image Conditional GANS | |||
* Add an image encoder which outputs z | |||
;pix2pix | |||
* Add L1 loss to the loss function | |||
* UNet generator | |||
* PatchGAN discriminator | |||
** PatchGAN outputs N*N values with real-fake with each patch (i.e. limited receptive field) | |||
* Requires paired samples | |||
;CycleGAN | |||
* Unpaired image-to-image translation | |||
* Cycle-consistency loss | |||
;BicycleGAN | |||
* First learn to reconstruct images with a nice latent code representation in between (cVAE-GAN) | |||
* The main difference is that we have a many-to-many mapping (multi-modal image-to-image) between the two domains. | |||
;MUNIT | |||
* Multimodal UNsupervised Image-to-image Translation | |||
* Maps images from each domain into a shared content space and domain-specific style space. | |||
===Training Problems with GANs=== | |||
* Instability | |||
* Difficult to keep generator and discriminator in sync. | |||
** Discriminator cannot be too good or too bad. Same with generator. | |||
** Tricks: LR scheduling, keep discriminator small, update generator more frequently. | |||
* Mode collapse | |||
Mode collapse happens when the generator cannot model different parts of the distribution. | |||
;DCGAN architecture guidelines | |||
* Use strided conv instead of pooling for discriminator. | |||
* Use batchnorm in generator and discriminator. | |||
* Remove FC hidden layers. | |||
* Use Relu for hidden layers, tanh for output layers of generator. | |||
* Use LRelu for discriminator. | |||
LSGAN, WGAN have tricks to mitigate mode collapse. | |||
===Evaluation of GANs=== | |||
* Turing test (User study) | |||
* Inception score | |||
===Variational Auto-encoders (VAEs)=== | |||
;Training a VAE | |||
* Data likelihood: <math>P(x) = \int P(x|z) P(z) dz</math> | |||
* Approx with samples of z during training: <math>P(x) \approx \frac{1}{n} \sum_{i=0}^{n} P(x | z_i)</math> | |||
* This is impractical. | |||
Assume we can learn a distribution <math>Q(z)</math> such that <math>z \sim Q(z)</math> generates <math>P(x|z) > 0</math>. | |||
Relating <math>P(x)</math> and <math>E_{z \sim Q(z|x)}</math>? | |||
<math> | |||
\begin{aligned} | |||
D_{KL}[Q(z|x) \Vert P(z|x)] &= E_{z \sim Q}[\log Q(z|x) - \log P(z)] - E_{z \sim Q}[\log P(x|z)] + \log P(x)\\ | |||
&= D_{KL}[Q|P] - E_{z \sim Q}[\log P(x|z)] + \log P(x) | |||
\end{aligned} | |||
</math> | |||
Rearranging we get: | |||
<math> | |||
\begin{aligned} | |||
&\log P(x) - D_{KL}[Q(z|x) \Vert P(z|x)] = E_{z \sim Q}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)]\\ | |||
\implies &\log P(x) \geq E_{z \sim Q(z|x)}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)] | |||
\end{aligned} | |||
</math> | |||
This is known as variational lower bound or ''ELBO''. | |||
* We first have the encoder output a mean <math>\mu_{z|x}</math> and covariance matrix diagonal <math>\Sigma_{z|x}</math>. | |||
* For ELBO we want to optimize <math>E_{z \sim Q(z|x)}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)]</math>. | |||
* Our first loss is <math>D_{KL}(N(\mu_{z|x}, \Sigma_{z|x}) \Vert N(0, I))</math>. | |||
* We sample z from <math>N(\mu_{z|x}, \Sigma_{z|x})</math> and pass it to the decoder which outputs <math>\mu_{x|z}, \Sigma_{x|z}</math>. | |||
* Sample <math>\hat{x}</math> from the distribution and have reconstruction loss <math>\Vert x - \hat{x} \Vert^2</math>. | |||
* Most blog posts will forget to sample from <math>P(x|z)</math>. | |||
;Modeling P(x|z) | |||
Let f(z) be the network output. | |||
* Assume <math>P(x|z)</math> is iid Gaussian. | |||
* <math>\hat{x} = f(z) + \eta</math> where <math>\eta \sim N(0,1)</math> | |||
* Simplifies to an L2 loss <math>\Vert x - f(z) \Vert^2</math> | |||
Importance weighted VAE uses N samples for the loss. | |||
;Reparameterization trick | |||
To sample from the latent space, you do <math>z = \mu + \sigma \varepsilon</math> where <math>\varepsilon \sim N(0,1)</math>. | |||
This way, you can backprop through through the sampling step. | |||
;Conditional VAE | |||
Just input the condition into the encoder and decoder. | |||
;Pros | |||
* Principled approach to generative models | |||
* Allows inference of <math>q(z|x)</math> which can be used as a feature representation | |||
;Cons | |||
* Maximizes ELBO | |||
* Samples are blurrier than GANs | |||
Why are samples blurry? | |||
* Samples are not blurry but noisy | |||
** Sample vs Mean/Expected Value | |||
* L2 loss | |||
===Flow-based Models=== | |||
Flow-based models minimize the negative log-likelihood. | |||
==Attribute-based Representation== | |||
;Motivation | |||
Typically in recognition, we only predict the class of the image. | |||
From the category, we can guess the attributes but the category provides only limited information. | |||
The network cannot perform prediction on unseen new classes. | |||
This problem used to be called ''graceful degradation''. | |||
;Goal | |||
Learn intermediate structure with object categories. | |||
;Should we care about attributes in DL? | |||
;Why is attributes not simply supervised recognition? | |||
;Benefits | |||
* Dealing with inevitable failure. | |||
* We can infer things about unseen categories. | |||
* We can make comparison between objects or categories. | |||
;Datasets | |||
* a-Pascal | |||
* a-Yahoo | |||
* CORE | |||
* COCO Attributes | |||
Deep networks should be able to learn attributes implicitly. | |||
However, you don't know if it has actually learned them. | |||
==Extra Topics== | |||
===Fine-grained Recognition=== | |||
===Few-shot Recognition=== | |||
* Metric learning methods | |||
* Meta-learning methods | |||
* Data Augmentation Methods | |||
* Semantics | |||
===Zero-shot Recognition=== | |||
Goal is train a classifier without having seen a single labeled example. | |||
The information comes from a knowledge graph e.g. from word embeddings. | |||
===Beyond Labelled Datasets=== | |||
* Semi-supervised: We have both labelled and unlabeled training samples. | |||
* Weakly-supervised: The labels are weak, noisy, and non-necessarily for the task we want. | |||
* Learning from the Web: Download data from the internet | |||
==Will be on the exam== | ==Will be on the exam== | ||
| Line 850: | Line 1,249: | ||
* Softmax, sigmoid, cross entropy | * Softmax, sigmoid, cross entropy | ||
* RCNN vs Fast-RCNN vs Faster-RCNN | * RCNN vs Fast-RCNN vs Faster-RCNN | ||
* DPM | |||
* Selective search vs RPM | |||
* ELBO | |||
Final exam: | |||
* Friday Dec 4, 2020 | |||
* 4pm-6pm on gradescope | |||
* Will have multiple choice, fill-in-the-blank, question answering, open-ended | |||
** Generally simple questions; either you know or you don't | |||
* No practice exams | |||
* Only need to know major names (RCNN, Fast(er)-RCNN) | |||
* Only covers lecture material. | |||
* 1 letter page of open notes, both sides allowed (honor system) | |||
Homework 2 | |||
* Released Nov 30, 2020 | |||
* Take SLIC super pixels, extract deep features and classify them. | |||
* Will have 2 bonus credits, must pick one | |||
** Use features from multiple layers (multi-scale) | |||
** Use multiple-levels of SLIC in input (SLIC feature pyramid) | |||
Final Project | |||
* Presentations Dec 10 | |||
* Recorded videos with presentations | |||
* Final reports Dec 18 | |||
[https://docs.google.com/document/d/1BKmpBWBWuEEywDyBw9CsHgOPB6DH7I0oKEs8zXS7XQw/edit?usp=sharing My Exam Cheat Sheet] | |||
==Project Notes== | ==Project Notes== | ||
| Line 855: | Line 1,281: | ||
* Challenges | * Challenges | ||
* What methods worked and didn't work. | * What methods worked and didn't work. | ||
==References== | ==References== | ||
| Line 863: | Line 1,286: | ||
<ref name="torralba2008tinyimages">Antonio Torralba, Rob Fergus and William T. Freeman (2008). 80 million tiny images: a large dataset for | <ref name="torralba2008tinyimages">Antonio Torralba, Rob Fergus and William T. Freeman (2008). 80 million tiny images: a large dataset for | ||
non-parametric object and scene recognition (PAMI 2008) [https://people.csail.mit.edu/torralba/publications/80millionImages.pdf Link]</ref> | non-parametric object and scene recognition (PAMI 2008) [https://people.csail.mit.edu/torralba/publications/80millionImages.pdf Link]</ref> | ||
<ref name="standing1973learning">Lionel Standing (1973). Learning 10000 pictures. ''Journal | <ref name="standing1973learning">Lionel Standing (1973). Learning 10000 pictures. ''Journal Quarterly Journal of Experimental Psychology'' [https://www.tandfonline.com/doi/abs/10.1080/14640747308400340 Link]</ref> | ||
Quarterly Journal of Experimental Psychology'' [https://www.tandfonline.com/doi/abs/10.1080/14640747308400340 Link]</ref> | |||
<ref name="brady2008visual">Timothy F. Brady, Talia Konkle, George A. Alvarez, and Aude Oliva (2008). Visual long-term memory has a massive storage capacity for object details. [http://olivalab.mit.edu/MM/pdfs/BradyKonkleAlvarezOliva2008.pdf Link].</ref> | <ref name="brady2008visual">Timothy F. Brady, Talia Konkle, George A. Alvarez, and Aude Oliva (2008). Visual long-term memory has a massive storage capacity for object details. [http://olivalab.mit.edu/MM/pdfs/BradyKonkleAlvarezOliva2008.pdf Link].</ref> | ||
<ref name="torralba2011unbiased>Antonio Torralba, Alexei A. Efros (2011). Unbiased Look at Dataset Bias (CVPR 2011) [https://people.csail.mit.edu/torralba/publications/datasets_cvpr11.pdf Link]</ref> | <ref name="torralba2011unbiased>Antonio Torralba, Alexei A. Efros (2011). Unbiased Look at Dataset Bias (CVPR 2011) [https://people.csail.mit.edu/torralba/publications/datasets_cvpr11.pdf Link]</ref> | ||
| Line 879: | Line 1,301: | ||
<ref name="redmon2016yolo">Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi (2016) You Only Look Once: Unified, Real-Time Object Detection [https://pjreddie.com/media/files/papers/yolo.pdf Link]</ref> | <ref name="redmon2016yolo">Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi (2016) You Only Look Once: Unified, Real-Time Object Detection [https://pjreddie.com/media/files/papers/yolo.pdf Link]</ref> | ||
<ref name="shotton2009texton">Jamie Shotton John Winn Carsten Rother Antonio Criminisi (2009) TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context. [https://www.microsoft.com/en-us/research/publication/textonboost-for-image-understanding-multi-class-object-recognition-and-segmentation-by-jointly-modeling-texture-layout-and-context/ Link]</ref> | <ref name="shotton2009texton">Jamie Shotton John Winn Carsten Rother Antonio Criminisi (2009) TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context. [https://www.microsoft.com/en-us/research/publication/textonboost-for-image-understanding-multi-class-object-recognition-and-segmentation-by-jointly-modeling-texture-layout-and-context/ Link]</ref> | ||
<ref name="shrivastava2016ohem">Abhinav Shrivastava, Abhinav Gupta, Ross Girshick (2016) Training Region-Based Object Detectors With Online Hard Example Mining. (CVPR 2016)[https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Shrivastava_Training_Region-Based_Object_CVPR_2016_paper.html Link]</ref> | <ref name="shrivastava2016ohem">Abhinav Shrivastava, Abhinav Gupta, Ross Girshick (2016) Training Region-Based Object Detectors With Online Hard Example Mining. (CVPR 2016) [https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Shrivastava_Training_Region-Based_Object_CVPR_2016_paper.html Link]</ref> | ||
<ref name="bell2016ion">Sean Bell, C. Lawrence Zitnick, Kavita Bala, Ross Girshick (2016). Inside-Outside Net: Detecting Objects in Context With Skip Pooling and Recurrent Neural Networks (CVPR 2016)</ref> | <ref name="bell2016ion">Sean Bell, C. Lawrence Zitnick, Kavita Bala, Ross Girshick (2016). Inside-Outside Net: Detecting Objects in Context With Skip Pooling and Recurrent Neural Networks (CVPR 2016) [https://openaccess.thecvf.com/content_cvpr_2016/papers/Bell_Inside-Outside_Net_Detecting_CVPR_2016_paper.pdf CVF Mirror]</ref> | ||
}} | }} | ||