Visual Learning and Recognition: Difference between revisions

 
(35 intermediate revisions by the same user not shown)
Line 1: Line 1:
Notes for CMSC828I Visual Learning and Recognition (Fall 2020) taught by Abhinav Shrivastava
Notes for CMSC828I Visual Learning and Recognition (Fall 2020) taught by [http://abhinavsh.info/ Abhinav Shrivastava]
 
[https://www.cs.umd.edu/class/fall2020/cmsc828i/ Course Website]


This class covers:
This class covers:
Line 375: Line 377:
* 4 Texture Related
* 4 Texture Related


===Discriminatively Trained Part Based Models (DPM)===
===Deformable Part Models (DPM)===
Lecture (Oct 6-8, 2020) 
Deformable Part Models (DPM)
 
Felzenszwalb et al (2009) <ref name="felzenszwalb2009dpm"></ref>
Felzenszwalb et al (2009) <ref name="felzenszwalb2009dpm"></ref>
'''Important: Read this paper'''
'''Important: Read this paper'''
Line 388: Line 393:
* <math>w_{ij}^T \in \mathbb{R}^5</math> is the deformation parameter between parts i and j
* <math>w_{ij}^T \in \mathbb{R}^5</math> is the deformation parameter between parts i and j


The total number of configurations is <math>10^(4*N)</math> since for each <math>100\times100</math> image, each of <math>p_i</math> can take 100*100 values. <math>N</math> is the number of parts.
The total number of configurations is <math>10^{(4*N)}</math> since for each <math>100 \times 100</math> image, each of <math>p_i</math> can take 100*100 values. <math>N</math> is the number of parts.


The trick is to use dynamic programming and a tree-based model.   
The trick is to use dynamic programming and a tree-based model.   
Line 455: Line 460:


===TextonBoost===
===TextonBoost===
Shotton ''et al.'' <ref name="shotton2009texton"></ref>
Shotton ''et al.'' <ref name="shotton2009texton"></ref>
Incorporates texture-layout, color, location, and edge in a conditional random field. 
Jointly considers appearance, shape, context.


==Semantic Segmentation==
==Semantic Segmentation==
Line 514: Line 521:


==Region-based Object Detection Systems==
==Region-based Object Detection Systems==
 
Lecture (Oct 15, 2020)
===R-CNN===
===R-CNN===
;R-CNN at test time
;R-CNN at test time
Line 744: Line 751:


===Approaches for Pose Estimation===
===Approaches for Pose Estimation===
Lecture Oct 27
* Top-down approaches
* Top-down approaches
** Do person detection then do pose estimation.
** Do person detection then do pose estimation.
Line 841: Line 850:


===Models for Video Recognition===
===Models for Video Recognition===
In general:
;Basic Video Pipeline
# Extract features
# Extract features
# Learn space-time bag of words
# Learn space-time bag of words
# Train/test BoW classifier
# Train/test BoW classifier
;Spatio-temporal Feature Detectors
* Harris3D
* Cuboid
* Hessian
* Dense
;Spatio-temporal Feature Descriptors
* HOG/HOF
* Cuboid
* HOG3D
* ExtendedSURF
;Add trajectories
# Track a keypoint's movement over time
# Make a ''feature tube'' around the trajectory. Existing methods used a cube instead of a tube.
# Then do whatever pooling you want (e.g. HOG) to get a trajectory description.
* Two-stream ConvNet uses a spatial stream (RGB) and temporal stream (optical flow)
* Add/Stack Trajectories: Flow should be added on top of original points.
* Pool Along Trajectories instead of cubes
* I3D stacks 8 frames, passes to 3D convnet and gets an output
* Late fusion extracts features per-frame and then combines later.
===What is an action?===
ActionVLAD
* BoW for actions
* Actions are made up of subactions
** E.g. basketball shoot = dribbling + jump + throw + running + ball
Gaussian Temporal Awareness Networks
* Key idea: Not all actions have the same temporal support.
** Depending on frame-rate & action speed, actions can take a variable number of frames.
Compressed Video Action Recognition
* Idea is to present P-frames directly to the CNN which are essentially optical-flow.
R-C3D: Region Convolutional 3D Network for Temporal Activity Detection
* Inspired by Faster R-CNN Architecture
SST: Single-Stream Temporal Action Proposals
* Single-shot proposal network
Action Tubelet Detector for Spatio-Temporal Action Localization
* For every frame, regress how to move the tubelet up or down
Tube Convolutional Neural Network (T-CNN)
===Complementary Approaches===
Pose-based Action Recognition
* Convert a video into pose maps and do classifications on poses
PoTion: Pose MoTion Representation
* Do pose estimation to get joint heatmaps.
* Represent pose position & movement as  an image with red=start and green=end.
* Stack this across time.
PA3D: Pose-Action 3D Machine
* Focus on pose to do action recognition
VideoGraph: Recognizing Minutes-Long Activites
==Recognition==
===Context===
Context is a 1-2% idea. Traditionally, it has only provided 1-2% of improved performance.
;When is context helpful? 
* Typical answer: to ''guess'' small/blurry objects based on a prior.
* Deeper answer: to make sense of the visual world.
** When to use context and when not to use context.
** 80% of context is automatically handled by neural networks, but 20% of work still remains.
;Why context is important?
* To resolve ambiguity.
** Even high-res objects can be ambiguous.
** There are 30,000+ types of objects but only a few can occur in an image.
* To notice ''unusual'' things.
* In infer function of unknown object.
===Pixel Context===
Look at nearby pixels by inputting a slightly bigger region.
===Semantic Context===
Use other objects present to answer what is present in the target pixels.
===Geometric Context===
[Hoiem ''et al.'' 2005] 
Use segmentation to interpret geometry:
* Sky
* Ground
* Building with normal of direction
===Photometric Context===
If you know where the camera is, you can estimate the size of people and cars. 
If you know where the sun is, you can estimate where the scene
===Geographic Context===
===Autocontext===
Do a prediction and then feed in the output to the model again for the model to refine the prediction. 
This is similar to pose machines, hourglass networks, iterative bounding box regression.
===Classemes===
Descriptor is formed by concatenating outputs of weakly trained classifiers.
==3D Scene Understanding==
What can you get from knowing pairwise pixel distances? (i.e. given two sets of pixels, which pair is closer in 3D space) 
You can get horizons.
Single Image Reconstruction
By finding vanishing points and lines, you can do 3D reconstruction.
;Taxonomy
How: Bottom up classifiers to explicit constraints and reasoning.
What: Qualitative to explicit/quantitative.
From qualitative to quantitative:
* Surface labels
* Boundaries + objects
* Stronger geometric constraints
* Reasoning on aspects & poses
* 3D point clouds
Using depth ordering, surface labels, and occlusion cues can give us a planar reconstruction.
Benefits of volumes:
* Finite volumes
* Spatial exclusion (no intersections)
* Mechanical relationships and physical stability (one volume atop another)
Room layout estimation:
* Estimate walls and floor from vanishing points.
* Three principle directions
* Every room is a box
* Minimum number of walls is 1, maximum is 6 but most see 5 walls if camera is facing one wall.
* Use geometric context, optimizing to get a room context.
* Given segmentation masks, you can estimate clutter vs free space.
Functional constraints:
* People sit on laptops, people can open drawer, ...
Primitives
* Depth - not normalized making them hard to use, have discontinuities, do not represent objects
* Surface normals - are gradient of depth
===Scene Intrinsics===
Recovering Intrinsic Scene Characteristics: 
Given the following: 
* original scene
* distance (depth)
* reflectance
* orientation (normal)
* illumination
You can extract the scene perfectly.
Learning ordinal relationships:
* Which point is closer?
** This gets you depth for 3D
* Which point is darker?
** This gets you reflectance for shading
Depth vs surface normals:
* Surface normals are gradient of depth
* Depth is hard to use due to large discontinuities and unbounded values.
===Reasoning===
Qualitative Parse Graph
* Understanding of 3D support, support surfaces (physics)
** E.g. lamp is supported by nightstand
* Dataset: NYU v2
* Given an image, identify surfaces, then classify edges as concave (pop in) or convex (pop out).
** From this, you can create a popup scene.
==Objects + 3D==
* Rasterized 3D representations:
** multi-view images
** depth maps
** volumetric (voxels)
* Geometric 3D representations:
** mesh
** point cloud
** CAD models
** primitive-based CAD models
;Datasets
* Pascal 3D
* ObjectNet3D
* ShapeNet
* Matterport3D
Rough 3D reconstruction:
* Do classification & segmentation using a NN.
* Fix existing CAD model to the image.
Shape carving:
* Assume everything is cubeoids and remove cubes.
Without segmentation masks:
* Train autoencoder for 3D voxels shape
* Have encoder for rendered chairs.
* At test time, images go to encoder for rendered and out the decoder for 3D voxels.
==GANs and VAEs==
===Pixel-RNN/CNN===
* Fully-visible belief network
* Explicit density model:
** Each pixel depends on all previous pixels
** <math>P_{\theta}(x) = \prod_{i=1}^{n} P_{\theta}(x_i | x_1, ..., x_{i-1})</math>
** You need to define what is ''previous pixels'' (e.g. all pixels above and left)
* Then maximize likelihood of training data
;Pros:
* Can explicitly compute P(x)
* Explicit P(x) gives good evaluation metric
;Cons:
* Sequence generation is slow
* Optimizing P(x) is hard.
Types of ''previous pixels'' connections:
* PixelCNN looks at all previous pixels (fastest)
* Row LSTM has a triangular receptive field (slow)
* Diagonal LSTM
* Diagonal BiLSTM has a full dependency field (slowest)
;Multi-scale PixelRNN
* Takes subsampled pixels as additional input pixels
* Can capture better global information
* Slightly better results
===Generative Adversarial Networks (GANs)===
* Generator generates images
* Discriminator classifies real or fake
* Loss: <math>\min_{G} \max_{D} E_x[\log D(x)] + E_z[\log(1-D(G(z)))]</math>
;Image-to-image Conditional GANS
* Add an image encoder which outputs z
;pix2pix
* Add L1 loss to the loss function
* UNet generator
* PatchGAN discriminator
** PatchGAN outputs N*N values with real-fake with each patch (i.e. limited receptive field)
* Requires paired samples
;CycleGAN
* Unpaired image-to-image translation
* Cycle-consistency loss
;BicycleGAN
* First learn to reconstruct images with a nice latent code representation in between (cVAE-GAN)
* The main difference is that we have a many-to-many mapping (multi-modal image-to-image) between the two domains.
;MUNIT
* Multimodal UNsupervised Image-to-image Translation
* Maps images from each domain into a shared content space and domain-specific style space.
===Training Problems with GANs===
* Instability
* Difficult to keep generator and discriminator in sync.
** Discriminator cannot be too good or too bad. Same with generator.
** Tricks: LR scheduling, keep discriminator small, update generator more frequently.
* Mode collapse
Mode collapse happens when the generator cannot model different parts of the distribution.
;DCGAN architecture guidelines
* Use strided conv instead of pooling for discriminator.
* Use batchnorm in generator and discriminator.
* Remove FC hidden layers.
* Use Relu for hidden layers, tanh for output layers of generator.
* Use LRelu for discriminator.
LSGAN, WGAN have tricks to mitigate mode collapse.
===Evaluation of GANs===
* Turing test (User study)
* Inception score
===Variational Auto-encoders (VAEs)===
;Training a VAE
* Data likelihood: <math>P(x) = \int P(x|z) P(z) dz</math>
* Approx with samples of z during training: <math>P(x) \approx \frac{1}{n} \sum_{i=0}^{n} P(x | z_i)</math>
* This is impractical.
Assume we can learn a distribution <math>Q(z)</math> such that <math>z \sim Q(z)</math> generates <math>P(x|z) > 0</math>. 
Relating <math>P(x)</math> and <math>E_{z \sim Q(z|x)}</math>? 
<math>
\begin{aligned}
D_{KL}[Q(z|x) \Vert P(z|x)] &= E_{z \sim Q}[\log Q(z|x) - \log P(z)] - E_{z \sim Q}[\log P(x|z)] + \log P(x)\\
&= D_{KL}[Q|P] - E_{z \sim Q}[\log P(x|z)] + \log P(x)
\end{aligned}
</math> 
Rearranging we get: 
<math>
\begin{aligned}
&\log P(x) - D_{KL}[Q(z|x) \Vert P(z|x)] = E_{z \sim Q}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)]\\
\implies &\log P(x) \geq E_{z \sim Q(z|x)}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)]
\end{aligned}
</math> 
This is known as variational lower bound or ''ELBO''.
* We first have the encoder output a mean <math>\mu_{z|x}</math> and covariance matrix diagonal <math>\Sigma_{z|x}</math>. 
* For ELBO we want to optimize <math>E_{z \sim Q(z|x)}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)]</math>. 
* Our first loss is <math>D_{KL}(N(\mu_{z|x}, \Sigma_{z|x}) \Vert N(0, I))</math>. 
* We sample z from <math>N(\mu_{z|x}, \Sigma_{z|x})</math> and pass it to the decoder which outputs <math>\mu_{x|z}, \Sigma_{x|z}</math>. 
* Sample <math>\hat{x}</math> from the distribution and have reconstruction loss <math>\Vert x - \hat{x} \Vert^2</math>. 
* Most blog posts will forget to sample from <math>P(x|z)</math>.
;Modeling P(x|z)
Let f(z) be the network output.
* Assume <math>P(x|z)</math> is iid Gaussian.
* <math>\hat{x} = f(z) + \eta</math> where <math>\eta \sim N(0,1)</math>
* Simplifies to an L2 loss <math>\Vert x - f(z) \Vert^2</math>
Importance weighted VAE uses N samples for the loss.
;Reparameterization trick
To sample from the latent space, you do <math>z = \mu + \sigma \varepsilon</math> where <math>\varepsilon \sim N(0,1)</math>. 
This way, you can backprop through through the sampling step.
;Conditional VAE
Just input the condition into the encoder and decoder.
;Pros
* Principled approach to generative models
* Allows inference of <math>q(z|x)</math> which can be used as a feature representation
;Cons
* Maximizes ELBO
* Samples are blurrier than GANs
Why are samples blurry?
* Samples are not blurry but noisy
** Sample vs Mean/Expected Value
* L2 loss
===Flow-based Models===
Flow-based models minimize the negative log-likelihood.
==Attribute-based Representation==
;Motivation
Typically in recognition, we only predict the class of the image. 
From the category, we can guess the attributes but the category provides only limited information. 
The network cannot perform prediction on unseen new classes. 
This problem used to be called ''graceful degradation''.
;Goal
Learn intermediate structure with object categories.
;Should we care about attributes in DL?
;Why is attributes not simply supervised recognition?
;Benefits
* Dealing with inevitable failure.
* We can infer things about unseen categories.
* We can make comparison between objects or categories.
;Datasets
* a-Pascal
* a-Yahoo
* CORE
* COCO Attributes
Deep networks should be able to learn attributes implicitly. 
However, you don't know if it has actually learned them.
==Extra Topics==
===Fine-grained Recognition===
===Few-shot Recognition===
* Metric learning methods
* Meta-learning methods
* Data Augmentation Methods
* Semantics
===Zero-shot Recognition===
Goal is train a classifier without having seen a single labeled example. 
The information comes from a knowledge graph e.g. from word embeddings.
===Beyond Labelled Datasets===
* Semi-supervised: We have both labelled and unlabeled training samples.
* Weakly-supervised: The labels are weak, noisy, and non-necessarily for the task we want.
* Learning from the Web: Download data from the internet


==Will be on the exam==
==Will be on the exam==
Line 850: Line 1,249:
* Softmax, sigmoid, cross entropy
* Softmax, sigmoid, cross entropy
* RCNN vs Fast-RCNN vs Faster-RCNN
* RCNN vs Fast-RCNN vs Faster-RCNN
* DPM
* Selective search vs RPM
* ELBO
Final exam:
* Friday Dec 4, 2020
* 4pm-6pm on gradescope
* Will have multiple choice, fill-in-the-blank, question answering, open-ended
** Generally simple questions; either you know or you don't
* No practice exams
* Only need to know major names (RCNN, Fast(er)-RCNN)
* Only covers lecture material.
* 1 letter page of open notes, both sides allowed (honor system)
Homework 2
* Released Nov 30, 2020
* Take SLIC super pixels, extract deep features and classify them.
* Will have 2 bonus credits, must pick one
** Use features from multiple layers (multi-scale)
** Use multiple-levels of SLIC in input (SLIC feature pyramid)
Final Project
* Presentations Dec 10
* Recorded videos with presentations
* Final reports Dec 18
[https://docs.google.com/document/d/1BKmpBWBWuEEywDyBw9CsHgOPB6DH7I0oKEs8zXS7XQw/edit?usp=sharing My Exam Cheat Sheet]


==Project Notes==
==Project Notes==
Line 855: Line 1,281:
* Challenges
* Challenges
* What methods worked and didn't work.
* What methods worked and didn't work.
==Misc==
[[Visible to::users]]


==References==
==References==
Line 863: Line 1,286:
<ref name="torralba2008tinyimages">Antonio Torralba, Rob Fergus and William T. Freeman (2008). 80 million tiny images: a large dataset for
<ref name="torralba2008tinyimages">Antonio Torralba, Rob Fergus and William T. Freeman (2008). 80 million tiny images: a large dataset for
non-parametric object and scene recognition (PAMI 2008) [https://people.csail.mit.edu/torralba/publications/80millionImages.pdf Link]</ref>
non-parametric object and scene recognition (PAMI 2008) [https://people.csail.mit.edu/torralba/publications/80millionImages.pdf Link]</ref>
<ref name="standing1973learning">Lionel Standing (1973). Learning 10000 pictures. ''Journal
<ref name="standing1973learning">Lionel Standing (1973). Learning 10000 pictures. ''Journal Quarterly Journal of Experimental Psychology'' [https://www.tandfonline.com/doi/abs/10.1080/14640747308400340 Link]</ref>
Quarterly Journal of Experimental Psychology'' [https://www.tandfonline.com/doi/abs/10.1080/14640747308400340 Link]</ref>
<ref name="brady2008visual">Timothy F. Brady, Talia Konkle, George A. Alvarez, and Aude Oliva (2008). Visual long-term memory has a massive storage capacity for object details. [http://olivalab.mit.edu/MM/pdfs/BradyKonkleAlvarezOliva2008.pdf Link].</ref>
<ref name="brady2008visual">Timothy F. Brady, Talia Konkle, George A. Alvarez, and Aude Oliva (2008). Visual long-term memory has a massive storage capacity for object details. [http://olivalab.mit.edu/MM/pdfs/BradyKonkleAlvarezOliva2008.pdf Link].</ref>
<ref name="torralba2011unbiased>Antonio Torralba, Alexei A. Efros (2011). Unbiased Look at Dataset Bias (CVPR 2011) [https://people.csail.mit.edu/torralba/publications/datasets_cvpr11.pdf Link]</ref>
<ref name="torralba2011unbiased>Antonio Torralba, Alexei A. Efros (2011). Unbiased Look at Dataset Bias (CVPR 2011) [https://people.csail.mit.edu/torralba/publications/datasets_cvpr11.pdf Link]</ref>
Line 879: Line 1,301:
<ref name="redmon2016yolo">Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi (2016) You Only Look Once: Unified, Real-Time Object Detection [https://pjreddie.com/media/files/papers/yolo.pdf Link]</ref>
<ref name="redmon2016yolo">Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi (2016) You Only Look Once: Unified, Real-Time Object Detection [https://pjreddie.com/media/files/papers/yolo.pdf Link]</ref>
<ref name="shotton2009texton">Jamie Shotton John Winn Carsten Rother Antonio Criminisi (2009) TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context. [https://www.microsoft.com/en-us/research/publication/textonboost-for-image-understanding-multi-class-object-recognition-and-segmentation-by-jointly-modeling-texture-layout-and-context/ Link]</ref>
<ref name="shotton2009texton">Jamie Shotton John Winn Carsten Rother Antonio Criminisi (2009) TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context. [https://www.microsoft.com/en-us/research/publication/textonboost-for-image-understanding-multi-class-object-recognition-and-segmentation-by-jointly-modeling-texture-layout-and-context/ Link]</ref>
<ref name="shrivastava2016ohem">Abhinav Shrivastava, Abhinav Gupta, Ross Girshick (2016) Training Region-Based Object Detectors With Online Hard Example Mining. (CVPR 2016)[https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Shrivastava_Training_Region-Based_Object_CVPR_2016_paper.html Link]</ref>
<ref name="shrivastava2016ohem">Abhinav Shrivastava, Abhinav Gupta, Ross Girshick (2016) Training Region-Based Object Detectors With Online Hard Example Mining. (CVPR 2016) [https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Shrivastava_Training_Region-Based_Object_CVPR_2016_paper.html Link]</ref>
<ref name="bell2016ion">Sean Bell, C. Lawrence Zitnick, Kavita Bala, Ross Girshick (2016). Inside-Outside Net: Detecting Objects in Context With Skip Pooling and Recurrent Neural Networks (CVPR 2016)</ref>
<ref name="bell2016ion">Sean Bell, C. Lawrence Zitnick, Kavita Bala, Ross Girshick (2016). Inside-Outside Net: Detecting Objects in Context With Skip Pooling and Recurrent Neural Networks (CVPR 2016) [https://openaccess.thecvf.com/content_cvpr_2016/papers/Bell_Inside-Outside_Net_Detecting_CVPR_2016_paper.pdf CVF Mirror]</ref>
}}
}}