@@ Line 1: / Line 1: @@
-Notes for CMSC828I Visual Learning and Recognition (Fall 2020) taught by Abhinav Shrivastava
+Notes for CMSC828I Visual Learning and Recognition (Fall 2020) taught by [http://abhinavsh.info/ Abhinav Shrivastava]
+[https://www.cs.umd.edu/class/fall2020/cmsc828i/ Course Website]
 This class covers:
@@ Line 375: / Line 377: @@
 * 4 Texture Related
-===Discriminatively Trained Part Based Models (DPM)===
+===Deformable Part Models (DPM)===
+Lecture (Oct 6-8, 2020)
+Deformable Part Models (DPM)
 Felzenszwalb et al (2009) <ref name="felzenszwalb2009dpm"></ref>
 '''Important: Read this paper'''
@@ Line 388: / Line 393: @@
 * <math>w_{ij}^T \in \mathbb{R}^5</math> is the deformation parameter between parts i and j
-The total number of configurations is <math>10^(4*N)</math> since for each <math>100\times100</math> image, each of <math>p_i</math> can take 100*100 values. <math>N</math> is the number of parts.
+The total number of configurations is <math>10^{(4*N)}</math> since for each <math>100 \times 100</math> image, each of <math>p_i</math> can take 100*100 values. <math>N</math> is the number of parts.
 The trick is to use dynamic programming and a tree-based model.
@@ Line 455: / Line 460: @@
 ===TextonBoost===
 Shotton ''et al.'' <ref name="shotton2009texton"></ref>
+Incorporates texture-layout, color, location, and edge in a conditional random field.
+Jointly considers appearance, shape, context.
 ==Semantic Segmentation==
@@ Line 514: / Line 521: @@
 ==Region-based Object Detection Systems==
+Lecture (Oct 15, 2020)
 ===R-CNN===
 ;R-CNN at test time
@@ Line 744: / Line 751: @@
 ===Approaches for Pose Estimation===
+Lecture Oct 27
 * Top-down approaches
 ** Do person detection then do pose estimation.
@@ Line 841: / Line 850: @@
 ===Models for Video Recognition===
-In general:
+;Basic Video Pipeline
 # Extract features
 # Learn space-time bag of words
 # Train/test BoW classifier
+;Spatio-temporal Feature Detectors
+* Harris3D
+* Cuboid
+* Hessian
+* Dense
+;Spatio-temporal Feature Descriptors
+* HOG/HOF
+* Cuboid
+* HOG3D
+* ExtendedSURF
+;Add trajectories
+# Track a keypoint's movement over time
+# Make a ''feature tube'' around the trajectory. Existing methods used a cube instead of a tube.
+# Then do whatever pooling you want (e.g. HOG) to get a trajectory description.
+* Two-stream ConvNet uses a spatial stream (RGB) and temporal stream (optical flow)
+* Add/Stack Trajectories: Flow should be added on top of original points.
+* Pool Along Trajectories instead of cubes
+* I3D stacks 8 frames, passes to 3D convnet and gets an output
+* Late fusion extracts features per-frame and then combines later.
+===What is an action?===
+ActionVLAD
+* BoW for actions
+* Actions are made up of subactions
+** E.g. basketball shoot = dribbling + jump + throw + running + ball
+Gaussian Temporal Awareness Networks
+* Key idea: Not all actions have the same temporal support.
+** Depending on frame-rate & action speed, actions can take a variable number of frames.
+Compressed Video Action Recognition
+* Idea is to present P-frames directly to the CNN which are essentially optical-flow.
+R-C3D: Region Convolutional 3D Network for Temporal Activity Detection
+* Inspired by Faster R-CNN Architecture
+SST: Single-Stream Temporal Action Proposals
+* Single-shot proposal network
+Action Tubelet Detector for Spatio-Temporal Action Localization
+* For every frame, regress how to move the tubelet up or down
+Tube Convolutional Neural Network (T-CNN)
+===Complementary Approaches===
+Pose-based Action Recognition
+* Convert a video into pose maps and do classifications on poses
+PoTion: Pose MoTion Representation
+* Do pose estimation to get joint heatmaps.
+* Represent pose position & movement as  an image with red=start and green=end.
+* Stack this across time.
+PA3D: Pose-Action 3D Machine
+* Focus on pose to do action recognition
+VideoGraph: Recognizing Minutes-Long Activites
+==Recognition==
+===Context===
+Context is a 1-2% idea. Traditionally, it has only provided 1-2% of improved performance.
+;When is context helpful?
+* Typical answer: to ''guess'' small/blurry objects based on a prior.
+* Deeper answer: to make sense of the visual world.
+** When to use context and when not to use context.
+** 80% of context is automatically handled by neural networks, but 20% of work still remains.
+;Why context is important?
+* To resolve ambiguity.
+** Even high-res objects can be ambiguous.
+** There are 30,000+ types of objects but only a few can occur in an image.
+* To notice ''unusual'' things.
+* In infer function of unknown object.
+===Pixel Context===
+Look at nearby pixels by inputting a slightly bigger region.
+===Semantic Context===
+Use other objects present to answer what is present in the target pixels.
+===Geometric Context===
+[Hoiem ''et al.'' 2005]
+Use segmentation to interpret geometry:
+* Sky
+* Ground
+* Building with normal of direction
+===Photometric Context===
+If you know where the camera is, you can estimate the size of people and cars.
+If you know where the sun is, you can estimate where the scene
+===Geographic Context===
+===Autocontext===
+Do a prediction and then feed in the output to the model again for the model to refine the prediction.
+This is similar to pose machines, hourglass networks, iterative bounding box regression.
+===Classemes===
+Descriptor is formed by concatenating outputs of weakly trained classifiers.
+==3D Scene Understanding==
+What can you get from knowing pairwise pixel distances? (i.e. given two sets of pixels, which pair is closer in 3D space)
+You can get horizons.
+Single Image Reconstruction
+By finding vanishing points and lines, you can do 3D reconstruction.
+;Taxonomy
+How: Bottom up classifiers to explicit constraints and reasoning.
+What: Qualitative to explicit/quantitative.
+From qualitative to quantitative:
+* Surface labels
+* Boundaries + objects
+* Stronger geometric constraints
+* Reasoning on aspects & poses
+* 3D point clouds
+Using depth ordering, surface labels, and occlusion cues can give us a planar reconstruction.
+Benefits of volumes:
+* Finite volumes
+* Spatial exclusion (no intersections)
+* Mechanical relationships and physical stability (one volume atop another)
+Room layout estimation:
+* Estimate walls and floor from vanishing points.
+* Three principle directions
+* Every room is a box
+* Minimum number of walls is 1, maximum is 6 but most see 5 walls if camera is facing one wall.
+* Use geometric context, optimizing to get a room context.
+* Given segmentation masks, you can estimate clutter vs free space.
+Functional constraints:
+* People sit on laptops, people can open drawer, ...
+Primitives
+* Depth - not normalized making them hard to use, have discontinuities, do not represent objects
+* Surface normals - are gradient of depth
+===Scene Intrinsics===
+Recovering Intrinsic Scene Characteristics:
+Given the following:
+* original scene
+* distance (depth)
+* reflectance
+* orientation (normal)
+* illumination
+You can extract the scene perfectly.
+Learning ordinal relationships:
+* Which point is closer?
+** This gets you depth for 3D
+* Which point is darker?
+** This gets you reflectance for shading
+Depth vs surface normals:
+* Surface normals are gradient of depth
+* Depth is hard to use due to large discontinuities and unbounded values.
+===Reasoning===
+Qualitative Parse Graph
+* Understanding of 3D support, support surfaces (physics)
+** E.g. lamp is supported by nightstand
+* Dataset: NYU v2
+* Given an image, identify surfaces, then classify edges as concave (pop in) or convex (pop out).
+** From this, you can create a popup scene.
+==Objects + 3D==
+* Rasterized 3D representations:
+** multi-view images
+** depth maps
+** volumetric (voxels)
+* Geometric 3D representations:
+** mesh
+** point cloud
+** CAD models
+** primitive-based CAD models
+;Datasets
+* Pascal 3D
+* ObjectNet3D
+* ShapeNet
+* Matterport3D
+Rough 3D reconstruction:
+* Do classification & segmentation using a NN.
+* Fix existing CAD model to the image.
+Shape carving:
+* Assume everything is cubeoids and remove cubes.
+Without segmentation masks:
+* Train autoencoder for 3D voxels shape
+* Have encoder for rendered chairs.
+* At test time, images go to encoder for rendered and out the decoder for 3D voxels.
+==GANs and VAEs==
+===Pixel-RNN/CNN===
+* Fully-visible belief network
+* Explicit density model:
+** Each pixel depends on all previous pixels
+** <math>P_{\theta}(x) = \prod_{i=1}^{n} P_{\theta}(x_i | x_1, ..., x_{i-1})</math>
+** You need to define what is ''previous pixels'' (e.g. all pixels above and left)
+* Then maximize likelihood of training data
+;Pros:
+* Can explicitly compute P(x)
+* Explicit P(x) gives good evaluation metric
+;Cons:
+* Sequence generation is slow
+* Optimizing P(x) is hard.
+Types of ''previous pixels'' connections:
+* PixelCNN looks at all previous pixels (fastest)
+* Row LSTM has a triangular receptive field (slow)
+* Diagonal LSTM
+* Diagonal BiLSTM has a full dependency field (slowest)
+;Multi-scale PixelRNN
+* Takes subsampled pixels as additional input pixels
+* Can capture better global information
+* Slightly better results
+===Generative Adversarial Networks (GANs)===
+* Generator generates images
+* Discriminator classifies real or fake
+* Loss: <math>\min_{G} \max_{D} E_x[\log D(x)] + E_z[\log(1-D(G(z)))]</math>
+;Image-to-image Conditional GANS
+* Add an image encoder which outputs z
+;pix2pix
+* Add L1 loss to the loss function
+* UNet generator
+* PatchGAN discriminator
+** PatchGAN outputs N*N values with real-fake with each patch (i.e. limited receptive field)
+* Requires paired samples
+;CycleGAN
+* Unpaired image-to-image translation
+* Cycle-consistency loss
+;BicycleGAN
+* First learn to reconstruct images with a nice latent code representation in between (cVAE-GAN)
+* The main difference is that we have a many-to-many mapping (multi-modal image-to-image) between the two domains.
+;MUNIT
+* Multimodal UNsupervised Image-to-image Translation
+* Maps images from each domain into a shared content space and domain-specific style space.
+===Training Problems with GANs===
+* Instability
+* Difficult to keep generator and discriminator in sync.
+** Discriminator cannot be too good or too bad. Same with generator.
+** Tricks: LR scheduling, keep discriminator small, update generator more frequently.
+* Mode collapse
+Mode collapse happens when the generator cannot model different parts of the distribution.
+;DCGAN architecture guidelines
+* Use strided conv instead of pooling for discriminator.
+* Use batchnorm in generator and discriminator.
+* Remove FC hidden layers.
+* Use Relu for hidden layers, tanh for output layers of generator.
+* Use LRelu for discriminator.
+LSGAN, WGAN have tricks to mitigate mode collapse.
+===Evaluation of GANs===
+* Turing test (User study)
+* Inception score
+===Variational Auto-encoders (VAEs)===
+;Training a VAE
+* Data likelihood: <math>P(x) = \int P(x|z) P(z) dz</math>
+* Approx with samples of z during training: <math>P(x) \approx \frac{1}{n} \sum_{i=0}^{n} P(x | z_i)</math>
+* This is impractical.
+Assume we can learn a distribution <math>Q(z)</math> such that <math>z \sim Q(z)</math> generates <math>P(x|z) > 0</math>.
+Relating <math>P(x)</math> and <math>E_{z \sim Q(z|x)}</math>?
+<math>
+\begin{aligned}
+D_{KL}[Q(z|x) \Vert P(z|x)] &= E_{z \sim Q}[\log Q(z|x) - \log P(z)] - E_{z \sim Q}[\log P(x|z)] + \log P(x)\\
+&= D_{KL}[Q|P] - E_{z \sim Q}[\log P(x|z)] + \log P(x)
+\end{aligned}
+</math>
+Rearranging we get:
+<math>
+\begin{aligned}
+&\log P(x) - D_{KL}[Q(z|x) \Vert P(z|x)] = E_{z \sim Q}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)]\\
+\implies &\log P(x) \geq E_{z \sim Q(z|x)}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)]
+\end{aligned}
+</math>
+This is known as variational lower bound or ''ELBO''.
+* We first have the encoder output a mean <math>\mu_{z|x}</math> and covariance matrix diagonal <math>\Sigma_{z|x}</math>.
+* For ELBO we want to optimize <math>E_{z \sim Q(z|x)}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)]</math>.
+* Our first loss is <math>D_{KL}(N(\mu_{z|x}, \Sigma_{z|x}) \Vert N(0, I))</math>.
+* We sample z from <math>N(\mu_{z|x}, \Sigma_{z|x})</math> and pass it to the decoder which outputs <math>\mu_{x|z}, \Sigma_{x|z}</math>.
+* Sample <math>\hat{x}</math> from the distribution and have reconstruction loss <math>\Vert x - \hat{x} \Vert^2</math>.
+* Most blog posts will forget to sample from <math>P(x|z)</math>.
+;Modeling P(x|z)
+Let f(z) be the network output.
+* Assume <math>P(x|z)</math> is iid Gaussian.
+* <math>\hat{x} = f(z) + \eta</math> where <math>\eta \sim N(0,1)</math>
+* Simplifies to an L2 loss <math>\Vert x - f(z) \Vert^2</math>
+Importance weighted VAE uses N samples for the loss.
+;Reparameterization trick
+To sample from the latent space, you do <math>z = \mu + \sigma \varepsilon</math> where <math>\varepsilon \sim N(0,1)</math>.
+This way, you can backprop through through the sampling step.
+;Conditional VAE
+Just input the condition into the encoder and decoder.
+;Pros
+* Principled approach to generative models
+* Allows inference of <math>q(z|x)</math> which can be used as a feature representation
+;Cons
+* Maximizes ELBO
+* Samples are blurrier than GANs
+Why are samples blurry?
+* Samples are not blurry but noisy
+** Sample vs Mean/Expected Value
+* L2 loss
+===Flow-based Models===
+Flow-based models minimize the negative log-likelihood.
+==Attribute-based Representation==
+;Motivation
+Typically in recognition, we only predict the class of the image.
+From the category, we can guess the attributes but the category provides only limited information.
+The network cannot perform prediction on unseen new classes.
+This problem used to be called ''graceful degradation''.
+;Goal
+Learn intermediate structure with object categories.
+;Should we care about attributes in DL?
+;Why is attributes not simply supervised recognition?
+;Benefits
+* Dealing with inevitable failure.
+* We can infer things about unseen categories.
+* We can make comparison between objects or categories.
+;Datasets
+* a-Pascal
+* a-Yahoo
+* CORE
+* COCO Attributes
+Deep networks should be able to learn attributes implicitly.
+However, you don't know if it has actually learned them.
+==Extra Topics==
+===Fine-grained Recognition===
+===Few-shot Recognition===
+* Metric learning methods
+* Meta-learning methods
+* Data Augmentation Methods
+* Semantics
+===Zero-shot Recognition===
+Goal is train a classifier without having seen a single labeled example.
+The information comes from a knowledge graph e.g. from word embeddings.
+===Beyond Labelled Datasets===
+* Semi-supervised: We have both labelled and unlabeled training samples.
+* Weakly-supervised: The labels are weak, noisy, and non-necessarily for the task we want.
+* Learning from the Web: Download data from the internet
 ==Will be on the exam==
@@ Line 850: / Line 1,249: @@
 * Softmax, sigmoid, cross entropy
 * RCNN vs Fast-RCNN vs Faster-RCNN
+* DPM
+* Selective search vs RPM
+* ELBO
+Final exam:
+* Friday Dec 4, 2020
+* 4pm-6pm on gradescope
+* Will have multiple choice, fill-in-the-blank, question answering, open-ended
+** Generally simple questions; either you know or you don't
+* No practice exams
+* Only need to know major names (RCNN, Fast(er)-RCNN)
+* Only covers lecture material.
+* 1 letter page of open notes, both sides allowed (honor system)
+Homework 2
+* Released Nov 30, 2020
+* Take SLIC super pixels, extract deep features and classify them.
+* Will have 2 bonus credits, must pick one
+** Use features from multiple layers (multi-scale)
+** Use multiple-levels of SLIC in input (SLIC feature pyramid)
+Final Project
+* Presentations Dec 10
+* Recorded videos with presentations
+* Final reports Dec 18
+[https://docs.google.com/document/d/1BKmpBWBWuEEywDyBw9CsHgOPB6DH7I0oKEs8zXS7XQw/edit?usp=sharing My Exam Cheat Sheet]
 ==Project Notes==
@@ Line 855: / Line 1,281: @@
 * Challenges
 * What methods worked and didn't work.
-==Misc==
-[[Visible to::users]]
 ==References==
@@ Line 863: / Line 1,286: @@
 <ref name="torralba2008tinyimages">Antonio Torralba, Rob Fergus and William T. Freeman (2008). 80 million tiny images: a large dataset for
 non-parametric object and scene recognition (PAMI 2008) [https://people.csail.mit.edu/torralba/publications/80millionImages.pdf Link]</ref>
-<ref name="standing1973learning">Lionel Standing (1973). Learning 10000 pictures. ''Journal
+<ref name="standing1973learning">Lionel Standing (1973). Learning 10000 pictures. ''Journal Quarterly Journal of Experimental Psychology'' [https://www.tandfonline.com/doi/abs/10.1080/14640747308400340 Link]</ref>
-Quarterly Journal of Experimental Psychology'' [https://www.tandfonline.com/doi/abs/10.1080/14640747308400340 Link]</ref>
 <ref name="brady2008visual">Timothy F. Brady, Talia Konkle, George A. Alvarez, and Aude Oliva (2008). Visual long-term memory has a massive storage capacity for object details. [http://olivalab.mit.edu/MM/pdfs/BradyKonkleAlvarezOliva2008.pdf Link].</ref>
 <ref name="torralba2011unbiased>Antonio Torralba, Alexei A. Efros (2011). Unbiased Look at Dataset Bias (CVPR 2011) [https://people.csail.mit.edu/torralba/publications/datasets_cvpr11.pdf Link]</ref>
@@ Line 879: / Line 1,301: @@
 <ref name="redmon2016yolo">Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi (2016) You Only Look Once: Unified, Real-Time Object Detection [https://pjreddie.com/media/files/papers/yolo.pdf Link]</ref>
 <ref name="shotton2009texton">Jamie Shotton John Winn Carsten Rother Antonio Criminisi (2009) TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context. [https://www.microsoft.com/en-us/research/publication/textonboost-for-image-understanding-multi-class-object-recognition-and-segmentation-by-jointly-modeling-texture-layout-and-context/ Link]</ref>
-<ref name="shrivastava2016ohem">Abhinav Shrivastava, Abhinav Gupta, Ross Girshick (2016) Training Region-Based Object Detectors With Online Hard Example Mining. (CVPR 2016)[https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Shrivastava_Training_Region-Based_Object_CVPR_2016_paper.html Link]</ref>
+<ref name="shrivastava2016ohem">Abhinav Shrivastava, Abhinav Gupta, Ross Girshick (2016) Training Region-Based Object Detectors With Online Hard Example Mining. (CVPR 2016) [https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Shrivastava_Training_Region-Based_Object_CVPR_2016_paper.html Link]</ref>
-<ref name="bell2016ion">Sean Bell, C. Lawrence Zitnick, Kavita Bala, Ross Girshick (2016). Inside-Outside Net: Detecting Objects in Context With Skip Pooling and Recurrent Neural Networks (CVPR 2016)</ref>
+<ref name="bell2016ion">Sean Bell, C. Lawrence Zitnick, Kavita Bala, Ross Girshick (2016). Inside-Outside Net: Detecting Objects in Context With Skip Pooling and Recurrent Neural Networks (CVPR 2016) [https://openaccess.thecvf.com/content_cvpr_2016/papers/Bell_Inside-Outside_Net_Detecting_CVPR_2016_paper.pdf CVF Mirror]</ref>
 }}

Visual Learning and Recognition: Difference between revisions