Visual Learning and Recognition

From David's Wiki
Jump to navigation Jump to search
\( \newcommand{\P}[]{\unicode{xB6}} \newcommand{\AA}[]{\unicode{x212B}} \newcommand{\empty}[]{\emptyset} \newcommand{\O}[]{\emptyset} \newcommand{\Alpha}[]{Α} \newcommand{\Beta}[]{Β} \newcommand{\Epsilon}[]{Ε} \newcommand{\Iota}[]{Ι} \newcommand{\Kappa}[]{Κ} \newcommand{\Rho}[]{Ρ} \newcommand{\Tau}[]{Τ} \newcommand{\Zeta}[]{Ζ} \newcommand{\Mu}[]{\unicode{x039C}} \newcommand{\Chi}[]{Χ} \newcommand{\Eta}[]{\unicode{x0397}} \newcommand{\Nu}[]{\unicode{x039D}} \newcommand{\Omicron}[]{\unicode{x039F}} \DeclareMathOperator{\sgn}{sgn} \def\oiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x222F}\,}{\unicode{x222F}}{\unicode{x222F}}{\unicode{x222F}}}\,}\nolimits} \def\oiiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x2230}\,}{\unicode{x2230}}{\unicode{x2230}}{\unicode{x2230}}}\,}\nolimits} \)

Notes for CMSC828I Visual Learning and Recognition (Fall 2020) taught by Abhinav Shrivastava

Course Website

This class covers:

  • How a sub-topic evolved
  • State of the art

Introduction to Data

Lecture 3 September 8, 2020

The extremes of data. If we have very few images, we are working on an extrapolation problem.
As we approach an infinite number of training samples, learning becomes an interpolation problem.
Traditional datasets are in the order of \(\displaystyle 10^2-10^4\) training samples.
Current datasets are in the order of \(\displaystyle 10^5-10^7\) training samples.

In tiny images [1], Torrabla et al. use 80 million tiny images.

What is the capacity of visual long term memory?

In Standing (1973)[2], people could recall whether they've seen 10,000 images with 83% recognition.

What we don't know is what people are remembering for each item?

In Brady et al.[3], they tested recall for novel (new object), exemplar (same type of object), and state (same object & state). They got 92% for novel, 88% for exemplar, and 87% for state so humans remember the exact state of objects they've seen.

Rule of thumb

(Simple algorithms + big data) is better than (complicated algorithms + small data)

Lecture 4 September 10, 2020

This lecture is on the bias of data. It follows Torralba et al.[4]

Will big data solve all our problems?

E.g. Can (big company) just dump millions of dollars to collect data and solve any problem?
No. E.g. COVID.
There will always be new tasks or problems.

We will never have enough data

Long tails - Zipf's law

Data is biased

Types of visual bias:

  • Observer Bias (human vs bird)
  • Capture Bias (photographer vs robot)
  • Selection Bias (Flickr vs Google Street View)
  • Category/Label Bias
  • Negative Set Bias

In general, all datasets will have all of these biases mixed in.

  • Social Bias

Graduation photos always have a certain structure.

Measuring Dataset Bias

Evaluate cross-dataset performance
Train on one dataset, test on another

To evaluate negative set bias, pool negatives from other datasets (e.g. not car or not person).
They found that models trained on a dataset do ~8% worse at detecting negatives from other datasets.

Overcoming Datset Bias

Mixing datasets

Selection bias

In general, automatically gathered images do better.
You can also collect data from multiple sources (multiple search engines across multiple countries) or collect unannotated images and label them via crowd-sourcing.

Capture bias

To overcome the bias of professional photographs:
Apply data augmentations: flipping images, jittering (small affine transformations), random crops.

Negative set bias

Add negatives from other datasets.
Mine hard negatives from other datasets using standard algorithms.

Data-driven Methods in Vision

Beginning of Lecture 5 (September 15)

Dale et al.[5] perform semantic color correction using a large dataset.
Heys and Efros[6] perform scene completion.
Heys and Efros[7] perform image localization.
Kaneva et al.[8] perform scene matching with camera view transformations.

Dealing with Sparse Data

  • Better Similarity

Better Alignment

    • E.g. reduce resolution, sifting, warping

Take sift features for all regions.
Then learn some SIFT vector to RGB color matching. The RGB images are called sift flow features.
Similar RGB regions will have similar SIFT feature vectors.
Then we can learn some transformation \(\displaystyle T\) to match the sift flows (i.e. \(\displaystyle T(F_1) \approx F_2\)).

Non-parametric Scene Parsing (CVPR 2009)

If you have a good scene alignment algorithm, you can just use a segmentation map.

Use sub-images (primitives) to match

Allows matching from multiple images

Mid-level primitives

Bag of visual words:

  1. Take some features (e.g. SIFT) from every image in your dataset.
  2. Apply clustering to your dataset to get k clusters. These k clusters are your visual words.

The challenge with matching patches is how to find patches to match?
Ideally, we want patches which are both representative and discriminate.
Representative is that the patch is found in the target image set; i.e. coverage of the target concept.
Discriminative is that the patch is not found in non-target image sets (distinct from other concepts).

Understanding simple stuff first

E.g. from a video, find one frame which is easy to detect pose and then apply optical-flow methods to transfer the flow to adjacent frames.

Looking beyond the k-NN method

Use data to make connections.

Visual Memex Knowledge Graph

(Malisiewicz and Efros 2009)
Build a visual knowledge graph of entites. Edges can be context edges or similarity edges.
Embed an image into the graph and copy information from the graph.

Manifolds in Vision

These days, we can assume deep learning features are reasonable manifolds.

ConvNets and Architectures

See Convolutional neural network for basics.

Paper Summaries

Krizhevsky et al.[9] develop AlexNet for image classification. AlexNet is a CNN architecture with two branches. Their architecture and training proceedure includes many tricks, some of which are now commonplace today. These include multi-GPU training, 8 layers, ReLU activations, Local Response Normalization (LRU), (overlapping) max pooling, data augmentation, and dropout. They won on ImageNet 2012 by a large margin.

Huang et al.[10] develop DenseNet for image classification. The main contribution are dense blocks where each layer within the block are connected to all subsequent layers (i.e. the outputs are accumulated by concatenation). Each layer consists of (BN-ReLU-Conv(1x1)-BN-ReLU-Conv(3x3)). Following each dense block, they use transition layers (1x1 conv + 2x2 avg pool) to shrink the size. The evaluate on CIFAR, SVHN, and ImageNet.

Xie et al.[11] develop S3D-G for video classification. The main idea is that video classification can be done with Conv2d layers at lower layers and Conv3d layers at higher layers. In addition, the time and spatial dimensions can be separated into two different 3D convolutions (with \(\displaystyle 1 \times k \times k\) and \(\displaystyle k_t \times 1 \times 1\) kernels). These two changes improve the accuracy and efficiency of video classification compared to just Conv3d.


ConvNet pipeline:

  • Input
  • Conv/ReLU/Pool
  • FC/ReLu
  • FC/Normalization/Loss


ILSVRC 2014 2nd place

This is a sequence of deeper networks trained progressively.
They replace large receptive fields with successive 3x3 conv + ReLU layers.
A single 7x7 conv layer with C-dim input and C-dim output would need \(\displaystyle 49 \times C^2\) weights.
Three \(\displaystyle 3\times 3\) conv layers only need \(\displaystyle 27 \times C^2\) weights.

Network in network

Use a small perceptron as your convolution kernel. I.e. the block goes into the perceptron. This output instead of calculating cross correlation with a standard kernel.


Hebbian Principle: Neurons that fire together are typically wired together.
Implemented using an Inception Module.

The key idea is to use a heterogeneous set of convolutions.
Naive idea: Do a 1x1 convolution, 3x3 convolution, and 5x5 convolution and then concatenate the output together. The intuition is that each captures a different receptive field. In practice, they need to add 1x1 convolutions before the 3x3 and 5x5 convolutions to make it work. These are used for dimension reduction by controlling the channel.

Another idea is to add auxiliary classifiers across the network.

Inception v2, v3 V2 adds batch-normalization to reduce dependence on auxiliary classifiers. V3 addes factored convolutions (i.e. nx1 and 1xn convolutions).


The main idea is to introduce skip or shortcut connections.
This existing in literature before. The means returning \(\displaystyle F(x)+x\).
This allow smoother gradient flows since intermediate layers cannot block gradient flow.

They also replace 3x3 convolutions on 256 channels with 1x1 to 64 channels, 3x3 on the 64 channels, then 1x1 back to 256 channels.
This reduces parameters from approx 600k to approx 70k.

Accuracy vs efficiency

First we had AlexNet. Then we had VGG which had way more parameters and better accuracy.
Then we had GoogLeNet which is much smaller than both AlexNet and VGG with roughly the same accuracy.
Next ResNet and Inception increases the parameters slightly and attained better performance.

Beyond Resnet

Fractal Net

This is a take on ResNet which removes the skip connection across the whole network.
The point is to show that the performance is from connections of different lengths.

Wide ResNet

Reduce the number of residual blocks but increase feature maps in each block.
Shows that it's not just about depth but also the width of each layer.
Computationally, a wide network is more parallelizable.

We thought that more layers make networks exponentially more powerful.
However, this contradicts that hypothesis.


Propose cardinality as a feature of network design.
First split the 256x256 input across the channel dimension. Each layer has much fewer features: 64 to 4, but now with 32 separate paths.
Within each layer, how many things can we do independently of each other?.


CVPR 2017 best paper award.
Forget about resnets, just connect everything to all following layers.


Images are RGB.
Videos are RGB+T.

Combine per-frame models

  • Single Frame
  • Late Fusion (combine features for frames apart in time)
  • Early Fusion (combine features from adjacent frames)
  • Slow Fusion (combind features from adjacent frames and then features from adjacent features,...)

2-stream networks

Have a spatial stream and a temporal stream.
The spatial stream works on a single RGB frame.
The temporal stream works on optical flow.

3D ConvNets

Slide the kernel in the time domain.


Inflated 3D ConvNet
Types of 3D networks:

  • LSTM
  • 3D-ConvNet
  • Two strea
  • 3D-Fused Two Stream
  • Two-stream 3D-ConvNet

Take the inception module and add a time dimension.

Design Principles

  • Make networks parameter-efficient
    • Reduce filter sizes, factorize filters
    • Use 1x1 convolutions to reduce number of feature maps
    • Minimize reliance on FC layers
  • Reduce spatial resolution gradually so we can repeat the same block
  • Use skip connections or multiple redundant paths.
  • Play around with depth vs width vs cardinality

Miscellaneous Things

  • Training tricks and details
  • Training data augmentation
  • Ensembles of networks

Object Detection

Beginning of Lecture 10 (Oct 1)

Edge Templates + Nearest Neighbor

Gavrila & Philomen (1999)

  1. From a raw image, do feature extraction and calculate distance transform.
  2. Do nearest neighbor search.


  • Templates are hand-made.

Haar Wavelets + SVM

A Trainable System for Object Detection. (Papageorgou & Poggio, 2000)

  1. Extract Overcomplete Representation
    • Called Haar Wavelets. Similar to CNN filters.
    • Wavelet features can be calculated by averaging all faces. Similar to CNN features.
  2. Do SVM Classifier
+ Parts (2001)

Trained an SVM for face, legs, left arm, right arm.
When detecting a person, make sure all parts are in the correct location and shape with some constraints.

YYY + adaBoost

Basically, do the same as before (Haar Wavelets) but replace SVM with adaBoost.

Rectangular differential features (Viola & Jones 2001)

Use fast features to throw out parts of the image.
Then do processing on the remainder.
Became the standard object detection system in OpenCV.

Learnt wavelets + adaBoost

Works on more than just faces.
Ensemble face detection.  

Dynamic Programming

Efficient matching of pictorial structures (Felzenszwalb & Huttenlocher, 2000) Basically have a cartoon model and match the position & orientation of each part.

Probabilistic Methods for Finding People (Ioffe & Forsyth, 1999)

More Techniques

How to detect objects at different scale?
One trick is to detect the horizon line and scale based on the horizon line.

Sliding Window:
Create multiple scales of the image and detect at each scale.
This is done by building a feature pyramid using an image pyramid.

Histograph of Gradients (HoG)

How many octaves? However many octaves to reduces the image size to the template size + 1 for 2x2 upscaling.
How many levels? Generally people try 10 levels.

Precision and Recall

Precision is (# correct) / (# predictions).
Recall is (# correct) / (# ground truth).

Consider the following table

Box Score IOU G1 IOU G2 IOU G3 IOU G4 IOU G5
b1 0.9 0.6 0.1 0.1 0 0
b1 0.8 0 0 0.1 0 0
b1 0.7 0 0 0 0 0.7
b1 0.6 0 0 0 0 0

Starting with b1 we have a precision of 1 and a recall of 1/5 since we detect only G1.
From b2, precision becomes 1/2, recall remains the same since we detect nothing.
From b3, precision becomes 2/3, recall becomes 2/5.
From b4, precision becomes 2/4, recall is still 2/5.

The area under the Precision vs Recall curve is call the average precision (AP).

Non-max Supression

The NMS heuristic here is used to reduce the number of bounding boxes per object to 1.

Initially, you have a set of overlapping bounding boxes \(\displaystyle B\).
Create a final set \(\displaystyle D\).

  • While B is not empty
    • Remove the highest confidence/score box \(\displaystyle b_i\) from \(\displaystyle B\). Add it to \(\displaystyle D\)
    • For every other box \(\displaystyle b_j\),
      • If \(\displaystyle IOU(b_i, b_j) \gt \lambda\) (i.e. they bound the same object), discard \(\displaystyle b_j\)

Hard mining

During training, classify on all images.
Figure out which instances the classifier classifies incorrectly. Then train only on those negative instances.

Current HOG

Current HOG uses 31 dimensions

  • 9 Contrast insensitive gradients
  • 18 Contrast sensitive gradients
  • 4 Texture Related

Deformable Part Models (DPM)

Lecture (Oct 6-8, 2020)
Deformable Part Models (DPM)

Felzenszwalb et al (2009) [12] Important: Read this paper

  1. Train a part detector (e.g. Head, Leg, Arm)
  2. Enforce constraints between parts.
Part Configurations

\(\displaystyle \mathbf{p} = (p_0, p_1, p_2,...)\).

Scoring a configuration

\(\displaystyle score(p) = \sum_{i=0}^{N}w_i^T \phi(p_i) + \sum_{ij}w_{ij}^T \psi(p_i, p_j)\)

  • \(\displaystyle w_{ij}^T \in \mathbb{R}^5\) is the deformation parameter between parts i and j

The total number of configurations is \(\displaystyle 10^{(4*N)}\) since for each \(\displaystyle 100 \times 100\) image, each of \(\displaystyle p_i\) can take 100*100 values. \(\displaystyle N\) is the number of parts.

The trick is to use dynamic programming and a tree-based model.
I.e. if p1 is the body and p2 is the head then the deformation of p2 is only with respect to p1 and p3 is only with respect to p1. There is no deformation calculation between p2 and p3. The deformation is \(\displaystyle w_{12}^T \psi(p_1, p_2)\).
Then we can compute the max for p2 with respect to p1, p3 wrt p1, and then p1.

Mixture Models

Is One Model Enough?

In generally no because objects have multiple views.
The solution is the use mixture models.
This gives us multiple part based models so we can capture different views of a single object.

\(\displaystyle score(\mathbf{p}) = \beta \Psi(\mathbb{p})\) where:
\(\displaystyle \beta = [w_0,..., b]\) and \(\displaystyle \Psi(\mathbb{p}) = [\phi(p_0),...,\phi(p_N), \psi(p_0, p_1),...]\).
This can be trained using a linear SVM or using block gradient descent.

Analyzing Mixture Models

\(\displaystyle L(\beta) = \frac{1}{2} \Vert \beta \Vert^2 + C\sum_{i=1}^{n} \max(0, 1-y_i * score(\mathbf{z}))\)

Region-based Approaches

1 Stage:

  • Overfeat
  • SSD
  • YOLO

2 Stage:

  • RCNN
  • Fast RCNN
  • Mask RCNN

Instance based:

  • SDS
  • RFCN


Winner of ILSVRC 2014 localization challenge.
The architecture first passes the image through some convolution & pooling layers. The a sequence of FC layers produces an output.

Sliding Window

If network takes 3x221x221 and you have an image 3x257x257.
Run image through network with sliding window. Then greedily merge the boxes.

Efficient sliding window

Use a fully convolutional network.

Single Stage Multibox Detector (SSD)

Liu et al (2016)[13] propose SSD: Single Shot MultiBox Detector.
The idea is that they train a CNN to do object detection over the entire image.
The CNN outputs multiple feature maps for each of the categories, each with different aspect ratios and scales.
Pixels of the feature maps are scores for default boxes; each pixel is associated with a default bounding box.
The candidate results from the feature maps are filtered using non-maximum suppression.
Different scales are achieved by extracting feature maps from intermediate layers of the network.
The aspect ratio of each default box does not actually correspond to the receptive field associated with the feature pixel.

During training, all default boxes with jaccard overlap >0.5 with the ground truth are matched.
They also apply hard negative mining and data augmentation.


Redmon et al.[14] develop You Only Look Once: Unified, Real-Time Object Detection.
This is similar to the SSD paper. Each image is processed into an \(\displaystyle S \times S\) grid. The difference is that rather than each grid cell corresponding to a default box, the grid cell needs to produce the bounding box for the image centered at that pixel. Each cell predicts (x, y, w, h, confidence) where (x,y) represent the center of the bounding box relative to the grid cell as well a class probabilities. The output of the network is a \(\displaystyle S \times S \times (B*5+C)\) tensor.

Some training tricks: They use Lrelu. They predict square root of height and width. They weigh bounding boxes containing objects 10x those which are empty in the loss function.


Shotton et al. [15]
Incorporates texture-layout, color, location, and edge in a conditional random field.
Jointly considers appearance, shape, context.

Semantic Segmentation

Given an image, label every pixel with a class.
Note object segmentation is semantic segmentation with just one class.
Segmentation does not give us instances unlike object detection.
Instance segmentation is object segmentation + detection.

Energy function
  • Labels a pixel
  • Penalty if label is unlikely


Markov random field


Conditional random field Read TextonBoost ECCV 2006

\(\displaystyle \log P(\mathbf{c}|\mathbf{x}, \boldsymbol{\theta}) = \sum_i [\psi() + \pi() + \lambda()] + \sum_{(i,j)} \phi(c_i, c_j, g_{ij}(x); \theta) - \log Z(\theta, x)\)

  • \(\displaystyle \phi()\) is texture-layout
  • \(\displaystyle \pi()\) is color
  • \(\displaystyle \lambda()\) is location
  • \(\displaystyle \phi()\) is edge


The idea is that small regions of pixels all with the same or similar color probably represent the same thing.
Thus you can divide up an image into these regions of similar pixels and do reasoning on the regions rather than individual pixels.
These regions of similar pixels are called superpixels.

SLIC Superpixels

This is a very popular algorithm for superpixels.
You can control the quantity of superpixels.
Large superpixels are combinations of smaller superpixels.

  • Assignment 1 will be to extract superpixels.
  • Assignment 2 will be to classify superpixels.
  • Assignment 3 will be to replace RGB with deep features.

End-to-end Pixel-to-pixel Network

Develop a fully convolutional network.
Do upsampling at the end can compute pixelwise output + loss.

Evaluation of Segmentation

  • Pixel Accuracy: (% correct)
  • Class Accuracy: (% correct, averaged per class)
  • Trimap: Only evaluate at the boundaries.

Other segmentation tasks

  • Foreground vs Background
  • Amodal segmentation
    • Identify the extent of object including occluded regions.

Region Proposals

Convert superpixel regions to boxes and classify boxes.

Read: pdollar A seismic shift in object detection

Region-based Object Detection Systems

Lecture (Oct 15, 2020)


R-CNN at test time
  1. From an input image, they extract ~2k region proposals.
    • All of the region proposals likely contain an object.
  2. For each bounding box:
    • Dilate the proposal on each side by \(\displaystyle p=16\) pixels.
    • Crop it out and scale to \(\displaystyle 227 \times 227\).
    • Pass it through a CNN (5 conv + 2 FC) to get \(\displaystyle 4096\) dim features.
    • Do classification using an SVM.
  3. Do object proposal refinement to predict object bounding box.
Training R-CNN
  1. First train a CNN for 1000-way ImageNet image classification.
  2. Fine-tune the CNN for detection from PASCAL VOC.
  3. Train detection SVMs.

Both training and inference are super-slow.
Extracting RoI takes a lot of time.
Then you need to do a forward pass for each of the 2k regions to get features.
Inference on 1 image takes almost 1 minute.


Makes R-CNN fast using a spatial pyramid pooling (SPP) layer.

  1. Run a frozen CNN over the whole image to get a feature map.
  2. Map boxes from region proposals generated by selective search to the feature map.
  3. For each region, resize to \(\displaystyle 7 \times 7 \times 256\), do SPP and pass to an FC network to get bbox and class.

For each of the 2000 boxes, you have IOU_foreground and IOU_background.

Fast R-CNN

Makes the whole network trainable.

  • Pass the whole image through a CNN
  • For each RoI (suppose size \(\displaystyle h \times w\)), do RoI pooling to get an \(\displaystyle H \times W\) feature map.
    • This is a max-pooling over subwindows of size \(\displaystyle h/H \times w/H\).
  • Pass the feature map into a FC + softmax classifier.
  • Pass the feature map in a bbox regressor.

The entire network is trained together rather than in stages.
The final loss function combines both tasks.

Exam Question


  • Removes SVM part to do end-to-end training.
  • Each region is cropped from a feature map of the image rather than the raw image.
    • One con is the feature map is lower-res so small objects may become \(\displaystyle 1\times1\) features.
    • RCNN is better for smaller objects

Faster R-CNN

Focuses on the region proposals by replacing selective search with a region proposal network.
Computes region proposals on-the-fly.

Contains the following

  1. Feature extractor
  2. RoI Proposal Network
  3. RoI Classification & Regression Network
How region proposal works
  1. Given an image, pass through conv filters to get feature maps.
  2. Map each pixel to \(\displaystyle k\) anchor boxes.
  3. Then a layer outputs foreground-background classification and another outputs bounding box regression for each pixel.

Generally two-stage models perform better. Everything is trained end-to-end.


Read these 4 papers and also read the professor's online hard mining paper (OHEM).

Instance Segmentation Systems


Object detection via region-based fully connected networks.


Instance-aware Semantic Segmentation via Multi-task Network Cascades
Multi stage:

  • Extract boxes
  • Box to mask
  • Mask to Class
  • Class to Mask
  • Mask to class

Region Driven

Go from boundaries to classes.

  • SDS
  • HyperCol
  • CFM
  • MNC

FCN driven

Go from class to boundaries.

  • PFN
  • InstanceCut
  • Watershed
  • FCIS
  • DIN

Mask R-CNN

  • Given an image, calculate fully-conv features.
  • Extract feature map for a region using RoIAlign.
  • Do fully-conv over the region only to get segmentation + class.

RoI Pooling variants

  • RoI Pool (SPPNet, Fast/Faster R-CNN)
  • RoI Warp (MNC)
  • RoI Align (Mask R-CNN)


Network Variants

  • ResNet

Skip-connection Variants

  • ION Inside-Outside Network
  • TDM: Top-down Modulation Network
  • FPN: Feature Pyramid Network

ION: Inside-Outside Network

Bell et al. [16]
The key idea is that we want a feature vector which includes multi-scale and contextual information.

Potential Exam Question

We want to use features from multiple levels. The RoI is fixed.
The resolution, number of channels, and magnitude of features can be different.


  1. There are 5 conv blocks, followed by two 4-dir IRNN blocks which extract context features.
  2. The whole image passes through this entire network.


  • This is a set of 4 RNNs which move across the image. Up, down, left, right.
  • The outputs of each RNN are concatenated, yielding an image with the same shape.

For each RoI identified using object proposals:

  • Do L2 normalization of the features at different layers (Conv3, conv4, conv5, and context features)
  • Concatenate features to a single feature image.
  • Rescale them and do 1x1 convolution to get a \(\displaystyle 512 \times 7 \times 7\) feature descriptors.
  • Pass through two FC layers.
  • Finally, one FC extracts the class via softmax and another the bounding box.

Analysis and Diagnosis

  • DPMs are CNNs
  • Speed/Accuracy Trade-offs
  • How to diagnose Object Detection Results?

Diagnosing Object Detection Results

Average Precision (AP) is a good summary for quick comparison but not a good driver of research.

  • Top false positives: Airplane (DPM), Animals, Vehicles
  • Analysis of object characteristics: occlusion level, parts visible, sides visible on DPM
    • High occlusion hurs performance
    • Extra small and extra large objects are really bad
    • Tall aspect ratios have bad performance
    • Certain sides and parts visible can affect performance.
    • Deep learning methods do not make as much mistakes on similar objects.
  • TIDE: Analyze different error types


Read OHEM paper[17].

Summary of online hard example mining

Previously hard mining involved the following two steps

  1. Freeze the model and run it on 10-100s to find hard examples.
  2. Train on hard examples.

They instead propose finding hard-examples per mini-batch.
This is possible because there are thousands of RoIs within each image.

  1. Run a mini-batch through the CNN feature extractor.
  2. Do forward-pass on all RoIs.
  3. Sort RoIs by loss and take the top \(\displaystyle B/N\) examples.
    • Filter duplicates using NMS.
  4. Backwards pass only on the top \(\displaystyle B/N\) examples.

Bells & Whistles

Most used:

  • Multi-scale training and testing
  • Iterative bounding-box prediction + weighted NMS
    • Do classification, regression then repeat.
  • Ensemble
  • More data


  • Loss variants
  • Network variants
  • NMS variants
How to do ensembling for object detection?

This is an open research question.
You can concatenated the regions and pass them through each network.
You can do weighted NMS.


Also known as hard negative mining.

  1. Mine hard negatives from model to fix training set.
  2. Train model on new fixed training set.


Homework 1

SLIC Superpixels based on (r,g,b,x,y)
Releases Oct 23 (Fri). Due Oct 30 (Fri).
Can be done individually or groups of two.

Homework 2

Deep features on superpixels.
Train SVM for classification.
Releases Nov 9 (Mon), Due Nov 17 (Tues).

Final Exam

Officially Monday December 21 but may be moved earlier.

Final Presentations

18 groups, total 3.5 hours.

Pose Estimation


  • PASCAL 2010
  • H3D Berkeley 2009
  • FLIC Dataset 2013
  • Leeds Sports Pose Dataset
  • MPII
  • COCO Keypoints Challenge Dataset
    • 59k images, 156k people, 1.7m keypoints
How to evaluate this?

Object Keypoint Similarity (OKS) is a metric based on euclidean distance with a Gaussian error margin per keypoint based on the scale of each object.

Approaches for Pose Estimation

Lecture Oct 27

  • Top-down approaches
    • Do person detection then do pose estimation.
    • Pros: Already have a person detector, smaller image to process in second stage.
    • Cons: Runtime proportional to number of people.
  • Bottom-up approaches
    • Detect limbs then do person
    • Pros: No dependency on person detector. Maybe single stage. Time independent of # of people.
    • Cons: Optimization for part associate is NP hard.


Pose Machines + CPM + OpenPose
2016 COCO Winner

  • Pose Machines ECCV 2014
  • Convolutional Pose Machines CVPR 2016
  • Realtime Multi-Person 2D

G-RMI 2016 COCO Runner-up

Pose Machines

ECCV 2014
Bottom-up approach

Address challenges:

  • Local evidence is weak
  • Part context is a strong cue
  • Larger composite parts can be easier to detect.

Convolutional Pose Machines


For 30 people with 17 keypoints each, we have 1.3e5 pairwise connections.

G-RMI (Google)

  1. Compute Heatmaps G
  2. Offset Output
  3. Fuse heatmaps and offsets via Hough Voting

Stacked Hourglass Networks

2017 COCO Runner-up


  • Video as space-time volume
  • Object correspondence via tracking
  • Motion for parallax, occlusions
Why are videos challenging?
  • Boundaries between tasks are poorly defined.
  • Huge computational cost.
Spectrum of Problems
  • Primitive actions
  • Actions
  • Events
Capturing long-range context
  • Long-range
  • Spatio-temporal
  • Camera motion
  • Cycles and speed

Tasks and Datasets

  • Action Classification
  • Temporal Action Localization
  • Spatio-Temporal Action Detection
Actions datasets
  • KTH Human actions dataset
  • UCF Sport actions dataset
    • Biases: Hard to have negative actions (e.g. drum in scene without drumming)
  • Sports-1M
    • Has audio bias.
  • Kinetics-v2
    • ImageNet of videos
    • Collected from YouTube
    • 600 actions, 500k clips
  • Moments in time
    • 800k 3 second clips
    • 339 classes.
  • SLAC: Sparsely Labeled Actions Dataset
    • 520K untrimmed videos
  • Something Something
  • Charades
  • AVA
  • EPIC Kitchen
  • HAA500
  • FineGYM
Self driving cars datasets
  • KITTI++
  • ArgoVerse
  • Open Waymo Dataset
  • Lyft Level 5 Open Data
  • Berkeley DeepDrive

Models for Video Recognition

Basic Video Pipeline
  1. Extract features
  2. Learn space-time bag of words
  3. Train/test BoW classifier
Spatio-temporal Feature Detectors
  • Harris3D
  • Cuboid
  • Hessian
  • Dense
Spatio-temporal Feature Descriptors
  • Cuboid
  • HOG3D
  • ExtendedSURF
Add trajectories
  1. Track a keypoint's movement over time
  2. Make a feature tube around the trajectory. Existing methods used a cube instead of a tube.
  3. Then do whatever pooling you want (e.g. HOG) to get a trajectory description.

  • Two-stream ConvNet uses a spatial stream (RGB) and temporal stream (optical flow)
  • Add/Stack Trajectories: Flow should be added on top of original points.
  • Pool Along Trajectories instead of cubes
  • I3D stacks 8 frames, passes to 3D convnet and gets an output
  • Late fusion extracts features per-frame and then combines later.

What is an action?


  • BoW for actions
  • Actions are made up of subactions
    • E.g. basketball shoot = dribbling + jump + throw + running + ball

Gaussian Temporal Awareness Networks

  • Key idea: Not all actions have the same temporal support.
    • Depending on frame-rate & action speed, actions can take a variable number of frames.

Compressed Video Action Recognition

  • Idea is to present P-frames directly to the CNN which are essentially optical-flow.

R-C3D: Region Convolutional 3D Network for Temporal Activity Detection

  • Inspired by Faster R-CNN Architecture

SST: Single-Stream Temporal Action Proposals

  • Single-shot proposal network

Action Tubelet Detector for Spatio-Temporal Action Localization

  • For every frame, regress how to move the tubelet up or down

Tube Convolutional Neural Network (T-CNN)

Complementary Approaches

Pose-based Action Recognition

  • Convert a video into pose maps and do classifications on poses

PoTion: Pose MoTion Representation

  • Do pose estimation to get joint heatmaps.
  • Represent pose position & movement as an image with red=start and green=end.
  • Stack this across time.

PA3D: Pose-Action 3D Machine

  • Focus on pose to do action recognition

VideoGraph: Recognizing Minutes-Long Activites



Context is a 1-2% idea. Traditionally, it has only provided 1-2% of improved performance.

When is context helpful?
  • Typical answer: to guess small/blurry objects based on a prior.
  • Deeper answer: to make sense of the visual world.
    • When to use context and when not to use context.
    • 80% of context is automatically handled by neural networks, but 20% of work still remains.
Why context is important?
  • To resolve ambiguity.
    • Even high-res objects can be ambiguous.
    • There are 30,000+ types of objects but only a few can occur in an image.
  • To notice unusual things.
  • In infer function of unknown object.

Pixel Context

Look at nearby pixels by inputting a slightly bigger region.

Semantic Context

Use other objects present to answer what is present in the target pixels.

Geometric Context

[Hoiem et al. 2005]
Use segmentation to interpret geometry:

  • Sky
  • Ground
  • Building with normal of direction

Photometric Context

If you know where the camera is, you can estimate the size of people and cars.
If you know where the sun is, you can estimate where the scene

Geographic Context


Do a prediction and then feed in the output to the model again for the model to refine the prediction.
This is similar to pose machines, hourglass networks, iterative bounding box regression.


Descriptor is formed by concatenating outputs of weakly trained classifiers.

3D Scene Understanding

What can you get from knowing pairwise pixel distances? (i.e. given two sets of pixels, which pair is closer in 3D space)
You can get horizons.

Single Image Reconstruction By finding vanishing points and lines, you can do 3D reconstruction.


How: Bottom up classifiers to explicit constraints and reasoning. What: Qualitative to explicit/quantitative.

From qualitative to quantitative:

  • Surface labels
  • Boundaries + objects
  • Stronger geometric constraints
  • Reasoning on aspects & poses
  • 3D point clouds

Using depth ordering, surface labels, and occlusion cues can give us a planar reconstruction.

Benefits of volumes:

  • Finite volumes
  • Spatial exclusion (no intersections)
  • Mechanical relationships and physical stability (one volume atop another)

Room layout estimation:

  • Estimate walls and floor from vanishing points.
  • Three principle directions
  • Every room is a box
  • Minimum number of walls is 1, maximum is 6 but most see 5 walls if camera is facing one wall.
  • Use geometric context, optimizing to get a room context.
  • Given segmentation masks, you can estimate clutter vs free space.

Functional constraints:

  • People sit on laptops, people can open drawer, ...


  • Depth - not normalized making them hard to use, have discontinuities, do not represent objects
  • Surface normals - are gradient of depth

Scene Intrinsics

Recovering Intrinsic Scene Characteristics:
Given the following:

  • original scene
  • distance (depth)
  • reflectance
  • orientation (normal)
  • illumination

You can extract the scene perfectly.

Learning ordinal relationships:

  • Which point is closer?
    • This gets you depth for 3D
  • Which point is darker?
    • This gets you reflectance for shading

Depth vs surface normals:

  • Surface normals are gradient of depth
  • Depth is hard to use due to large discontinuities and unbounded values.


Qualitative Parse Graph

  • Understanding of 3D support, support surfaces (physics)
    • E.g. lamp is supported by nightstand
  • Dataset: NYU v2
  • Given an image, identify surfaces, then classify edges as concave (pop in) or convex (pop out).
    • From this, you can create a popup scene.

Objects + 3D

  • Rasterized 3D representations:
    • multi-view images
    • depth maps
    • volumetric (voxels)
  • Geometric 3D representations:
    • mesh
    • point cloud
    • CAD models
    • primitive-based CAD models
  • Pascal 3D
  • ObjectNet3D
  • ShapeNet
  • Matterport3D

Rough 3D reconstruction:

  • Do classification & segmentation using a NN.
  • Fix existing CAD model to the image.

Shape carving:

  • Assume everything is cubeoids and remove cubes.

Without segmentation masks:

  • Train autoencoder for 3D voxels shape
  • Have encoder for rendered chairs.
  • At test time, images go to encoder for rendered and out the decoder for 3D voxels.

GANs and VAEs


  • Fully-visible belief network
  • Explicit density model:
    • Each pixel depends on all previous pixels
    • \(\displaystyle P_{\theta}(x) = \prod_{i=1}^{n} P_{\theta}(x_i | x_1, ..., x_{i-1})\)
    • You need to define what is previous pixels (e.g. all pixels above and left)
  • Then maximize likelihood of training data
  • Can explicitly compute P(x)
  • Explicit P(x) gives good evaluation metric
  • Sequence generation is slow
  • Optimizing P(x) is hard.

Types of previous pixels connections:

  • PixelCNN looks at all previous pixels (fastest)
  • Row LSTM has a triangular receptive field (slow)
  • Diagonal LSTM
  • Diagonal BiLSTM has a full dependency field (slowest)
Multi-scale PixelRNN
  • Takes subsampled pixels as additional input pixels
  • Can capture better global information
  • Slightly better results

Generative Adversarial Networks (GANs)

  • Generator generates images
  • Discriminator classifies real or fake
  • Loss: \(\displaystyle \min_{G} \max_{D} E_x[\log D(x)] + E_z[\log(1-D(G(z)))]\)
Image-to-image Conditional GANS
  • Add an image encoder which outputs z
  • Add L1 loss to the loss function
  • UNet generator
  • PatchGAN discriminator
    • PatchGAN outputs N*N values with real-fake with each patch (i.e. limited receptive field)
  • Requires paired samples
  • Unpaired image-to-image translation
  • Cycle-consistency loss
  • First learn to reconstruct images with a nice latent code representation in between (cVAE-GAN)
  • The main difference is that we have a many-to-many mapping (multi-modal image-to-image) between the two domains.
  • Multimodal UNsupervised Image-to-image Translation
  • Maps images from each domain into a shared content space and domain-specific style space.

Training Problems with GANs

  • Instability
  • Difficult to keep generator and discriminator in sync.
    • Discriminator cannot be too good or too bad. Same with generator.
    • Tricks: LR scheduling, keep discriminator small, update generator more frequently.
  • Mode collapse

Mode collapse happens when the generator cannot model different parts of the distribution.

DCGAN architecture guidelines
  • Use strided conv instead of pooling for discriminator.
  • Use batchnorm in generator and discriminator.
  • Remove FC hidden layers.
  • Use Relu for hidden layers, tanh for output layers of generator.
  • Use LRelu for discriminator.

LSGAN, WGAN have tricks to mitigate mode collapse.

Evaluation of GANs

  • Turing test (User study)
  • Inception score

Variational Auto-encoders (VAEs)

Training a VAE
  • Data likelihood: \(\displaystyle P(x) = \int P(x|z) P(z) dz\)
  • Approx with samples of z during training: \(\displaystyle P(x) \approx \frac{1}{n} \sum_{i=0}^{n} P(x | z_i)\)
  • This is impractical.

Assume we can learn a distribution \(\displaystyle Q(z)\) such that \(\displaystyle z \sim Q(z)\) generates \(\displaystyle P(x|z) \gt 0\).
Relating \(\displaystyle P(x)\) and \(\displaystyle E_{z \sim Q(z|x)}\)?
\(\displaystyle \begin{aligned} D_{KL}[Q(z|x) \Vert P(z|x)] &= E_{z \sim Q}[\log Q(z|x) - \log P(z)] - E_{z \sim Q}[\log P(x|z)] + \log P(x)\\ &= D_{KL}[Q|P] - E_{z \sim Q}[\log P(x|z)] + \log P(x) \end{aligned} \)
Rearranging we get:
\(\displaystyle \begin{aligned} &\log P(x) - D_{KL}[Q(z|x) \Vert P(z|x)] = E_{z \sim Q}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)]\\ \implies &\log P(x) \geq E_{z \sim Q(z|x)}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)] \end{aligned} \)
This is known as variational lower bound or ELBO.

  • We first have the encoder output a mean \(\displaystyle \mu_{z|x}\) and covariance matrix diagonal \(\displaystyle \Sigma_{z|x}\).
  • For ELBO we want to optimize \(\displaystyle E_{z \sim Q(z|x)}[\log P(x|z)] - D_{KL}[Q(z|x) \Vert P(z)]\).
  • Our first loss is \(\displaystyle D_{KL}(N(\mu_{z|x}, \Sigma_{z|x}) \Vert N(0, I))\).
  • We sample z from \(\displaystyle N(\mu_{z|x}, \Sigma_{z|x})\) and pass it to the decoder which outputs \(\displaystyle \mu_{x|z}, \Sigma_{x|z}\).
  • Sample \(\displaystyle \hat{x}\) from the distribution and have reconstruction loss \(\displaystyle \Vert x - \hat{x} \Vert^2\).
  • Most blog posts will forget to sample from \(\displaystyle P(x|z)\).
Modeling P(x|z)

Let f(z) be the network output.

  • Assume \(\displaystyle P(x|z)\) is iid Gaussian.
  • \(\displaystyle \hat{x} = f(z) + \eta\) where \(\displaystyle \eta \sim N(0,1)\)
  • Simplifies to an L2 loss \(\displaystyle \Vert x - f(z) \Vert^2\)

Importance weighted VAE uses N samples for the loss.

Reparameterization trick

To sample from the latent space, you do \(\displaystyle z = \mu + \sigma \varepsilon\) where \(\displaystyle \varepsilon \sim N(0,1)\).
This way, you can backprop through through the sampling step.

Conditional VAE

Just input the condition into the encoder and decoder.

  • Principled approach to generative models
  • Allows inference of \(\displaystyle q(z|x)\) which can be used as a feature representation
  • Maximizes ELBO
  • Samples are blurrier than GANs

Why are samples blurry?

  • Samples are not blurry but noisy
    • Sample vs Mean/Expected Value
  • L2 loss

Flow-based Models

Flow-based models minimize the negative log-likelihood.

Attribute-based Representation


Typically in recognition, we only predict the class of the image.
From the category, we can guess the attributes but the category provides only limited information.
The network cannot perform prediction on unseen new classes.
This problem used to be called graceful degradation.


Learn intermediate structure with object categories.

Should we care about attributes in DL?
Why is attributes not simply supervised recognition?
  • Dealing with inevitable failure.
  • We can infer things about unseen categories.
  • We can make comparison between objects or categories.
  • a-Pascal
  • a-Yahoo
  • CORE
  • COCO Attributes

Deep networks should be able to learn attributes implicitly.
However, you don't know if it has actually learned them.

Extra Topics

Fine-grained Recognition

Few-shot Recognition

  • Metric learning methods
  • Meta-learning methods
  • Data Augmentation Methods
  • Semantics

Zero-shot Recognition

Goal is train a classifier without having seen a single labeled example.
The information comes from a knowledge graph e.g. from word embeddings.

Beyond Labelled Datasets

  • Semi-supervised: We have both labelled and unlabeled training samples.
  • Weakly-supervised: The labels are weak, noisy, and non-necessarily for the task we want.
  • Learning from the Web: Download data from the internet

Will be on the exam

  • Back-prop and SGD,
  • Softmax, sigmoid, cross entropy
  • RCNN vs Fast-RCNN vs Faster-RCNN
  • DPM
  • Selective search vs RPM
  • ELBO

Final exam:

  • Friday Dec 4, 2020
  • 4pm-6pm on gradescope
  • Will have multiple choice, fill-in-the-blank, question answering, open-ended
    • Generally simple questions; either you know or you don't
  • No practice exams
  • Only need to know major names (RCNN, Fast(er)-RCNN)
  • Only covers lecture material.
  • 1 letter page of open notes, both sides allowed (honor system)

Homework 2

  • Released Nov 30, 2020
  • Take SLIC super pixels, extract deep features and classify them.
  • Will have 2 bonus credits, must pick one
    • Use features from multiple layers (multi-scale)
    • Use multiple-levels of SLIC in input (SLIC feature pyramid)

Final Project

  • Presentations Dec 10
  • Recorded videos with presentations
  • Final reports Dec 18

My Exam Cheat Sheet

Project Notes

We will need to include:

  • Challenges
  • What methods worked and didn't work.


  1. Antonio Torralba, Rob Fergus and William T. Freeman (2008). 80 million tiny images: a large dataset for non-parametric object and scene recognition (PAMI 2008) Link
  2. Lionel Standing (1973). Learning 10000 pictures. Journal Quarterly Journal of Experimental Psychology Link
  3. Timothy F. Brady, Talia Konkle, George A. Alvarez, and Aude Oliva (2008). Visual long-term memory has a massive storage capacity for object details. Link.
  4. Antonio Torralba, Alexei A. Efros (2011). Unbiased Look at Dataset Bias (CVPR 2011) Link
  5. Kevin Dale, Micah K. Johnson, Kalyan Sunkavalli, Wojciech Matusik, Hanspeter Pfister (2009) Image Restoration using Online Photo Collections (ICCV 2009) Link
  6. James Hays, Alexei A. Efros (2007). Scene Completion Using Millions of Photographs (SIGGRAPH 2007) Link
  7. James Hays, Alexei A. Efros (2008). IM2GPS: estimating geographic information from a single image. (CVPR 2008) Link
  8. Biliana Kaneva, Josef Sivic, Antonio Torralba, Shai Avidan, William T. Freeman (2008). Matching and Predicting Street Level Images (ECCV Workshops 2008) Link
  9. Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton (2012) ImageNet Classification with Deep Convolutional Neural Networks (NIPS 2012) Link
  10. Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger (2017). Densely Connected Convolutional Networks (CVPR 2017) Link
  11. Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy (2018). Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification (ECCV 2018) Link
  12. Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester and Deva Ramanan (2009) Object Detection with Discriminatively Trained Part Based Models Link
  13. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg (2016) SSD: Single Shot MultiBox Detector Link
  14. Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi (2016) You Only Look Once: Unified, Real-Time Object Detection Link
  15. Jamie Shotton John Winn Carsten Rother Antonio Criminisi (2009) TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context. Link
  16. Sean Bell, C. Lawrence Zitnick, Kavita Bala, Ross Girshick (2016). Inside-Outside Net: Detecting Objects in Context With Skip Pooling and Recurrent Neural Networks (CVPR 2016) CVF Mirror
  17. Abhinav Shrivastava, Abhinav Gupta, Ross Girshick (2016) Training Region-Based Object Detectors With Online Hard Example Mining. (CVPR 2016) Link