Visual Learning and Recognition: Difference between revisions
Line 440: | Line 440: | ||
The idea is that they train a CNN to do object detection over the entire image. | The idea is that they train a CNN to do object detection over the entire image. | ||
The CNN outputs multiple feature maps for each of the categories, each with different aspect ratios and scales. | The CNN outputs multiple feature maps for each of the categories, each with different aspect ratios and scales. | ||
Pixels of the feature maps are ''default boxes'' | Pixels of the feature maps are scores for ''default boxes''; each pixel is associated with a default bounding box. | ||
The candidate results from the feature maps are filtered using non-maximum suppression. | |||
Different scales are achieved by extracting feature maps from intermediate layers of the network. | Different scales are achieved by extracting feature maps from intermediate layers of the network. | ||
The aspect ratio of each default box does not actually correspond to the receptive field associated with the feature pixel. | The aspect ratio of each default box does not actually correspond to the receptive field associated with the feature pixel. |
Revision as of 20:37, 8 October 2020
Notes for CMSC828I Visual Learning and Recognition (Fall 2020) taught by Abhinav Shrivastava
This class covers:
- How a sub-topic evolved
- State of the art
Introduction to Data
Lecture 3 September 8, 2020
The extremes of data.
If we have very few images, we are working on an extrapolation problem.
As we approach an infinite number of training samples, learning becomes an interpolation problem.
Traditional datasets are in the order of \(\displaystyle 10^2-10^4\) training samples.
Current datasets are in the order of \(\displaystyle 10^5-10^7\) training samples.
In tiny images [1], Torrabla et al. use 80 million tiny images.
- What is the capacity of visual long term memory?
In Standing (1973)[2], people could recall whether they've seen 10,000 images with 83% recognition.
- What we don't know is what people are remembering for each item?
In Brady et al.[3], they tested recall for novel (new object), exemplar (same type of object), and state (same object & state). They got 92% for novel, 88% for exemplar, and 87% for state so humans remember the exact state of objects they've seen.
- Rule of thumb
(Simple algorithms + big data) is better than (complicated algorithms + small data)
Lecture 4 September 10, 2020
This lecture is on the bias of data. It follows Torralba et al.[4]
- Will big data solve all our problems?
E.g. Can (big company) just dump millions of dollars to collect data and solve any problem?
No. E.g. COVID.
There will always be new tasks or problems.
We will never have enough data
Long tails - Zipf's law
Data is biased
Types of visual bias:
- Observer Bias (human vs bird)
- Capture Bias (photographer vs robot)
- Selection Bias (Flickr vs Google Street View)
- Category/Label Bias
- Negative Set Bias
In general, all datasets will have all of these biases mixed in.
- Social Bias
Graduation photos always have a certain structure.
Measuring Dataset Bias
Evaluate cross-dataset performance
Train on one dataset, test on another
To evaluate negative set bias, pool negatives from other datasets (e.g. not car or not person).
They found that models trained on a dataset do ~8% worse at detecting negatives from other datasets.
Overcoming Datset Bias
Mixing datasets
- Selection bias
In general, automatically gathered images do better.
You can also collect data from multiple sources (multiple search engines across multiple countries)
or collect unannotated images and label them via crowd-sourcing.
- Capture bias
To overcome the bias of professional photographs:
Apply data augmentations: flipping images, jittering (small affine transformations), random crops.
- Negative set bias
Add negatives from other datasets.
Mine hard negatives from other datasets using standard algorithms.
Data-driven Methods in Vision
Beginning of Lecture 5 (September 15)
Dale et al.[5] perform semantic color correction using a large dataset.
Heys and Efros[6] perform scene completion.
Heys and Efros[7] perform image localization.
Kaneva et al.[8] perform scene matching with camera view transformations.
Dealing with Sparse Data
- Better Similarity
Better Alignment
- E.g. reduce resolution, sifting, warping
- SIFT-Flow
Take sift features for all regions.
Then learn some SIFT vector to RGB color matching.
The RGB images are called sift flow features.
Similar RGB regions will have similar SIFT feature vectors.
Then we can learn some transformation \(\displaystyle T\) to match the sift flows (i.e. \(\displaystyle T(F_1) \approx F_2\)).
- Non-parametric Scene Parsing (CVPR 2009)
If you have a good scene alignment algorithm, you can just use a segmentation map.
Use sub-images (primitives) to match
Allows matching from multiple images
- Mid-level primitives
Bag of visual words:
- Take some features (e.g. SIFT) from every image in your dataset.
- Apply clustering to your dataset to get k clusters. These k clusters are your visual words.
The challenge with matching patches is how to find patches to match?
Ideally, we want patches which are both representative and discriminate.
Representative is that the patch is found in the target image set; i.e. coverage of the target concept.
Discriminative is that the patch is not found in non-target image sets (distinct from other concepts).
Understanding simple stuff first
E.g. from a video, find one frame which is easy to detect pose and then apply optical-flow methods to transfer the flow to adjacent frames.
Looking beyond the k-NN method
Use data to make connections.
- Visual Memex Knowledge Graph
(Malisiewicz and Efros 2009)
Build a visual knowledge graph of entites. Edges can be context edges or similarity edges.
Embed an image into the graph and copy information from the graph.
- Manifolds in Vision
These days, we can assume deep learning features are reasonable manifolds.
ConvNets and Architectures
See Convolutional neural network for basics.
Paper Summaries
Krizhevsky et al.[9] develop AlexNet for image classification. AlexNet is a CNN architecture with two branches. Their architecture and training proceedure includes many tricks, some of which are now commonplace today. These include multi-GPU training, 8 layers, ReLU activations, Local Response Normalization (LRU), (overlapping) max pooling, data augmentation, and dropout. They won on ImageNet 2012 by a large margin.
Huang et al.[10] develop DenseNet for image classification. The main contribution are dense blocks where each layer within the block are connected to all subsequent layers (i.e. the outputs are accumulated by concatenation). Each layer consists of (BN-ReLU-Conv(1x1)-BN-ReLU-Conv(3x3)). Following each dense block, they use transition layers (1x1 conv + 2x2 avg pool) to shrink the size. The evaluate on CIFAR, SVHN, and ImageNet.
Xie et al.[11] develop S3D-G for video classification. The main idea is that video classification can be done with Conv2d layers at lower layers and Conv3d layers at higher layers. In addition, the time and spatial dimensions can be separated into two different 3D convolutions (with \(\displaystyle 1 \times k \times k\) and \(\displaystyle k_t \times 1 \times 1\) kernels). These two changes improve the accuracy and efficiency of video classification compared to just Conv3d.
Overview
ConvNet pipeline:
- Input
- Conv/ReLU/Pool
- FC/ReLu
- FC/Normalization/Loss
VGGNet
ILSVRC 2014 2nd place
This is a sequence of deeper networks trained progressively.
They replace large receptive fields with successive 3x3 conv + ReLU layers.
A single 7x7 conv layer with C-dim input and C-dim output would need \(\displaystyle 49 \times C^2\) weights.
Three \(\displaystyle 3\times 3\) conv layers only need \(\displaystyle 27 \times C^2\) weights.
Network in network
Use a small perceptron as your convolution kernel. I.e. the block goes into the perceptron. This output instead of calculating cross correlation with a standard kernel.
GoogLeNet
Hebbian Principle: Neurons that fire together are typically wired together.
Implemented using an Inception Module.
The key idea is to use a heterogeneous set of convolutions.
Naive idea: Do a 1x1 convolution, 3x3 convolution, and 5x5 convolution and then concatenate the output together.
The intuition is that each captures a different receptive field.
In practice, they need to add 1x1 convolutions before the 3x3 and 5x5 convolutions to make it work. These are used for dimension reduction by controlling the channel.
Another idea is to add auxiliary classifiers across the network.
Inception v2, v3 V2 adds batch-normalization to reduce dependence on auxiliary classifiers. V3 addes factored convolutions (i.e. nx1 and 1xn convolutions).
ResNet
The main idea is to introduce skip or shortcut connections.
This existing in literature before.
The means returning \(\displaystyle F(x)+x\).
This allow smoother gradient flows since intermediate layers cannot block gradient flow.
They also replace 3x3 convolutions on 256 channels with 1x1 to 64 channels, 3x3 on the 64 channels, then 1x1 back to 256 channels.
This reduces parameters from approx 600k to approx 70k.
Accuracy vs efficiency
First we had AlexNet. Then we had VGG which had way more parameters and better accuracy.
Then we had GoogLeNet which is much smaller than both AlexNet and VGG with roughly the same accuracy.
Next ResNet and Inception increases the parameters slightly and attained better performance.
Beyond Resnet
- Fractal Net
This is a take on ResNet which removes the skip connection across the whole network.
The point is to show that the performance is from connections of different lengths.
- Wide ResNet
Reduce the number of residual blocks but increase feature maps in each block.
Shows that it's not just about depth but also the width of each layer.
Computationally, a wide network is more parallelizable.
We thought that more layers make networks exponentially more powerful.
However, this contradicts that hypothesis.
- ResNeXt
Propose cardinality as a feature of network design.
First split the 256x256 input across the channel dimension.
Each layer has much fewer features: 64 to 4, but now with 32 separate paths.
Within each layer, how many things can we do independently of each other?.
DenseNets
CVPR 2017 best paper award.
Forget about resnets, just connect everything to all following layers.
Video
Images are RGB.
Videos are RGB+T.
Combine per-frame models
- Single Frame
- Late Fusion (combine features for frames apart in time)
- Early Fusion (combine features from adjacent frames)
- Slow Fusion (combind features from adjacent frames and then features from adjacent features,...)
2-stream networks
Have a spatial stream and a temporal stream.
The spatial stream works on a single RGB frame.
The temporal stream works on optical flow.
3D ConvNets
Slide the kernel in the time domain.
I3D
Inflated 3D ConvNet
Types of 3D networks:
- LSTM
- 3D-ConvNet
- Two strea
- 3D-Fused Two Stream
- Two-stream 3D-ConvNet
Take the inception module and add a time dimension.
Design Principles
- Make networks parameter-efficient
- Reduce filter sizes, factorize filters
- Use 1x1 convolutions to reduce number of feature maps
- Minimize reliance on FC layers
- Reduce spatial resolution gradually so we can repeat the same block
- Use skip connections or multiple redundant paths.
- Play around with depth vs width vs cardinality
Miscellaneous Things
- Training tricks and details
- Training data augmentation
- Ensembles of networks
Object Detection
Beginning of Lecture 10 (Oct 1)
Edge Templates + Nearest Neighbor
Gavrila & Philomen (1999)
- From a raw image, do feature extraction and calculate distance transform.
- Do nearest neighbor search.
Cons:
- Templates are hand-made.
Haar Wavelets + SVM
A Trainable System for Object Detection. (Papageorgou & Poggio, 2000)
- Extract Overcomplete Representation
- Called Haar Wavelets. Similar to CNN filters.
- Wavelet features can be calculated by averaging all faces. Similar to CNN features.
- Do SVM Classifier
- + Parts (2001)
Trained an SVM for face, legs, left arm, right arm.
When detecting a person, make sure all parts are in the correct location and shape with some constraints.
YYY + adaBoost
Basically, do the same as before (Haar Wavelets) but replace SVM with adaBoost.
- Rectangular differential features (Viola & Jones 2001)
Use fast features to throw out parts of the image.
Then do processing on the remainder.
Became the standard object detection system in OpenCV.
- Learnt wavelets + adaBoost
Works on more than just faces.
Ensemble face detection.
Dynamic Programming
Efficient matching of pictorial structures (Felzenszwalb & Huttenlocher, 2000) Basically have a cartoon model and match the position & orientation of each part.
Probabilistic Methods for Finding People (Ioffe & Forsyth, 1999)
More Techniques
How to detect objects at different scale?
One trick is to detect the horizon line and scale based on the horizon line.
Sliding Window:
Create multiple scales of the image and detect at each scale.
This is done by building a feature pyramid using an image pyramid.
Histograph of Gradients (HoG)
How many octaves? However many octaves to reduces the image size to the template size + 1 for 2x2 upscaling.
How many levels? Generally people try 10 levels.
Precision and Recall
Precision is (# correct) / (# predictions).
Recall is (# correct) / (# ground truth).
Consider the following table
Box | Score | IOU G1 | IOU G2 | IOU G3 | IOU G4 | IOU G5 |
b1 | 0.9 | 0.6 | 0.1 | 0.1 | 0 | 0 |
b1 | 0.8 | 0 | 0 | 0.1 | 0 | 0 |
b1 | 0.7 | 0 | 0 | 0 | 0 | 0.7 |
b1 | 0.6 | 0 | 0 | 0 | 0 | 0 |
Starting with b1 we have a precision of 1 and a recall of 1/5 since we detect only G1.
From b2, precision becomes 1/2, recall remains the same since we detect nothing.
From b3, precision becomes 2/3, recall becomes 2/5.
From b4, precision becomes 2/4, recall is still 2/5.
The area under the Precision vs Recall curve is call the average precision (AP).
Non-max Supression
The NMS heuristic here is used to reduce the number of bounding boxes per object to 1.
Initially, you have a set of overlapping bounding boxes \(\displaystyle B\).
Create a final set \(\displaystyle D\).
- While B is not empty
- Remove the highest confidence/score box \(\displaystyle b_i\) from \(\displaystyle B\). Add it to \(\displaystyle D\)
- For every other box \(\displaystyle b_j\),
- If \(\displaystyle IOU(b_i, b_j) \gt \lambda\) (i.e. they bound the same object), discard \(\displaystyle b_j\)
Hard mining
During training, classify on all images.
Figure out which instances the classifier classifies incorrectly.
Then train only on those negative instances.
Current HOG
Current HOG uses 31 dimensions
- 9 Contrast insensitive gradients
- 18 Contrast sensitive gradients
- 4 Texture Related
Discriminatively Trained Part Based Models (DPM)
Felzenszwalb et al (2009) [12] Important: Read this paper
- Train a part detector (e.g. Head, Leg, Arm)
- Enforce constraints between parts.
- Part Configurations
\(\displaystyle \mathbf{p} = (p_0, p_1, p_2,...)\).
- Scoring a configuration
\(\displaystyle score(p) = \sum_{i=0}^{N}w_i^T \phi(p_i) + \sum_{ij}w_{ij}^T \psi(p_i, p_j)\)
- \(\displaystyle w_{ij}^T \in \mathbb{R}^5\) is the deformation parameter between parts i and j
The total number of configurations is \(\displaystyle 10^(4*N)\) since for each \(\displaystyle 100\times100\) image, each of \(\displaystyle p_i\) can take 100*100 values. \(\displaystyle N\) is the number of parts.
The trick is to use dynamic programming and a tree-based model.
I.e. if p1 is the body and p2 is the head then the deformation of p2 is only with respect to p1 and p3 is only with respect to p1.
There is no deformation calculation between p2 and p3.
The deformation is \(\displaystyle w_{12}^T \psi(p_1, p_2)\).
Then we can compute the max for p2 with respect to p1, p3 wrt p1, and then p1.
Mixture Models
- Is One Model Enough?
In generally no because objects have multiple views.
The solution is the use mixture models.
This gives us multiple part based models so we can capture different views of a single object.
\(\displaystyle score(\mathbf{p}) = \beta \Psi(\mathbb{p})\) where:
\(\displaystyle \beta = [w_0,..., b]\) and
\(\displaystyle \Psi(\mathbb{p}) = [\phi(p_0),...,\phi(p_N), \psi(p_0, p_1),...]\).
This can be trained using a linear SVM or using block gradient descent.
- Analyzing Mixture Models
\(\displaystyle L(\beta) = \frac{1}{2} \Vert \beta \Vert^2 + C\sum_{i=1}^{n} \max(0, 1-y_i * score(\mathbf{z}))\)
Region-based Approaches
1 Stage:
- Overfeat
- SSD
- YOLO
2 Stage:
- RCNN
- Fast RCNN
- Mask RCNN
Instance based:
- SDS
- RFCN
- MASK RCNN
Overfeat
Winner of ILSVRC 2014 localization challenge.
The architecture first passes the image through some convolution & pooling layers.
The a sequence of FC layers produces an output.
- Sliding Window
If network takes 3x221x221 and you have an image 3x257x257.
Run image through network with sliding window. Then greedily merge the boxes.
- Efficient sliding window
Use a fully convolutional network.
Single Stage Multibox Detector (SSD)
Liu et al (2016) propose SSD: Single Shot MultiBox Detector.
The idea is that they train a CNN to do object detection over the entire image.
The CNN outputs multiple feature maps for each of the categories, each with different aspect ratios and scales.
Pixels of the feature maps are scores for default boxes; each pixel is associated with a default bounding box.
The candidate results from the feature maps are filtered using non-maximum suppression.
Different scales are achieved by extracting feature maps from intermediate layers of the network.
The aspect ratio of each default box does not actually correspond to the receptive field associated with the feature pixel.
During training, all default boxes with jaccard overlap >0.5 with the ground truth are matched.
They also apply hard negative mining and data augmentation.
YOLO
Semantic Segmentation
Given an image, label every pixel with a class.
Note object segmentation is semantic segmentation with just one class.
Segmentation does not give us instances unlike object detection.
Instance segmentation is object segmentation + detection.
- Energy function
- Labels a pixel
- Penalty if label is unlikely
MRF
Markov random field
CRF
Conditional random field Read TextonBoost ECCV 2006
Superpixels
- TODO
- Read SSD, YOLO, TexonBoost
Will be on the exam
- Back-prop and SGD,
- Softmax, sigmoid, cross entropy
Project Notes
We will need to include:
- Challenges
- What methods worked and didn't work.
Misc
References
- ↑ Antonio Torralba, Rob Fergus and William T. Freeman (2008). 80 million tiny images: a large dataset for non-parametric object and scene recognition (PAMI 2008) https://people.csail.mit.edu/torralba/publications/80millionImages.pdf
- ↑ Lionel Standing (1973). Learning 10000 pictures. Journal Quarterly Journal of Experimental Psychology https://www.tandfonline.com/doi/abs/10.1080/14640747308400340
- ↑ Timothy F. Brady, Talia Konkle, George A. Alvarez, and Aude Oliva (2008). Visual long-term memory has a massive storage capacity for object details. http://olivalab.mit.edu/MM/pdfs/BradyKonkleAlvarezOliva2008.pdf.
- ↑ Antonio Torralba, Alexei A. Efros (2011). Unbiased Look at Dataset Bias (CVPR 2011) https://people.csail.mit.edu/torralba/publications/datasets_cvpr11.pdf
- ↑ Kevin Dale, Micah K. Johnson, Kalyan Sunkavalli, Wojciech Matusik, Hanspeter Pfister (2009) Image Restoration using Online Photo Collections (ICCV 2009) https://faculty.idc.ac.il/arik/seminar2010/papers/ImageRestoration/restoration_iccv09.pdf
- ↑ James Hays, Alexei A. Efros (2007). Scene Completion Using Millions of Photographs (SIGGRAPH 2007) http://graphics.cs.cmu.edu/projects/scene-completion/scene-completion.pdf
- ↑ James Hays, Alexei A. Efros (2008). IM2GPS: estimating geographic information from a single image. (CVPR 2008) http://graphics.cs.cmu.edu/projects/im2gps/im2gps.pdf
- ↑ Biliana Kaneva, Josef Sivic, Antonio Torralba, Shai Avidan, William T. Freeman (2008). Matching and Predicting Street Level Images (ECCV Workshops 2008) https://people.csail.mit.edu/biliana/papers/eccv2010/eccv_workshop_2010.pdf
- ↑ Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton (2012) ImageNet Classification with Deep Convolutional Neural Networks (NIPS 2012) https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
- ↑ Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger (2017). Densely Connected Convolutional Networks (CVPR 2017) https://arxiv.org/pdf/1608.06993.pdf
- ↑ Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy (2018). Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification (ECCV 2018) https://arxiv.org/pdf/1712.04851.pdf
- ↑ Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester and Deva Ramanan (2009) Object Detection with Discriminatively Trained Part Based Models http://cs.brown.edu/people/pfelzens/papers/lsvm-pami.pdf