Visual Learning and Recognition: Difference between revisions

From David's Wiki
Line 123: Line 123:


==ConvNets and Architectures==
==ConvNets and Architectures==
See [[Convolutional neural network]] for basics.
===Paper Summaries===
===Paper Summaries===
Krizhevsky ''et al.''<ref name="krizhevsky2012alexnet"></ref> develop AlexNet for image classification. AlexNet is a CNN architecture with two branches. Their architecture and training proceedure includes many tricks, some of which are now commonplace today. These include multi-GPU training, 8 layers, ReLU activations, Local Response Normalization (LRU), (overlapping) max pooling, data augmentation, and dropout. They won on ImageNet 2012 by a large margin.
Krizhevsky ''et al.''<ref name="krizhevsky2012alexnet"></ref> develop AlexNet for image classification. AlexNet is a CNN architecture with two branches. Their architecture and training proceedure includes many tricks, some of which are now commonplace today. These include multi-GPU training, 8 layers, ReLU activations, Local Response Normalization (LRU), (overlapping) max pooling, data augmentation, and dropout. They won on ImageNet 2012 by a large margin.

Revision as of 16:41, 24 September 2020

Notes for CMSC828I Visual Learning and Recognition (Fall 2020) taught by Abhinav Shrivastava

This class covers:

  • How a sub-topic evolved
  • State of the art

Introduction to Data

Lecture 3 September 8, 2020

The extremes of data. If we have very few images, we are working on an extrapolation problem.
As we approach an infinite number of training samples, learning becomes an interpolation problem.
Traditional datasets are in the order of \(\displaystyle 10^2-10^4\) training samples.
Current datasets are in the order of \(\displaystyle 10^5-10^7\) training samples.

In tiny images [1], Torrabla et al. use 80 million tiny images.

What is the capacity of visual long term memory?

In Standing (1973)[2], people could recall whether they've seen 10,000 images with 83% recognition.

What we don't know is what people are remembering for each item?

In Brady et al.[3], they tested recall for novel (new object), exemplar (same type of object), and state (same object & state). They got 92% for novel, 88% for exemplar, and 87% for state so humans remember the exact state of objects they've seen.

Rule of thumb

(Simple algorithms + big data) is better than (complicated algorithms + small data)

Lecture 4 September 10, 2020

This lecture is on the bias of data. It follows Torralba et al.[4]

Will big data solve all our problems?

E.g. Can (big company) just dump millions of dollars to collect data and solve any problem?
No. E.g. COVID.
There will always be new tasks or problems.

We will never have enough data

Long tails - Zipf's law

Data is biased

Types of visual bias:

  • Observer Bias (human vs bird)
  • Capture Bias (photographer vs robot)
  • Selection Bias (Flickr vs Google Street View)
  • Category/Label Bias
  • Negative Set Bias

In general, all datasets will have all of these biases mixed in.

  • Social Bias

Graduation photos always have a certain structure.

Measuring Dataset Bias

Evaluate cross-dataset performance
Train on one dataset, test on another

To evaluate negative set bias, pool negatives from other datasets (e.g. not car or not person).
They found that models trained on a dataset do ~8% worse at detecting negatives from other datasets.

Overcoming Datset Bias

Mixing datasets

Selection bias

In general, automatically gathered images do better.
You can also collect data from multiple sources (multiple search engines across multiple countries) or collect unannotated images and label them via crowd-sourcing.

Capture bias

To overcome the bias of professional photographs:
Apply data augmentations: flipping images, jittering (small affine transformations), random crops.

Negative set bias

Add negatives from other datasets.
Mine hard negatives from other datasets using standard algorithms.

Data-driven Methods in Vision

Beginning of Lecture 5 (September 15)

Dale et al.[5] perform semantic color correction using a large dataset.
Heys and Efros[6] perform scene completion.
Heys and Efros[7] perform image localization.
Kaneva et al.[8] perform scene matching with camera view transformations.


Dealing with Sparse Data

  • Better Similarity

Better Alignment

    • E.g. reduce resolution, sifting, warping
SIFT-Flow

Take sift features for all regions.
Then learn some SIFT vector to RGB color matching. The RGB images are called sift flow features.
Similar RGB regions will have similar SIFT feature vectors.
Then we can learn some transformation \(\displaystyle T\) to match the sift flows (i.e. \(\displaystyle T(F_1) \approx F_2\)).

Non-parametric Scene Parsing (CVPR 2009)

If you have a good scene alignment algorithm, you can just use a segmentation map.

Use sub-images (primitives) to match

Allows matching from multiple images

Mid-level primitives

Bag of visual words:

  1. Take some features (e.g. SIFT) from every image in your dataset.
  2. Apply clustering to your dataset to get k clusters. These k clusters are your visual words.

The challenge with matching patches is how to find patches to match?
Ideally, we want patches which are both representative and discriminate.
Representative is that the patch is found in the target image set; i.e. coverage of the target concept.
Discriminative is that the patch is not found in non-target image sets (distinct from other concepts).

Understanding simple stuff first

E.g. from a video, find one frame which is easy to detect pose and then apply optical-flow methods to transfer the flow to adjacent frames.

Looking beyond the k-NN method

Use data to make connections.

Visual Memex Knowledge Graph

(Malisiewicz and Efros 2009)
Build a visual knowledge graph of entites. Edges can be context edges or similarity edges.
Embed an image into the graph and copy information from the graph.

Manifolds in Vision

These days, we can assume deep learning features are reasonable manifolds.

ConvNets and Architectures

See Convolutional neural network for basics.

Paper Summaries

Krizhevsky et al.[9] develop AlexNet for image classification. AlexNet is a CNN architecture with two branches. Their architecture and training proceedure includes many tricks, some of which are now commonplace today. These include multi-GPU training, 8 layers, ReLU activations, Local Response Normalization (LRU), (overlapping) max pooling, data augmentation, and dropout. They won on ImageNet 2012 by a large margin.

Huang et al.[10] develop DenseNet for image classification. The main contribution are dense blocks where each layer within the block are connected to all subsequent layers (i.e. the outputs are accumulated by concatenation). Each layer consists of (BN-ReLU-Conv(1x1)-BN-ReLU-Conv(3x3)). Following each dense block, they use transition layers (1x1 conv + 2x2 avg pool) to shrink the size. The evaluate on CIFAR, SVHN, and ImageNet.

Xie et al.[11] develop S3D-G for video classification. The main idea is that video classification can be done with Conv2d layers at lower layers and Conv3d layers at higher layers. In addition, the time and spatial dimensions can be separated into two different 3D convolutions (with \(\displaystyle 1 \times k \times k\) and \(\displaystyle k_t \times 1 \times 1\) kernels). These two changes improve the accuracy and efficiency of video classification compared to just Conv3d.

Will be on the exam

  • Back-prop and SGD,
  • Softmax, sigmoid, cross entropy

Project Notes

We will need to include:

  • Challenges
  • What methods worked and didn't work.

Misc

Visible to::users

References

  1. Antonio Torralba, Rob Fergus and William T. Freeman (2008). 80 million tiny images: a large dataset for non-parametric object and scene recognition (PAMI 2008) https://people.csail.mit.edu/torralba/publications/80millionImages.pdf
  2. Lionel Standing (1973). Learning 10000 pictures. Journal Quarterly Journal of Experimental Psychology https://www.tandfonline.com/doi/abs/10.1080/14640747308400340
  3. Timothy F. Brady, Talia Konkle, George A. Alvarez, and Aude Oliva (2008). Visual long-term memory has a massive storage capacity for object details. http://olivalab.mit.edu/MM/pdfs/BradyKonkleAlvarezOliva2008.pdf.
  4. Antonio Torralba, Alexei A. Efros (2011). Unbiased Look at Dataset Bias (CVPR 2011) https://people.csail.mit.edu/torralba/publications/datasets_cvpr11.pdf
  5. Kevin Dale, Micah K. Johnson, Kalyan Sunkavalli, Wojciech Matusik, Hanspeter Pfister (2009) Image Restoration using Online Photo Collections (ICCV 2009) https://faculty.idc.ac.il/arik/seminar2010/papers/ImageRestoration/restoration_iccv09.pdf
  6. James Hays, Alexei A. Efros (2007). Scene Completion Using Millions of Photographs (SIGGRAPH 2007) http://graphics.cs.cmu.edu/projects/scene-completion/scene-completion.pdf
  7. James Hays, Alexei A. Efros (2008). IM2GPS: estimating geographic information from a single image. (CVPR 2008) http://graphics.cs.cmu.edu/projects/im2gps/im2gps.pdf
  8. Biliana Kaneva, Josef Sivic, Antonio Torralba, Shai Avidan, William T. Freeman (2008). Matching and Predicting Street Level Images (ECCV Workshops 2008) https://people.csail.mit.edu/biliana/papers/eccv2010/eccv_workshop_2010.pdf
  9. Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton (2012) ImageNet Classification with Deep Convolutional Neural Networks (NIPS 2012) https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
  10. Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger (2017). Densely Connected Convolutional Networks (CVPR 2017) https://arxiv.org/pdf/1608.06993.pdf
  11. Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy (2018). Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification (ECCV 2018) https://arxiv.org/pdf/1712.04851.pdf