Visual Learning and Recognition: Difference between revisions

Visual Learning and Recognition (view source)

Revision as of 14:22, 17 September 2020

1,783 bytes added , 17 September 2020

no edit summary

David

Bureaucrats, Interface administrators, Administrators

5,322

edits

@@ Line 85: / Line 85: @@
 ==ConvNets and Architectures==
-===Papers===
+===Paper Summaries===
-Huang ''et al.''<ref name="huang2018densenet"></ref> develop DenseNet.
+Krizhevsky ''et al.''<ref name="krizhevsky2012alexnet"></ref> develop AlexNet for image classification. AlexNet is a CNN architecture with two branches. Their architecture and training proceedure includes many tricks, some of which are now commonplace today. These include multi-GPU training, 8 layers, ReLU activations, Local Response Normalization (LRU), (overlapping) max pooling, data augmentation, and dropout. They won on ImageNet 2012 by a large margin.
+Huang ''et al.''<ref name="huang2018densenet"></ref> develop DenseNet for image classification. The main contribution are dense blocks where each layer within the block are connected to all subsequent layers (i.e. the outputs are accumulated by concatenation). Each layer consists of (BN-ReLU-Conv(1x1)-BN-ReLU-Conv(3x3)). Following each dense block, they use transition layers (1x1 conv + 2x2 avg pool) to shrink the size. The evaluate on CIFAR, SVHN, and ImageNet.
 Xie ''et al.''<ref name="xie2018rethinking"></ref> develop S3D-G for video classification. The main idea is that video classification can be done with Conv2d layers at lower layers and Conv3d layers at higher layers. In addition, the time and spatial dimensions can be separated into two different 3D convolutions (with <math>1 \times k \times k</math> and <math>k_t \times 1 \times 1</math> kernels). These two changes improve the accuracy and efficiency of video classification compared to just Conv3d.
@@ Line 109: / Line 111: @@
 <ref name="heys2008gps">James Hays, Alexei A. Efros (2008). IM2GPS: estimating geographic information from a single image. (CVPR 2008) [http://graphics.cs.cmu.edu/projects/im2gps/im2gps.pdf http://graphics.cs.cmu.edu/projects/im2gps/im2gps.pdf]</ref>
 <ref name="kaneva2008matching">Biliana Kaneva, Josef Sivic, Antonio Torralba, Shai Avidan, William T. Freeman (2008). Matching and Predicting Street Level Images (ECCV Workshops 2008) [https://people.csail.mit.edu/biliana/papers/eccv2010/eccv_workshop_2010.pdf https://people.csail.mit.edu/biliana/papers/eccv2010/eccv_workshop_2010.pdf]</ref>
+<ref name="xie2018rethinking">Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy (2018). Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification (ECCV 2018) [https://arxiv.org/pdf/1712.04851.pdf https://arxiv.org/pdf/1712.04851.pdf]</ref>
+<ref name="huang2018densenet">Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger (2017). Densely Connected Convolutional Networks (CVPR 2017) [https://arxiv.org/pdf/1608.06993.pdf https://arxiv.org/pdf/1608.06993.pdf]</ref>
+<ref name="krizhevsky2012alexnet">Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton (2012) ImageNet Classification with Deep Convolutional
+Neural Networks (NIPS 2012) [https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf]</ref>
 }}