Visual Learning and Recognition: Difference between revisions

Line 146:

A single 7x7 conv layer with C-dim input and C-dim output would need <math>49 \times C^2</math> weights.

Three <math>3\times 3</math> conv layers only need <math>27 \times C^2</math> weights.

===Network in network===

Use a small perceptron as your convolution kernel. I.e. the block goes into the perceptron. This output instead of calculating cross correlation with a standard kernel.

===GoogLeNet===

Hebbian Principle: Neurons that fire together are typically wired together.

Implemented using an ''Inception Module''.

The key idea is to use a heterogeneous set of convolutions.

Naive idea: Do a 1x1 convolution, 3x3 convolution, and 5x5 convolution and then concatenate the output together.

The intuition is that each captures a different receptive field.

In practice, they need to add 1x1 convolutions before the 3x3 and 5x5 convolutions to make it work. These are used for dimension reduction by controlling the channel.

Another idea is to add auxiliary classifiers across the network.

Inception v2, v3

V2 adds batch-normalization to reduce dependence on auxiliary classifiers.

V3 addes factored convolutions (i.e. nx1 and 1xn convolutions).

===ResNet===

The main idea is to introduce skip or shortcut connections.

This existing in literature before.

The means returning <math>F(x)+x</math>.

This allow smoother gradient flows since intermediate layers cannot block gradient flow.

They also replace 3x3 convolutions on 256 channels with 1x1 to 64 channels, 3x3 on the 64 channels, then 1x1 back to 256 channels.

This reduces parameters from approx 600k to approx 70k.

===Accuracy vs efficiency===

First we had AlexNet. Then we had VGG which had way more parameters and better accuracy.

Then we had GoogLeNet which is much smaller than both AlexNet and VGG with roughly the same accuracy.

Next ResNet and Inception increases the parameters slightly and attained better performance.

==Will be on the exam==