Visual Learning and Recognition: Difference between revisions

Line 512: Line 512:


Read: pdollar A seismic shift in object detection
Read: pdollar A seismic shift in object detection
==Object Detection (Part 2)==
===R-CNN===
;R-CNN at test time
# From an input image, they extract ~2k region proposals. 
#* All of the region proposals likely contain an object. 
# For each bounding box:
#* Dilate the proposal.
#* Crop it out and scale to <math>227 \times 227</math>.
#* Convert to <math>4096</math>-dim feature and do classification using an SVM.
# Do object proposal refinement to predict object bounding box.
;Training R-CNN
# First train a CNN for 1000-way ImageNet image classification.
# Fine-tune the CNN for detection from PASCAL VOC.
# Train detection SVMs.
Both training and inference are super-slow. 
Extracting RoI takes a lot of time. 
Then you need to do a forward pass for each of the 2k regions to get features. 
Inference on $1$ images takes almost $1$ minute.
===SPP-net===
Makes R-CNN fast using a spatial pyramid pooling (SPP) layer.
# Run a frozen CNN over the whole image to get a feature map.
# Map boxes from region proposals generated by selective search to the feature map.
# For each region, resize to <math>7 \times 7 \times 256</math>, do SPP and pass to an FC network to get bbox and class.
Hard-mining: 
For each of the 2000 boxes, you have IOU_foreground and IOU_background.
===Fast R-CNN===
Makes the whole network trainable.
'''Exam Question'''


==Will be on the exam==
==Will be on the exam==