Visual Learning and Recognition: Difference between revisions

Visual Learning and Recognition (view source)

1,311 bytes added , 15 October 2020

5,337

edits

@@ Line 512: / Line 512: @@
 Read: pdollar A seismic shift in object detection
+==Object Detection (Part 2)==
+===R-CNN===
+;R-CNN at test time
+# From an input image, they extract ~2k region proposals.
+#* All of the region proposals likely contain an object.
+# For each bounding box:
+#* Dilate the proposal.
+#* Crop it out and scale to <math>227 \times 227</math>.
+#* Convert to <math>4096</math>-dim feature and do classification using an SVM.
+# Do object proposal refinement to predict object bounding box.
+;Training R-CNN
+# First train a CNN for 1000-way ImageNet image classification.
+# Fine-tune the CNN for detection from PASCAL VOC.
+# Train detection SVMs.
+Both training and inference are super-slow.
+Extracting RoI takes a lot of time.
+Then you need to do a forward pass for each of the 2k regions to get features.
+Inference on $1$ images takes almost $1$ minute.
+===SPP-net===
+Makes R-CNN fast using a spatial pyramid pooling (SPP) layer.
+# Run a frozen CNN over the whole image to get a feature map.
+# Map boxes from region proposals generated by selective search to the feature map.
+# For each region, resize to <math>7 \times 7 \times 256</math>, do SPP and pass to an FC network to get bbox and class.
+Hard-mining:
+For each of the 2000 boxes, you have IOU_foreground and IOU_background.
+===Fast R-CNN===
+Makes the whole network trainable.
+'''Exam Question'''
 ==Will be on the exam==