Visual Learning and Recognition: Difference between revisions

Visual Learning and Recognition (view source)

25 bytes added , 8 October 2020

5,337

edits

@@ Line 440: / Line 440: @@
 The idea is that they train a CNN to do object detection over the entire image.
 The CNN outputs multiple feature maps for each of the categories, each with different aspect ratios and scales.
-Pixels of the feature maps are ''default boxes'', representing a default bounding box.
+Pixels of the feature maps are scores for ''default boxes''; each pixel is associated with a default bounding box.
-Each feature map gives candidate results which are filtered using non-maximum suppression.
+The candidate results from the feature maps are filtered using non-maximum suppression.
 Different scales are achieved by extracting feature maps from intermediate layers of the network.
 The aspect ratio of each default box does not actually correspond to the receptive field associated with the feature pixel.