2015 CVPR
1. Single-Scale Architecture
2. Extraction of High-Level Features
(1) We consider a small sub-volume of the feature map stack produced at each layer. The sub-volume is centered at the center of the patch in order to assess the presence of a contour in a small area around the candidate point.
(2) We perform max, average, and center pooling on this sub-volume. We define center pooling as selecting the center-value from each of the feature maps.
(3) Because the candidate point is located at the center of the input patch, center pooling extracts the activation value from the location that corresponds to our candidate point location.
3. Bifurcated Sub-NetWork
(1) We connect the feature maps computed via pooling from the five convolutional layers to two separately-trained network branches. Each branch consists of two fully-connected layers.
(2) The first branch is trained using binary labels to perform contour classification. This branch is making less selective predictions by classifying whether a given point is a contour or not.
(3) The second branch is optimized as a regressor to predict the fraction of human labelers agreeing about the contour presence at a particular point. It is trained to learn the structural differences between the contours that are marked by a different fraction of human labelers.
(4) At testing time, the scalar outputs computed from these two sub-networks are averaged to produce a final score indicative of the probability that the candidate point is a contour.
4. Other parts
Binary labels: we first sample 40000 positive examples that were marked as contours by at least one of the labelers.
Negative examples: we consider the points that were selected as candidate contour points by the Canny edge detector but that have not been marked as contours by any of the human labelers.
Regression labels: the fraction of human labelers that marked the point as a contour.
5. MultiScale Architecture
(1) We extract patches around the candidate point for different patch sizes so that they cover different spatial extents of the image. We then resize the patches to fit the KNet input and pump them in parallel through the five convolutional layers.
(2) The sizes of patches are 64*64, 128*128, 196*196 and a full-sized image. All of the patches are then resized to the KNet input dimensions of 227*227.
(3) We use sub-volumes of convolutional feature maps having spatial sizes 7*7, 5*5, 3*3, 3*3, and 3*3 for the convolutional layers 1, 2, 3, 4, 5. Our choice of sub-volume sizes is made to ensure we are roughly considering the same spatial extent of the original image at each layer.