论文笔记 | OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

Author

Pierre Sermanet David Eigen Xiang Zhang Michael Mathieu Rob Fergus Yann LeCun
这里写图片描述
Pierre Sermanet

Abstract

We show how a multi-scale and sliding window approach can be implemented in a ConvNet. We also introduce a novel approach to localization by learning to predict boundaries. Bounding boxes are then accumulated rather than suppressed in order to increase detection confidence.

3 Classification

Our classification architecture is similar to the Krizhevsky’s network, however, we inprove on the network design and inference step.

3.1 Model Design and training

Each image is downsampled so that the smallest dimension is 256 pixels. We than extract 5 crops (random 221x221) and then horizontal flips.
The architecture sizes are detailed in the followed tables:
这里写图片描述
这里写图片描述
Layers 1-5 are similar to Kriz’s, using relu and max pooling, but with the following differences:
1. no contrast normalization is used(???)
2. pooling are non-overlapping
3. has larger 1st,2nd layer feature maps.(stride4->2)

3.3 Multi-scale classification

Multi-view voting:1) can ignore many regions of the image 2)computattionally redundant when views overlap 3)one single scale. So MULTISCALE CALSSIFICATION
这里写图片描述
这里写图片描述
1. For a single image, at a given scale, we start with the unpooled layer 5 feature maps.
2. Each of the unpooled maps undergoes a 3x3 non-overlapping pooling operation, repeated 3x3 times (for x and y)(fig.3)
3. This produces a set of pooled feature maps
4. Classifier(filter) has a fixed input size(5x5), and is applied in sliding-window fashion to the pooled maps, yielding C dimensionsal output maps.
5. The output maps for different combinations are reshaped into 3d output maps
The procedure above is repeated for the horizontally flipped version of each image.
We than produce the final classification by:
1. taking the spatial max for each class, at each scale and flip
2. averaging the resulting C-d vectors from different scales and flips
3. taking the top1 or top5 elements from the mean class vector.
The exhaustive pooling scheme( single pixel shifts( Δx,Δy )) ensures that we can obtain fine alignment between the classifier and the object.

3.5 efficiency

这里写图片描述

Note that the last layers of our architecture are fully connected linear layers, At test time, layers are effectively replaced by convolution operations with kernels of 1x1 spatial extent. The entire ConvNet is then simply a sequence of convolutions, maxpooling and thresholding operations exclusively.

4 Localization

Replace the classifier layers by a regression network and train it to predict object bounding boxes at each spatial location and scale. The we combine the regression predictions together.

4.2 regressor training

这里写图片描述

4.3 combining predictions

  1. top k classes for each scale–> Cs
  2. bounding boxes for Cs –> Bs
  3. sBsB
  4. (b1,b2)=argminb1b2match_score(b1,b2)
  5. for (match_score(b1,b2)<t) :
    B=B{ b1,b2 } box_merge(b1,b2)
    match_score=sum of the distance between centers of the two bounding boxes and the intersection area of the boxes
    box_merge= average of the bounding boxes’s coordinates.
    这里写图片描述

6 discussion

Our approach might still be improved in several ways:
1. For localization, we are not currently back-propping through the whole network
2. IoU can take the place of l2 loss in our approach
3. Alternate parameterizations of the bounding box may help to decorrelate the outputs, which will aid network training.

7 conclusion

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值