SSD中的多尺度分析

Choosing scales and aspect ratios for default boxes
To handle different object scales, some methods [4,9] suggest processing the image at different sizes and combining the results afterwards. However, by utilizing feature maps from several different layers in a single network for prediction we can mimic the same effect, while also sharing parameters across all object scales. Previous works [10,11] have shown that using feature maps from the lower layers can improve semantic segmentation quality because the lower layers capture more fine details of the input objects. Similarly, [12] showed that adding global context pooled from a feature map can help smooth the segmentation results. Motivated by these methods, we use both the lower and upper feature maps for detection. Figure 1 shows two exemplar feature maps (8×8 and 4×4) which are used in the framework. In practice, we can use many more with small computational overhead.

在这里插入图片描述
Multiple output layers at different resolutions is better. A major contribution of
SSD is using default boxes of different scales on different output layers. To measure
the advantage gained, we progressively remove layers and compare results. For a fair
comparison, every time we remove a layer, we adjust the default box tiling to keep the
total number of boxes similar to the original (8732). This is done by stacking more
scales of boxes on remaining layers and adjusting scales of boxes if needed. We do not
exhaustively optimize the tiling for each setting. Table 3 shows a decrease in accuracy
with fewer layers, dropping monotonically from 74.3 to 62.4. When we stack boxes of
multiple scales on a layer, many are on the image boundary and need to be handled
carefully. We tried the strategy used in Faster R-CNN [2], ignoring boxes which are
on the boundary. We observe some interesting trends. For example, it hurts the performance by a large margin if we use very coarse feature maps (e.g. conv11 2 (1 × 1)
or conv10 2 (3 × 3)). The reason might be that we do not have enough large boxes to
cover large objects after the pruning. When we use primarily finer resolution maps, the
performance starts increasing again because even after pruning a sufficient number of
large boxes remains. If we only use conv7 for prediction, the performance is the worst,
reinforcing the message that it is critical to spread boxes of different scales over different layers. Besides, since our predictions do not rely on ROI pooling as in [6], we
do not have the collapsing bins problem in low-resolution feature maps [23]. The SSD
architecture combines predictions from feature maps of various resolutions to achieve
comparable accuracy to Faster R-CNN, while using lower resolution input images.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Gallant Hu

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值