two-stage算法一点思考

最新推荐文章于 2024-07-29 09:11:11 发布

Dlyldxwl

最新推荐文章于 2024-07-29 09:11:11 发布

阅读量5k

点赞数

分类专栏：其它

本文链接：https://blog.csdn.net/Dlyldxwl/article/details/78825939

版权

其它专栏收录该内容

4 篇文章 0 订阅

订阅专栏

所有的two-stage detection 算法大致都由两部分组成：RPN生成proposal和对proposal的cls和reg。本科做毕设用了Faster RCNN，对此类算法稍稍有点了解，但是还是很多迷惑。最近本人认真研读了light head rcnn，对two stage算法进行了比较，也解答了之前的一些疑惑。一一列举：
1.RPN网络中anchor和proposal的关系。以faster rcnn网络为例，假定输入到RPN网络的feature map的size为13*13，现定义了3个scales和3个aspect radio，即为13*13feature map上的每个点考虑k=9个框，anchor实际上就是一组由rpn/generate_anchors.py生成的矩形。RPN中的cls层会输出(2*9)13*13个分类信息，为了利用softmax对这么多框进行背景前景分类，要对(2*9)*13*13进行reshape成2（9*13）*13的size，cls完成后再reshape回来；与此同时，reg层也会输出(4*9)*13*13个位置偏移量，注意并不是直接输出位置信息喔，是通过论文中的公式计算得到位置信息的。接下来在proposal layer 中，主要有以下几步：利用[dx(A)，dy(A)，dw(A)，dh(A)]对所有的框做bbox regression回归；提取前景pre_nms_topN(e.g. 1000)个框，即提取修正位置后的属于前景的1000个框；限定超出图像边界的框为图像边界（防止后续roi pooling时proposal超出图像边界）；剔除非常小的框；再进行NMS，取出前300个框作为输出，这些输出的框就是proposal.这些过程可以理解为predict过程。在训练的时候，回传18*13*13个anchor box，肯定是没有必要的，具体怎么做的呢？train的时候随机选取256个postive anchors box+256个negative anchors box 进行训练，postive 和negative的界定方法如下：当anchor box与GT间IoU>0.7，认为是该anchor box是foreground；反之IoU<0.3时，认为是该anchor box是background，至于IoU在0.3和0.7之间的anchorbox 则不参与训练.（具体数字可能不准确，表达出意思即可。train的时候回传的proposal数目会高于predict过程，且计算IOU是只在train过程进行）。
2. 为什么roi pooling需要proposal 和 feature map两个输入。因为在proposal layer第三步中将框映射回原图判断是否超出边界，所以这里输出的proposal是对应MxN输入图像尺度的，所以要想进行后续的cls和reg就必须要把proposal再缩放到feature map的像素极上，因此在feature上找到proposal对应的位置，进行pooling，所以需要两个输入，根本原因就是proposal是针对原图的。
对faster rcnn解释的最清楚的一篇文章奉上：链接。
3. RFCN中提出了PSROIPooling的概念，引入了位置敏感的score map。说白了就是因为如果把roipooling层的输入直接接全连接层用于cls和reg，会让detection网络对位置不敏感，但是如果让单个proposal都通过一些卷积层又会导致计算量太大，时间过长。R-FCN可以说是做到了两全其美，首先对于backbone的最后一层feature map，又接了一层卷积层，干什么呢？就是用来处理位置敏感问题的。
将每个ROI分为k*k个bin，那么这层卷积层的输出channel就是k*k*(c+1)，怎么理解这个channel呢？考虑人这一类，对于检测人来说top-center就是头部，bottom-center就是脚。现在就通过对每个bin进行检测来综合得到这个roi的检测结果。如果一个roi的top-center、bottom-center和其他bin都被检测为头、脚和人的其他特征，那么就都断定这个roi是人。好，现在以检测人头为目标生成一张feature map（这个feature map就是检测人的top-center），再以人脚为目标生成一张feature map,同理其他bin的特征也生成对应的feature map，那么对于人来说就生成k*k个feature map，总共c+1类就生成了k*k*(c+1)维feature map，即score map. 接下来用PSROIPooling的方法去得到每个roi的检测结果。试想如果roi的位置稍微偏移了一点就意味着每个bins都有移动，那么将每个bin的检测结果结合起来，差别就大了，位置敏感性就不言而喻了。
4. 对于一个network，如果想把roipooling换成PSROIPooling，必须要生成一个score map吗？答案是否定的，如果使用PSROIPooling层，只需要注意输入的channel和psroipooling层的output_dim*group_size^2一致即可，至于class相不相同不是很重要了，因为后面可以接一个fc来改变维度。这也就是light head rcnn的一个创新点了，thinner feature map，链接。此处的output_dim为pool后的channel，group_size为pool后的size。

华丽的分割线

2018.04.22

时间久了不看还是会忘啊，这一次看rfcn记下一些细节。
backbone为resnet101，RPN的输入是conv4，先说一个很容易被忽略的细节，resent101在前4个conv阶段有且只有一个stride = 2，也就说明了conv4的输出是input_size的1/16，以此为基础才能把proposal映射回原图和positive map上。先贴一下prototxt的非backbone部分吧。

#========= RPN ============

layer {
  name: "rpn_conv/3x3"
  type: "Convolution"
  bottom: "res4b22"
  top: "rpn/output"
  param { lr_mult: 1.0 }
  param { lr_mult: 2.0 }
  convolution_param {
    num_output: 512
    kernel_size: 3 pad: 1 stride: 1
    weight_filler { type: "gaussian" std: 0.01 }
    bias_filler { type: "constant" value: 0 }
  }
}
layer {
  name: "rpn_relu/3x3"
  type: "ReLU"
  bottom: "rpn/output"
  top: "rpn/output"
}

layer {
  name: "rpn_cls_score"
  type: "Convolution"
  bottom: "rpn/output"
  top: "rpn_cls_score"
  param { lr_mult: 1.0 }
  param { lr_mult: 2.0 }
  convolution_param {
    num_output: 18   # 2(bg/fg) * 9(anchors)
    kernel_size: 1 pad: 0 stride: 1
    weight_filler { type: "gaussian" std: 0.01 }
    bias_filler { type: "constant" value: 0 }
  }
}

layer {
  name: "rpn_bbox_pred"
  type: "Convolution"
  bottom: "rpn/output"
  top: "rpn_bbox_pred"
  param { lr_mult: 1.0 }
  param { lr_mult: 2.0 }
  convolution_param {
    num_output: 36   # 4 * 9(anchors)
    kernel_size: 1 pad: 0 stride: 1
    weight_filler { type: "gaussian" std: 0.01 }
    bias_filler { type: "constant" value: 0 }
  }
}

layer {
   bottom: "rpn_cls_score"
   top: "rpn_cls_score_reshape"
   name: "rpn_cls_score_reshape"
   type: "Reshape"
   reshape_param { shape { dim: 0 dim: 2 dim: -1 dim: 0 } }
}

layer {
  name: 'rpn-data'
  type: 'Python'
  bottom: 'rpn_cls_score'
  bottom: 'gt_boxes'
  bottom: 'im_info'
  bottom: 'data'
  top: 'rpn_labels'
  top: 'rpn_bbox_targets'
  top: 'rpn_bbox_inside_weights'
  top: 'rpn_bbox_outside_weights'
  python_param {
    module: 'rpn.anchor_target_layer'
    layer: 'AnchorTargetLayer'
    param_str: "'feat_stride': 16"
  }
}

layer {
  name: "rpn_loss_cls"
  type: "SoftmaxWithLoss"
  bottom: "rpn_cls_score_reshape"
  bottom: "rpn_labels"
  propagate_down: 1
  propagate_down: 0
  top: "rpn_cls_loss"
  loss_weight: 1
  loss_param {
    ignore_label: -1
    normalize: true
  }
}

layer {
  name: "rpn_loss_bbox"
  type: "SmoothL1Loss"
  bottom: "rpn_bbox_pred"
  bottom: "rpn_bbox_targets"
  bottom: 'rpn_bbox_inside_weights'
  bottom: 'rpn_bbox_outside_weights'
  top: "rpn_loss_bbox"
  loss_weight: 1
  smooth_l1_loss_param { sigma: 3.0 }
}

#========= RoI Proposal ============

layer {
  name: "rpn_cls_prob"
  type: "Softmax"
  bottom: "rpn_cls_score_reshape"
  top: "rpn_cls_prob"
}

layer {
  name: 'rpn_cls_prob_reshape'
  type: 'Reshape'
  bottom: 'rpn_cls_prob'
  top: 'rpn_cls_prob_reshape'
  reshape_param { shape { dim: 0 dim: 18 dim: -1 dim: 0 } }
}

layer {
  name: 'proposal'
  type: 'Python'
  bottom: 'rpn_cls_prob_reshape'
  bottom: 'rpn_bbox_pred'
  bottom: 'im_info'
  top: 'rpn_rois'
  python_param {
    module: 'rpn.proposal_layer'
    layer: 'ProposalLayer'
    param_str: "'feat_stride': 16"
  }
}

layer {
  name: 'roi-data'
  type: 'Python'
  bottom: 'rpn_rois'
  bottom: 'gt_boxes'
  top: 'rois'
  top: 'labels'
  top: 'bbox_targets'
  top: 'bbox_inside_weights'
  top: 'bbox_outside_weights'
  python_param {
    module: 'rpn.proposal_target_layer'
    layer: 'ProposalTargetLayer'
    param_str: "'num_classes': 2"
  }
}

#----------------------new conv layer------------------
layer {
    bottom: "res5c"
    top: "conv_new_1"
    name: "conv_new_1"
    type: "Convolution"
    convolution_param {
        num_output: 1024
        kernel_size: 1
        pad: 0
        weight_filler {
            type: "gaussian"
            std: 0.01
        }
        bias_filler {
            type: "constant"
            value: 0
        }
    }
    param {
        lr_mult: 1.0
    }
    param {
        lr_mult: 2.0
    }
}

layer {
    bottom: "conv_new_1"
    top: "conv_new_1"
    name: "conv_new_1_relu"
    type: "ReLU"
}

layer {
    bottom: "conv_new_1"
    top: "rfcn_cls"
    name: "rfcn_cls"
    type: "Convolution"
    convolution_param {
        num_output: 1029 #21*(7^2) cls_num*(score_maps_size^2)
        kernel_size: 1
        pad: 0
        weight_filler {
            type: "gaussian"
            std: 0.01
        }
        bias_filler {
            type: "constant"
            value: 0
        }
    }
    param {
        lr_mult: 1.0
    }
    param {
        lr_mult: 2.0
    }
}
layer {
    bottom: "conv_new_1"
    top: "rfcn_bbox"
    name: "rfcn_bbox"
    type: "Convolution"
    convolution_param {
        num_output: 392 #2*4*(7^2) (bg/fg)*(dx, dy, dw, dh)*(score_maps_size^2)
        kernel_size: 1
        pad: 0
        weight_filler {
            type: "gaussian"
            std: 0.01
        }
        bias_filler {
            type: "constant"
            value: 0
        }
    }
    param {
        lr_mult: 1.0
    }
    param {
        lr_mult: 2.0
    }
}

#--------------position sensitive RoI pooling--------------
layer {
    bottom: "rfcn_cls"
    bottom: "rois"
    top: "psroipooled_cls_rois"
    name: "psroipooled_cls_rois"
    type: "PSROIPooling"
    psroi_pooling_param {
        spatial_scale: 0.0625
        output_dim: 21
        group_size: 7
    }
}

layer {
    bottom: "psroipooled_cls_rois"
    top: "cls_score"
    name: "ave_cls_score_rois"
    type: "Pooling"
    pooling_param {
        pool: AVE
        kernel_size: 7
        stride: 7
    }
}


layer {
    bottom: "rfcn_bbox"
    bottom: "rois"
    top: "psroipooled_loc_rois"
    name: "psroipooled_loc_rois"
    type: "PSROIPooling"
    psroi_pooling_param {
        spatial_scale: 0.0625
        output_dim: 8
        group_size: 7
    }
}

layer {
    bottom: "psroipooled_loc_rois"
    top: "bbox_pred"
    name: "ave_bbox_pred_rois"
    type: "Pooling"
    pooling_param {
        pool: AVE
        kernel_size: 7
        stride: 7
    }
}


#-----------------------output------------------------
layer {
   name: "loss"
   type: "SoftmaxWithLoss"
   bottom: "cls_score"
   bottom: "labels"
   top: "loss_cls"
   loss_weight: 1
   propagate_down: true
   propagate_down: false
}

layer {
   name: "accuarcy"
   type: "Accuracy"
   bottom: "cls_score"
   bottom: "labels"
   top: "accuarcy"
   #include: { phase: TEST }
   propagate_down: false
   propagate_down: false
}

layer {
   name: "loss_bbox"
   type: "SmoothL1LossOHEM"
   bottom: "bbox_pred"
   bottom: "bbox_targets"
   bottom: 'bbox_inside_weights'
   top: "loss_bbox"
   loss_weight: 1
   loss_param {
        normalization: PRE_FIXED
        pre_fixed_normalizer: 128
   }
   propagate_down: true
   propagate_down: false
   propagate_down: false
}

layer {
    name: "silence"
    type: "Silence"
    bottom: "bbox_outside_weights"
}

关于rpn-data的作用，简单来说个人理解就是生成anchor，源码解析参照链接

prototxt多次出现16和0.0625，是在进行原图/proposal和anchor/featurp map之间转换。
可以看到RFCN中也采用了同样的proposal概念。

这里写图片描述

anchor和proposal都是对应原图的，bbox pred训练的是dx，dy，dw，dh。proposal是位置精修过的取topN的forgronud anchor。

关于文章的motivation，可以参考这段话：
这里要注意的，第1部分都是像VGG、GoogleNet、ResNet之类的基础分类网络，这些网络的计算都是所有RoIs共享的，在一张图片测试的时候只需要进行一次前向计算即可。而对于第2部分的RoI-wise subnetwork，它却不是所有RoIs共享的，因为这一部分的作用就是“给每个RoI进行分类和回归”，所以当然不能共享计算。那么现在问题就处在这里，首先第1部分的网络具有“位置不敏感性”，而如果我们将一个分类网络比如ResNet的所有卷积层都放置在第1部分用来提取特征，而第2部分则只剩下全连接层，这样的目标检测网络是“位置不敏感的translation-invariance”，所以其检测精度会较低，并且也白白浪费了分类网络强大的分类能力（does not match the network’s superior classification accuracy）。而ResNet论文中为了解决这样的位置不敏感的缺点，做出了一点让步，即将RoI Pooling Layer不再放置在ResNet-101网络的最后一层卷积层之后而是放置在了“卷积层之间”，这样RoI Pooling Layer之前和之后都有卷积层，并且RoI Pooling Layer之后的卷积层不是共享计算的，它们是针对每个RoI进行特征提取的，所以这种网络设计，其RoI Pooling Layer之后就具有了“位置敏感性translation-variance”，但是这样做牺牲了测试速度，因为所有RoIs都要经过若干层卷积计算，测试速度会很慢。

那么到底怎么办？要精度还是要速度？R-FCN的回答是：都要！！！

摘自知乎。
RFCN即做到了位置敏感(position sensitive map)(较FRCNN的突破)，又做到了共享计算(相较于resnet101 FRCNN的改进)，所以可以说是两全其美了。