



Choosing scales and aspect ratios for default boxes To handle different object scales, some methods [4,9] suggest processing the image at different sizes and combining the results afterwards. However, by utilizing feature maps from several different layers in a single network for prediction we can mimic the same effect, while also sharing parame-ters across all object scales. Previous works [10,11] have shown that using feature maps from the lower layers can improve semantic segmentation quality because the lower layers capture more fine details of the input objects. Similarly, [12] showed that adding global context pooled from a feature map can help smooth the segmentation results.

Motivated by these methods, we use both the lower and upper feature maps for detection. Figure 1 shows two exemplar feature maps (8 × 8 and 4 × 4) which are used in the framework. In practice, we can use many more with small computational overhead.
Feature maps from different levels within a network are known to have different (empirical) receptive field sizes [13]. Fortunately, within the SSD framework, the de-fault boxes do not necessary need to correspond to the actual receptive fields of each layer. We design the tiling of default boxes so that specific feature maps learn to be responsive to particular scales of the objects. Suppose we want to use m feature maps for prediction. The scale of the default boxes for each feature map is computed as:

                                                        s_{k}=s_{min}+\frac{s_{max}-s_{min}}{m-1}(k-1), k\in [1,m]                    

where s_{min} is 0.2 and s_{max} is 0.9, meaning the lowest layer has a scale of 0.2 and the highest layer has a scale of 0.9, and all layers in between are regularly spaced. We impose different aspect ratios for the default boxes, and denote them as a_{r}\in\left \{ 1,2,3,\frac{1}{2},\frac{1}{3} \right \}. We can compute the width (w_{k}^{a}=s_{k}\sqrt{a_{r}}) and height (h_{k}^{a}=s_{k}/\sqrt{a_{r}}) for each default box. For the aspect ratio of 1, we also add a default box whose scale is s_{k}^{'}=\sqrt{s_{k}s_{k+1}}, resulting in 6 default boxes per feature map location. We set the center of each default box to (\frac{i+0.5}{\left | f_{k} \right |},\frac{j+0.5}{\left | f_{k} \right |}), where \left | f_{k} \right | is the size of the k-th square feature map, i,j\in[0,|f_{k}|). In practice, one can also design a distribution of default boxes to best fit a specific dataset. How to design the optimal tiling is an open question as well.
By combining predictions for all default boxes with different scales and aspect ratios from all locations of many feature maps, we have a diverse set of predictions, covering various input object sizes and shapes. For example, in Fig. 1, the dog is matched to a default box in the 4 × 4 feature map, but not to any default boxes in the 8 × 8 feature map. This is because those boxes have different scales and do not match the dog box, and therefore are considered as negatives during training.


为默认框选择尺度和宽高比:为了处理不同的目标尺度,一些方法[4,9]建议处理不同尺寸的图像并在之后组合结果。但是,通过利用单个网络中几个不同层的特征图进行预测,我们可以模拟相同的效果,同时还可以跨所有目标尺度共享参数(即同一个网络实现多尺度目标的处理)。之前的工作[10,11]已经表明,使用较低层的特征图可以提高语义分割质量,因为较低层捕获输入目标更精细的细节。同样,[12]表明,从特征图中添加全局上下文可以帮助平滑分割结果。在这些方法的推动下,我们使用低层和高层特征图进行检测。图1显示了框架中使用的两个示例性特征图(8×8和4×4)。实际上,我们可以使用更多的小计算开销(small computational overhead)。


                                                         s_{k}=s_{min}+\frac{s_{max}-s_{min}}{m-1}(k-1), k\in [1,m]

其中,s_{min} 取为0.2,s_{max} 取为0.9,表示最低层的尺度为0.2,最高层的尺度为0.9,其间的所有层都是规则间隔的。我们对默认框施加不同的宽高比,记为a_{r}\in\left \{ 1,2,3,\frac{1}{2},\frac{1}{3} \right \} 。由此能够计算每一个默认框的宽度(w_{k}^{a}=s_{k}\sqrt{a_{r}})和高度(h_{k}^{a}=s_{k}/\sqrt{a_{r}})。对于宽高比为1时,我们还添加了一个默认框,其尺寸为s_{k}^{'}=\sqrt{s_{k}s_{k+1}} ,由此在每个特征图位置处产生6个默认框。我们设置每个默认框的中心为(\frac{i+0.5}{\left | f_{k} \right |},\frac{j+0.5}{\left | f_{k} \right |}) ,其中 \left | f_{k} \right | 表示第k个方形特征图的大小(即特征图的长/宽),i,j\in[0,|f_{k}|) 。在实践中,还可以设计默认框的分布以最佳地适合特定数据集。 如何设计最佳平铺也是一个悬而未决的问题(开放性问题)



图 1  SSD框架。(a)SSD在训练期间仅需要每个对象的输入图像和地面实况框。以卷积方式,我们在几个具有不同形状尺寸的特征图(例如,8×8 和4×4 在(b)和(c)中)中的每个位置处评估一组不同宽高比的小集(例如4个)。对于每个默认框,我们预测形状偏移和所有目标类别的置信度((c_{1},c_{2},...,c_{p}))。在训练时,我们首先将这些默认框与地面实况框匹配。例如,我们将两个默认框一个与cat匹配,一个与dog匹配,它们被视为正例,其余的为负例(即与地面实况框匹配上的为正例,否则为负例)。模型损失是定位损失(例如,smooth L1 [6])和置信度损失(例如,Softmax)之间的加权和





// Message that store parameters used by PriorBoxLayer
message PriorBoxParameter {
  // Encode/decode type.
  enum CodeType {
    CORNER = 1;
    CENTER_SIZE = 2;
    CORNER_SIZE = 3;
  // Minimum box size (in pixels). Required!
  repeated float min_size = 1; //对应论文2.2节中公式(4)中的sk×网络输入层输入图像[data层的输入]大小
  // Maximum box size (in pixels). Required!
  repeated float max_size = 2; //下一层用来生成默认框特征图所在的min_size(对应论文2.2节中公式(4)中的sk+1×网络输入层输入图像[data层的输入]大小)
  // Various of aspect ratios. Duplicate ratios will be ignored.
  // If none is provided, we use default ratio 1.
  repeated float aspect_ratio = 3; //宽高比
  // If true, will flip each aspect ratio.
  // For example, if there is aspect ratio "r",
  // we will generate aspect ratio "1.0/r" as well.
  optional bool flip = 4 [default = true]; //是否翻转宽高比
  // If true, will clip the prior so that it is within [0, 1]
  optional bool clip = 5 [default = false]; //是否进行裁剪(是否保证默认框整个在网络输入层输入图像内)
  // Variance for adjusting the prior bboxes.
  repeated float variance = 6; //暂时未知用来做什么
  // By default, we calculate img_height, img_width, step_x, step_y based on
  // bottom[0] (feat) and bottom[1] (img). Unless these values are explicitely
  // provided.
  // Explicitly provide the img_size.
  optional uint32 img_size = 7;
  // Either img_size or img_h/img_w should be specified; not both.
  optional uint32 img_h = 8; //网络输入层输入图像的高(或自行设置的高度)
  optional uint32 img_w = 9; //网络输入层输入图像的宽(或自行设置的宽度)

  // Explicitly provide the step size.
  optional float step = 10;
  // Either step or step_h/step_w should be specified; not both.
  optional float step_h = 11; //特征图上同一列上相邻两像素点间的距离在网络输入层输入图像上的距离
  optional float step_w = 12; //特征图上同一行上相邻两像素点间的距离在网络输入层输入图像上的距离

  // Offset to the top left corner of each cell.
  optional float offset = 13 [default = 0.5]; //默认框中心偏移量(相对偏移量)







layer {
  name: "conv4_3_norm_mbox_priorbox"
  type: "PriorBox"
  bottom: "conv4_3_norm"
  bottom: "data"
  top: "conv4_3_norm_mbox_priorbox"
  prior_box_param {
    min_size: 30.0
    max_size: 60.0
    aspect_ratio: 2
    flip: true
    clip: false
    variance: 0.1
    variance: 0.1
    variance: 0.2
    variance: 0.2
    step: 8
    offset: 0.5




这里的min_size,max_size在SSD300中并不完全遵循2.2节部分中的公式s_{k}=s_{min}+\frac{s_{max}-s_{min}}{m-1}(k-1), k\in [1,m],但所表示的意思即是s_{k}s_{k+1},原因在论文后面说了(对于conv4_3采用的s_{k}=0.1),如下图:


层名称特征图大小(\left | f_{k} \right |*\left | f_{k} \right |min_sizemax_sizestep





注:step\approx \frac{I}{|f_{k}|},其中I为网络输入层输入图像大小。




#include <vector>

#include "caffe/blob.hpp"
#include "caffe/layer.hpp"
#include "caffe/proto/caffe.pb.h"

namespace caffe {

 * @brief Generate the prior boxes of designated sizes and aspect ratios across
 *        all dimensions @f$ (H \times W) @f$.
 * Intended for use with MultiBox detection method to generate prior (template).
 * NOTE: does not implement Backwards operation.
template <typename Dtype>
class PriorBoxLayer : public Layer<Dtype> {
   * @param param provides PriorBoxParameter prior_box_param,
   *     with PriorBoxLayer options:
   *   - min_size (\b minimum box size in pixels. can be multiple. required!). 对应论文2.2节中公式(4)中的sk×网络输入层输入图像[data层的输入]大小
   *   - max_size (\b maximum box size in pixels. can be ignored or same as the 对应论文2.2节中公式(4)中的sk+1×网络输入层输入图像[data层的输入]大小
   *   # of min_size.).
   *   - aspect_ratio (\b optional aspect ratios of the boxes. can be multiple).
   *   - flip (\b optional bool, default true).
   *     if set, flip the aspect ratio.
  explicit PriorBoxLayer(const LayerParameter& param)
      : Layer<Dtype>(param) {}
  virtual void LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Reshape(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);

  virtual inline const char* type() const { return "PriorBox"; }
  virtual inline int ExactBottomBlobs() const { return 2; } //输入blob数目为2(第一个blob一般为特征图;第二个blob一般为data层输入的图像)
  virtual inline int ExactNumTopBlobs() const { return 1; } //输出blob数目为1

   * @brief Generates prior boxes for a layer with specified parameters.
   * @param bottom input Blob vector (at least 2)
   *   -# @f$ (N \times C \times H_i \times W_i) @f$
   *      the input layer @f$ x_i @f$
   *   -# @f$ (N \times C \times H_0 \times W_0) @f$
   *      the data layer @f$ x_0 @f$
   * @param top output Blob vector (length 1)
   *   -# @f$ (N \times 2 \times K*4) @f$ where @f$ K @f$ is the prior numbers
   *   By default, a box of aspect ratio 1 and min_size and a box of aspect
   *   ratio 1 and sqrt(min_size * max_size) are created.
  virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  /// @brief Not implemented 无需后向传播
  virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom) {

  vector<float> min_sizes_;  //存储所设置的min_size(对应论文2.2节中公式(4)中的sk×网络输入层输入图像大小)
  vector<float> max_sizes_;  //下一层用来生成默认框特征图所在的min_size(对应论文2.2节中公式(4)中的sk+1×网络输入层输入图像大小)
  vector<float> aspect_ratios_; //存储所设置的宽高比(包含默认的宽高比1)
  bool flip_; //是否翻转宽高比
  int num_priors_;  //默认框数目(default box)
  bool clip_; //是否进行裁剪(是否保证默认框整个在网络输入层输入图像内)
  vector<float> variance_; //存储variance(暂时不清楚此参数用来做什么)

  int img_w_; //网络输入层输入图像的宽(或自行设置的宽度)
  int img_h_; //网络输入层输入图像的高(或自行设置的高度)
  float step_w_; //特征图上同一行上相邻两像素点间的距离在网络输入层输入图像上的距离
  float step_h_; //特征图上同一列上相邻两像素点间的距离在网络输入层输入图像上的距离

  float offset_; //默认框中心偏移量(相对偏移量)

}  // namespace caffe




#include <algorithm>
#include <functional>
#include <utility>
#include <vector>

#include "caffe/layers/prior_box_layer.hpp"

namespace caffe {

template <typename Dtype>
void PriorBoxLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  const PriorBoxParameter& prior_box_param =
      this->layer_param_.prior_box_param();  //获取所需的参数
  CHECK_GT(prior_box_param.min_size_size(), 0) << "must provide min_size."; //min_size是必须的,不可缺省设置
  for (int i = 0; i < prior_box_param.min_size_size(); ++i) {
    CHECK_GT(min_sizes_.back(), 0) << "min_size must be positive."; //min_size必须为正数(CHECK_GT表示大于[greater than])
  aspect_ratios_.push_back(1.);  //默认情况下宽高比为1(也即会默认设置一个为1的宽高比)
  flip_ = prior_box_param.flip(); //flip=true表示翻转宽高比,即原设置的宽高比为2,则翻转后宽高比为1/2
  for (int i = 0; i < prior_box_param.aspect_ratio_size(); ++i) {
    float ar = prior_box_param.aspect_ratio(i);
    bool already_exist = false;
    for (int j = 0; j < aspect_ratios_.size(); ++j) {
      if (fabs(ar - aspect_ratios_[j]) < 1e-6) {
        already_exist = true;
        break; //跳出当前for循环
    if (!already_exist) {
      aspect_ratios_.push_back(ar); //将不同的宽高比放入aspect_ratios_中
      if (flip_) {
        aspect_ratios_.push_back(1./ar); //将翻转后的宽高比也放入aspect_ratios_中
  num_priors_ = aspect_ratios_.size() * min_sizes_.size(); //计算需要生成的默认框(参见论文中的default box术语)数目
  if (prior_box_param.max_size_size() > 0) {
    CHECK_EQ(prior_box_param.min_size_size(), prior_box_param.max_size_size()); //检查所设置的min_size数目和max_size数目是否相等(CHECK_EQ表示相等)
    for (int i = 0; i < prior_box_param.max_size_size(); ++i) {
      CHECK_GT(max_sizes_[i], min_sizes_[i])
          << "max_size must be greater than min_size."; //max_size必须大于min_size
      num_priors_ += 1;  //默认框数目加1
  clip_ = prior_box_param.clip();  //获取裁剪参数
  if (prior_box_param.variance_size() > 1) {
    // Must and only provide 4 variance. 此情况下有且只能设置4个variance
    CHECK_EQ(prior_box_param.variance_size(), 4);
    for (int i = 0; i < prior_box_param.variance_size(); ++i) {
      CHECK_GT(prior_box_param.variance(i), 0);
  } else if (prior_box_param.variance_size() == 1) { //此情况下表示只设置一个variance
    CHECK_GT(prior_box_param.variance(0), 0);
  } else {
    // Set default to 0.1.
    variance_.push_back(0.1); //默认情况下设置variance = 0.1

  if (prior_box_param.has_img_h() || prior_box_param.has_img_w()) {
        << "Either img_size or img_h/img_w should be specified; not both."; //两者只能设置一种
    img_h_ = prior_box_param.img_h();
    CHECK_GT(img_h_, 0) << "img_h should be larger than 0.";
    img_w_ = prior_box_param.img_w();
    CHECK_GT(img_w_, 0) << "img_w should be larger than 0.";
  } else if (prior_box_param.has_img_size()) {
    const int img_size = prior_box_param.img_size();
    CHECK_GT(img_size, 0) << "img_size should be larger than 0.";
    img_h_ = img_size;
    img_w_ = img_size;
  } else {
    img_h_ = 0;  //如果两者均未设置,则先赋值为0
    img_w_ = 0;
  if (prior_box_param.has_step_h() || prior_box_param.has_step_w()) {
        << "Either step or step_h/step_w should be specified; not both.";
    step_h_ = prior_box_param.step_h();
    CHECK_GT(step_h_, 0.) << "step_h should be larger than 0.";
    step_w_ = prior_box_param.step_w();
    CHECK_GT(step_w_, 0.) << "step_w should be larger than 0.";
  } else if (prior_box_param.has_step()) {
    const float step = prior_box_param.step();
    CHECK_GT(step, 0) << "step should be larger than 0.";
    step_h_ = step;
    step_w_ = step;
  } else {
    step_h_ = 0;
    step_w_ = 0;

  offset_ = prior_box_param.offset();  //获取相对左上角的偏移量(默认为0.5)

template <typename Dtype>
void PriorBoxLayer<Dtype>::Reshape(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  const int layer_width = bottom[0]->width(); 
  const int layer_height = bottom[0]->height();
  vector<int> top_shape(3, 1);
  // Since all images in a batch has same height and width, we only need to
  // generate one set of priors which can be shared across all images.
  top_shape[0] = 1; //由于每一batch中所有特征图具有相同的长和宽,因此我们只需要生成一组可以在该batch中所有特征图之间共享的默认框
  // 2 channels. First channel stores the mean of each prior coordinate.
  // Second channel stores the variance of each prior coordinate.
  top_shape[1] = 2; //第一个通道存储默认框左上角和右下角归一化坐标;第二个通道存储这些坐标的variance
  top_shape[2] = layer_width * layer_height * num_priors_ * 4; //特征图每一像素点处都产生num_priors_个默认框,每个预测框相对默认框有4归一化坐标值/也有4个variance
  CHECK_GT(top_shape[2], 0);

template <typename Dtype>
void PriorBoxLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,
    const vector<Blob<Dtype>*>& top) {
  const int layer_width = bottom[0]->width();  //bottom[0]一般为特征图(feature map);bottom[1]一般为网络输入层输入数据即data
  const int layer_height = bottom[0]->height();
  int img_width, img_height;
  if (img_h_ == 0 || img_w_ == 0) {
    img_width = bottom[1]->width();
    img_height = bottom[1]->height();
  } else {
    img_width = img_w_;
    img_height = img_h_;
  float step_w, step_h;
  if (step_w_ == 0 || step_h_ == 0) {
    step_w = static_cast<float>(img_width) / layer_width;
    step_h = static_cast<float>(img_height) / layer_height;
  } else {
    step_w = step_w_;
    step_h = step_h_;
  Dtype* top_data = top[0]->mutable_cpu_data();
  int dim = layer_height * layer_width * num_priors_ * 4;
  int idx = 0;
  //嵌套for循环来设置默认框数据(参见论文2.2节Choosing scales and aspect ratios for default boxes部分)
  for (int h = 0; h < layer_height; ++h) {
    for (int w = 0; w < layer_width; ++w) {
      float center_x = (w + offset_) * step_w;  //默认框中心在网络输入层输入图像(即网络的data层输入图像)上的x坐标
      float center_y = (h + offset_) * step_h; //默认框中心在网络输入层输入图像上的y坐标
      float box_width, box_height;
      for (int s = 0; s < min_sizes_.size(); ++s) {
        int min_size_ = min_sizes_[s];
        // first prior: aspect_ratio = 1, size = min_size 
        box_width = box_height = min_size_;
        // xmin
        top_data[idx++] = (center_x - box_width / 2.) / img_width;  //默认框左上角归一化后x坐标(归一化后,即网络输入层输入图像x坐标在0-1范围内)
        // ymin
        top_data[idx++] = (center_y - box_height / 2.) / img_height; //默认框左上角归一化后y坐标
        // xmax
        top_data[idx++] = (center_x + box_width / 2.) / img_width; //默认框右下角归一化后x坐标
        // ymax
        top_data[idx++] = (center_y + box_height / 2.) / img_height; //默认框右下角归一化后y坐标

        if (max_sizes_.size() > 0) {
          CHECK_EQ(min_sizes_.size(), max_sizes_.size());
          int max_size_ = max_sizes_[s];
          // second prior: aspect_ratio = 1, size = sqrt(min_size * max_size) 论文中额外添加的另一个宽高比为1的默认框
          box_width = box_height = sqrt(min_size_ * max_size_);
          // xmin
          top_data[idx++] = (center_x - box_width / 2.) / img_width;
          // ymin
          top_data[idx++] = (center_y - box_height / 2.) / img_height;
          // xmax
          top_data[idx++] = (center_x + box_width / 2.) / img_width;
          // ymax
          top_data[idx++] = (center_y + box_height / 2.) / img_height;

        // rest of priors 计算剩余的默认框左上角和右下角坐标
        for (int r = 0; r < aspect_ratios_.size(); ++r) {
          float ar = aspect_ratios_[r];
          if (fabs(ar - 1.) < 1e-6) { //除去宽高比为1的情况,上面已经计算了
          box_width = min_size_ * sqrt(ar);
          box_height = min_size_ / sqrt(ar);
          // xmin
          top_data[idx++] = (center_x - box_width / 2.) / img_width;
          // ymin
          top_data[idx++] = (center_y - box_height / 2.) / img_height;
          // xmax
          top_data[idx++] = (center_x + box_width / 2.) / img_width;
          // ymax
          top_data[idx++] = (center_y + box_height / 2.) / img_height;
  // clip the prior's coordidate such that it is within [0, 1] 
  if (clip_) {
    for (int d = 0; d < dim; ++d) {
      top_data[d] = std::min<Dtype>(std::max<Dtype>(top_data[d], 0.), 1.);
  // set the variance.设置variance(暂时还不知道此部分用来做什么)
  top_data += top[0]->offset(0, 1);
  if (variance_.size() == 1) {
    caffe_set<Dtype>(dim, Dtype(variance_[0]), top_data);
  } else {
    int count = 0;
    for (int h = 0; h < layer_height; ++h) {
      for (int w = 0; w < layer_width; ++w) {
        for (int i = 0; i < num_priors_; ++i) {
          for (int j = 0; j < 4; ++j) {
            top_data[count] = variance_[j];


}  // namespace caffe


较为传统的方法如下图(可以自行学习一下Ng的deep learning.ai的第四门课的第三周):








