SSD 是 Single Shot Multibox Detector 的简称
本文以 tensorflow models 的 object detect 中的 ssd 实现为准。特征提取层采用 MobileNetV1。具体参考附录配置示例
物体检测训练
输入图片 300 x 300
标签 :
- gt_box : [batch_size, num_boxes, 4]
- gt_class : [batch_size, num_boxes, num_classes]
预处理
- 水平翻转
- 随机 crop
- 采用 BILINEAR 方法将输入图片 resize 为 300 * 300
特征提取
- preprocessed_inputs 经过 MobileNetV1,从 Conv2d_13_pointwise 提取特征图 image_feature
- 依次经过
Conv2d_11_pointwise,
Conv2d_13_pointwise,
Conv2d_13_pointwise_2_Conv2d_2_1x1_256, Conv2d_13_pointwise_2_Conv2d_2_3x3_s2_512,
Conv2d_13_pointwise_3_Conv2d_2_1x1_128, Conv2d_13_pointwise_3_Conv2d_2_3x3_s2_256,
Conv2d_13_pointwise_4_Conv2d_2_1x1_128, Conv2d_13_pointwise_4_Conv2d_2_3x3_s2_256,
Conv2d_13_pointwise_4_Conv2d_2_1x1_64, Conv2d_13_pointwise_4_Conv2d_2_3x3_s2_128
IoU 过滤
- 计算 gt_box 与 anchors 的 IoU 矩阵 match_quality_matrix (得到 [num_boxes, num_anchor] 的矩阵。其中 row 索引为 groundtruth_boxeslist 元素索引, colum 索引为 anchors 元素索引)
- 记录 match_quality_matrix 每列最大值对应的行索引得到 match1。此时 match1 本身的索引为列索引,存储的值为行索引。因此,通过 match1 就能定位到 match_quality_matrix 对应的 IoU 值。
- 如果 match1[i] 对应的元素(IoU)大于 0.5,match1[i] 不变,如果 match1[i] 中元素对应 IoU 小于 0.5,match1[i] 为 -1。
- 记录 match_quality_matrix 每行最大值对应的列索引得到 match2。此时 match2 本身的索引为行索引,存储的值为列索引。因此,通过 match2 就能定位到 match_quality_matrix 对应的 IoU 值。
- matches 为 num_anchor 个元素的数组,从 0 到 num_anchor 的任意值 i,如果 i 存在于 match2,matches[i] 为 i 在 match2 中的索引。否则 matches[i] = match1[i] (备注:这是整个实现的一个非常绕的点,要仔细推敲)。至此 matches 中大于 -1 的元素为满足条件,小于 0 为不满足条件。实际上 matches 小于 0,只可能取 -1, -2
编码
输入:Anchors 与 gt_box 编码
anchors 表示为 [ycenter_a, xcenter_a, ha, wa]
gt_box 表示为 [ycenter, xcenter, h, w]
tx = (xcenter - xcenter_a) / wa
ty = (ycenter - ycenter_a) / ha
tw = tf.log(w / wa)
th = tf.log(h / ha)
输出:[tx, ty, tw, th]
解码
输入 Anchors 与 box_encoding 编码
anchors 表示为 [ycenter_a, xcenter_a, ha, wa]
box_encoding 表示为 [ty, tx, th, tw]
w = tf.exp(tw) * wa
h = tf.exp(th) * ha
ycenter = ty * ha + ycenter_a
xcenter = tx * wa + xcenter_a
ymin = ycenter - h / 2.
xmin = xcenter - w / 2.
ymax = ycenter + h / 2
xmax = xcenter + w / 2
输出 [ymin, xmin, ymax, xmax]
损失函数
分类损失:tf.nn.sigmoid_cross_entropy_with_logits
位置损失:smoothL1
分类损失和位置损失乘以权重并归一化
验证
预处理
- 水平翻转
- 随机 crop
- 采用 BILINEAR 方法将输入图片 resize 为 300 * 300
特征提取
- preprocessed_inputs 经过 MobileNetV1,从 Conv2d_13_pointwise 提取特征图 image_feature
- 依次经过
Conv2d_11_pointwise,
Conv2d_13_pointwise,
Conv2d_13_pointwise_2_Conv2d_2_1x1_256, Conv2d_13_pointwise_2_Conv2d_2_3x3_s2_512,
Conv2d_13_pointwise_3_Conv2d_2_1x1_128, Conv2d_13_pointwise_3_Conv2d_2_3x3_s2_256,
Conv2d_13_pointwise_4_Conv2d_2_1x1_128, Conv2d_13_pointwise_4_Conv2d_2_3x3_s2_256,
Conv2d_13_pointwise_4_Conv2d_2_1x1_64, Conv2d_13_pointwise_4_Conv2d_2_3x3_s2_128
附录
Anchor 生成
将输出的 6 个 feature_map 生成依次 3,7,7,7,7,7 个 anchor,将 base_anchor_size 为 256 划分为 feature_map[i] 个 grid, anchor_stride 为 base_anchor/feature_map[i],anchor_offset 为每个 grid 的中心。
smin = 0.2
smax = 0.95
aspect_ratio = [0.2, 0.35, 0.5, 0.65, 0.80, 0.95, 1.0]
#每个元素为 scale : aspect_ratio
[
[(0.1, 1.0), (0.2, 2.0), (0.2, 0.5)],
[(0.35,1.0),(0.35,2.0),(0.35,3.0),(0.35,1.0/2),(0.35,1.0/3),(sqrt(0.418), 1.0)],
[(0.5, 1.0),(0.5, 2.0),(0.5, 3.0),(0.5, 1.0/2),(0.5, 1.0/3),(sqrt(0.570), 1.0)],
[(0.65,1.0),(0.65,2.0),(0.65,3.0),(0.65,1.0/2),(0.65,1.0/3),(sqrt(0.721), 1.0)],
[(0.80,1.0),(0.80,2.0),(0.80,3.0),(0.80,1.0/2),(0.80,1.0/3),(sqrt(0.872), 1.0)],
[(0.95,1.0),(0.95,2.0),(0.65,3.0),(0.95,1.0/2),(0.95,1.0/3),(sqrt(0.975), 1.0)]
]
配置示例
samples/configs/ssd_mobilenet_v1_coco.config
model {
ssd {
num_classes: 90
box_coder {
faster_rcnn_box_coder {
y_scale: 10.0
x_scale: 10.0
height_scale: 5.0
width_scale: 5.0
}
}
matcher {
argmax_matcher {
matched_threshold: 0.5
unmatched_threshold: 0.5
ignore_thresholds: false
negatives_lower_than_unmatched: true
force_match_for_each_row: true
}
}
similarity_calculator {
iou_similarity {
}
}
anchor_generator {
ssd_anchor_generator {
num_layers: 6
min_scale: 0.2
max_scale: 0.95
aspect_ratios: 1.0
aspect_ratios: 2.0
aspect_ratios: 0.5
aspect_ratios: 3.0
aspect_ratios: 0.3333
}
}
image_resizer {
fixed_shape_resizer {
height: 300
width: 300
}
}
box_predictor {
convolutional_box_predictor {
min_depth: 0
max_depth: 0
num_layers_before_predictor: 0
use_dropout: false
dropout_keep_probability: 0.8
kernel_size: 1
box_code_size: 4
apply_sigmoid_to_scores: false
conv_hyperparams {
activation: RELU_6,
regularizer {
l2_regularizer {
weight: 0.00004
}
}
initializer {
truncated_normal_initializer {
stddev: 0.03
mean: 0.0
}
}
batch_norm {
train: true,
scale: true,
center: true,
decay: 0.9997,
epsilon: 0.001,
}
}
}
}
feature_extractor {
type: 'ssd_mobilenet_v1'
min_depth: 16
depth_multiplier: 1.0
conv_hyperparams {
activation: RELU_6,
regularizer {
l2_regularizer {
weight: 0.00004
}
}
initializer {
truncated_normal_initializer {
stddev: 0.03
mean: 0.0
}
}
batch_norm {
train: true,
scale: true,
center: true,
decay: 0.9997,
epsilon: 0.001,
}
}
}
loss {
classification_loss {
weighted_sigmoid {
}
}
localization_loss {
weighted_smooth_l1 {
}
}
hard_example_miner {
num_hard_examples: 3000
iou_threshold: 0.99
loss_type: CLASSIFICATION
max_negatives_per_positive: 3
min_negatives_per_image: 0
}
classification_weight: 1.0
localization_weight: 1.0
}
normalize_loss_by_num_matches: true
post_processing {
batch_non_max_suppression {
score_threshold: 1e-8
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SIGMOID
}
}
}
train_config: {
batch_size: 24
optimizer {
rms_prop_optimizer: {
learning_rate: {
exponential_decay_learning_rate {
initial_learning_rate: 0.004
decay_steps: 800720
decay_factor: 0.95
}
}
momentum_optimizer_value: 0.9
decay: 0.9
epsilon: 1.0
}
}
fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"
from_detection_checkpoint: true
# Note: The below line limits the training process to 200K steps, which we
# empirically found to be sufficient enough to train the pets dataset. This
# effectively bypasses the learning rate schedule (the learning rate will
# never decay). Remove the below line to train indefinitely.
num_steps: 200000
data_augmentation_options {
random_horizontal_flip {
}
}
data_augmentation_options {
ssd_random_crop {
}
}
}
train_input_reader: {
tf_record_input_reader {
input_path: "PATH_TO_BE_CONFIGURED/mscoco_train.record"
}
label_map_path: "PATH_TO_BE_CONFIGURED/mscoco_label_map.pbtxt"
}
eval_config: {
num_examples: 8000
# Note: The below line limits the evaluation process to 10 evaluations.
# Remove the below line to evaluate indefinitely.
max_evals: 10
}
eval_input_reader: {
tf_record_input_reader {
input_path: "PATH_TO_BE_CONFIGURED/mscoco_val.record"
}
label_map_path: "PATH_TO_BE_CONFIGURED/mscoco_label_map.pbtxt"
shuffle: false
num_readers: 1
}