训练流程
1.输入图像经过CNN的backbone获得32倍下采样的深度特征;
2.将图片给拉直形成token,并添加位置编码送入encoder中;
3.将encoder的输出以及Object Query作为decoder的输入得到解码特征;
4.将解码后的特征传入FFN得到预测特征;
5.根据预测特征计算cost matrix,并由匈牙利算法匹配GT,获得正负样本;
6.根据正负样本计算分类与回归loss。
前两章,我们分析了训练步骤的1,2,3。了解DETR是通过CNN的backbone获取的32倍下采样特征,并构建Object Query可学习编码,利用encoder-decoder结构输出解码特征。接下来我们需要理解DETR如何进行正负样本分配(label assignment),以及loss是如何根据正负样本构建的。文章代码很多,希望大家能够耐心看完。
代码分析
继续书接上回,我们将x输入self.transformer(outs_dec, _ = self.transformer)并获得输出outs_dec,outs_dec的维度为[6,2,100,256],其中第一维度6是因为保存了self.decoder中间层的输出,2是指batch-size。
self.fc_cls = Linear(in_features=256, out_features=8, bias=True),将outs_dec送入分类头self.fc_cls获得all_cls_scores(维度为[6,2,100,92]),coco数据集中存在91类别,加上背景,一共92个类别。self.reg_ffn是FFN模块,如下所示,存在两个全连接层与激活成以及dropout层。self.fc_reg = Linear(in_features=256, out_features=4, bias=True),将outs_dec送入回归头self.fc_reg获得all_bbox_preds(维度为[6,2,100,4]),其中保存了4个坐标特征。
FFN(
(activate): ReLU(inplace=True)
(layers): Sequential(
(0): Sequential(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): ReLU(inplace=True)
(2): Dropout(p=0.0, inplace=False)
)
(1): Linear(in_features=256, out_features=256, bias=True)
(2): Dropout(p=0.0, inplace=False)
)
(dropout_layer): Identity()
)
def forward_single(self, x, img_metas):
""""Forward function for a single feature level.
Args:
x (Tensor): Input feature from backbone's single stage, shape
[bs, c, h, w].
img_metas (list[dict]): List of image information.
Returns:
all_cls_scores (Tensor): Outputs from the classification head,
shape [nb_dec, bs, num_query, cls_out_channels]. Note
cls_out_channels should includes background.
all_bbox_preds (Tensor): Sigmoid outputs from the regression
head with normalized coordinate format (cx, cy, w, h).
Shape [nb_dec, bs, num_query, 4].
"""
# construct binary masks which used for the transformer.
# NOTE following the official DETR repo, non-zero values representing
# ignored positions, while zero values means valid positions.
batch_size = x.size(0)
input_img_h, input_img_w = img_metas[0]['batch_input_shape']
masks = x.new_ones((batch_size, input_img_h, input_img_w))
for img_id in range(batch_size):
img_h, img_w, _ = img_metas[img_id]['img_shape']
masks[img_id, :img_h, :img_w] = 0
x = self.input_proj(x)
# interpolate masks to have the same spatial shape with x
masks = F.interpolate(
masks.unsqueeze(1), size=x.shape[-2:]).to(torch.bool).squeeze(1)
# position encoding
pos_embed = self.positional_encoding(masks) # [bs, embed_dim, h, w]
# outs_dec: [nb_dec, bs, num_query, embed_dim]
outs_dec, _ = self.transformer(x, masks, self.query_embedding.weight,
pos_embed)
all_cls_scores = self.fc_cls(outs_dec)
all_bbox_preds = self.fc_reg(self.activate(
self.reg_ffn(outs_dec))).sigmoid()
return all_cls_scores, all_bbox_preds
程序从forward_single到loss_single过程简单,不赘述了。loss_single传入的参数cls_scores的维度是[2,100,92], bbox_preds的维度为[2,100,4],gt_bboxes_list存放了batchsize=2个图片的坐标框,gt_labels_list存放了图片类别。
def loss_single(self,
cls_scores,
bbox_preds,
gt_bboxes_list,
gt_labels_list,
img_metas,
gt_bboxes_ignore_list=None):
""""Loss function for outputs from a single decoder layer of a single
feature level.
Args:
cls_scores (Tensor): Box score logits from a single decoder layer
for all images. Shape [bs, num_query, cls_out_channels].
bbox_preds (Tensor): Sigmoid outputs from a single decoder layer
for all images, with normalized coordinate (cx, cy, w, h) and
shape [bs, num_query, 4].
gt_bboxes_list (list[Tensor]): Ground truth bboxes for each image
with shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format.
gt_labels_list (list[Tensor]): Ground truth class indices for each
image with shape (num_gts, ).
img_metas (list[dict]): List of image meta information.
gt_bboxes_ignore_list (list[Tensor], optional): Bounding
boxes which can be ignored for each image. Default None.
Returns:
dict[str, Tensor]: A dictionary of loss components for outputs from
a single decoder layer.
"""
num_imgs = cls_scores.size(0)
cls_scores_list = [cls_scores[i] for i in range(num_imgs)]
bbox_preds_list = [bbox_preds[i] for i in range(num_imgs)]
cls_reg_targets = self.get_targets(cls_scores_list, bbox_preds_list,
gt_bboxes_list, gt_labels_list,
img_metas, gt_bboxes_ignore_list)
(labels_list, label_weights_list, bbox_targets_list, bbox_weights_list,
num_total_pos, num_total_neg) = cls_reg_targets
labels = torch.cat(labels_list, 0)
label_weights = torch.cat(label_weights_list, 0)
bbox_targets = torch.cat(bbox_targets_list, 0)
bbox_weights = torch.cat(bbox_weights_list, 0)
# classification loss
cls_scores = cls_scores.reshape(-1, self.cls_out_channels)
# construct weighted avg_factor to match with the official DETR repo
cls_avg_factor = num_total_pos * 1.0 + \
num_total_neg * self.bg_cls_weight
if self.sync_cls_avg_factor:
cls_avg_factor = reduce_mean(
cls_scores.new_tensor([cls_avg_factor]))
cls_avg_factor = max(cls_avg_factor, 1)
loss_cls = self.loss_cls(
cls_scores, labels, label_weights, avg_factor=cls_avg_factor)
# Compute the average number of gt boxes across all gpus, for
# normalization purposes
num_total_pos = loss_cls.new_tensor([num_total_pos])
num_total_pos = torch.clamp(reduce_mean(num_total_pos), min=1).item()
# construct factors used for rescale bboxes
factors = []
for img_meta, bbox_pred in zip(img_metas, bbox_preds):
img_h, img_w, _ = img_meta['img_shape']
factor = bbox_pred.new_tensor([img_w, img_h, img_w,
img_h]).unsqueeze(0).repeat(
bbox_pred.size(0), 1)
factors.append(factor)
factors = torch.cat(factors, 0)
# DETR regress the relative position of boxes (cxcywh) in the image,
# thus the learning target is normalized by the image size. So here
# we need to re-scale them for calculating IoU loss
bbox_preds = bbox_preds.reshape(-1, 4)
bboxes = bbox_cxcywh_to_xyxy(bbox_preds) * factors
bboxes_gt = bbox_cxcywh_to_xyxy(bbox_targets) * factors
# regression IoU loss, defaultly GIoU loss
loss_iou = self.loss_iou(
bboxes, bboxes_gt, bbox_weights, avg_factor=num_total_pos)
# regression L1 loss
loss_bbox = self.loss_bbox(
bbox_preds, bbox_targets, bbox_weights, avg_factor=num_total_pos)
return loss_cls, loss_bbox, loss_iou
进一步来到_get_target_single,这里对每一张图片匹配正负样本点,其中输入的参数没有batch的维度,即cls_score的维度为[100,8]。self.assigner在detr_head初始化时已经激活,它是来自HungarianAssigner类的一个对象。
def _get_target_single(self,
cls_score,
bbox_pred,
gt_bboxes,
gt_labels,
img_meta,
gt_bboxes_ignore=None):
""""Compute regression and classification targets for one image.
Outputs from a single decoder layer of a single feature level are used.
Args:
cls_score (Tensor): Box score logits from a single decoder layer
for one image. Shape [num_query, cls_out_channels].
bbox_pred (Tensor): Sigmoid outputs from a single decoder layer
for one image, with normalized coordinate (cx, cy, w, h) and
shape [num_query, 4].
gt_bboxes (Tensor): Ground truth bboxes for one image with
shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format.
gt_labels (Tensor): Ground truth class indices for one image
with shape (num_gts, ).
img_meta (dict): Meta information for one image.
gt_bboxes_ignore (Tensor, optional): Bounding boxes
which can be ignored. Default None.
Returns:
tuple[Tensor]: a tuple containing the following for one image.
- labels (Tensor): Labels of each image.
- label_weights (Tensor]): Label weights of each image.
- bbox_targets (Tensor): BBox targets of each image.
- bbox_weights (Tensor): BBox weights of each image.
- pos_inds (Tensor): Sampled positive indices for each image.
- neg_inds (Tensor): Sampled negative indices for each image.
"""
num_bboxes = bbox_pred.size(0)
# assigner and sampler
assign_result = self.assigner.assign(bbox_pred, cls_score, gt_bboxes,
gt_labels, img_meta,
gt_bboxes_ignore)
sampling_result = self.sampler.sample(assign_result, bbox_pred,
gt_bboxes)
pos_inds = sampling_result.pos_inds
neg_inds = sampling_result.neg_inds
# label targets
labels = gt_bboxes.new_full((num_bboxes, ),
self.num_classes,
dtype=torch.long)
labels[pos_inds] = gt_labels[sampling_result.pos_assigned_gt_inds]
label_weights = gt_bboxes.new_ones(num_bboxes)
# bbox targets
bbox_targets = torch.zeros_like(bbox_pred)
bbox_weights = torch.zeros_like(bbox_pred)
bbox_weights[pos_inds] = 1.0
img_h, img_w, _ = img_meta['img_shape']
# DETR regress the relative position of boxes (cxcywh) in the image.
# Thus the learning target should be normalized by the image size, also
# the box format should be converted from defaultly x1y1x2y2 to cxcywh.
factor = bbox_pred.new_tensor([img_w, img_h, img_w,
img_h]).unsqueeze(0)
pos_gt_bboxes_normalized = sampling_result.pos_gt_bboxes / factor
pos_gt_bboxes_targets = bbox_xyxy_to_cxcywh(pos_gt_bboxes_normalized)
bbox_targets[pos_inds] = pos_gt_bboxes_targets
return (labels, label_weights, bbox_targets, bbox_weights, pos_inds,
neg_inds)
继续来到核心函数assign中,num_gts表示GT个数,num_bboxes=100,img_meta[‘img_shape’]表示现在图片的尺寸(经过前处理即增强后的)。DETR的label assignment的主要思想是将目标检测问题转变为集合问题,通过pred的cls以及iou构建cost matrix,并利用匈牙利算法获得最优匹配。由于正样本与GT是一一对应的,所以没有anchor以及nms的过程
cost_matrix一部分由ClassificationCost组成,如下所示,首先对cls_pred(维度[100,92])最后一维进行softmax激活,获得类别得分。gt_labels表示该图GT的类别,cls_score[:, gt_labels]将gt对应的类选出并赋值给cls_cost (维度[100,num_gts])。
cls_score = cls_pred.softmax(-1)
cls_cost = -cls_score[:, gt_labels]
return cls_cost * self.weight
cost_matrix一部分是BBoxL1Cost,因为bbox_pred回归经过sigmoid归一化,所以gt_bboxes也需要进行归一化。如下所示,先将xyxy格式的gt改成xywh格式与bboxo_pred一致,然后计算每个pred对应GT坐标的距离bbox_cost (维度[100,num_gts])。
gt_bboxes = bbox_xyxy_to_cxcywh(gt_bboxes)
bbox_cost = torch.cdist(bbox_pred, gt_bboxes, p=1)
return bbox_cost * self.weight
cost_matrix另一部分是由IoUCost构成,先将bbox_pred格式改成xyxy再rescale成图片尺寸大小。如下所示,计算pred与GT的iou。
注意,这里iou_cost 是-iou,所以值越小代表pred与GT越接近,cls_cost 也是负值,同样值越小表示该类别置信度越高,bbox_cost 表示pred与GT坐标的距离,值越小表示与GT越接近。因此,将三者相加,(cost = cls_cost + reg_cost + iou_cost),那么cost的值越小表示该pred与该GT越应该匹配。
matched_row_inds, matched_col_inds = linear_sum_assignment(cost),cost即二分图匹配的cost matrix,输入linear_sum_assignment能够获得Object query与GT的最优匹配(一个query对应一个GT)。最后,将assignment的结果存放在AssignResult类中。
loss过程与常规目标检测类似,比较简单,这里不赘述了。介绍一下detr使用的三种loss,回归使用Giou loss,分类使用交叉熵loss,不同的是,DETR使用了L1 loss帮助坐标回归。应该是当年还没有DIOU的loss,GIOU没有强调GT与pred中心点的距离,所以添加了L1 loss来补充。
overlaps = bbox_overlaps(
bboxes, gt_bboxes, mode=self.iou_mode, is_aligned=False)
# The 1 is a constant that doesn’t change the matching, so omitted.
iou_cost = -overlaps
return iou_cost * self.weight
def assign(self,
bbox_pred,
cls_pred,
gt_bboxes,
gt_labels,
img_meta,
gt_bboxes_ignore=None,
eps=1e-7):
"""Computes one-to-one matching based on the weighted costs.
This method assign each query prediction to a ground truth or
background. The `assigned_gt_inds` with -1 means don't care,
0 means negative sample, and positive number is the index (1-based)
of assigned gt.
The assignment is done in the following steps, the order matters.
1. assign every prediction to -1
2. compute the weighted costs
3. do Hungarian matching on CPU based on the costs
4. assign all to 0 (background) first, then for each matched pair
between predictions and gts, treat this prediction as foreground
and assign the corresponding gt index (plus 1) to it.
Args:
bbox_pred (Tensor): Predicted boxes with normalized coordinates
(cx, cy, w, h), which are all in range [0, 1]. Shape
[num_query, 4].
cls_pred (Tensor): Predicted classification logits, shape
[num_query, num_class].
gt_bboxes (Tensor): Ground truth boxes with unnormalized
coordinates (x1, y1, x2, y2). Shape [num_gt, 4].
gt_labels (Tensor): Label of `gt_bboxes`, shape (num_gt,).
img_meta (dict): Meta information for current image.
gt_bboxes_ignore (Tensor, optional): Ground truth bboxes that are
labelled as `ignored`. Default None.
eps (int | float, optional): A value added to the denominator for
numerical stability. Default 1e-7.
Returns:
:obj:`AssignResult`: The assigned result.
"""
assert gt_bboxes_ignore is None, \
'Only case when gt_bboxes_ignore is None is supported.'
num_gts, num_bboxes = gt_bboxes.size(0), bbox_pred.size(0)
# 1. assign -1 by default
assigned_gt_inds = bbox_pred.new_full((num_bboxes, ),
-1,
dtype=torch.long)
assigned_labels = bbox_pred.new_full((num_bboxes, ),
-1,
dtype=torch.long)
if num_gts == 0 or num_bboxes == 0:
# No ground truth or boxes, return empty assignment
if num_gts == 0:
# No ground truth, assign all to background
assigned_gt_inds[:] = 0
return AssignResult(
num_gts, assigned_gt_inds, None, labels=assigned_labels)
img_h, img_w, _ = img_meta['img_shape']
factor = gt_bboxes.new_tensor([img_w, img_h, img_w,
img_h]).unsqueeze(0)
# 2. compute the weighted costs
# classification and bboxcost.
cls_cost = self.cls_cost(cls_pred, gt_labels)
# regression L1 cost
normalize_gt_bboxes = gt_bboxes / factor
reg_cost = self.reg_cost(bbox_pred, normalize_gt_bboxes)
# regression iou cost, defaultly giou is used in official DETR.
bboxes = bbox_cxcywh_to_xyxy(bbox_pred) * factor
iou_cost = self.iou_cost(bboxes, gt_bboxes)
# weighted sum of above three costs
cost = cls_cost + reg_cost + iou_cost
# 3. do Hungarian matching on CPU using linear_sum_assignment
cost = cost.detach().cpu()
if linear_sum_assignment is None:
raise ImportError('Please run "pip install scipy" '
'to install scipy first.')
matched_row_inds, matched_col_inds = linear_sum_assignment(cost)
matched_row_inds = torch.from_numpy(matched_row_inds).to(
bbox_pred.device)
matched_col_inds = torch.from_numpy(matched_col_inds).to(
bbox_pred.device)
# 4. assign backgrounds and foregrounds
# assign all indices to backgrounds first
assigned_gt_inds[:] = 0
# assign foregrounds based on matching results
assigned_gt_inds[matched_row_inds] = matched_col_inds + 1
assigned_labels[matched_row_inds] = gt_labels[matched_col_inds]
return AssignResult(
num_gts, assigned_gt_inds, None, labels=assigned_labels)
结论
可以看出DETR的效果还是很不错的,基于ResNet50的DETR取得了和经过各种finetune的Faster-RCNN相媲美的效果。同时DETR在大目标检测上性能是最好的,但是小目标上稍差。这个比较容易理解,因为DETR将32倍下采样的特征输入encoder-decoder,相对于CNN,自注意力具有全局视野,自由度更高,因此对于大尺寸目标,其检测效果更佳,但也由于缺少细节信息(只有32倍下采样特征,同时缺少类似FPN融合机制),DETR无法很好的处理小目标。
而且基于match的loss导致学习很难收敛。在传统检测任务中,loss可用借助人工设计的anchor,直接根据anchor位置(每个GT对应的anchor是固定的)对相应卷积进行优化。而在DETR中,没有anchor的概念,DETR设计了Object query让模型自己学习对应GT的位置信息(与训练集的数据分布强相关),导致DETR收敛困难,所需的数据量大,但也正因为DETR模型自由度大,可视化自注意力特征,该特征能够完全分离实例,这是传统检测做不到的。
对于上面的两个问题,Deformable DETR较好的解决了这两个问题。下面我们将带着大家一起剖析Deformable DETR。