【动手深度学习v2】anchor的生成

岁余十二.

已于 2023-07-20 14:38:34 修改

阅读量197

点赞数

分类专栏：动手学深度学习v2 文章标签：深度学习人工智能

于 2023-07-18 22:34:13 首次发布

本文链接：https://blog.csdn.net/Tracy_yi/article/details/131796840

版权

动手学深度学习v2 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

该文详细介绍了深度学习目标检测中锚框的生成过程，包括计算不同比例和尺度的锚框，以及如何通过IoU匹配真实边界框。还提到了非极大值抑制在预测边界框中的应用，以减少重叠的检测结果。

摘要由CSDN通过智能技术生成

生成多个锚框

首先是示例代码：

def multibox_prior(data, sizes, ratios):
    """生成以每个像素为中心具有不同形状的锚框"""
    in_height, in_width = data.shape[-2:]
    device, num_sizes, num_ratios = data.device, len(sizes), len(ratios)
    boxes_per_pixel = (num_sizes + num_ratios - 1)
    size_tensor = torch.tensor(sizes, device=device)  # 存放scale的tensor 
    ratio_tensor = torch.tensor(ratios, device=device)  # 存放宽高比的tensor
    # 为了将锚点移动到像素的中心，需要设置偏移量。
    # 因为一个像素的高为1且宽为1，我们选择偏移我们的中心0.5
    offset_h, offset_w = 0.5, 0.5
    steps_h = 1.0 / in_height  # 在y轴上缩放步长
    steps_w = 1.0 / in_width  # 在x轴上缩放步长

    # 生成锚框的所有中心点
    center_h = (torch.arange(in_height, device=device) + offset_h) * steps_h  
    center_w = (torch.arange(in_width, device=device) + offset_w) * steps_w
    shift_y, shift_x = torch.meshgrid(center_h, center_w, indexing='ij')  #torch.meshgrid生成网格，之后(shift_y[i],shift_x[i])就是一对可选参数
    shift_y, shift_x = shift_y.reshape(-1), shift_x.reshape(-1)
    # 生成“boxes_per_pixel”个高和宽，
    # 之后用于创建锚框的四角坐标(xmin,xmax,ymin,ymax)
    w = torch.cat((size_tensor * torch.sqrt(ratio_tensor[0]),
                   sizes[0] * torch.sqrt(ratio_tensor[1:])))\
                   * in_height / in_width  # 处理矩形输入
    h = torch.cat((size_tensor / torch.sqrt(ratio_tensor[0]),
                   sizes[0] / torch.sqrt(ratio_tensor[1:])))
    # w和h分别是anchor box的宽和高
    # 除以2来获得半高和半宽
    anchor_manipulations = torch.stack((-w, -h, w, h)).T.repeat(
                                        in_height * in_width, 1) / 2

    # 每个中心点都将有“boxes_per_pixel”个锚框，
    # 所以生成含所有锚框中心的网格，重复了“boxes_per_pixel”次
    out_grid = torch.stack([shift_x, shift_y, shift_x, shift_y],
                dim=1).repeat_interleave(boxes_per_pixel, dim=0)
    print(anchor_manipulations)
    output = out_grid + anchor_manipulations
    return output.unsqueeze(0)

首先是开始的几行：

in_height, in_width = data.shape[-2:]
device, num_sizes, num_ratios = data.device, len(sizes), len(ratios)
boxes_per_pixel = (num_sizes + num_ratios - 1)
size_tensor = torch.tensor(sizes, device=device)  # 存放scale的tensor 
ratio_tensor = torch.tensor(ratios, device=device)  # 存放宽高比的tensor

这几行获取了输入图片的高度和宽度，设置了设备、不同缩放比的个数、不同宽高比的个数和每个像素的锚框数量。

offset_h, offset_w = 0.5, 0.5
steps_h = 1.0 / in_height  # 在y轴上缩放步长
steps_w = 1.0 / in_width  # 在x轴上缩放步长

center_h = (torch.arange(in_height, device=device) + offset_h) * steps_h  
center_w = (torch.arange(in_width, device=device) + offset_w) * steps_w
shift_y, shift_x = torch.meshgrid(center_h, center_w, indexing='ij')  
shift_y, shift_x = shift_y.reshape(-1), shift_x.reshape(-1)

center_h和center_w表示的是中心点的横坐标和纵坐标（此处是百分比，也就是说都是在0~1之间的值）
接下来就是torch.meshgrid函数，该函数的作用是生成网格，可以用于生成坐标。函数输入两个数据类型相同的一维张量，两个输出张量的行数为第一个输入张量的元素个数，列数为第二个输入张量的元素个数，当两个输入张量数据类型不同或维度不是一维时会报错。
示例如下：

t1 = torch.tensor([1,2,3])
t2 = torch.tensor([2,3,4])
torch.meshgrid(t1,t2, indexing='ij')

输出：

(tensor([[1, 1, 1],
         [2, 2, 2],
         [3, 3, 3]]),
 tensor([[2, 3, 4],
         [2, 3, 4],
         [2, 3, 4]]))

使用torch.meshgrid生成网格后，将 $shift_y$ 和 $shift_x$ 均拉为1维，这样对于每一个i， $shift_y[i]$ 和 $shift_x[i]$ 就是一个锚框中心点的坐标了。

接下来是困扰我许久的代码：

w = torch.cat((size_tensor * torch.sqrt(ratio_tensor[0]),
                   sizes[0] * torch.sqrt(ratio_tensor[1:])))\
                   * in_height / in_width  # 处理矩形输入
h = torch.cat((size_tensor / torch.sqrt(ratio_tensor[0]),
                   sizes[0] / torch.sqrt(ratio_tensor[1:])))

关键在于给出的公式和代码实现不一样，网上的解释是： $r$ 并不是锚框的宽高比，而是锚框的宽高比与图像的宽高比之比：

r 是指锚框的宽高比与图像的宽高比之比即 w’/h’ = w/h*r，s是图像尺寸缩放因子即w’h’ = whs^2，联立求解即可得文中的锚框宽高即w’ = ws×sqrt( r ), h’ = hs/sqrt( r )

anchor_manipulations = torch.stack((-w, -h, w, h)).T.repeat(
                                        in_height * in_width, 1) / 2
out_grid = torch.stack([shift_x, shift_y, shift_x, shift_y],
                dim=1).repeat_interleave(boxes_per_pixel, dim=0)
print(anchor_manipulations)
output = out_grid + anchor_manipulations

接下来的代码中，anchor_manipulations的生成使用到了stack函数，在未指定维度时默认dim = 0，在此处可以理解为在垂直方向上堆叠行向量。
示例如下：

t1 = torch.tensor([1,2,3])
t2 = torch.tensor([2,3,4])
torch.stack((t1,t2))

输出：

tensor([[1, 2, 3],
        [2, 3, 4]])

repeat函数在此处是沿着列的方向重复这个张量。
示例如下：

t1 = torch.tensor([1,2,3])
t2 = torch.tensor([2,3,4])
t = torch.stack((t1,t2))
t.repeat(3,1)

输出：

tensor([[1, 2, 3],
        [2, 3, 4],
        [1, 2, 3],
        [2, 3, 4],
        [1, 2, 3],
        [2, 3, 4]])

anchor_manipulations最终是一个大小为([图片中的像素点数*每个像素点为中心的锚框数, 4])，其中的每一行都为(-半宽，-半高，半宽，半高)
接下来out_grid的每一行都是(中心点x坐标，中心点y坐标，中心点x坐标，中心点y坐标)，每一个这样的行都会重复 每个像素点为中心的锚框数 （次），只需要将out_grid 和anchor_manipulations相加，就可以得到每一个锚框的左上和右下的x坐标。

计算交并比

$J(\mathcal{A},\mathcal{B}) = \frac{\left|\mathcal{A} \cap \mathcal{B}\right|}{\left| \mathcal{A} \cup \mathcal{B}\right|}.$
示例代码：

#@save
def box_iou(boxes1, boxes2):
    """计算两个锚框或边界框列表中成对的交并比"""
    box_area = lambda boxes: ((boxes[:, 2] - boxes[:, 0]) *
                              (boxes[:, 3] - boxes[:, 1]))
    # boxes1,boxes2,areas1,areas2的形状:
    # boxes1：(boxes1的数量,4),
    # boxes2：(boxes2的数量,4),
    # areas1：(boxes1的数量,),
    # areas2：(boxes2的数量,)
    areas1 = box_area(boxes1)
    areas2 = box_area(boxes2)
    # inter_upperlefts,inter_lowerrights,inters的形状:
    # (boxes1的数量,boxes2的数量,2)
    inter_upperlefts = torch.max(boxes1[:, None, :2], boxes2[:, :2])
    inter_lowerrights = torch.min(boxes1[:, None, 2:], boxes2[:, 2:])
    inters = (inter_lowerrights - inter_upperlefts).clamp(min=0)
    # inter_areasandunion_areas的形状:(boxes1的数量,boxes2的数量)
    inter_areas = inters[:, :, 0] * inters[:, :, 1]
    union_areas = areas1[:, None] + areas2 - inter_areas
    return inter_areas / union_areas

首先定义了一个lambda函数：box_area对于一个框/框列表求面积。之后对于每两个框，求出他们的交集的左上角和右下角坐标。比如第一个框的左上角坐标为（1,-1），第二个框的左上角坐标为（0,1），那么inter_upperlefts就是(1,1)，inter_upperlefts[i][j]表示的是boxes1[i]与boxes2[j]的交集的左上角坐标，inter_lowerrights同理。
最后计算出交集的面积（此处需要注意，因为并不是每两个框都有交集，clamp(min=0)相当于进行了处理，将那些两个框没有交集的位置设置为0）

将真实边界框分配给锚框

示例代码：

#@save
def assign_anchor_to_bbox(ground_truth, anchors, device, iou_threshold=0.5):
    """将最接近的真实边界框分配给锚框"""
    num_anchors, num_gt_boxes = anchors.shape[0], ground_truth.shape[0]
    # 位于第i行和第j列的元素x_ij是锚框i和真实边界框j的IoU
    jaccard = box_iou(anchors, ground_truth)
    # 对于每个锚框，分配的真实边界框的张量
    anchors_bbox_map = torch.full((num_anchors,), -1, dtype=torch.long,
                                  device=device)
    # 根据阈值，决定是否分配真实边界框
    max_ious, indices = torch.max(jaccard, dim=1)
    anc_i = torch.nonzero(max_ious >= iou_threshold).reshape(-1)
    box_j = indices[max_ious >= iou_threshold]
    anchors_bbox_map[anc_i] = box_j
    col_discard = torch.full((num_anchors,), -1)
    row_discard = torch.full((num_gt_boxes,), -1)
    for _ in range(num_gt_boxes):
        max_idx = torch.argmax(jaccard)
        box_idx = (max_idx % num_gt_boxes).long()
        anc_idx = (max_idx / num_gt_boxes).long()
        anchors_bbox_map[anc_idx] = box_idx
        jaccard[:, box_idx] = col_discard
        jaccard[anc_idx, :] = row_discard
    return anchors_bbox_map

$ja cc a r d [i] [j]$ 表示锚框i和真实边框j的IoU。
$anchors\_bbox\_map[i]$ 表示锚框i匹配到的真实边框的编号。
$max_iou，indices$ 是两个行向量，分别是jaccard矩阵中每一行中最大的元素和它对应的真实边框的编号。
$anc_i$ 也是一个行向量，是max_iou中值大于iou_threshold的位置编号。
$box_j$ 是max_iou中值大于iou_threshold的元素对应的真实边框编号。

之后就可以使用 $anchors\_bbox\_map$ 记录下对应关系，这里和讲解中有出入，讲解中说的是先找整个矩阵中最大的值，匹配后去掉行列，如此循环直到只剩下 $n_a - n_b$ 个锚框（ $n_a , n_b$ 分别是锚框的数量和真实框的数量），再根据IoU筛选。但是此处是直接先按照IoU筛选，再进行循环匹配。其实这样是等效的，相当于先给所有元素填上值，在之后的过程中会覆盖掉一些值。
然后进入循环，有多少个真实边框就循环多少次，每次找到jaccard矩阵中的最大元素，再定位最大元素的行和列，建立对应关系后，该最大元素所在的行和列的就要被设置为-1，不能用于之后的匹配。

标记锚框

给定框 $A$ 和 $B$ ，中心坐标分别为 $x_a, y_a)$ 和 $x_b, y_b)$ ，宽度分别为 $w_a$ 和 $w_b$ ，高度分别为 $h_a$ 和 $h_b$ ，可以将 $A$ 的偏移量标记为：

$\left( \frac{ \frac{x_b - x_a}{w_a} - \mu_x }{\sigma_x}, \frac{ \frac{y_b - y_a}{h_a} - \mu_y }{\sigma_y}, \frac{ \log \frac{w_b}{w_a} - \mu_w }{\sigma_w}, \frac{ \log \frac{h_b}{h_a} - \mu_h }{\sigma_h}\right),$
接下来的示例代码中实现了给定两个框列表，计算偏移量：

#@save
def offset_boxes(anchors, assigned_bb, eps=1e-6):
    """对锚框偏移量的转换"""
    c_anc = d2l.box_corner_to_center(anchors)
    c_assigned_bb = d2l.box_corner_to_center(assigned_bb)
    offset_xy = 10 * (c_assigned_bb[:, :2] - c_anc[:, :2]) / c_anc[:, 2:]
    offset_wh = 5 * torch.log(eps + c_assigned_bb[:, 2:] / c_anc[:, 2:])
    offset = torch.cat([offset_xy, offset_wh], axis=1)
    return offset

#@save
def multibox_target(anchors, labels):
    """使用真实边界框标记锚框"""
    batch_size, anchors = labels.shape[0], anchors.squeeze(0)
    batch_offset, batch_mask, batch_class_labels = [], [], []
    device, num_anchors = anchors.device, anchors.shape[0]
    for i in range(batch_size):
        label = labels[i, :, :]
        anchors_bbox_map = assign_anchor_to_bbox(
            label[:, 1:], anchors, device)
        bbox_mask = ((anchors_bbox_map >= 0).float().unsqueeze(-1)).repeat(
            1, 4)
        # 将类标签和分配的边界框坐标初始化为零
        class_labels = torch.zeros(num_anchors, dtype=torch.long,
                                   device=device)
        assigned_bb = torch.zeros((num_anchors, 4), dtype=torch.float32,
                                  device=device)
        # 使用真实边界框来标记锚框的类别。
        # 如果一个锚框没有被分配，标记其为背景（值为零）
        indices_true = torch.nonzero(anchors_bbox_map >= 0)
        bb_idx = anchors_bbox_map[indices_true]
        class_labels[indices_true] = label[bb_idx, 0].long() + 1
        assigned_bb[indices_true] = label[bb_idx, 1:]
        # 偏移量转换
        offset = offset_boxes(anchors, assigned_bb) * bbox_mask
        batch_offset.append(offset.reshape(-1))
        batch_mask.append(bbox_mask.reshape(-1))
        batch_class_labels.append(class_labels)
    bbox_offset = torch.stack(batch_offset)
    bbox_mask = torch.stack(batch_mask)
    class_labels = torch.stack(batch_class_labels)
    return (bbox_offset, bbox_mask, class_labels)

此处需要注意，labels和anchors的形状是不同的，labels每一行的第一个元素表示分类，如下：

labels = torch.tensor([[0, 0.1, 0.08, 0.52, 0.92],
                         [1, 0.55, 0.2, 0.9, 0.88]])
anchors = torch.tensor([[0, 0.1, 0.2, 0.3], [0.15, 0.2, 0.4, 0.4],
                    [0.63, 0.05, 0.88, 0.98], [0.66, 0.45, 0.8, 0.8],
                    [0.57, 0.3, 0.92, 0.9]])

对于IoU小于iou_threshold的那些框，anchors_bbox_map[i]会为-1，其对应位置的bbox_mask会为0，indices_true将那些IoU≥iou_threshold的框的编号取出来，class_labels和assigned_bb分别存他们的标签和分配的真实边框的坐标。

非极大值抑制

非极大值抑制:
对于一个预测边界框 $B$ ，目标检测模型会计算每个类别的预测概率。
假设最大的预测概率为 $p$ ，则该概率所对应的类别 $B$ 即为预测的类别。
具体来说，我们将 $p$ 称为预测边界框 $B$ 的置信度（confidence）。
在同一张图像中，所有预测的非背景边界框都按置信度降序排序，以生成列表 $L$ 。然后我们通过以下步骤操作排序列表 $L$ 。

从 $L$ 中选取置信度最高的预测边界框 $B_1$ 作为基准，然后将所有与 $B_1$ 的IoU超过预定阈值 $\epsilon$ 的非基准预测边界框从 $L$ 中移除。这时， $L$ 保留了置信度最高的预测边界框，去除了与其太过相似的其他预测边界框。简而言之，那些具有非极大值置信度的边界框被抑制了。
从 $L$ 中选取置信度第二高的预测边界框 $B_2$ 作为又一个基准，然后将所有与 $B_2$ 的IoU大于 $\epsilon$ 的非基准预测边界框从 $L$ 中移除。
重复上述过程，直到 $L$ 中的所有预测边界框都曾被用作基准。此时， $L$ 中任意一对预测边界框的IoU都小于阈值 $\epsilon$ ；因此，没有一对边界框过于相似。
输出列表 $L$ 中的所有预测边界框。

#@save
def nms(boxes, scores, iou_threshold):
    """对预测边界框的置信度进行排序"""
    B = torch.argsort(scores, dim=-1, descending=True)
    keep = []  # 保留预测边界框的指标
    while B.numel() > 0:
        i = B[0]
        keep.append(i)
        if B.numel() == 1: break
        iou = box_iou(boxes[i, :].reshape(-1, 4),
                      boxes[B[1:], :].reshape(-1, 4)).reshape(-1)
        inds = torch.nonzero(iou <= iou_threshold).reshape(-1)
        B = B[inds + 1]
    return torch.tensor(keep, device=boxes.device)

nms函数输入boxes，和每一个box对应的分类最大预测概率。返回保留的box的编号。

#@save
def multibox_detection(cls_probs, offset_preds, anchors, nms_threshold=0.5,
                       pos_threshold=0.009999999):
    """使用非极大值抑制来预测边界框"""
    device, batch_size = cls_probs.device, cls_probs.shape[0]
    anchors = anchors.squeeze(0)
    num_classes, num_anchors = cls_probs.shape[1], cls_probs.shape[2]
    out = []
    for i in range(batch_size):
        cls_prob, offset_pred = cls_probs[i], offset_preds[i].reshape(-1, 4)
        conf, class_id = torch.max(cls_prob[1:], 0)
        predicted_bb = offset_inverse(anchors, offset_pred)
        keep = nms(predicted_bb, conf, nms_threshold)

        # 找到所有的non_keep索引，并将类设置为背景
        all_idx = torch.arange(num_anchors, dtype=torch.long, device=device)
        combined = torch.cat((keep, all_idx))
        uniques, counts = combined.unique(return_counts=True)
        non_keep = uniques[counts == 1]
        all_id_sorted = torch.cat((keep, non_keep))
        class_id[non_keep] = -1
        class_id = class_id[all_id_sorted]
        conf, predicted_bb = conf[all_id_sorted], predicted_bb[all_id_sorted]
        # pos_threshold是一个用于非背景预测的阈值
        below_min_idx = (conf < pos_threshold)
        class_id[below_min_idx] = -1
        conf[below_min_idx] = 1 - conf[below_min_idx]
        pred_info = torch.cat((class_id.unsqueeze(1),
                               conf.unsqueeze(1),
                               predicted_bb), dim=1)
        out.append(pred_info)
    return torch.stack(out)

首先来理解输入：

cls_probs是一个大小为类别总数 $*$ 框总数的tensor。表示每一个框为某一种类别的概率。
offset_preds是一个长度为框总数$*$4的tensor，表示每一个框相对真实框的坐标偏移
anchors就是大小为框总数$*$4的tensor

batch_size此处设置为类别总数，也就是每一次循环只考虑一种类别的。
首先求出keep向量，也就是保留的框的编号，non_keep是那些不保留的框的编号，这些框的class_id会被设置为-1。再将那些置信度（也就是分类中的最大概率）小于pos_threshold的框的分类都设置为-1，置信度设置为1-当前的置信度。
最终返回结果的形状是（批量大小，锚框的数量，6）。最内层维度中的六个元素提供了同一预测边界框的输出信息。第一个元素是预测的类索引，值-1表示背景或在非极大值抑制中被移除了。第二个元素是预测的边界框的置信度。其余四个元素分别是预测边界框左上角和右下角的 $(x, y)$ 轴坐标（范围介于0和1之间）。