（八）计算机视觉 -- 4 锚框

最新推荐文章于 2024-01-26 17:33:16 发布

Fiona-Dong

最新推荐文章于 2024-01-26 17:33:16 发布

阅读量4k

点赞数 4

分类专栏：动手学深度学习-TF2.0（读书笔记）

原文链接：https://trickygo.github.io/Dive-into-DL-TensorFlow2.0/#/chapter09_computer-vision/9.4_anchor

版权

动手学深度学习-TF2.0（读书笔记）专栏收录该内容

45 篇文章 13 订阅

订阅专栏

4. 锚框

目标检测算法通常会在输入图像中采样大量的区域，然后判断这些区域中是否包含感兴趣的目标，并调整区域边缘从而更准确地预测目标的真实边界框（ground-truth bounding box）。

不同的模型使用的区域采样方法可能不同。这里介绍其中的一种：
它以每个像素为中心生成多个大小和宽高比（aspect ratio）不同的边界框。这些边界框被称为锚框（anchor box）。

4.1 生成多个锚框

假设输入图像高为h、宽为w，分别以图像的每个像素为中心生成不同形状的锚框。
设大小为 s∈(0,1] 且宽高比为 r>0 ，则锚框的宽和高分别为 $ws\sqrt r$ 和 $hs/\sqrt r$ 。
当中心位置给定时，已知宽和高的锚框是确定的。

分别设定一组大小 $s_1,…,s_n$ 和一组宽高比 $r_1,…,r_m$ 。
如果以每个像素为中心时使用所有的大小与宽高比的组合，输入图像将一共得到 $w h n m$ 个锚框。

虽然这些锚框可能覆盖了所有的真实边界框，但计算复杂度容易过高。
因此，通常只对包含 $s_1$ 或 $r_1$ 的大小与宽高比的组合感兴趣，即：

$s_1,r_1),(s_1,r_2),…,(s_1,r_m),(s_2,r_1),(s_3,r_1),…,(s_n,r_1)$

也就是说，以相同像素为中心的锚框的数量为 $n + m - 1$ 。
对于整个输入图像，将一共生成 $w h (n + m - 1)$ 个锚框。

以上生成锚框的方法，在MultiBoxPrior函数中得以实现。
指定输入、一组大小和一组宽高比，该函数将返回输入的所有锚框：

h, w = img.shape[0], img.shape[1]
print(h, w)

561 728

def MultiBoxPrior(feature_map, sizes=[0.75, 0.5, 0.25], ratios=[1, 2, 0.5]):
    """
    # anchor表示成(xmin, ymin, xmax, ymax).
    https://zh.d2l.ai/chapter_computer-vision/anchor.html
    Args:
        feature_map: torch tensor, Shape: [N, C, H, W].
        sizes: List of sizes (0~1) of generated MultiBoxPriores. 
        ratios: List of aspect ratios (non-negative) of generated MultiBoxPriores. 
    Returns:
        anchors of shape (1, num_anchors, 4). 由于batch里每个都一样, 所以第一维为1
    """
    pairs = [] # pair of (size, sqrt(ratio))
    for r in ratios:
        pairs.append([sizes[0], np.sqrt(r)])
    for s in sizes[1:]:
        pairs.append([s, np.sqrt(ratios[0])])

    pairs = np.array(pairs)

    ss1 = pairs[:, 0] * pairs[:, 1] # size * sqrt(ration)
    ss2 = pairs[:, 0] / pairs[:, 1] # size / sqrt(retion)

    base_anchors = tf.stack([-ss1, -ss2, ss1, ss2], axis=1) / 2

    h, w = feature_map.shape[-2:]
    shifts_x = tf.divide(tf.range(0, w), w)
    shifts_y = tf.divide(tf.range(0, h), h)
    shift_x, shift_y = tf.meshgrid(shifts_x, shifts_y)
    shift_x = tf.reshape(shift_x, (-1,))
    shift_y = tf.reshape(shift_y, (-1,))
    shifts = tf.stack((shift_x, shift_y, shift_x, shift_y), axis=1)

    anchors = tf.add(tf.reshape(shifts, (-1,1,4)), tf.reshape(base_anchors, (1,-1,4)))
    return tf.cast(tf.reshape(anchors, (1,-1,4)), tf.float32)

x = tf.zeros((1,3,h,w))
y = MultiBoxPrior(x)
y.shape

TensorShape([1, 2042040, 4])

由此可见，返回锚框变量y的形状为（1，锚框个数，4）。
将锚框变量y的形状变为（图像高，图像宽，以相同像素为中心的锚框个数，4）后，即可通过指定像素位置来获取所有以该像素为中心的锚框了。

例如，访问以（250，250）为中心的第一个锚框。
它有4个元素，分别是锚框左上角的 $x$ 和 $y$ 轴坐标和右下角的 $x$ 和 $y$ 轴坐标。
其中， $x$ 和 $y$ 轴的坐标值分别已除以图像的宽和高，因此值域均为0和1之间。

boxes = tf.reshape(y, (h,w,5,4))
boxes[250,250,0,:]

<tf.Tensor: id=67, shape=(4,), dtype=float32, numpy=array([-0.03159341,  0.0706328 ,  0.7184066 ,  0.8206328 ], dtype=float32)>

为了描绘图像中以某个像素为中心的所有锚框，定义show_bboxes函数以便在图像上画出多个边界框。

def bbox_to_rect(bbox, color):
    # 将边界框(左上x, 左上y, 右下x, 右下y)格式转换成matplotlib格式：
    # ((左上x, 左上y), 宽, 高)
    return plt.Rectangle(
        xy=(bbox[0], bbox[1]), width=bbox[2]-bbox[0], height=bbox[3]-bbox[1],
        fill=False, edgecolor=color, linewidth=2)


def show_bboxes(axes, bboxes, labels=None, colors=None):
    def _make_list(obj, default_values=None):
        if obj is None:
            obj = default_values
        elif not isinstance(obj, (list, tuple)):
            obj = [obj]
        return obj

    labels = _make_list(labels)
    colors = _make_list(colors, ['b', 'g', 'r', 'm', 'c'])
    for i, bbox in enumerate(bboxes):
        color = colors[i % len(colors)]
        rect = bbox_to_rect(bbox.numpy(), color)
        axes.add_patch(rect)
        if labels and len(labels) > i:
            text_color = 'k' if color == 'w' else 'w'
            axes.text(rect.xy[0], rect.xy[1], labels[i],
                va='center', ha='center', fontsize=6,
                color=text_color, bbox=dict(facecolor=color, lw=0))

由于变量boxes中 $x$ 和 $y$ 轴的坐标值分别已除以图像的宽和高，在绘图时，需要恢复锚框的原始坐标值，并定义了变量bbox_scale。

由此，可以画出图像中以(250, 250)为中心的所有锚框：

from IPython import display

def use_svg_display():
    """Use svg format to display plot in jupyter"""
    display.set_matplotlib_formats('svg')

use_svg_display()

# 设置图的尺寸
plt.rcParams['figure.figsize'] = (3.5, 2.5)

fig = plt.imshow(img)

bbox_scale = tf.constant([[w,h,w,h]], dtype=tf.float32)
show_bboxes(fig.axes, 
            tf.multiply(boxes[250,250,:,:], bbox_scale), 
            labels=['s=0.75, r=1', 's=0.75, r=2', 's=0.55, r=0.5', 's=0.5, r=1', 's=0.25, r=1'])

可以看到，不同大小及宽高比的锚框对图像中狗的覆盖程度不同。

4.2 交并比

上文提到不同的锚框对图像中狗的覆盖程度不同。
若该目标的真实边界框已知，如何对覆盖程度进行量化？

一种直观的方法是衡量锚框和真实边界框之间的相似度。
Jaccard系数（Jaccard index）可以衡量两个集合的相似度。
给定集合A和B，它们的Jaccard系数即为二者交集大小除以二者并集大小：

实际上，可以把边界框内的像素区域看作像素的集合。
由此，可以用两个边界框的像素集合的Jaccard系数衡量这两个边界框的相似度。

当衡量两个边界框的相似度时，通常将Jaccard系数称为交并比（Intersection over Union，IoU），即两个边界框相交面积与相并面积之比。

如下图所示：

交并比的取值范围在0和1之间：
0表示两个边界框无重合像素，1表示两个边界框相等。

代码实现如下：

def compute_intersection(set_1, set_2):
    """
    计算anchor之间的交集
    Args:
        set_1: a tensor of dimensions (n1, 4), anchor表示成(xmin, ymin, xmax, ymax)
        set_2: a tensor of dimensions (n2, 4), anchor表示成(xmin, ymin, xmax, ymax)
    Returns:
        intersection of each of the boxes in set 1 with respect to each of the boxes in set 2, shape: (n1, n2)
    """
    # tensorflow auto-broadcasts singleton dimensions
    lower_bounds = tf.maximum(tf.expand_dims(set_1[:,:2], axis=1), tf.expand_dims(set_2[:,:2], axis=0)) # (n1, n2, 2)
    upper_bounds = tf.minimum(tf.expand_dims(set_1[:,2:], axis=1), tf.expand_dims(set_2[:,2:], axis=0)) # (n1, n2, 2)
    # 设置最小值
    intersection_dims = tf.clip_by_value(upper_bounds - lower_bounds, clip_value_min=0, clip_value_max=3) # (n1, n2, 2)
    return tf.multiply(intersection_dims[:, :, 0], intersection_dims[:, :, 1]) # (n1, n2)


def compute_jaccard(set_1, set_2):
    """
    计算anchor之间的Jaccard系数(IoU)
    Args:
        set_1: a tensor of dimensions (n1, 4), anchor表示成(xmin, ymin, xmax, ymax)
        set_2: a tensor of dimensions (n2, 4), anchor表示成(xmin, ymin, xmax, ymax)
    Returns:
        Jaccard Overlap of each of the boxes in set 1 with respect to each of the boxes in set 2, shape: (n1, n2)
    """
    # Find intersections
    intersection = compute_intersection(set_1, set_2)

    # Find areas of each box in both sets
    areas_set_1 = tf.multiply(tf.subtract(set_1[:, 2], set_1[:, 0]), tf.subtract(set_1[:, 3], set_1[:, 1]))  # (n1)
    areas_set_2 = tf.multiply(tf.subtract(set_2[:, 2], set_2[:, 0]), tf.subtract(set_2[:, 3], set_2[:, 1]))  # (n2)

    # Find the union
    union = tf.add(tf.expand_dims(areas_set_1, axis=1), tf.expand_dims(areas_set_2, axis=0))  # (n1, n2)
    union = tf.subtract(union, intersection)  # (n1, n2)

    return tf.divide(intersection, union) #(n1, n2)

4.3 标注训练集的锚框

在训练集中，将每个锚框视为一个训练样本。

为了训练目标检测模型，需要为每个锚框标注两类标签：
一是锚框所含目标的类别，简称类别；
二是真实边界框相对锚框的偏移量，简称偏移量（offset）。

在目标检测时，首先生成多个锚框，然后为每个锚框预测类别以及偏移量，接着根据预测的偏移量调整锚框位置从而得到预测边界框，最后筛选需要输出的预测边界框。

在目标检测的训练集中，每个图像已标注了真实边界框的位置以及所含目标的类别。
在生成锚框之后，主要依据与锚框相似的真实边界框的位置和类别信息为锚框标注。
那么，该如何为锚框分配与其相似的真实边界框呢？

假设图像中锚框分别为 $A_1,A_2,…,A_{n_a}$ ，真实边界框分别为 $B_1,B_2,…,B_{n_b}$ ，且 $n_a≥n_b$ 。

定义矩阵 $X∈R^{n_a×n_b}$ ，其中第 i 行第 j 列的元素 $x_{ij}$ 为锚框 $A_i$ 与真实边界框 $B_j$ 的交并比。

首先，找出矩阵 $X$ 中最大元素，并将该元素的行索引与列索引分别记为 $i_1$ , $j_1$ 。为锚框 $A_{i1}$ 分配真实边界框 $B_{j1}$ 。
显然，锚框 $A_{i1}$ 和真实边界框 $B_{j1}$ 在所有的“锚框—真实边界框”的配对中相似度最高。

接下来，将矩阵 $X$ 中第 $i_1$ 行和第 $j_1$ 列上的所有元素丢弃。找出矩阵 $X$ 中剩余的最大元素，并将该元素的行索引与列索引分别记为 $i_2$ , $j_2$ 。
为锚框 $A_{i2}$ 分配真实边界框 $B_{j2}$ ，再将矩阵 $X$ 中第 $i_2$ 行和第 $j_2$ 列上的所有元素丢弃。
此时矩阵 $X$ 中已有2行2列的元素被丢弃。
依此类推，直到矩阵 $X$ 中所有 $n_b$ 列元素全部被丢弃。

此时，已为 $n_b$ 个锚框各分配了一个真实边界框。
之后，只遍历剩余的 $n_a−n_b$ 个锚框：
给定其中的锚框 $A_i$ ，根据矩阵 $X$ 的第 i 行找到与 $A_i$ 交并比最大的真实边界框 $B_j$ ，且只有当该交并比大于预先设定的阈值时，才为锚框 $A_i$ 分配真实边界框 $B_j$ 。

如上图（左）所示，假设矩阵 $X$ 中最大值为 $x_{23}$ ，我们将为锚框 $A_2$ 分配真实边界框 $B_3$ 。然后，丢弃矩阵中第2行和第3列的所有元素，找出剩余阴影部分的最大元素 $x_{71}$ ，为锚框 $A_7$ 分配真实边界框 $B_1$ 。

接着如上图（中）所示，丢弃矩阵中第7行和第1列的所有元素，找出剩余阴影部分的最大元素 $x_{54}$ ，为锚框 $A_5$ 分配真实边界框 $B_4$ 。

最后如上图（右）所示，丢弃矩阵中第5行和第4列的所有元素，找出剩余阴影部分的最大元素 $x_{92}$ ，为锚框 $A_9$ 分配真实边界框 $B_2$ 。

之后，只需遍历除去 $A_2,A_5,A_7,A_9$ 的剩余锚框，并根据阈值判断是否为剩余锚框分配真实边界框。

现在可以标注锚框的类别和偏移量了。

如果一个锚框 A 被分配了真实边界框 B ，将锚框 A 的类别设为 B 的类别，并根据 B 和 A 的中心坐标的相对位置以及两个框的相对大小为锚框 A 标注偏移量。

由于数据集中各个框的位置和大小各异，因此这些相对位置和相对大小通常需要一些特殊变换，才能使偏移量的分布更均匀从而更容易拟合。

设锚框 A 及其被分配的真实边界框 B 的中心坐标分别为 $x_a,y_a) 和 (x_b,y_b)$ ， A 和 B 的宽分别为 $w_a$ 和 $w_b$ ，高分别为 $h_a$ 和 $h_b$ ，一个常用的技巧是将 A 的偏移量标注为：

其中，常数的默认值为：
$μ_x=μ_y=μ_w=μ_h=0, \,σ_x=σ_y=0.1, \,σ_w=σ_h=0.2$ 。

如果一个锚框没有被分配真实边界框，只需将该锚框的类别设为背景。
类别为背景的锚框通常被称为负类锚框，其余则被称为正类锚框。

例如，需要为读取的图像中的猫和狗定义真实边界框。

其中，第一个元素为类别（0为狗，1为猫），剩余4个元素分别为左上角的 x 和 y 轴坐标以及右下角的 x 和 y 轴坐标（值域在0到1之间）。

由此，通过左上角和右下角的坐标构造了5个需要标注的锚框，分别记为 $A_0,…,A_4$ （程序中索引从0开始）。

先绘制这些锚框与真实边界框在图像中的位置：

bbox_scale = tf.constant([[w,h,w,h]], dtype=tf.float32)

ground_truth = tf.constant([[0, 0.1, 0.08, 0.52, 0.92],
                         [1, 0.55, 0.2, 0.9, 0.88]])
anchors = tf.constant([[0, 0.1, 0.2, 0.3], [0.15, 0.2, 0.4, 0.4],
                    [0.63, 0.05, 0.88, 0.98], [0.66, 0.45, 0.8, 0.8],
                    [0.57, 0.3, 0.92, 0.9]])

fig = plt.imshow(img)
show_bboxes(fig.axes, tf.multiply(ground_truth[:, 1:], bbox_scale),['dog', 'cat'], 'k')
show_bboxes(fig.axes, tf.multiply(anchors, bbox_scale),['0', '1', '2', '3', '4'])

之后，实现MultiBoxTarget函数为锚框标注类别和偏移量。

该函数将背景类别设为0，并令从零开始的目标类别的整数索引自加1（1为狗，2为猫）。

def assign_anchor(bb, anchor, jaccard_threshold=0.5):
    """
    为每个anchor分配真实的bb # anchor表示成归一化(xmin, ymin, xmax, ymax).
    https://zh.d2l.ai/chapter_computer-vision/anchor.html
    Args:
        bb: 真实边界框(bounding box), shape:（nb, 4）
        anchor: 待分配的anchor, shape:（na, 4）
        jaccard_threshold: 预先设定的阈值
    Returns:
        assigned_idx: shape: (na, ), 每个anchor分配的真实bb对应的索引, 若未分配任何bb则为-1
    """
    na = anchor.shape[0]
    nb = bb.shape[0]
    jaccard = compute_jaccard(anchor, bb).numpy()   # shape: (na, nb)
    assigned_idx = np.ones(na) * -1 # 初始全为-1

    # 先为每个bb分配一个anchor（不要求满足jaccard_threshold）
    jaccard_cp = jaccard.copy()
    for j in range(nb):
        i = np.argmax(jaccard_cp[:, j])
        assigned_idx[i] = j
        jaccard_cp[i, :] = float("-inf")    # 赋值为负无穷, 相当于去掉这一行

    # 处理还未被分配的anchor， 要求满足jaccard_threshold
    for i in range(na):
        if assigned_idx[i] == -1:
            j = np.argmax(jaccard[i, :])
            if jaccard[i, j] >= jaccard_threshold:
                assigned_idx[i] = j
    return tf.cast(assigned_idx, tf.int32)



def xy_to_cxcy(xy):
    """
    将(x_min, y_min, x_max, y_max)形式的anchor转换成(center_x, center_y, w, h)形式的.
    https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/utils.py
    Args:
        xy: bounding boxes in boundary coordinates, a tensor of size (n_boxes, 4)
    Returns: 
        bounding boxes in center-size coordinates, a tensor of size (n_boxes, 4)
    """
    return tf.concat(((xy[:, 2:] + xy[:, :2]) / 2,  #c_x, c_y
              xy[:, 2:] - xy[:, :2]), axis=1)



def MultiBoxTarget(anchor, label):
    """
    为锚框标注类别和偏移量 # anchor表示成归一化(xmin, ymin, xmax, ymax).
    https://zh.d2l.ai/chapter_computer-vision/anchor.html
    Args:
        anchor: torch tensor, 输入的锚框, 一般是通过MultiBoxPrior生成, shape:（1，锚框总数，4）
        label: 真实标签, shape为(bn, 每张图片最多的真实锚框数, 5)
               第二维中，如果给定图片没有这么多锚框, 可以先用-1填充空白, 最后一维中的元素为[类别标签, 四个坐标值]
    Returns:
        列表, [bbox_offset, bbox_mask, cls_labels]
        bbox_offset: 每个锚框的标注偏移量，形状为(bn，锚框总数*4)
        bbox_mask: 形状同bbox_offset, 每个锚框的掩码, 一一对应上面的偏移量, 负类锚框(背景)对应的掩码均为0, 正类锚框的掩码均为1
        cls_labels: 每个锚框的标注类别, 其中0表示为背景, 形状为(bn，锚框总数)
    """
    assert len(anchor.shape) == 3 and len(label.shape) == 3
    bn = label.shape[0]

    def MultiBoxTarget_one(anchor, label, eps=1e-6):
        """
        MultiBoxTarget函数的辅助函数, 处理batch中的一个
        Args:
            anchor: shape of (锚框总数, 4)
            label: shape of (真实锚框数, 5), 5代表[类别标签, 四个坐标值]
            eps: 一个极小值, 防止log0
        Returns:
            offset: (锚框总数*4, )
            bbox_mask: (锚框总数*4, ), 0代表背景, 1代表非背景
            cls_labels: (锚框总数, 4), 0代表背景
        """
        an = anchor.shape[0]
        assigned_idx = assign_anchor(label[:, 1:], anchor) ## (锚框总数, )
        # 决定anchor留下或者舍去
        bbox_mask = tf.repeat(tf.expand_dims(tf.cast((assigned_idx >= 0), dtype=tf.double), axis=-1), repeats=4, axis=1)

        cls_labels = np.zeros(an, dtype=int) # 0表示背景
        assigned_bb = np.zeros((an, 4), dtype=float) # 所有anchor对应的bb坐标
        for i in range(an):
            bb_idx = assigned_idx[i]
            if bb_idx >= 0: # 即非背景
                cls_labels[i] = label.numpy()[bb_idx, 0] + 1 # 要注意加1
                assigned_bb[i, :] = label.numpy()[bb_idx, 1:]

        center_anchor = tf.cast(xy_to_cxcy(anchor), dtype=tf.double)  # (center_x, center_y, w, h)
        center_assigned_bb = tf.cast(xy_to_cxcy(assigned_bb), dtype=tf.double) # (center_x, center_y, w, h)

        offset_xy = 10.0 * (center_assigned_bb[:,:2] - center_anchor[:,:2]) / center_anchor[:,2:]
        offset_wh = 5.0 * tf.math.log(eps + center_assigned_bb[:, 2:] / center_anchor[:, 2:])
        offset = tf.multiply(tf.concat((offset_xy, offset_wh), axis=1), bbox_mask)    # (锚框总数, 4)

        return tf.reshape(offset, (-1,)), tf.reshape(bbox_mask, (-1,)), cls_labels

    batch_offset = []
    batch_mask = []
    batch_cls_labels = []
    for b in range(bn):
        offset, bbox_mask, cls_labels = MultiBoxTarget_one(anchor[0, :, :], label[b,:,:])

        batch_offset.append(offset)
        batch_mask.append(bbox_mask)
        batch_cls_labels.append(cls_labels)

    batch_offset = tf.convert_to_tensor(batch_offset)
    batch_mask = tf.convert_to_tensor(batch_mask)
    batch_cls_labels = tf.convert_to_tensor(batch_cls_labels)

    return [batch_offset, batch_mask, batch_cls_labels]

通过tf.expand_dims函数为锚框和真实边界框添加样本维：

labels = MultiBoxTarget(tf.expand_dims(anchors, axis=0), tf.expand_dims(ground_truth, axis=0))
print(type(labels))
print(len(labels))
print()
print(labels[0])
print()
print(labels[1])
print()
print(labels[2])

<class 'list'>
3

tf.Tensor(
[[-0.00000000e+00 -0.00000000e+00 -0.00000000e+00 -0.00000000e+00
   1.39999941e+00  9.99999963e+00  2.59397170e+00  7.17542385e+00
  -1.20000005e+00  2.68817346e-01  1.68236424e+00 -1.56545220e+00
  -0.00000000e+00 -0.00000000e+00 -0.00000000e+00 -0.00000000e+00
  -5.71428839e-01 -1.00000047e+00  4.14850341e-06  6.25820368e-01]], shape=(1, 20), dtype=float64)

tf.Tensor([[0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1.]], shape=(1, 20), dtype=float64)

tf.Tensor([[0 1 2 0 2]], shape=(1, 5), dtype=int32)

根据锚框与真实边界框在图像中的位置来分析这些标注的类别。

首先，在所有的“锚框—真实边界框”的配对中，锚框 $A_4$ 与猫的真实边界框的交并比最大，因此锚框 $A_4$ 的类别标注为猫。

不考虑锚框 $A_4$ 或猫的真实边界框，在剩余的“锚框—真实边界框”的配对中，最大交并比的配对为锚框 $A_1$ 和狗的真实边界框，因此锚框 $A_1$ 的类别标注为狗。

接下来遍历未标注的剩余3个锚框：
与锚框 $A_0$ 交并比最大的真实边界框的类别为狗，但交并比小于阈值（默认为0.5），因此类别标注为背景；
与锚框 $A_2$ 交并比最大的真实边界框的类别为猫，且交并比大于阈值，因此类别标注为猫；
与锚框 $A_3$ 交并比最大的真实边界框的类别为猫，但交并比小于阈值，因此类别标注为背景。

返回的第一项是为每个锚框标注的四个偏移量，其中负类锚框的偏移量标注为0。

返回值的第二项为掩码（mask）变量，形状为(批量大小, 锚框个数的四倍)。
掩码变量中的元素与每个锚框的4个偏移量一一对应。
由于我们不关心对背景的检测，有关负类的偏移量不应影响目标函数。通过按元素乘法，掩码变量中的0可以在计算目标函数之前过滤掉负类的偏移量。

4.4 输出预测边界框

在模型预测阶段，先为图像生成多个锚框，并为这些锚框一一预测类别和偏移量。

随后，根据锚框及其预测偏移量得到预测边界框。

当锚框数量较多时，同一个目标上可能会输出较多相似的预测边界框。
为了使结果更加简洁，可以移除相似的预测边界框。常用的方法叫作非极大值抑制（non-maximum suppression，NMS）。

非极大值抑制的工作原理：

对于一个预测边界框 B ，模型会计算各个类别的预测概率。

设其中最大的预测概率为 p ，该概率所对应的类别即 B 的预测类别，将 p 称为预测边界框 B 的置信度。

在同一图像上，将预测类别非背景的预测边界框按置信度从高到低排序，得到列表 L 。

从 L 中选取置信度最高的预测边界框 $B_1$ 作为基准，将所有与 $B_1$ 的交并比大于某阈值（预先设定的超参数）的非基准预测边界框从 L 中移除。
此时， L 保留了置信度最高的预测边界框并移除了与其相似的其他预测边界框。

接下来，从 L 中选取置信度第二高的预测边界框 $B_2$ 作为基准，将所有与 $B_2$ 的交并比大于某阈值的非基准预测边界框从 L 中移除。

重复该过程，直到 L 中所有的预测边界框都曾作为基准。
此时 L 中任意一对预测边界框的交并比都小于阈值。

最终，输出列表 L 中的所有预测边界框。

例如，先构造4个锚框。简单起见，假设预测偏移量全是0：预测边界框即锚框。
最后，构造每个类别的预测概率。

anchors = tf.convert_to_tensor([[0.1, 0.08, 0.52, 0.92],
                [0.08, 0.2, 0.56, 0.95],
                [0.15, 0.3, 0.62, 0.91],
                [0.55, 0.2, 0.9, 0.88]])

offset_preds = tf.convert_to_tensor([0.0] * (4 * len(anchors)))

cls_probs = tf.convert_to_tensor([[0., 0., 0., 0.], # 背景的预测概率
                [0.9, 0.8, 0.7, 0.1],    # 狗的预测概率
                [0.1, 0.2, 0.3, 0.9]])   # 猫的预测概率

print(anchors)
print()
print(offset_preds)
print()
print(cls_probs)

tf.Tensor(
[[0.1  0.08 0.52 0.92]
 [0.08 0.2  0.56 0.95]
 [0.15 0.3  0.62 0.91]
 [0.55 0.2  0.9  0.88]], shape=(4, 4), dtype=float32)

tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(16,), dtype=float32)

tf.Tensor(
[[0.  0.  0.  0. ]
 [0.9 0.8 0.7 0.1]
 [0.1 0.2 0.3 0.9]], shape=(3, 4), dtype=float32)

在图像上打印预测边界框和它们的置信度：

fig = plt.imshow(img)
show_bboxes(fig.axes, anchors * bbox_scale, labels=['dog=0.9', 'dog=0.8', 'dog=0.7', 'cat=0.9'])

实现MultiBoxDetection函数来执行非极大值抑制：

from collections import namedtuple

# Returns a new subclass of tuple with named fields.
Pred_BB_Info = namedtuple("Pred_BB_Info", ["index", "class_id", "confidence", "xyxy"])
Pred_BB_Info.__doc__

'Pred_BB_Info(index, class_id, confidence, xyxy)'

def non_max_suppression(bb_info_list, nms_threshold=0.5):
    """
    非极大抑制处理预测的边界框
    Args:
        bb_info_list: Pred_BB_Info的列表, 包含预测类别、置信度等信息
        nms_threshold: 阈值
    Returns:
        output: Pred_BB_Info的列表, 只保留过滤后的边界框信息
    """
    output = []
    # 现根据置信度从高到底排序
    sorted_bb_info_list = sorted(bb_info_list,
                    key = lambda x: x.confidence, 
                    reverse=True)
    while len(sorted_bb_info_list) != 0:
        best = sorted_bb_info_list.pop(0)
        output.append(best)

        if len(sorted_bb_info_list) == 0:
            break
        bb_xyxy = []
        for bb in sorted_bb_info_list:
            bb_xyxy.append(bb.xyxy)

        iou = compute_jaccard(tf.convert_to_tensor(best.xyxy),
                    tf.squeeze(tf.convert_to_tensor(bb_xyxy), axis=1))[0] # shape: (len(sorted_bb_info_list), )
        n = len(sorted_bb_info_list)
        sorted_bb_info_list = [
                    sorted_bb_info_list[i] for i in 
                    range(n) if iou[i] <= nms_threshold]
    return output



def MultiBoxDetection(cls_prob, loc_pred, anchor, nms_threshold=0.5):
    """
    非极大值抑制 # anchor表示成归一化(xmin, ymin, xmax, ymax).
    https://zh.d2l.ai/chapter_computer-vision/anchor.html
    Args:
        cls_prob: 经过softmax后得到的各个锚框的预测概率, shape:(bn, 预测总类别数+1, 锚框个数)
        loc_pred: 预测的各个锚框的偏移量, shape:(bn, 锚框个数*4)
        anchor: MultiBoxPrior输出的默认锚框, shape: (1, 锚框个数, 4)
        nms_threshold: 非极大抑制中的阈值
    Returns:
        所有锚框的信息, shape: (bn, 锚框个数, 6)
        每个锚框信息由[class_id, confidence, xmin, ymin, xmax, ymax]表示
        class_id=-1 表示背景或在非极大值抑制中被移除了
    """
    assert len(cls_prob.shape) == 3 and len(loc_pred.shape) == 2 and len(anchor.shape) == 3
    bn = cls_prob.shape[0]

    def MultiBoxDetection_one(c_p, l_p, anc, nms_threshold=0.5):
        """
        MultiBoxDetection的辅助函数, 处理batch中的一个
        Args:
            c_p: (预测总类别数+1, 锚框个数)
            l_p: (锚框个数*4, )
            anc: (锚框个数, 4)
            nms_threshold: 非极大抑制中的阈值
        Return:
            output: (锚框个数, 6)
        """
        pred_bb_num = c_p.shape[1]
        # 加上偏移量
        anc = tf.add(anc, tf.reshape(l_p, (pred_bb_num, 4))).numpy()

        # 最大的概率
        confidence = tf.reduce_max(c_p, axis=0)
        # 最大概率对应的id
        class_id = tf.argmax(c_p, axis=0)
        confidence = confidence.numpy()
        class_id = class_id.numpy()

        pred_bb_info = [Pred_BB_Info(index=i,
                    class_id=class_id[i]-1,
                    confidence=confidence[i],
                    xyxy=[anc[i]]) # xyxy是个列表
                for i in range(pred_bb_num)]
        # 正类的index
        obj_bb_idx = [bb.index for bb 
                in non_max_suppression(pred_bb_info,
                            nms_threshold)]
        output = []
        for bb in pred_bb_info:
            output.append(np.append([
                (bb.class_id if bb.index in obj_bb_idx 
                        else -1.0),
                bb.confidence],
                bb.xyxy))

        return tf.convert_to_tensor(output) # shape: (锚框个数， 6)

    batch_output = []
    for b in range(bn):
        batch_output.append(MultiBoxDetection_one(cls_prob[b],
                        loc_pred[b], anchor[0],
                        nms_threshold))

    return tf.convert_to_tensor(batch_output)

运行MultiBoxDetection函数并设阈值为0.5，此外，为输入都增加了样本维：

output = MultiBoxDetection(
    tf.expand_dims(cls_probs, 0),
    tf.expand_dims(offset_preds, 0),
    tf.expand_dims(anchors, 0),
    nms_threshold=0.5)

output

<tf.Tensor: id=3621, shape=(1, 4, 6), dtype=float64, numpy=
array([[[ 0.        ,  0.89999998,  0.1       ,  0.08      ,
          0.51999998,  0.92000002],
        [-1.        ,  0.80000001,  0.08      ,  0.2       ,
          0.56      ,  0.94999999],
        [-1.        ,  0.69999999,  0.15000001,  0.30000001,
          0.62      ,  0.91000003],
        [ 1.        ,  0.89999998,  0.55000001,  0.2       ,
          0.89999998,  0.88      ]]])>

由此可见，返回的结果的形状为(批量大小, 锚框个数, 6)。

其中，每一行的6个元素代表同一个预测边界框的输出信息：
第一个元素是索引从0开始计数的预测类别（0为狗，1为猫），而 -1表示背景或在非极大值抑制中被移除；
第二个元素是预测边界框的置信度；
剩余的4个元素分别是预测边界框左上角的 $x$ 和 $y$ 轴坐标以及右下角的 $x$ 和 $y$ 轴坐标（值域在0到1之间）。

移除掉类别为-1的预测边界框，并可视化非极大值抑制保留的结果：

fig = plt.imshow(img)

for i in output[0].numpy():
    if i[0] == -1:
        continue
    label = ('dog=', 'cat=')[int(i[0])] + str(i[1])
    show_bboxes(fig.axes, tf.multiply(i[2:], bbox_scale), label)

实践中，可以在执行非极大值抑制前将置信度较低的预测边界框移除，从而减小非极大值抑制的计算量。

此外，还可以筛选非极大值抑制的输出，例如，只保留其中置信度较高的结果作为最终输出。

参考

《动手学深度学习》(TF2.0版)

Fiona-Dong

关注

4
点赞
踩
25

收藏

觉得还不错? 一键收藏
2
评论
（八）计算机视觉 -- 4 锚框

4. 锚框目标检测算法通常会在输入图像中采样大量的区域，然后判断这些区域中是否包含感兴趣的目标，并调整区域边缘从而更准确地预测目标的真实边界框（ground-truth bounding box）。不同的模型使用的区域采样方法可能不同。这里介绍其中的一种：它以每个像素为中心生成多个大小和宽高比（aspect ratio）不同的边界框。这些边界框被称为锚框（anchor box）。4.1 生成多个锚框假设输入图像高为h、宽为w，分别以图像的每个像素为中心生成不同形状的锚框。设大小为 s∈(0,1
复制链接

扫一扫

专栏目录