CornerNet原理与代码解析

最新推荐文章于 2023-11-30 17:14:03 发布

00000cj

最新推荐文章于 2023-11-30 17:14:03 发布

阅读量1.4k

点赞数 5

分类专栏： Object Detection

本文链接：https://blog.csdn.net/ooooocj/article/details/115385869

版权

Object Detection 专栏收录该内容

43 篇文章 3 订阅

订阅专栏

input：(batch_size, 3, 511, 511)

backbone: hourglass

输入首先接一个stem模块（由一个conv(7x7-c128-s2-p3)的卷积模块和一个residual模块，其中卷积层的核为3x3-c256-s2-p1组成），输出变成(batch_size, 256, 128, 128)

self.stem = nn.Sequential(
    ConvModule(3, 128, 7, padding=3, stride=2, norm_cfg=norm_cfg),  # torch.Size([2, 128, 256, 256])
    ResLayer(BasicBlock, 128, 256, 1, stride=2, norm_cfg=norm_cfg))

然后跟着两个hourglass module, 之前穿插着一些额外的卷积，代码如下

for ind in range(self.num_stacks):  # 2
    single_hourglass = self.hourglass_modules[ind]
    out_conv = self.out_convs[ind]

    hourglass_feat = single_hourglass(inter_feat)
    out_feat = out_conv(hourglass_feat)
    out_feats.append(out_feat)

    if ind < self.num_stacks - 1:  # 2-1
        inter_feat = self.conv1x1s[ind](
            inter_feat) + self.remap_convs[ind](
                out_feat)
        inter_feat = self.inters[ind](self.relu(inter_feat))
    
    return out_feats

Hourglass

hourglass结构如下所示

从图中可以看出整个结构是一个对称的嵌套结构，每一层是由up1、low1、low2、low3、up2五个操作组成，代码如下所示，最后返回up1+up2的结果，代码中通过在low2这一步递归调用自己来实现图中的嵌套结构

def forward(self, x):
    """Forward function."""
    up1 = self.up1(x)
    low1 = self.low1(x)
    low2 = self.low2(low1)
    low3 = self.low3(low2)
    up2 = self.up2(low3)

    return up1 + up2

每一个hourglass_module的输入和输出维度是一样的

整个流程如下

(batch_size, 3, 511, 511)的输入经过一个stem模块，输出维度为(batch_size, 256, 128, 128)
经过一个hourglass_module，输出维度还是(batch_size, 256, 128, 128)
经过一个(3x3-c256-s1-p1)out_conv卷积，输出(batch_size, 256, 128, 128)，结果保存到列表中
第1步的输出和第3步的输出分别经过一个(1x1-c256-s1-p1)的卷积并相加，即上面代码中的conv1x1s和conv1x1s和remap_convs。然后经过relu和一个Reslayer，输出(batch_size, 256, 128, 128)
然后重复第2和第3步，结果添加到第3步的列表中。最终返回out_feats列表，列表里的两个feature_map维度都是(batch_size, 256, 128, 128)

注意训练和测试的时候这里的输出不一样，训练的时候两个hourglass module的输出都用到了，而测试的时候只用了第二个hourglass module的输出。训练阶段这里的输出为[(batch_size, 256, 128, 128), (batch_size, 256, 128, 128)]。

接着hourglass module的输出传入两个prediction module，分别检测top-left corner和bottom-right corner。

模型总的流程图如下所示

上图中只画出了第二个hourglass module的输出，在训练阶段第一个hourglass module的输出也传入了prediction module，两个输出的后续操作是一样的，因此下面只以一个module的输出为例继续讲解后续操作。

top-left corner和bottom-right corner的prediction module不共享参数，除了corner pooling的操作不一样，其它都是一样的，以top-left corner的prediction module为例，它又是由top-left corner pooling module和后面输出Heatmaps、Embeddings、Offsets两个部分组成，首先看top-left corner pooling module，如下图所示

代码如下

direction1_conv = self.direction1_conv(x)  # torch.Size([2, 128, 128, 128])
direction2_conv = self.direction2_conv(x)  # torch.Size([2, 128, 128, 128])
direction1_feat = self.direction1_pool(direction1_conv)  # torch.Size([2, 128, 128, 128])
direction2_feat = self.direction2_pool(direction2_conv)  # torch.Size([2, 128, 128, 128])
aftpool_conv = self.aftpool_conv(direction1_feat + direction2_feat)  # torch.Size([2, 256, 128, 128])
conv1 = self.conv1(x)  # torch.Size([2, 256, 128, 128])
relu = self.relu(aftpool_conv + conv1)  # torch.Size([2, 256, 128, 128])
conv2 = self.conv2(relu)  # torch.Size([2, 256, 128, 128])
return conv2

backbone也就是hourglass module的输出维度是(batch_size, 256, 128, 128)，分别经过两个3×3 Conv-Bn-ReLu和一个1×1 Conv-Bn，对应上面代码中的self.direction1_conv、self.direction2_conv、self.conv1，输出分别为(b,128,128,128)、(b,128,128,128)、(b,256,128,128)，就是上图中第二列的三个feature map。接着前两个输出分别传入top corner pooling和left corner pooling，对应上面代码中的self.direction1_pool、self.direction2_pool。

Corner Pooling

corner pooling是作者专门提出根据先验知识来定位物体的左上和右下角点的，如下图所示，当我们在左上角从左往右看可以确定物体的上边界，从上往下看可以确定物体的左边界

因此当求解某一个点的 top-left corner pooling时，就是以该点为起点，水平向右看遇到的最大值以及竖直向下看最大的值之和。

以left corner pooling为例，具体计算的时候作者想到一个巧妙的方法，即每一行从右向左遍历每个像素，用遇到的最大值替换当前像素。top corner pooling则是从下向上遍历。

top corner pooling和left corner pooling的输出相加，接一个3×3 Conv-BN，然后与一开始1×1 Conv-BN的结果相加，接一个ReLU，再接一个3×3 Conv-BN-ReLU，即得到图中prediction module中间的那个灰色feature map，也就是代码中返回的conv2，维度为(b,256,128,128)。

corner pooling的结果分别接三个3×3 Conv-ReLU和一个1×1 Conv，就得到了Heatmaps、Embeddings、Offsets，维度分别为(b, num_class, 128, 128)、(b, 1, 128, 128)、(b, 2, 128, 128)。

到这一步得到了模型的最终输出，bottom-right prediction module也会得到同样维度的三个输出，一共六个输出。

result_list = [tl_heat, br_heat, tl_emb, br_emb, tl_off, br_off]

训练阶段，第一个hourglass module会直接传入后面的predition module，也会得到和上面一样的六个输出。接下来这些输出就要和ground truth进行loss的计算了。

制作标签

在计算loss之前，首先要计算模型输出对应的ground truth。首先是heatmap的ground truth，heatmap的通道数就是类别数，若原图存在某个类别的一个物体，则heatmap对应通道上、根据stride对应角点位置处应为正，其余为负。但这样太过严格，角点附近的点确定的一个预测有可能和ground truth的iou非常大，比如说0.9，这样的点label为负显然不利于模型训练。因此作者并没有将除角点外的所有位置都视为负样本给予同等的惩罚，而是减少了对以角点为圆心的某个半径区域内位置的惩罚，半径大小根据与ground truth的iou确定，而圆区域内的标签由作者提出的改进的高斯分布确定，越靠近圆心标签越大，越远标签越接近负样本的标签。

半径的计算可以参考这个https://zhuanlan.zhihu.com/p/96856635?from_voters_page=true。其实就是分为内切、外切和一个内切一个外切三种情况，注意虽然图中画的是个圆形，但实际计算的时候是按正方形算的。根据确定的iou，求解三种情况下的二元一次方程，得到最终的半径r。代码如下

def gaussian_radius(det_size, min_overlap):
    r"""Generate 2D gaussian radius.

    This function is modified from the `official github repo
    <https://github.com/princeton-vl/CornerNet-Lite/blob/master/core/sample/
    utils.py#L65>`_.

    Given ``min_overlap``, radius could computed by a quadratic equation
    according to Vieta's formulas.

    There are 3 cases for computing gaussian radius, details are following:

    - Explanation of figure: ``lt`` and ``br`` indicates the left-top and
      bottom-right corner of ground truth box. ``x`` indicates the
      generated corner at the limited position when ``radius=r``.

    - Case1: one corner is inside the gt box and the other is outside.

    .. code:: text

        |<   width   >|

        lt-+----------+         -
        |  |          |         ^
        +--x----------+--+
        |  |          |  |
        |  |          |  |    height
        |  | overlap  |  |
        |  |          |  |
        |  |          |  |      v
        +--+---------br--+      -
           |          |  |
           +----------+--x

    To ensure IoU of generated box and gt box is larger than ``min_overlap``:

    .. math::
        \cfrac{(w-r)*(h-r)}{w*h+(w+h)r-r^2} \ge {iou} \quad\Rightarrow\quad
        {r^2-(w+h)r+\cfrac{1-iou}{1+iou}*w*h} \ge 0 \\
        {a} = 1,\quad{b} = {-(w+h)},\quad{c} = {\cfrac{1-iou}{1+iou}*w*h}
        {r} \le \cfrac{-b-\sqrt{b^2-4*a*c}}{2*a}

    - Case2: both two corners are inside the gt box.

    .. code:: text

        |<   width   >|

        lt-+----------+         -
        |  |          |         ^
        +--x-------+  |
        |  |       |  |
        |  |overlap|  |       height
        |  |       |  |
        |  +-------x--+
        |          |  |         v
        +----------+-br         -

    To ensure IoU of generated box and gt box is larger than ``min_overlap``:

    .. math::
        \cfrac{(w-2*r)*(h-2*r)}{w*h} \ge {iou} \quad\Rightarrow\quad
        {4r^2-2(w+h)r+(1-iou)*w*h} \ge 0 \\
        {a} = 4,\quad {b} = {-2(w+h)},\quad {c} = {(1-iou)*w*h}
        {r} \le \cfrac{-b-\sqrt{b^2-4*a*c}}{2*a}

    - Case3: both two corners are outside the gt box.

    .. code:: text

           |<   width   >|

        x--+----------------+
        |  |                |
        +-lt-------------+  |   -
        |  |             |  |   ^
        |  |             |  |
        |  |   overlap   |  | height
        |  |             |  |
        |  |             |  |   v
        |  +------------br--+   -
        |                |  |
        +----------------+--x

    To ensure IoU of generated box and gt box is larger than ``min_overlap``:

    .. math::
        \cfrac{w*h}{(w+2*r)*(h+2*r)} \ge {iou} \quad\Rightarrow\quad
        {4*iou*r^2+2*iou*(w+h)r+(iou-1)*w*h} \le 0 \\
        {a} = {4*iou},\quad {b} = {2*iou*(w+h)},\quad {c} = {(iou-1)*w*h} \\
        {r} \le \cfrac{-b+\sqrt{b^2-4*a*c}}{2*a}

    Args:
        det_size (list[int]): Shape of object.
        min_overlap (float): Min IoU with ground truth for boxes generated by
            keypoints inside the gaussian kernel.

    Returns:
        radius (int): Radius of gaussian kernel.
    """
    height, width = det_size

    a1 = 1
    b1 = (height + width)
    c1 = width * height * (1 - min_overlap) / (1 + min_overlap)
    sq1 = sqrt(b1**2 - 4 * a1 * c1)
    r1 = (b1 - sq1) / (2 * a1)

    a2 = 4
    b2 = 2 * (height + width)
    c2 = (1 - min_overlap) * width * height
    sq2 = sqrt(b2**2 - 4 * a2 * c2)
    r2 = (b2 - sq2) / (2 * a2)

    a3 = 4 * min_overlap
    b3 = -2 * min_overlap * (height + width)
    c3 = (min_overlap - 1) * width * height
    sq3 = sqrt(b3**2 - 4 * a3 * c3)
    r3 = (b3 + sq3) / (2 * a3)
    return min(r1, r2, r3)

确定了半径r后，根据一个二维高斯分布 $e^{-\frac{x^{2}+y^{2}}{2\sigma ^{2}}}$ 计算这个区域内的label值，其中 $\sigma$ 就是半径r。

接着计算offsets的标签。offsets的标签很好理解，就是将原图上的交点坐标根据步长映射到输出特征图上时会取整而导致与原始值有一个误差，offsets的标签就是这个插值，公式如下

embeddings的标签更好理解，就是将属于同一物体的一组角点对应起来，将映射到输出特征图上的属于同一个物体的一组坐标放入一个列表。

corner_match.append([[top_idx, left_idx], [bottom_idx, right_idx]])

其中top_id是某个物体的左上角的y坐标映射到输出特征图上取整的坐标值。

Loss

作者设计了一个focal loss的变体损失作为角点heatmaps的loss，公式如下

其中N是图像中目标的数量， $\alpha$ 和 $\beta$ 是超参，文中分别设为2和4， $p_{cij}$ 是 $c$ 类位置 $(i,j)$ 处的预测值。代码如下

eps = 1e-12
alpha = 2.0
gamma = 4.0
pos_weights = gaussian_target.eq(1)
neg_weights = (1 - gaussian_target).pow(gamma)
pos_loss = -(pred + eps).log() * (1 - pred).pow(alpha) * pos_weights
neg_loss = -(1 - pred + eps).log() * pred.pow(alpha) * neg_weights
heatmap_loss = pos_loss + neg_loss

embeddings loss的公式如下

其中 $e_{t_{k}}$ 是预测图中左上角点位置的值， $e_{b_{k}}$ 是右下角点位置的值， $e_{k}$ 是左上和右下角点值的均值。作者设计了pull和push两个loss，思想是把属于统一物体的一组角点pull到一起，把不同物体的角点push开来。代码如下

def ae_loss_per_image(tl_preds, br_preds, match):
    """Associative Embedding Loss in one image.

    Associative Embedding Loss including two parts: pull loss and push loss.
    Pull loss makes embedding vectors from same object closer to each other.
    Push loss distinguish embedding vector from different objects, and makes
        the gap between them is large enough.

    During computing, usually there are 3 cases:
        - no object in image: both pull loss and push loss will be 0.
        - one object in image: push loss will be 0 and pull loss is computed
            by the two corner of the only object.
        - more than one objects in image: pull loss is computed by corner pairs
            from each object, push loss is computed by each object with all
            other objects. We use confusion matrix with 0 in diagonal to
            compute the push loss.

    Args:
        tl_preds (tensor): Embedding feature map of left-top corner.
        br_preds (tensor): Embedding feature map of bottom-right corner.
        match (list): Downsampled coordinates pair of each ground truth box.
    """

    tl_list, br_list, me_list = [], [], []
    if len(match) == 0:  # no object in image
        pull_loss = tl_preds.sum() * 0.
        push_loss = tl_preds.sum() * 0.
    else:
        for m in match:
            [tl_y, tl_x], [br_y, br_x] = m
            tl_e = tl_preds[:, tl_y, tl_x].view(-1, 1)  # torch.Size([1]) -> torch.Size([1, 1])
            # tensor([[0.0916]], device='cuda:0', grad_fn=<ViewBackward>)
            br_e = br_preds[:, br_y, br_x].view(-1, 1)
            tl_list.append(tl_e)
            br_list.append(br_e)
            me_list.append((tl_e + br_e) / 2.0)

        tl_list = torch.cat(tl_list)  # torch.Size([3, 1])
        br_list = torch.cat(br_list)
        me_list = torch.cat(me_list)

        assert tl_list.size() == br_list.size()

        # N is object number in image, M is dimension of embedding vector
        N, M = tl_list.size()  # 3,1

        pull_loss = (tl_list - me_list).pow(2) + (br_list - me_list).pow(2)
        pull_loss = pull_loss.sum() / N

        margin = 1  # exp setting of CornerNet, details in section 3.3 of paper

        # confusion matrix of push loss
        conf_mat = me_list.expand((N, N, M)).permute(1, 0, 2) - me_list
        conf_weight = 1 - torch.eye(N).type_as(me_list)
        conf_mat = conf_weight * (margin - conf_mat.sum(-1).abs())

        if N > 1:  # more than one object in current image
            push_loss = F.relu(conf_mat).sum() / (N * (N - 1))
        else:
            push_loss = tl_preds.sum() * 0.

    return pull_loss, push_loss

offsets loss就是普通的SmoothL1Loss，这里不再赘述。

测试流程

测试时每张图片都会和翻转后的图片一起进行测试。
不对图片进行resize，对图片进行127的zero padding再送进网络。
对网络输出的heatmap进行NMS（注意这里的NMS和目标检测里常见的那个根据iou过滤的NMS不一样，这里的NMS是对heatmap进行3×3的max pooling，再与原始的heatmap进行比较，那些值改变了的位置就是非极大值，将其置0，后续这些位置就不作为可能的角点位置了）。代码如下
```
def _local_maximum(self, heat, kernel=3):
    pad = (kernel - 1) // 2
    hmax = F.max_pool2d(heat, kernel, stride=1, padding=pad)
    keep = (hmax == heat).float()
    return heat * keep
```
根据置信度选出top_k（100）个左上角和右下角位置（不区分类别）。
对这200个位置根据对应的offset进行调整，然后映射回原图（乘以stride并减去padding，127/2）。
对100个左上位置和100个右下位置进行一一配对，共有10000个可能的组合，取左上和右下得分的均值作为组合的得分。
```
scores = (tl_scores + br_scores) / 2
```
根据配对角点对应的embedding计算距离，排除这10000个组合中距离大于阈值（0.5）的、不属于同一类的、坐标关系不满足的（右下的xy坐标小于左上的xy坐标）。
```
scores[cls_inds] = -1
scores[width_inds] = -1
scores[height_inds] = -1
scores[dist_inds] = -1
```
从这10000个组合中根据得分选出top1000个，并根据对应的index选出1000个bbox和class。
按类别进行soft-nms，然后取前max_per_img=100个。
取得分大于thresh=0.3的作为最终结果。

参考

https://zhuanlan.zhihu.com/p/188587434

https://zhuanlan.zhihu.com/p/103705172

00000cj

关注

5
点赞
踩
14

收藏

觉得还不错? 一键收藏
打赏
0
评论
CornerNet原理与代码解析

模型input：(batch_size, 3, 511, 511)backbone: hourglass输入首先接一个stem模块（由一个conv(7x7-c128-s2-p3)的卷积模块和一个residual模块，其中卷积层的核为3x3-c256-s2-p1组成），输出变成(batch_size, 256, 128, 128)self.stem = nn.Sequential( ConvModule(3, 128, 7, padding=3, stride=2, norm_cfg=n
复制链接

扫一扫