yolox

最新推荐文章于 2024-09-14 08:24:15 发布

rrr2

最新推荐文章于 2024-09-14 08:24:15 发布

阅读量359

点赞数

分类专栏：目标检测文章标签：深度学习神经网络 pytorch

本文链接：https://blog.csdn.net/qq_35608277/article/details/120007262

版权

目标检测专栏收录该内容

90 篇文章 5 订阅

订阅专栏

网络的输入为：1×3×640×640
yolo的输出为：[1/8, 1/16, 1/32]stride, 分别的输出尺寸为：80×80,40 ×40,20×20
这里对3个分支，每个分支有两个解耦头；同时回归分支，再解耦为box + object 两个输出
我们添加了Mosaic与Mixup两种数据增广以提升YOLOX的性能。Mosaic是U版YOLOv3中引入的一种有效增广策略，后来被广泛应用于YOLOv4、YOLOv5等检测器中。MixUp早期是为图像分类设计后在BoF中进行修改用于目标检测训练。通过这种额外的数据增广，基线模型取得了42.0%AP指标。注：由于采用了更强的数据增广，我们发现ImageNet预训练将毫无意义，因此，所有模型我们均从头开始训练。

Mosaic数据增强方法

mosaic数据增强则利用了四张图片，对四张图片进行拼接，每一张图片都有其对应的框框，将四张图片拼接之后就获得一张新的图片，同时也获得这张图片对应的框框，然后我们将这样一张新的图片传入到神经网络当中去学习，相当于一下子传入四张图片进行学习了。论文中说这极大丰富了检测物体的背景！且在标准化BN计算的时候一下子会计算四张图片的数据！

代码部分在Mosaicdetection.py中

首先获得4张图片，进行Mosaic增强
然后再随机选一张进行mixup增强

# 1. Mosaic增强
            # 3 additional image indices
            indices = [idx] + [random.randint(0, len(self._dataset) - 1) for _ in range(3)]

            for i_mosaic, index in enumerate(indices):
                img, _labels, _, _ = self._dataset.pull_item(index)             # labels:x1,y1,x2,y2,idx(0~n-1): 3*5
                h0, w0 = img.shape[:2]  # orig hw
                scale = min(1. * input_h / h0, 1. * input_w / w0)
                img = cv2.resize(
                    img, (int(w0 * scale), int(h0 * scale)), interpolation=cv2.INTER_LINEAR
                )
                # generate output mosaic image
                (h, w, c) = img.shape[:3]
                if i_mosaic == 0:
                    mosaic_img = np.full((input_h * 2, input_w * 2, c), 114, dtype=np.uint8)

                # suffix l means large image, while s means small image in mosaic aug.
                (l_x1, l_y1, l_x2, l_y2), (s_x1, s_y1, s_x2, s_y2) = get_mosaic_coordinate(
                    mosaic_img, i_mosaic, xc, yc, w, h, input_h, input_w
                )

                mosaic_img[l_y1:l_y2, l_x1:l_x2] = img[s_y1:s_y2, s_x1:s_x2]
                padw, padh = l_x1 - s_x1, l_y1 - s_y1

                labels = _labels.copy()
                # Normalized xywh to pixel xyxy format
                if _labels.size > 0:
                    labels[:, 0] = scale * _labels[:, 0] + padw
                    labels[:, 1] = scale * _labels[:, 1] + padh
                    labels[:, 2] = scale * _labels[:, 2] + padw
                    labels[:, 3] = scale * _labels[:, 3] + padh
                mosaic_labels.append(labels)

            if len(mosaic_labels):
                mosaic_labels = np.concatenate(mosaic_labels, 0)
                np.clip(mosaic_labels[:, 0], 0, 2 * input_w, out=mosaic_labels[:, 0])
                np.clip(mosaic_labels[:, 1], 0, 2 * input_h, out=mosaic_labels[:, 1])
                np.clip(mosaic_labels[:, 2], 0, 2 * input_w, out=mosaic_labels[:, 2])
                np.clip(mosaic_labels[:, 3], 0, 2 * input_h, out=mosaic_labels[:, 3])

# 2. mixup
origin_img = 0.5 * origin_img + 0.5 * padded_cropped_img.astype(np.float32)

YOLOv4、YOLOv5均采用了YOLOv3原始的anchor设置。然而anchor机制存在诸多问题：(1) 为获得最优检测性能，需要在训练之前进行聚类分析以确定最佳anchor集合，这些anchor集合存在数据相关性，泛化性能较差；(2) anchor机制提升了检测头的复杂度。

Anchor-free检测器在过去两年得到了长足发展并取得了与anchor检测器相当的性能。将YOLO转换为anchor-free形式非常简单，我们将每个位置的预测从3下降为1并直接预测四个值：即两个offset以及高宽。参考FCOS，我们将每个目标的中心定位正样本并预定义一个尺度范围以便于对每个目标指派FPN水平。这种改进可以降低检测器的参数量于GFLOPs进而取得更快更优的性能：42.9%AP。

为确保与YOLOv3的一致性，前述anchor-free版本仅仅对每个目标赋予一个正样本，而忽视了其他高质量预测。参考FCOS，我们简单的赋予中心3×3区域为正样本。此时模型性能提升到45.0%，超过了当前最佳U版YOLOv3的44.3%。

对8400个yolo块的中心点，看是否在不同scale的中心区域，如果在，则将该点的pred暂时认定为正样本

        center_radius = 2.5

        gt_bboxes_per_image_l = (gt_bboxes_per_image[:, 0]).unsqueeze(1).repeat(
            1, total_num_anchors
        ) - center_radius * expanded_strides_per_image.unsqueeze(0)
        gt_bboxes_per_image_r = (gt_bboxes_per_image[:, 0]).unsqueeze(1).repeat(
            1, total_num_anchors
        ) + center_radius * expanded_strides_per_image.unsqueeze(0)
        gt_bboxes_per_image_t = (gt_bboxes_per_image[:, 1]).unsqueeze(1).repeat(
            1, total_num_anchors
        ) - center_radius * expanded_strides_per_image.unsqueeze(0)
        gt_bboxes_per_image_b = (gt_bboxes_per_image[:, 1]).unsqueeze(1).repeat(
            1, total_num_anchors
        ) + center_radius * expanded_strides_per_image.unsqueeze(0)

        c_l = x_centers_per_image - gt_bboxes_per_image_l               # 对8400个yolo块的中心点，看是否在不同scale的中心区域
        c_r = gt_bboxes_per_image_r - x_centers_per_image
        c_t = y_centers_per_image - gt_bboxes_per_image_t
        c_b = gt_bboxes_per_image_b - y_centers_per_image
        center_deltas = torch.stack([c_l, c_t, c_r, c_b], 2)
        is_in_centers = center_deltas.min(dim=-1).values > 0.0
        is_in_centers_all = is_in_centers.sum(dim=0) > 0

SimOTA标签分配
参考自:https://zhuanlan.zhihu.com/p/392221567

这里不得不提到OTA(Optimal Transport Assignment,Paper: https://arxiv.org/abs/2103.14259)，在目标检测中，有时候经常会出现一些模棱两可的anchor，如图3，即某一个anchor，按照正样本匹配规则，会匹配到两个gt，而retinanet这样基于IoU分配是会把anchor分配给IoU最大的gt，而OTA作者认为，将模糊的anchor分配给任何gt或背景都会对其他gt的梯度造成不利影响，因此，对模糊anchor样本的分配是特殊的，除了局部视图之外还需要其他信息。因此，更好的分配策略应该摆脱对每个gt对象进行最优分配的惯例，而转向全局最优的思想，换句话说，为图像中的所有gt对象找到全局的高置信度分配。（和DeTR中使用匈牙利算法一对一分配有点类似）
首先进行筛选 anchor的中心在 box & 在 box 的中心一定区域
进行 simota 标签分配
OTA分配的时候，cost是一个 n_gt × m_anchor 的矩阵。个人想法：1. 为m个anchor，每个match一个gt； 2. 通过OTA方法进行分配，为m中的m2个anchor分配目标gt，这样可以节省计算量

        # 1. 首先进行筛选 anchor的中心在 box & 在 box 的中心一定区域
        fg_mask, is_in_boxes_and_center = self.get_in_boxes_info(               # fg_mask, 用来做目标预测,初筛，anchor的中心在 box & 在 box 的中心一定区域
            gt_bboxes_per_image, expanded_strides, x_shifts, y_shifts, total_num_anchors, num_gt,
        )

        bboxes_preds_per_image = bboxes_preds_per_image[fg_mask]                # fg_mask: 3471
        cls_preds_ = cls_preds[batch_idx][fg_mask]
        obj_preds_ = obj_preds[batch_idx][fg_mask]
        num_in_boxes_anchor = bboxes_preds_per_image.shape[0]

        if mode == "cpu":
            gt_bboxes_per_image = gt_bboxes_per_image.cpu()
            bboxes_preds_per_image = bboxes_preds_per_image.cpu()

        pair_wise_ious = bboxes_iou(                                            # 3471*31
            gt_bboxes_per_image, bboxes_preds_per_image, False
        )

        gt_cls_per_image = (
            F.one_hot(gt_classes.to(torch.int64), self.num_classes).float()                 # 31*80
            .unsqueeze(1).repeat(1, num_in_boxes_anchor, 1)
        )
        pair_wise_ious_loss = -torch.log(pair_wise_ious + 1e-8)                 # 9*3471

        if mode == "cpu":
            cls_preds_, obj_preds_ = cls_preds_.cpu(), obj_preds_.cpu()

        cls_preds_ = (                                                          # 9*3471*80
            cls_preds_.float().unsqueeze(0).repeat(num_gt, 1, 1).sigmoid_()
            * obj_preds_.unsqueeze(0).repeat(num_gt, 1, 1).sigmoid_()
        )
        pair_wise_cls_loss = F.binary_cross_entropy(                            # 9*3471
            cls_preds_.sqrt_(), gt_cls_per_image, reduction="none"
        ).sum(-1)
        del cls_preds_

        cost = (                                    # 9*3471, 标签分配
            pair_wise_cls_loss
            + 3.0 * pair_wise_ious_loss
            + 100000.0 * (~is_in_boxes_and_center)
        )

        # 2. 进行 simota 标签分配
        (
            num_fg, gt_matched_classes, pred_ious_this_matching, matched_gt_inds
        ) = self.dynamic_k_matching(cost, pair_wise_ious, gt_classes, num_gt, fg_mask)
        del pair_wise_cls_loss, cost, pair_wise_ious, pair_wise_ious_loss

        if mode == "cpu":
            gt_matched_classes = gt_matched_classes.cuda()
            fg_mask = fg_mask.cuda()
            pred_ious_this_matching = pred_ious_this_matching.cuda()
            matched_gt_inds = matched_gt_inds.cuda()

        return gt_matched_classes, fg_mask, pred_ious_this_matching, matched_gt_inds, num_fg