RetinaFace

Fred_27

已于 2022-11-01 17:10:16 修改

阅读量700

点赞数 1

文章标签： python 深度学习计算机视觉

于 2022-10-18 11:38:27 首次发布

本文链接：https://blog.csdn.net/Fredzj/article/details/127359455

版权

文章目录

一、数据处理
- 1. wider_face.py
- 2. data_augment.py
二、默认框生成
三、网络框架
四、损失函数
box_utils.py
multibox_loss.py
五、与MTCNN对比
- MTCNN
- RetinaFace

一、数据处理

RetinaFace的数据处理脚本主要有wider_face.py和data_augment.py两个文件。

1. wider_face.py

在wider_face.py中首先定义了一个类：class WiderFaceDetection(data.Dataset)。
其主要包含三个方法：1. 初始化def __init__(self, txt_path, preproc=None) 2. 获取数据集数量def __len__(self) 3. 获取数据集信息__getitem__(self, index)。

具体步骤
（1）获取一个待处理的.txt文件地址（包含了widerface数据图像的地址，以及每个图像中人脸的box的坐标值和关键点的坐标值）；
（2）对输入的图像做各种数据增强处理等，这个类位于data_augment.py脚本中，后续会提到；
（3）定义一个用来存储图像路径的list和一个用来存放关于人脸box和关键点信息的list；
（4）对label.txt的操作，首先读取这个txt,获取每行信息，将其存储在lines中，对其进行遍历，对于遇到“#”开头的，便是图片的地址，将其放入img_path中，然后处理此图片的信息。每行信息代表如下：首先是box的x,y然后是w,h，接着是5个关键点信息，分别用0.0隔开。
（5）将这些信息放入words中，而isFirst便是处理完一张图片的标志。

2. data_augment.py

这个脚本主要包括一些数据增强的方法（裁剪，失真，平方填充，镜像，减均值等）以及一个统一起来的数据处理的类，即wider_face.py中提到的preproc，在这个类里，将上面所有的数据增强的方法都用了一遍最后返回处理的图像和Target。
具体流程如代码所示：
（1）获取图片信息；
（2）按照裁剪–>失真–>平方填充–>镜像–>去均值的步骤对图像进行数据增强。

class preproc(object):

    def __init__(self, img_dim, rgb_means):
        self.img_dim = img_dim
        self.rgb_means = rgb_means

    def __call__(self, image, targets):
        assert targets.shape[0] > 0, "this image does not have gt"

        boxes = targets[:, :4].copy()
        labels = targets[:, -1].copy()
        landm = targets[:, 4:-1].copy()

        image_t, boxes_t, labels_t, landm_t, pad_image_flag = _crop(image, boxes, labels, landm, self.img_dim)
        image_t = _distort(image_t)
        image_t = _pad_to_square(image_t, self.rgb_means, pad_image_flag)
        image_t, boxes_t, landm_t = _mirror(image_t, boxes_t, landm_t)
        height, width, _ = image_t.shape
        image_t = _resize_subtract_mean(image_t, self.img_dim, self.rgb_means)
        boxes_t[:, 0::2] /= width
        boxes_t[:, 1::2] /= height

        landm_t[:, 0::2] /= width
        landm_t[:, 1::2] /= height

        labels_t = np.expand_dims(labels_t, 1)
        targets_t = np.hstack((boxes_t, landm_t, labels_t))

        return image_t, targets_t

二、默认框生成

默认框生成主要在prior_box.py文件中，主要包含class PriorBox(object)类。
具体步骤
（1）在forward函数中获取一系列默认框，具体获取方式：先遍历三个特征图，（特征图大小为[80,40,20]论文中提到的是5个，但代码实现只有3个，如下图。）获取默认框大小，即min_sizes[k]；

anchors = []
        for k, f in enumerate(self.feature_maps):
            min_sizes = self.min_sizes[k]

（2）然后遍历特征图的每个像素点（itertools.product()即返回相应的笛卡尔坐标，各个坐标值相应配对），s_kx,s_ky代表的是默认框的宽w和高h；dense_cx,dense_cy包含的是默认框的中心点，即遍历获取的特征图的像素点；

            for i, j in product(range(f[0]), range(f[1])):
                for min_size in min_sizes:
                    # s_kx,s_ky代表的是默认框的宽w和高h
                    s_kx = min_size / self.image_size[1]
                    s_ky = min_size / self.image_size[0]
                    # dense_cx,dense_cy包含的是默认框的中心点，即遍历获取的特征图的像素点
                    # x,y乘以的是steps/image_size,因为feature_map和steps的乘积正好是image的大小，这样处理刚好等分遍历完整个图像
                    dense_cx = [x * self.steps[k] / self.image_size[1] for x in [j + 0.5]]
                    dense_cy = [y * self.steps[k] / self.image_size[0] for y in [i + 0.5]]

（3）最后变化下anchors的形状（[x,y,w,h]这个样子），并将其归一化于0-1之间（clamp()即将输入压缩与min和max之间，小于min的记为min，大于max的记为max），因为论文中 the aspect ratio at 1:1，所以直接返回即可。

# 最后变化下anchors的形状（[x,y,w,h]这个样子），并将其归一化于0-1之间（clamp()即将输入压缩与min和max之间，
                    # 小于min的记为min，大于max的记为max），因为论文中 the aspect ratio at 1:1，所以直接返回即可
                    for cy, cx in product(dense_cy, dense_cx):
                        anchors += [cx, cy, s_kx, s_ky]

        # back to torch land
        output = torch.Tensor(anchors).view(-1, 4)
        if self.clip:
            output.clamp_(max=1, min=0)
        return output

三、网络框架

网络框架主要包含有retinaface.py和net.py两个文件。RetinaNet的主干网络主要以Resnet50或MobileNet为主。
主要结构
（1）MobileNet
里面包含有深度可分离卷积，可大幅度降低网络参数量。
在这里插入图片描述

class MobileNetV1(nn.Module):
    def __init__(self):
        super(MobileNetV1, self).__init__()
        self.stage1 = nn.Sequential(
            conv_bn(3, 8, 2, leaky=0.1),  # 3
            conv_dw(8, 16, 1),  # 7
            conv_dw(16, 32, 2),  # 11
            conv_dw(32, 32, 1),  # 19
            conv_dw(32, 64, 2),  # 27
            conv_dw(64, 64, 1),  # 43
        )
        self.stage2 = nn.Sequential(
            conv_dw(64, 128, 2),  # 43 + 16 = 59
            conv_dw(128, 128, 1),  # 59 + 32 = 91
            conv_dw(128, 128, 1),  # 91 + 32 = 123
            conv_dw(128, 128, 1),  # 123 + 32 = 155
            conv_dw(128, 128, 1),  # 155 + 32 = 187
            conv_dw(128, 128, 1),  # 187 + 32 = 219
        )
        self.stage3 = nn.Sequential(
            conv_dw(128, 256, 2),  # 219 +3 2 = 241
            conv_dw(256, 256, 1),  # 241 + 64 = 301
        )
        self.avg = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(256, 1000)

    def forward(self, x):
        x = self.stage1(x)
        x = self.stage2(x)
        x = self.stage3(x)
        x = self.avg(x)
        # x = self.model(x)
        x = x.view(-1, 256)
        x = self.fc(x)
        return x

（2）FPN图像金字塔
图像金字塔主要是为了适应不同尺寸的输入图像而构建的。这种方法的优点在于针对各种尺寸的输入都有较好的检测效果，缺点在于增加了时间成本。
在这里插入图片描述

class FPN(nn.Module):
    def __init__(self, in_channels_list, out_channels):
        super(FPN, self).__init__()
        leaky = 0
        if out_channels <= 64:
            leaky = 0.1
        self.output1 = conv_bn1X1(in_channels_list[0], out_channels, stride=1, leaky=leaky)
        self.output2 = conv_bn1X1(in_channels_list[1], out_channels, stride=1, leaky=leaky)
        self.output3 = conv_bn1X1(in_channels_list[2], out_channels, stride=1, leaky=leaky)

        self.merge1 = conv_bn(out_channels, out_channels, leaky=leaky)
        self.merge2 = conv_bn(out_channels, out_channels, leaky=leaky)

    def forward(self, input):
        # names = list(input.keys())
        input = list(input.values())

        output1 = self.output1(input[0])
        output2 = self.output2(input[1])
        output3 = self.output3(input[2])

        up3 = F.interpolate(output3, size=[output2.size(2), output2.size(3)], mode="nearest")
        output2 = output2 + up3
        output2 = self.merge2(output2)

        up2 = F.interpolate(output2, size=[output1.size(2), output1.size(3)], mode="nearest")
        output1 = output1 + up2
        output1 = self.merge1(output1)

        out = [output1, output2, output3]
        return out

（3）SSH
通过了图像金字塔以后，我们获得了三个有效特征层，为了进一步加强感受野，采用了SSH模块。SSH的思想非常简单，使用了三个并行结构，利用3x3卷积的堆叠代替5x5与7x7卷积的效果：左边的是3x3卷积，中间利用两次3x3卷积代替5x5卷积，右边利用三次3x3卷积代替7x7卷积。
在这里插入图片描述

class SSH(nn.Module):
    def __init__(self, in_channel, out_channel):
        super(SSH, self).__init__()
        assert out_channel % 4 == 0
        leaky = 0
        if out_channel <= 64:
            leaky = 0.1
        self.conv3X3 = conv_bn_no_relu(in_channel, out_channel // 2, stride=1)

        self.conv5X5_1 = conv_bn(in_channel, out_channel // 4, stride=1, leaky=leaky)
        self.conv5X5_2 = conv_bn_no_relu(out_channel // 4, out_channel // 4, stride=1)

        self.conv7X7_2 = conv_bn(out_channel // 4, out_channel // 4, stride=1, leaky=leaky)
        self.conv7x7_3 = conv_bn_no_relu(out_channel // 4, out_channel // 4, stride=1)

    def forward(self, input):
        conv3X3 = self.conv3X3(input)

        conv5X5_1 = self.conv5X5_1(input)
        conv5X5 = self.conv5X5_2(conv5X5_1)

        conv7X7_2 = self.conv7X7_2(conv5X5_1)
        conv7X7 = self.conv7x7_3(conv7X7_2)

        out = torch.cat([conv3X3, conv5X5, conv7X7], dim=1)
        out = F.relu(out)
        return out

四、损失函数

损失函数主要包含两个脚本文件：multibox_loss.py和box_utils.py。
retinaface的损失函数是多重损失函数，主要由三个损失函数组成：分别是人脸边框坐标回归损失loss_landm、边框回归损失loss_l、人脸分类损失loss_c。

loss_landm = F.smooth_l1_loss(landm_p, landm_t, reduction='sum')
loss_l = F.smooth_l1_loss(loc_p, loc_t, reduction='sum')
loss_c = F.cross_entropy

box_utils.py

先看box_utils.py文件。里面主要是利用match函数来选取anchor。match函数的返回值为loc_t，conf_t，landm_t，这三个返回值是由encode函数计算得到。anchor的筛选通过交并比公式完成，每一个anchor对应一个最匹配的ground_truth，而每一个ground_truth也会得到一个最匹配的anchor。

multibox_loss.py

loss_landm = F.smooth_l1_loss(landm_p, landm_t, reduction='sum')
loss_l = F.smooth_l1_loss(loc_p, loc_t, reduction='sum')

人脸边框坐标回归损失loss_landm和边框回归损失loss_l都采用Smooth_L1_Loss函数，这个函数是由Fast RCNN提出来的。Smooth_L1_Loss 相比L1_loss 改进了零点不平滑问题；相比于L2_loss，在 x 较大的时候不像 L2 对异常值敏感，是一个缓慢变化的loss。公式如下：
在这里插入图片描述

loss_c = F.cross_entropy

人脸分类损失loss_c采用的是交叉熵损失，公式如下
在这里插入图片描述

代码如下：

class MultiBoxLoss(nn.Module):
    """SSD Weighted Loss Function
    Compute Targets:
        1) Produce Confidence Target Indices by matching  ground truth boxes
           with (default) 'priorboxes' that have jaccard index > threshold parameter
           (default threshold: 0.5).
        2) Produce localization target by 'encoding' variance into offsets of ground
           truth boxes and their matched  'priorboxes'.
        3) Hard negative mining to filter the excessive number of negative examples
           that comes with using a large number of default bounding boxes.
           (default negative:positive ratio 3:1)
    Objective Loss:
        L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / N
        Where, Lconf is the CrossEntropy Loss and Lloc is the SmoothL1 Loss
        weighted by α which is set to 1 by cross val.
        Args:
            c: class confidences,
            l: predicted boxes,
            g: ground truth boxes
            N: number of matched default boxes
        See: https://arxiv.org/pdf/1512.02325.pdf for more details.
    """

    def __init__(self, num_classes, overlap_thresh, prior_for_matching, bkg_label, neg_mining, neg_pos, neg_overlap,
                 encode_target):
        super(MultiBoxLoss, self).__init__()
        self.num_classes = num_classes
        self.threshold = overlap_thresh
        self.background_label = bkg_label
        self.encode_target = encode_target
        self.use_prior_for_matching = prior_for_matching
        self.do_neg_mining = neg_mining
        self.negpos_ratio = neg_pos
        self.neg_overlap = neg_overlap
        self.variance = [0.1, 0.2]

    def forward(self, predictions, priors, targets):
        """Multibox Loss
        Args:
            predictions (tuple): A tuple containing loc preds, conf preds,
            and prior boxes from SSD net.
                conf shape: torch.size(batch_size,num_priors,num_classes)
                loc shape: torch.size(batch_size,num_priors,4)
                priors shape: torch.size(num_priors,4)

            ground_truth (tensor): Ground truth boxes and labels for a batch,
                shape: [batch_size,num_objs,5] (last idx is the label).
        """

        loc_data, conf_data, landm_data = predictions
        priors = priors
        num = loc_data.size(0)
        num_priors = (priors.size(0))

        # match priors (default boxes) and ground truth boxes
        loc_t = torch.Tensor(num, num_priors, 4)
        landm_t = torch.Tensor(num, num_priors, 10)
        conf_t = torch.LongTensor(num, num_priors)
        for idx in range(num):
            truths = targets[idx][:, :4].data
            labels = targets[idx][:, -1].data
            landms = targets[idx][:, 4:14].data
            defaults = priors.data
            match(self.threshold, truths, defaults, self.variance, labels, landms, loc_t, conf_t, landm_t, idx)
        if GPU:
            loc_t = loc_t.cpu()
            conf_t = conf_t.cpu()
            landm_t = landm_t.cpu()

        zeros = torch.tensor(0).cpu()
        # landm Loss (Smooth L1)
        # Shape: [batch,num_priors,10]

        # pos1挑选出置信度大于0的用于计算landmark的损失值,这里采用Smooth函数
        pos1 = conf_t > zeros
        num_pos_landm = pos1.long().sum(1, keepdim=True)
        N1 = max(num_pos_landm.data.sum().float(), 1)
        pos_idx1 = pos1.unsqueeze(pos1.dim()).expand_as(landm_data)
        landm_p = landm_data[pos_idx1].view(-1, 10)
        landm_t = landm_t[pos_idx1].view(-1, 10)
        loss_landm = F.smooth_l1_loss(landm_p, landm_t, reduction='sum')

        pos = conf_t != zeros
        conf_t[pos] = 1

        # Localization Loss (Smooth L1)
        # Shape: [batch,num_priors,4]
        pos_idx = pos.unsqueeze(pos.dim()).expand_as(loc_data)
        loc_p = loc_data[pos_idx].view(-1, 4)
        loc_t = loc_t[pos_idx].view(-1, 4)
        loss_l = F.smooth_l1_loss(loc_p, loc_t, reduction='sum')

        # Compute max conf across batch for hard negative mining
        batch_conf = conf_data.view(-1, self.num_classes)
        loss_c = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1))

        # Hard Negative Mining
        '''
        先将正样本置为0，然后对loss排序(每张图片内部挑选)之后，取前self.negpos_ratio*num_pos个负样本的loss
        下一步loss_c shape转变为[batch,num_priors]
        '''

        loss_c[pos.view(-1, 1)] = 0  # filter out pos boxes for now
        loss_c = loss_c.view(num, -1)
        _, loss_idx = loss_c.sort(1, descending=True)
        _, idx_rank = loss_idx.sort(1)
        num_pos = pos.long().sum(1, keepdim=True)
        num_neg = torch.clamp(self.negpos_ratio * num_pos, max=pos.size(1) - 1)
        neg = idx_rank < num_neg.expand_as(idx_rank)

        # Confidence Loss Including Positive and Negative Examples
        """
        上面几步的操作就是为获得pos_idx和neg_idx
        conf_data 的shape为[batch,num_priors,num_classes]
        """
        pos_idx = pos.unsqueeze(2).expand_as(conf_data)
        neg_idx = neg.unsqueeze(2).expand_as(conf_data)
        """
        (pos_idx+neg_idx).gt(0)的原因个人猜测可能是因为挑选的正样本和负样本可能会重复，因此将大于1的数变成1.
        但是经过实验Tensor[mask]中对于mask大于1的数也是可以的
        """
        conf_p = conf_data[(pos_idx + neg_idx).gt(0)].view(-1, self.num_classes)
        targets_weighted = conf_t[(pos + neg).gt(0)]
        loss_c = F.cross_entropy(conf_p, targets_weighted, reduction='sum')

        # Sum of losses: L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / N
        N = max(num_pos.data.sum().float(), 1)
        loss_l /= N
        loss_c /= N
        loss_landm /= N1

        return loss_l, loss_c, loss_landm
        # loss_l:L(box)(边框回归损失) cross entropy
        # loss_c:L(cls)(人脸分类损失) smooth l1 loss
        # loss_landm：人脸边框坐标回归损失L（pts） smooth l1 loss