【Pytorch】常见的人脸身份识别损失函数

王小希ww

已于 2022-10-03 01:06:41 修改

阅读量2k

点赞数 1

分类专栏： # 计算机视觉 # python 机器学习文章标签： pytorch 机器学习深度学习

于 2022-03-16 22:53:19 首次发布

本文链接：https://blog.csdn.net/qq_33934427/article/details/123538453

版权

python 同时被 3 个专栏收录

48 篇文章 1 订阅

订阅专栏

机器学习

34 篇文章 4 订阅

订阅专栏

计算机视觉

24 篇文章 16 订阅

订阅专栏

【Pytorch】常见的人脸身份识别损失函数

实验环境准备：人脸多角度多光照的图像数据集MUCT（276个受试者）+ MobileNetV3
说明：对于人脸身份数据集MUCT，是少样本数据集，应该使用少样本学习方法进行训练。

如果使用大样本数据集的训练方法（划分训练集和测试集，对多人脸身份类别进行学习），MobileNetV3 + arcface loss头部模块容易欠拟合，模型无法收敛。
如果使用小样本训练方法（比如常见的元学习方法MAML，prototypical Net），使用MobileNetV3 + softmax进行多类别少样本的学习（比如 8ways - 5shots），模型容易欠拟合，无法收敛，换句话说，softmax无法有效区分人脸之前存在的细微差异。
我在实验时发现，MAML + softmax在训练少样本时无法收敛，而prototypical Net + softmax在训练少样本时可以收敛，但收敛精度低；所以建议使用prototypical Net进行少样本学习。
如果使用prototypical Net，使用MobileNetV3 + arcface loss头部模块进行多类别少样本的学习（比如 8ways - 5shots），模型可以有效收敛，换句话说，arcface loss + softmax可以有效区分人脸之前存在的细微差异。

0、Softmax（激活函数）

参考TORCH.NN.FUNCTIONAL.SOFTMAX

为了区分下面提到的softmax loss（也就是多分类的交叉熵损失函数），以及L2-softmax loss，ArcFace loss，cosFace loss等，这里有必要先提一下Softmax激活函数。Softmax激活函数如下：
$Softmax(x_i) = \frac{exp(x_i)}{\sum_{j = 1}exp(x_j)}$
该函数的目的是将input中的每个元素缩放至[0,1]区间上（归一化），且各元素之和为1，且输出向量的维数和输入向量的维数保持不变。

而对于Softmax损失函数式子如下，只不过下面的 $w x + b$ 是全连接层（nn.Linear(inputSize,outputSize)) 的输出；而取对数的作用其实是将Softmax归一化结果进行再次变换，用于表示 $x_{i + 1}$ （线性变换后的特征可用于分类任务）属于各个类别 ${y_1,...,y_n\}$ 的概率； $m$ 表示mini-batch的大小，求和的目的是计算当前批量任务的总损失值。
$L_S = - \sum_{i = 1}^m log\frac{e^{W^T_{y_i}x_i + b_{yi}}}{\sum_{j=1}^ne^{W^T_jx_i + b_j}} = - \sum_{i = 1}^m log Softmax(x_i)$
$L_S$ 损失函数其实就是下面提到的CrossEntropyLoss，用于多分类的交叉熵损失函数。

Note：全连接层的输出结果才可用于Softmax归一化

假设模型部分结构如下：
...
self.linear3 = nn.Linear(960, 1280)
self.bn3 = nn.BatchNorm1d(1280)
self.hs3 = hswish()
self.linear4 = nn.Linear(1280, num_classes = 276)  #输出类别数为276
假设batch_size为4，则最后一层全连接层的输入维数为： $[4, 1280]$ ，输出维数为： $[4, 276]$ ，该层权重 $W$ 维数为 $[1280, 276]$ ；其中 $W$ 中的每一列表示第 $j$ 个类别的权重向量。

接着代入上面的Softmax函数，即可实现对 $W^TX \in R[4,276]$ 的线性变换后的特征进行归一化操作。

1、NLLLoss（负对数似然损失）

NLLLoss - Negative Log Likelihood Loss 参考NLLLOSS - pytorch
$\{l_1,...,l_N\}^T, l_n = -w_{y_n}x_{n,y_n},w_c = weight[c] \cdot 1 \{c \neq ignore\_index\}$
其中 $x$ 是输入， $y$ 是目标输出， $w$ 是权重， $N$ 是batch size， $C$ 为类别数。如果reduction = none ，默认使用mean对batch损失取平均策略。

对于input和target（每个元素为类别标签），它有一定要求：

The input given through a forward call is expected to contain log-probabilities of each class. input has to be a Tensor of size either $(miniba t c h, C)$ or $minibatch, C, d_1, d_2, ..., d_K)$ with $\geq 1$ for the K-dimensional case. The latter is useful for higher dimension inputs, such as computing NLL loss per-pixel for 2D images.

The target that this loss expects should be a class index in the range $[0, C - 1]$ where C = number of classes; if ignore_index is specified, this loss also accepts this class index (this index may not necessarily be in the class range).

input维数为N x C，则target中每个元素的值要满足 0 <= value < C；
input维数为N x C x height x width，则target中每个元素的值要满足 0 <= value < C

1）源码解析

import torch
import torch.nn as nn

#The negative log likelihood loss
#参考 https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html#torch.nn.NLLLoss
m = nn.LogSoftmax(dim=1)   #logSoftmax = log(softmax) is an activation layer
loss = nn.NLLLoss()
# input is of size N x C = 3 x 5
# input = torch.randn(3, 5, requires_grad=True)
input = torch.tensor([[-1.0,-2.0,-3.0],[1.0,2.0,3.0],[5.0,7.0,3.0]],requires_grad=True)
# each element in target has to have 0 <= value < C
target = torch.tensor([1, 0, 2])
output = loss(m(input), target)

print(f"1: NLLLoss( logSoftmax(input) = {m(input)}, target = {target} ) = {output}")

print(f"torch.nn.functional.one_hot(target) * m(input) = {torch.nn.functional.one_hot(target) * m(input)}")

print(f"-torch.mean(torch.nn.functional.one_hot(target) * m(input)) * (input.shape[1]) = {-torch.mean(torch.nn.functional.one_hot(target) * m(input)) * (input.shape[1])}")
output.backward()
---
1: NLLLoss( logSoftmax(input) = tensor([[-0.4076, -1.4076, -2.4076],
        [-2.4076, -1.4076, -0.4076],
        [-2.1429, -0.1429, -4.1429]], grad_fn=<LogSoftmaxBackward>), target = tensor([1, 0, 2]) ) = 2.652714729309082
torch.nn.functional.one_hot(target) * m(input) = tensor([[-0.0000, -1.4076, -0.0000],
        [-2.4076, -0.0000, -0.0000],
        [-0.0000, -0.0000, -4.1429]], grad_fn=<MulBackward0>)
-torch.mean(torch.nn.functional.one_hot(target) * m(input)) * (input.shape[1]) = 2.652714729309082

你会发现，NLL_loss的计算原理如下：先对target进行one_hot编码，接着target和** $log\_Softmax(input) \in (\infty,0]$ 相乘**，得到的矩阵如下：

tensor([[-0.0000, -1.4076, -0.0000],
        [-2.4076, -0.0000, -0.0000],
        [-0.0000, -0.0000, -4.1429]]

再对各类别取平均（这里类别数为3），取负数得到最终的损失值：

-torch.mean(torch.nn.functional.one_hot(target) * m(input)) * (input.shape[1])

2）实验

使用NLLLoss对MobileNetV3进行训练，简单修改如下：

class MobileNetV3_Large(nn.Module):
...
def forward(self, x):
   #MobileNetV3_Large原来的模块
   out = self.hs1(self.bn1(self.conv1(x)))
   out = self.bneck(out)
   out = self.hs2(self.bn2(self.conv2(out)))
   out = F.avg_pool2d(out, 7)
   out = out.view(out.size(0), -1)
   out = self.hs3(self.bn3(self.linear3(out)))
   out = self.linear4(out) 
   
   #新增模块
   out = F.log_softmax(out)  #无需保存参数，直接使用functional里的方法即可
   return out
...
#接着在train.py中使用NLLLoss即可
out = model(sample)
loss = nll_loss(out,y)   # 总损失： NLLLoss(log(softmax)) = crossEntropyLoss

检测效果如下：

Epoch 1/100: 100%|▉| 2800/2802 [01:24<00:00, 33.18
epoch = 0, train_loss = 22.36589876106807, train_acc = 0.0064285714285714285,test_loss = 7.780539038635435, test_acc = 0.001488095238095238
checkpoint1 is saved
   Epoch 2/100: 100%|▉| 2800/2802 [01:22<00:00, 33.85
   epoch = 1, train_loss = 7.284130101885115, train_acc = 0.002142857142857143,test_loss = 6.771463133039928, test_acc = 0.00744047619047619
    Epoch 3/100:   0%|      | 0/2802 [00:00<?, ?img/s]checkpoint2 is saved
    Epoch 3/100: 100%|▉| 2800/2802 [01:22<00:00, 34.05
    epoch = 2, train_loss = 6.8465765258244105, train_acc = 0.005,test_loss = 6.706081149123964, test_acc = 0.00744047619047619
    checkpoint3 is saved
    Epoch 4/100: 100%|▉| 2800/2802 [01:22<00:00, 34.03
    epoch = 3, train_loss = 6.661047067642212, train_acc = 0.0032142857142857142,test_loss = 6.852347408022199, test_acc = 0.002976190476190476
    Epoch 5/100:   0%|      | 0/2802 [00:00<?, ?img/s]checkpoint4 is saved
    Epoch 5/100: 100%|▉| 2800/2802 [01:22<00:00, 33.83
    epoch = 4, train_loss = 6.594039328438895, train_acc = 0.002857142857142857,test_loss = 6.599982026077452, test_acc = 0.004464285714285714
    ```

2、CrossEntropyLoss（损失函数）

参考CROSSENTROPYLOSS
$\{l_1,...,l_N\}^T，l_n = -w_{yn} log\frac{exp(x_{n,y_n})}{\sum_{c=1}^C exp(x_{n,c})} \cdot 1 \{y_n \neq ignore\_index\}$
其中 $x$ 是输入， $y$ 是目标输出， $w$ 是当前样本计算的损失的权重， $N$ 是batch size， $C$ 为类别数。如果reduction = none ，默认使用mean对batch损失取平均策略。

The input is expected to contain raw, unnormalized scores for each class. input has to be a Tensor of size $(C)$ for unbatched input, $(miniba t c h, C)$ or $minibatch, C, d_1, d_2, ..., d_K)$ with $\geq 1$ for the K-dimensional case. The last being useful for higher dimension inputs, such as computing cross entropy loss per-pixel for 2D images.

The target that this loss expects should be a class index in the range $[0, C - 1]$ where C = number of classes;

input维数为N x C，则target中每个元素的值要满足 0 <= value < C；
input维数为N x C x height x width，则target中每个元素的值要满足 0 <= value < C

1）源码解析

import torch
import torch.nn as nn

#参考 https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss
# Example of target with class indices
loss = nn.CrossEntropyLoss()
# input = torch.randn(3, 5, requires_grad=True)
# target = torch.empty(3, dtype=torch.long).random_(5)
input = torch.tensor([[-1.0,-2.0,-3.0],[1.0,2.0,3.0],[5.0,7.0,3.0]],requires_grad=True)
# each element in target has to have 0 <= value < C
target = torch.tensor([1, 0, 2])
output = loss(input, target)
print(f"1: CrossEntropyLoss(input = {input}) = {output}")
output.backward()
---
1: CrossEntropyLoss(input = tensor([[-1., -2., -3.],
        [ 1.,  2.,  3.],
        [ 5.,  7.,  3.]], requires_grad=True)) = 2.652714729309082

你会发现，CrossEntropyLoss的计算结果与经过Softmax和取对数之后的NLLLoss计算结果一样。因此：

CrossEntropyLoss(out,y) = NLLLoss(log(softmax(x)),y)

2）实验

使用CrossEntropyLoss对MobileNetV3进行训练，简单修改如下：

#在train.py中使用NLLLoss即可
out = model(sample)
loss = cross_entropy_loss(out,y)   # 总损失： NLLLoss(log(softmax)) = crossEntropyLoss

训练结果如下：

Epoch 1/100: 100%|▉| 2800/2802 [01:24<00:00, 33.26
epoch = 0, train_loss = 23.81213624749865, train_acc = 0.003928571428571429,test_loss = 7.5758026611237295, test_acc = 0.005952380952380952
Epoch 2/100:   0%|      | 0/2802 [00:00<?, ?img/s]checkpoint1 is saved
Epoch 2/100: 100%|▉| 2800/2802 [01:22<00:00, 33.98
epoch = 1, train_loss = 7.077970628057208, train_acc = 0.0035714285714285713,test_loss = 6.91069389524914, test_acc = 0.002976190476190476
Epoch 3/100:   0%|      | 0/2802 [00:00<?, ?img/s]checkpoint2 is saved
Epoch 3/100: 100%|▉| 2800/2802 [01:22<00:00, 33.86
epoch = 2, train_loss = 6.829764229910714, train_acc = 0.005,test_loss = 6.861410782450721, test_acc = 0.0
checkpoint3 is saved
Epoch 4/100: 100%|▉| 2800/2802 [01:22<00:00, 33.89
epoch = 3, train_loss = 6.701256164142063, train_acc = 0.004642857142857143,test_loss = 6.797497911112649, test_acc = 0.001488095238095238
checkpoint4 is saved
Epoch 5/100: 100%|▉| 2800/2802 [01:23<00:00, 33.72
epoch = 4, train_loss = 6.5600184038707186, train_acc = 0.004285714285714286,test_loss = 6.7159488428206675, test_acc = 0.001488095238095238
checkpoint5 is saved

效果和NLLLoss一样的，只不过前者需要在MobileNetV3模型中添加一层log_softmax（不是模块），进行特征的缩放，而CrossEntropyLoss使用时无需修改模型层结构。

3、Center loss（损失函数 - 2016）

参考

center loss的原理主要是在softmax loss的基础上，通过对训练集的每个类别在特征空间分别维护一个类中心，在训练过程，增加样本经过网络映射后在特征空间与类中心的距离约束，从而兼顾了类内聚合与类间分离。

最终通过将centerloss和softmax loss进行加权求和，实现整体的分类任务的学习。

对于第二部分的center loss，

$c_{yi}$ 表示第 $y i$ 个类别的特征中心（特征中心的维数和全连接之前的特征 $x_{i}$ 相同），主要通过初始化（反向更新）的center loss的中心参数，利用 $y_i$ 的索引获取指定的行的参数特征，用于计算特征和特征中心的距离。

$x_i$ 表示全连接层之前的特征，而全连接之后的特征 $x_{i+1}$ 用于计算softmax loss。

Centor loss算法流程如下：

1）源码解析

参考

class CenterLoss(nn.Module):
    def __init__(self, num_classes, feat_dim, size_average=True):
        super(CenterLoss, self).__init__()
        self.centers = nn.Parameter(torch.randn(num_classes, feat_dim))  #Parameters are Tensor subclasses
        self.centerlossfunc = CenterlossFunc.apply  # pytorch中的model.apply(fn)会递归地将函数fn应用到父模块的每个子模块submodule，也包括model这个父模块自身
        self.feat_dim = feat_dim
        self.size_average = size_average

    def forward(self, feat, label):
        batch_size = feat.size(0)
        feat = feat.view(batch_size, -1)
        # To check the dim of centers and features
        if feat.size(1) != self.feat_dim:
            raise ValueError(
                "Center's dim: {0} should be equal to input feature's dim: {1}".format(self.feat_dim, feat.size(1)))
        loss = self.centerlossfunc(feat, label, self.centers)  #通过输入特征，真实标签和特征中心正向传播计算损失
        loss /= (batch_size if self.size_average else 1)
        return loss

#https://pytorch.org/docs/stable/notes/extending.html#extending-autograd
class CenterlossFunc(Function):
    @staticmethod
    # ctx用在静态方法中, 调用的时候不需要实例化对象, 直接通过类名就可以调用, 所以self在静态方法中没有意义
    #自定义的forward()方法和backward()方法的第一个参数必须是ctx; ctx可以保存forward()中的变量,以便在backward()中继续使用
    def forward(ctx, feature, label, centers):
        ctx.save_for_backward(feature, label, centers)
        centers_batch = centers.index_select(0, label.long())  #等价于torch.index_select(centers, 0, label.long()),第二个参数0表示按行索引，1表示按列进行索引，第三个参数是一个tensor，就是索引的序号
        return (feature - centers_batch).pow(2).sum() / 2.0

    @staticmethod
    def backward(ctx, grad_output):
        feature, label, centers = ctx.saved_tensors
        centers_batch = centers.index_select(0, label.long())
        diff = centers_batch - feature
        # init every iteration
        counts = centers.new(centers.size(0)).fill_(1)
        ones = centers.new(label.size(0)).fill_(1)
        grad_centers = centers.new(centers.size()).fill_(0)

        counts = counts.scatter_add_(0, label.long(), ones)
        grad_centers.scatter_add_(0, label.unsqueeze(1).expand(feature.size()).long(), diff)
        grad_centers = grad_centers / counts.view(-1, 1)
        return - grad_output * diff, None, grad_centers

解释一下为什么要，以及什么时候要继承torch.autograd.function的Function模块：https://pytorch.org/docs/stable/notes/extending.html#extending-autograd

如果想给新添加的算子实现“autograd”自动求导的功能，则需要为每一个算子实现Function子类。

In general, implement a custom function（自定义方法） if you want to perform computations in your model that are not differentiable or rely on non-Pytorch libraries (e.g., NumPy), but still wish for your operation to chain with other ops and work with the autograd engine.
In some situations, custom functions can also be used to improve performance and memory usage: If you implemented your forward and backward passes using a C++ extension, you can wrap them in Function to interface with the autograd engine. If you’d like to reduce the number of buffers saved for the backward pass, custom functions can be used to combine ops together.
If you can already write your function in terms of PyTorch’s built-in ops, its backward graph is (most likely) already able to be recorded by autograd. In this case, you do not need to implement the backward function yourself. Consider using a plain old Python function.

就是说如果我们想通过Numpy库（而非Pytorch库）来实现某个功能算子，又想将自定义的算子和其他算子进行链式绑定实现自动求导，此时则需要继承torch.autograd.function.Function，通过底层的C++扩展包来实现forward，backward前向和反向传播的方法。

如果我们使用pytorch构建的算子来自定义方法，则此时的计算图会在正向传播时，通过自动求导将计算图的拓扑结构进行保存。

具体的函数（save_for_backward()，mark_dirty() 等）介绍参考官网

2）实验

使用CenterLoss对MobileNetV3进行训练，简单修改如下：

class MobileNetV3_Large(nn.Module):
...
    def forward(self, x):
        out = self.hs1(self.bn1(self.conv1(x)))
        out = self.bneck(out)
        out = self.hs2(self.bn2(self.conv2(out)))
        out = F.avg_pool2d(out, 7)
        out = out.view(out.size(0), -1)
        out = self.hs3(self.bn3(self.linear3(out)))
        out1 = out  #输出特征，用于计算centerloss
        out2 = self.linear4(out)  #用于计算softmax loss（return2）
        return out1,out2
...
#接着在train.py中使用NLLLoss即可
softmax_loss = nn.CrossEntropyLoss().to(device)  # NLLLoss
Center_Loss = CenterLoss(num_classes=num_classes,feat_dim=1280).to(device)  #center_loss
weight = 0.3 #总损失： NLLLoss + center_loss * weight

for epoch in range(0,epoches):
    feat,predict = model(sample)

    #参考 https://github.com/jxgu1016/MNIST_center_loss_pytorch/blob/master/MNIST_with_centerloss.py
    loss = softmax_loss(predict,y) + weight * Center_Loss(feat,y)  # 总损失： NLLLoss + center_loss * weight

效果如下：

epoch = 44, train_loss = 197.36729516165596, train_acc = 0.002142857142857143,test_loss = 821.500404267084, test_acc = 0.005952380952380952
checkpoint45 is saved
Epoch 46/100: 100%|▉| 2800/2802 [01:25<00:00, 32.8
epoch = 45, train_loss = 197.30666959490094, train_acc = 0.004642857142857143,test_loss = 2388.358141308739, test_acc = 0.005952380952380952
Epoch 47/100:   0%|     | 0/2802 [00:00<?, ?img/s]checkpoint46 is saved
Epoch 47/100: 100%|▉| 2800/2802 [01:26<00:00, 32.4
epoch = 46, train_loss = 197.23798839024136, train_acc = 0.004642857142857143,test_loss = 80160891858.68933, test_acc = 0.005952380952380952
Epoch 48/100:   0%|     | 0/2802 [00:00<?, ?img/s]checkpoint47 is saved
Epoch 48/100: 100%|▉| 2800/2802 [01:25<00:00, 32.5
epoch = 47, train_loss = 197.2484938921247, train_acc = 0.0032142857142857142,test_loss = 1961.587382089524, test_acc = 0.004464285714285714
Epoch 49/100:   0%|     | 0/2802 [00:00<?, ?img/s]checkpoint48 is saved
Epoch 49/100: 100%|▉| 2800/2802 [01:25<00:00, 32.7
epoch = 48, train_loss = 197.18114281790596, train_acc = 0.005,test_loss = 2588.5203427814304, test_acc = 0.005952380952380952
checkpoint49 is saved
Epoch 50/100: 100%|▉| 2800/2802 [01:28<00:00, 31.5
epoch = 49, train_loss = 197.15495856148857, train_acc = 0.0035714285714285713,test_loss = 39041072659.06345, test_acc = 0.005952380952380952
checkpoint50 is saved

在使用MUCT数据集时，发现loss值一直降不下来

在这里插入图片描述

4、L2-Softmax（损失约束 - 2017 特征归一化）

参考

L2_Softmax Loss

关于L2约束的Softmax loss出现的背景

人脸验证在LFW数据集上做的很好，但是在实际场景：存在大量视角、分辨率、图像质量变化和遮挡时，验证效果并没有那么理想。主要是两个原因造成的：

1.数据质量不均衡：目前常用的人脸识别公开训练集图像大都是高清、正脸人脸图像，很少包含无限制条件下的难以识别的人脸图像。现在大多数的DCNN模型，使用softmax loss做分类，使用前面提到的训练集训练出来的模型，对高质量的图像过拟合，但对难以识别的图像欠拟合。
2.softmax loss不适合做人脸验证任务：softmax loss只是保证学习到的特征不用做任何matric learning的时候，能够使得人脸特征可分。但是softmax loss并没有保证positive pairs学到的特征足够近而negative pairs学到的特征足够远，因此不是很适合去做人脸验证任务。另外一点是，softmax loss是要最大化给定的mini-batch中所有样本的条件概率。但是，由于高质量的人脸图像的特征范数较大，低质量人脸图像的特征范数较小，如果直接让容易验证的样本的范数比较大，让难以验证的样本的范数较小，则可以得到最小化的softmax loss。因此，如果直接使用softmax loss只关注了mini-batch中高质量的人脸图像，而忽略了该mini-batch中较少的低质量的人脸图像。

解决方法：

在满足关于 $f(x_i)$ 特征归一化到固定值 $\alpha$ 的约束下，最小化softmax loss。其中 $f(x_i)$ 是DCNN倒数第二层提取到的特征。

参数α有两种设置方式，一是在训练过程中设置α为固定值，二是通过训练获得。但是第二种方式得到的α会得到比较大的值，添加的限制太过宽松。作者建议设置为一个比较小的固定值。

作者也观察到α的值设置太小的时候，超球面的表面积太小，特征分布不开，最后验证准确率也不高。

上图(b)表示以验证准确率p=0.9时，类别数C越大，需要的 $\alpha$ 值越大。

作者建议的 $\alpha$ 最小值为：
$\alpha_{low} = log \frac{p(C - 2)}{1 - p}$
实现细节就是增加一个L2归一化层，进行特征的缩放处理，最后再乘上 $\alpha$

这里可以使用F.normalize来计算 $L_p$ 范数。参考 https://pytorch.org/docs/stable/generated/torch.nn.functional.normalize.html?highlight=normalize#torch.nn.functional.normalize

该损失函数的好处是

由于损失函数使得所有人脸图像的特征的范数大小相同，所以softmax loss不会只偏重于对easy samples的学习，也会对diffcult samples进行学习；
还是由于特征的范数大小一致，所以所有的特征样本的特征都分布于一个固定半径的超球面上，此时最小化softmax loss等价于最大化positive pairs之间的余弦相似度，同时最小化negative pairs之间的余弦相似度。

简单来说，就是由于softmax loss在图片分类的优化过程中，仅关注特征范数大的高质量的图片，忽略了范数低的图片，为了解决图片质量对模型的训练效果的问题，提出要归一化特征，使得模型学到的特征既能够关注高质量的图片，也能够关注低质量的图片。

1）源码

#L2归一化层
class NormLinear(nn.Module):
    def __init__(self, in_features, classes, weight_norm=False, feature_norm=False):
        super(NormLinear, self).__init__()
        self.weight_norm = weight_norm
        self.feature_norm = feature_norm

        self.classes = classes
        self.in_features = in_features

        self.weight = nn.Parameter(torch.Tensor(classes, in_features))
        nn.init.normal_(self.weight, std=0.01)

    def forward(self, x):
        weight = F.normalize(self.weight, 2, dim=-1) if self.weight_norm else self.weight
        if self.feature_norm:
            x = F.normalize(x, 2, dim=-1)

        return F.linear(x, weight)

    def extra_repr(self):
        return 'in_features={}, out_features={}'.format(self.in_features, self.classes)

class L2Softmax(nn.Module):
    r"""L2Softmax from
    `"L2-constrained Softmax Loss for Discriminative Face Verification"
    <https://arxiv.org/abs/1703.09507>`_ paper.

    Parameters
    ----------
    classes: int.
        Number of classes.
    alpha: float.
        The scaling parameter, a hypersphere with small alpha
        will limit surface area for embedding features.
    p: float, default is 0.9.
        The expected average softmax probability for correctly
        classifying a feature.
    from_normx: bool, default is False.
         Whether input has already been normalized.

    Outputs:
        - **loss**: loss tensor with shape (1,). Dimensions other than
          batch_axis are averaged out.
    """
    def __init__(self, embedding_size, classes, alpha=64, p=0.9):
        super(L2Softmax, self).__init__()
        alpha_low = math.log(p * (classes - 2) / (1 - p))
        assert alpha > alpha_low, "For given probability of p={}, alpha should higher than {}.".format(p, alpha_low)
        self.alpha = alpha
        self.linear = NormLinear(embedding_size, classes, True, True)

    def forward(self, x, target):
        x = self.linear(x)
        x = x * self.alpha
        return x

2）实验

使用L2-softmax对MobileNetV3进行训练，简单修改如下：

class MobileNetV3_Large(nn.Module):
	def __init__(self, num_classes=1000):
		...
		self.linear4 = nn.Linear(1280, num_classes)
		self.l2_softmax = L2Softmax(embedding_size=num_classes,classes=num_classes)
		
    def forward(self, x):
        out = self.hs1(self.bn1(self.conv1(x)))
        out = self.bneck(out)
        out = self.hs2(self.bn2(self.conv2(out)))
        out = F.avg_pool2d(out, 7)
        out = out.view(out.size(0), -1)
        out = self.hs3(self.bn3(self.linear3(out)))
        out = self.linear4(out)
        out = self.l2_softmax(out,None)  #使用L2归一化 * α来处理特征
        return out
...
#接着在train.py中使用NLLLoss即可
softmax_loss = nn.CrossEntropyLoss().to(device)  # NLLLoss

for epoch in range(0,epoches):
    out = model(sample)

    loss = cross_entropy_loss(out,y)

效果如下：

epoch = 302, train_loss = 0.8345055354858881, train_acc = 0.7542857142857143,test_loss = 24.628671884536743, test_acc = 0.002976190476190476
Epoch 304/500:   0%|    | 0/2802 [00:00<?, ?img/s]checkpoint303 is saved
Epoch 304/500: 100%|▉| 2800/2802 [01:22<00:00, 34.
epoch = 303, train_loss = 0.8892281786871276, train_acc = 0.7414285714285714,test_loss = 24.245324452718098, test_acc = 0.004464285714285714
Epoch 305/500:   0%|    | 0/2802 [00:00<?, ?img/s]checkpoint304 is saved
Epoch 305/500: 100%|▉| 2800/2802 [01:22<00:00, 34.
epoch = 304, train_loss = 0.8428794197631734, train_acc = 0.7507142857142857,test_loss = 22.60235471384866, test_acc = 0.005952380952380952
checkpoint305 is saved
Epoch 306/500: 100%|▉| 2800/2802 [01:22<00:00, 33.
epoch = 305, train_loss = 0.8745315343341125, train_acc = 0.7621428571428571,test_loss = 23.958800395329792, test_acc = 0.002976190476190476
Epoch 307/500:   0%|    | 0/2802 [00:00<?, ?img/s]checkpoint306 is saved

会发现，train_loss和test_loss相差太大，模型存在过拟合。

在这里插入图片描述

5、SphereFace loss（损失约束 - 2017 权重归一化）

参考

文章作者主要提出了归一化权值（normalize weights and zero biases） 和角度间距（angular margin），基于这2个点，对传统的softmax进行了改进，从而实现了最大类内距离小于最小类间距离的识别标准，得到Angular Margin softmax loss。

在softmax loss的基础增加 $∣∣ W ∣∣ = 1, b = 0$ 的约束，并引入夹角得出Modified Softmax Loss公式如下：
$L_{modified} = \frac{1}{N} \sum_i - log(\frac{e^{||x_i||cos(\theta_{y_i},i)}}{\sum_j e^{||x_i||cos(\theta_j,i)}})$
其中 $w_i|| = 1$ ，因此指数部分如下所示，Modified Softmax Loss依然满足全连层的输出作为softmax函数的输入这项基本条件。
$||x_i||cos(\theta_{y_i},i) = \frac{w_i \times x_i}{||w_i||} = w_i \times x_i$

原始softmax loss和Modified Softmax Loss的特征分布结果如下：发现经过M-softmax之后不同类别的特征区域大小基本一致。

在此基础上，再引入angular，用m表示：
$L_{ang} = \frac{1}{N} \sum_i - log(\frac{e^{||x_i||cos(m\theta_{y_i},i)}}{e^{||x_i||cos(m\theta_{y_i},i)} +\sum_{j \neq y_i} e^{||x_i||cos(\theta_j,i)}})$
经过化简之后最终产生Angular-softmax的loss公式：
$L_{ang} = \frac{1}{N} \sum_i - log(\frac{e^{||x_i||\psi(\theta_{y_i},i)}}{e^{||x_i||\psi(\theta_{y_i},i)} +\sum_{j \neq y_i} e^{||x_i||cos(\theta_j,i)}})$
其中：

Angular Softmax Loss的特征分布结果如下：会发现A-Softmax不仅能对不同类别的特征生成大小基本一致的区域，而且类别间的间隔也很明显（类内聚合，类间分离）。

在这里插入图片描述
Note：上面是如何绘制二维散点图（输出二维特征），并将散点图映射到圆曲线上（归一化 + 决策边界）的，解析如下，请参考【Paper】SphereFace: A-Softmax Loss理解

论文中提出了一个问题：关于欧氏间隔的损失函数是否真的可以有效区分特征。在实验中，作者通过二分类问题对softmax loss进行可视化研究；

先回顾下softmax激活函数（不改变输入特征的维数），在二分类中，softmax损失可以用如下表示：

其中x是倒数第二层的输出（假设512维），W1是倒数第二层中x转换到类别1时进行全连接的权重，W2是倒数第二层中x转换到类别2时进行全连接的权重，两权重均为512维。

最后由于loss可以写成 $L (W, x)$ ，其中W由W1,W2构成。其中关于softmax损失可以拆解成如下softmax + NLLloss：

其中决策边界用 $W_2x + b2 = W_1x + b1$ 来表示，即 $W_2 - W_1)x + b2 - b1= 0$ ；其中 $W_2$ , $W_1$ 是关于softmax loss的权重，b2，b1是偏移量。

如果约束 $W_2|| = ||W_1|| = 1$ ， $b 2 = b 1 = 0$ ，则利用余弦距离公式可以将决策边界转换成 $||x||(cos(\theta_2) - cos(\theta_1)) = 0$ ，其中 $\theta$ 表示权重 $W$ 和特征 $x$ 的夹角，这样子决策边界可以由 $\theta_2$ 和 $\theta_1$ 决定，因此可以通过扩大角间隔（ $m$ ）来实现增大类间的距离，即 $||x||(cos(m\theta_2) - cos(\theta_1)) = 0$ ， $||x||(cos(\theta_2) - cos(m\theta_1)) = 0$

A-softmax与L-Softmax的区别：

A-Softmax与L-Softmax的最大区别在于A-Softmax的权重归一化了，而L-Softmax则没有。A-Softmax权重的归一化导致特征上的点映射到单位超球面上，而L-Softmax则不没有这个限制，这个特性使得两者在几何的解释上是不一样的。如果在训练时两个类别的特征输入在同一个区域时，如下图1所示。A-Softmax只能从角度上分度这两个类别，也就是说它仅从方向上区分类，分类的结果如图2所示；而L-Softmax，不仅可以从角度上区别两个类，还能从权重的模（长度）上区别这两个类，分类的结果如图3所示。在数据集合大小固定的条件下，L-Softmax能有两个方法分类，训练可能没有使得它在角度与长度方向都分离，导致它的精确可能不如A-Softmax。

1）源码

import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F
from torch.nn import Parameter
import math

def myphi(x,m):
    x = x * m
    return 1-x**2/math.factorial(2)+x**4/math.factorial(4)-x**6/math.factorial(6) + \
            x**8/math.factorial(8) - x**9/math.factorial(9)

class AngleLinear(nn.Module):
    def __init__(self, in_features, out_features, m = 4, phiflag=True):
        super(AngleLinear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = Parameter(torch.Tensor(in_features,out_features))
        self.weight.data.uniform_(-1, 1).renorm_(2,1,1e-5).mul_(1e5)
        self.phiflag = phiflag
        self.m = m
        self.mlambda = [
            lambda x: x**0,
            lambda x: x**1,
            lambda x: 2*x**2-1,
            lambda x: 4*x**3-3*x,
            lambda x: 8*x**4-8*x**2+1,
            lambda x: 16*x**5-20*x**3+5*x
        ]

    def forward(self, input):
        x = input   # size=(B,F)    F is feature len
        w = self.weight # size=(F,Classnum) F=in_features Classnum=out_features

        ww = w.renorm(2,1,1e-5).mul(1e5)
        xlen = x.pow(2).sum(1).pow(0.5) # size=B
        wlen = ww.pow(2).sum(0).pow(0.5) # size=Classnum

        cos_theta = x.mm(ww) # size=(B,Classnum)
        cos_theta = cos_theta / xlen.view(-1,1) / wlen.view(1,-1)
        cos_theta = cos_theta.clamp(-1,1)

        if self.phiflag:
            cos_m_theta = self.mlambda[self.m](cos_theta)
            theta = Variable(cos_theta.data.acos())
            k = (self.m*theta/3.14159265).floor()
            n_one = k*0.0 - 1
            phi_theta = (n_one**k) * cos_m_theta - 2*k
        else:
            theta = cos_theta.acos()
            phi_theta = myphi(theta,self.m)
            phi_theta = phi_theta.clamp(-1*self.m,1)

        cos_theta = cos_theta * xlen.view(-1,1)
        phi_theta = phi_theta * xlen.view(-1,1)
        output = (cos_theta,phi_theta)
        return output # size=(B,Classnum,2)


class AngleLoss(nn.Module):
    def __init__(self, gamma=0):
        super(AngleLoss, self).__init__()
        self.gamma   = gamma
        self.it = 0
        self.LambdaMin = 5.0
        self.LambdaMax = 1500.0
        self.lamb = 1500.0

    def forward(self, input, target):
        self.it += 1
        cos_theta,phi_theta = input
        target = target.view(-1,1) #size=(B,1)

        index = cos_theta.data * 0.0 #size=(B,Classnum)
        index.scatter_(1,target.data.view(-1,1),1)
        index = index.byte()
        index = Variable(index)

        self.lamb = max(self.LambdaMin,self.LambdaMax/(1+0.1*self.it ))
        output = cos_theta * 1.0 #size=(B,Classnum)
        output[index] -= cos_theta[index]*(1.0+0)/(1+self.lamb)
        output[index] += phi_theta[index]*(1.0+0)/(1+self.lamb)

        logpt = F.log_softmax(output)
        logpt = logpt.gather(1,target)
        logpt = logpt.view(-1)
        pt = Variable(logpt.data.exp())

        loss = -1 * (1-pt)**self.gamma * logpt
        loss = loss.mean()

        return loss

2）实验

使用angular_margin_softmax对MobileNetV3进行训练，简单修改如下：

class MobileNetV3_Large(nn.Module):
	def __init__(self, num_classes=1000):
		...
		self.linear4 = nn.Linear(1280, 512)
        self.angleLinear = AngleLinear(in_features=512,out_features=num_classes)
		
    def forward(self, x):
        out = self.hs1(self.bn1(self.conv1(x)))
        out = self.bneck(out)
        out = self.hs2(self.bn2(self.conv2(out)))
        out = F.avg_pool2d(out, 7)
        out = out.view(out.size(0), -1)
        out = self.hs3(self.bn3(self.linear3(out)))
        out = self.linear4(out)
        out = self.angleLinear(out)  #size = (B,Classnum,2),返回的是(cos_theta,phi_theta)
        return out
...
#接着在train.py中使用NLLLoss即可
angle_loss = AngleLoss().to(device)  #ArcFace输出结果是一个张量，不能进行反向传播，需要使用softmax函数（多类交叉熵）计算loss

for epoch in range(0,epoches):
    ...
     out = model(sample)  ## size=(B,Classnum,2)

     loss = angle_loss(out,y)  #A_softmax损失

     ...
     train_loss += loss.item()
     pred,pred_index = out[0].max(axis=1)

效果如下：

epoch = 5, train_loss = 6.160322679110935, train_acc = 0.004642857142857143,test_loss = 6.061383976822808, test_acc = 0.001488095238095238
checkpoint6 is saved
epoch = 6, train_loss = 6.13382915019989, train_acc = 0.007142857142857143,test_loss = 6.065926023891994, test_acc = 0.005952380952380952
checkpoint7 is saved
...

6、cosFace Loss（损失约束 - 2018 特征权重归一化）

参考

CosFace（Additive Cosine margin）为加法余弦间隔，CosFace的 $L_{MCL}$ （大间隔余弦损失函数）通过权重归一化，特征向量归一化到一个固定值s，并且让 $cos (θ)$ 加上m（注意是加在了余弦上）进行softmax loss损失函数的优化。
$L_{lmc} = 1/N \sum_i - log\frac{e^{s(cos(\theta_{y_i},i) - m)}}{e^{s(cos(\theta_{yi,i})-m)} + \sum_{j \neq y_i}e^{s(cos(\theta_j,i))}}$

Note：这里的 $cos(θ_j,i)$ 是通过特征的线性加权计算得到的，因为余弦距离在Softmax loss中是这么计算的：
$f_j = W^T_jx = ||w_j|| \cdot ||x|| \cdot cos\theta_j$
不同损失函数的对比：

NSL是进行特征归一化的L2_softmax loss；A-Softmax是SphereFace loss；灰色部分是决策边界。

特征归一化处理的必要性：

原始未进行特征归一化的softmax loss既要学习特征向量的L2范数，又要学习特征向量和权重系数之间的夹角。强调L2范数去减小整体损失会弱化cosine的限制。距离来说，训练过程中调整容易区分的样本的特征范数比难以区分的样本的特征范数大得多的话，就可以很大程度上掩盖掉cosine函数的作用。如果加上特征的范数限制，那么cosine函数的值就直接决定分类的概率，那么训练完成后同一类样本的特征向量在超平面上就聚集到了一起，不同类的特征向量在超平面上就可以做到相互远离。

特征归一化的幅值s还必须足够大，这样所有的类别簇才可以在半径足够大的超球面上分散开。
$\geq \frac{C-1}{C}log\frac{(C-1)P_W}{1 - P_W}$
C是要区分的类别数， $P_W$ 是对每一类期望达到的最小分类准确率。

超参数m的设置规则

考虑二分类的情况，NSL的决策边界是 $\cos(\theta_1)-\cos(\theta_2)=0$ ，如下图所示。从图上可以看出，对决策边界附近的样本， $\cos(\theta_1)$ 和 $\cos(\theta_2)$ 很接近，它的类别是模糊不清的，就是把它分到哪一类都可以。而对于LMCL，对于类别1，其决策边界是 $\cos(\theta_1) - \cos(\theta_2) = m$ 也就是要求 $\theta_1$ 要比 $\theta_2$ 小很多。因此，类内的变化空间被压缩了，类间的变化空间被加大了。

理论上，最优的分类结果是每一类的特征向量都与该类的权重W之间的夹角很小，也就是特征向量都分布在其所隶属的类权重向量的周围。那么理论上，m的取值范围为 $\leq m \leq (1 - max(W_i^TW_j))$ ，m要大于等于0很好理解，之所以m要小于 $1 - max(W_i^TW_j))$ 是因为，对于不同类别的样本，其最好情况下是分布在各自类别权重向量 $W_i$ 的周围，那么 $W_i^TW_j=||W_i||||W_j||\cos(\theta_{ij})=\cos(\theta_{ij})$ ， $θ_{ij}$ 就是上图右图中红色虚线表示的夹角，m肯定应该小于 $\cos(\theta_{ij})$ 。m的取值范围应该为：

C是分类类别数，K是学习的特征的维度。作者举了个例子，8类人脸在不同m情况下学到的特征的分布情况。
由于C等于8，K=2方便可视化，所以 $\leq 1 - \frac{\cos(2\pi)}{8} \approx 0.29$
所以作者设置m为0，0.1，0.2.

从上图可以看出，m越大，学习到的特征的判别力越好。

1）源码

import torch
import torch.nn as nn
import torch.nn.functional as F

#参考 https://blog.csdn.net/qq_34914551/article/details/104522030
class CosFaceLoss(nn.Module):
    r"""Implement of large margin cosine distance: :
    Args:
        in_features: size of each input sample
        out_features: size of each output sample
        s: norm of input feature
        m: margin
        cos(theta) - m
    """

    def __init__(self, in_features, out_features, s=30.0, m=0.40):
        super(CosFaceLoss, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.s = s
        self.m = m
        self.weight = nn.Parameter(torch.FloatTensor(out_features, in_features))
        nn.init.xavier_uniform_(self.weight)

    def forward(self, input, label):
        # --------------------------- cos(theta) & phi(theta) ---------------------------
        cosine = F.linear(F.normalize(input), F.normalize(self.weight))
        phi = cosine - self.m
        # --------------------------- convert label to one-hot ---------------------------
        one_hot = torch.zeros(cosine.size(), device='cuda')
        # one_hot = one_hot.cuda() if cosine.is_cuda else one_hot
        one_hot.scatter_(1, label.view(-1, 1).long(), 1)
        # -------------torch.where(out_i = {x_i if condition_i else y_i) -------------
        output = (one_hot * phi) + ((1.0 - one_hot) * cosine)
        # you can use torch.where if your torch.__version__ is 0.4
        output *= self.s
        # print(output)

        return output

    def __repr__(self):
        return self.__class__.__name__ + '(' \
               + 'in_features=' + str(self.in_features) \
               + ', out_features=' + str(self.out_features) \
               + ', s=' + str(self.s) \
               + ', m=' + str(self.m) + ')'

2）实验

使用cosFace loss对MobileNetV3进行训练，简单修改如下：

class MobileNetV3_Large(nn.Module):
	def __init__(self, num_classes=1000):
		...
		self.linear4 = nn.Linear(1280, 512)  #输出维数从num_classes修改为512
        self.cosFace_loss = CosFaceLoss(in_features=512,out_features=num_classes)
		
    def forward(self, x, y):
        out = self.hs1(self.bn1(self.conv1(x)))
        out = self.bneck(out)
        out = self.hs2(self.bn2(self.conv2(out)))
        out = F.avg_pool2d(out, 7)
        out = out.view(out.size(0), -1)
        out = self.hs3(self.bn3(self.linear3(out)))
        out = self.linear4(out)
        out = self.cosFace_loss(out,y)
        return out
...
#接着在train.py中使用NLLLoss即可
softmax_loss = nn.CrossEntropyLoss().to(device)  # NLLLoss

for epoch in range(0,epoches):
    ...
    out = model(sample,y)
    loss = cross_entropy_loss(out,y)

效果如下：

epoch = 0, train_loss = 19.490150450297765, train_acc = 0.0,test_loss = 18.7509761991955, test_acc = 0.0
checkpoint1 is saved
Epoch 2/500: 100%|▉| 2800/2802 [01:26<00:00, 32.31
epoch = 1, train_loss = 18.732188301086424, train_acc = 0.0,test_loss = 18.819916827338083, test_acc = 0.0
checkpoint2 is saved
Epoch 3/500: 100%|▉| 2800/2802 [01:25<00:00, 32.62
epoch = 2, train_loss = 18.447026476178852, train_acc = 0.0,test_loss = 18.377974646432058, test_acc = 0.0
checkpoint3 is saved
Epoch 4/500: 100%|▉| 2800/2802 [01:25<00:00, 32.57
epoch = 3, train_loss = 18.306991988590784, train_acc = 0.0,test_loss = 18.21899235816229, test_acc = 0.0
checkpoint4 is saved
Epoch 5/500: 100%|▉| 2800/2802 [01:25<00:00, 32.61
epoch = 4, train_loss = 18.136429609571184, train_acc = 0.0,test_loss = 18.920511461439588, test_acc = 0.0
Epoch 6/500:   0%|      | 0/2802 [00:00<?, ?img/s]checkpoint5 is saved

7、ArcFace loss（损失约束 - 2018 特征权重归一化）

原论文链接：ArcFace: Additive Angular Margin Loss for Deep Face Recognition
该论文大体内容如下（对照着原论文来看）：

arcface模型可以利用归一化层和反向传播，用于对公开的人脸检测数据集（MS1MV0，Celeb500K）进行清洗，减轻人工清洗的成本；
arcface损失函数（式L3是由式L2变化过来的，其中 $\theta$ 表示特征x与权重W之间的夹角）可以有效聚集类内样本（减少类内样本和类中心的角间隔，如式L5），拉开类间样本的距离（增大当前类样本和其他类中心的角间隔，如式L6）；
sphereFace，arcFace和cosFace包含3种不同类型的间隔惩罚项，可以用一个统一个式子L4表示：乘法型角间隔m1，加法型角间隔m2，累加型余弦间隔m3；
arcFace loss虽然对于减小类内间隔，增大类间间隔有效，但是对于数据不干净的数据集在减小类内间隔时可能会存在问题，因此引入了sub-center Arcface，假设对某个身份（记为sub-class）设置K个centers（ $\in R^{512*N*K}$ ），统计sub-class中的样本的聚集情况（K设置为1,3,10），选中样本统计最多的center作为dominant（主域），其余的centers作为non-dominant（非主域）（如图6），因此关于sub-class arcface损失函数（式L7）用max来计算类内最大center的角间隔 $\theta$ ，即主域的角间隔。
在实验中，训练集包括CASIA、MS1MV3、IBUG-500K等，验证集包括LFW、AgeDB、IJB-B等；使用RetinaFace裁剪人脸，而embedding network则是去掉BN-FC层的ArcFace model，用于提取512维的特征；
使用CASIA训练集，多种损失函数进行不同模型的训练（即[CASIA, ResNet50, Loss\*]），在验证阶段，则在LFW中提取512维特征，求取每个身份的特征中心点（center），完成身份的识别（文章中好像没有具体说怎样利用centers对LFW的样本进行infer，个人理解是使用欧式距离计算embedding与centers之间的距离）
ArcFace loss中关于决策边界的绘制，不同类的点是怎么打在圆曲线上的，请参考SphereFace loss小节。

参考ArcFace解析

ArcFace loss：Additive Angular Margin Loss（加性角度间隔损失函数），对特征向量和权重归一化，对θ加上角度间隔m，角度间隔比余弦间隔在对角度的影响更加直接。几何上有恒定的线性角度margen。

ArcFace中是直接在角度空间θ中最大化分类界限，而CosFace是在余弦空间cos(θ)中最大化分类界限。
预处理（人脸对齐）：人脸关键点由MTCNN检测，再通过相似变换得到了被裁剪的对齐人脸。
训练（人脸分类器）：ResNet50 + ArcFace loss
测试：从人脸分类器FC1层的输出中提取512维的嵌入特征，对输入的两个特征计算余弦距离，再来进行人脸验证和人脸识别。
实际代码中训练时分为resnet model+arc head+softmax loss。resnet model输出特征；arc head将特征与权重间加上角度间隔后，再输出预测标签，求ACC时就用这个输出标签；softmax loss求预测标签和实际的误差。
LFW上99.83%，YTF上98.02%

ArcFace loss实现过程：

ArcFace loss损失函数如下：
$\frac{1}{N} \sum_{i=1}^N log \frac{e^{s(cos(\theta_{y_i}+m))}}{e^{s(cos(\theta_{y_i}+m))} + \sum_{j=1,j \neq y_i}^n e^{s\cdot cos\theta_j}}$

在 $x_i$ 和 $W_{ji}$ 之间的θ上加上角度间隔m（注意是加在了角θ上），以加法的方式惩罚深度特征与其相应权重之间的角度，从而同时增强了类内紧度和类间差异。

惩罚θ角度的意思就是：训练时加上m就会使θ降低

解释Margin是如何使类内聚合类间分离的：比如训练时降到某一固定损失值时，有Margin和无Margin的e指数项是相等的，则有Margin的 $θ_{yi}$ 就需要相对的减少了。这样来看有 Margin的训练就会把 i 类别的输入特征和权重间的夹角 $θ_{yi}$ 缩小了，从一些角度的示图中可以看出，Margin把 $θ_{yi}$ 挤得更类内聚合了， $θ_{yi}$ 和其他θ类间也就更分离了。

L2归一化来修正单个权重 $W_j||=1$ （和L2_softmax有点像，都需要进行特征归一化约束之后，再使用softmax loss），还通过L2归一化来固定嵌入特征 $x_i||$ ，并将其重新缩放成s。特征和权重的归一化步骤使预测仅取决于特征和权重之间的角度。因此，所学的嵌入特征分布在半径为s的超球体上。

由于提出的加性角度间隔(additive angular margin)惩罚与测地线距离间隔(geodesic distance margin)惩罚在归一化的超球面上相等，因此我们将该方法命名为ArcFace。

Arcface的优点

性能高，易于编程实现，复杂性低，训练效率高

ArcFace直接优化geodesic distance margin(弧度)，因为归一化超球体中的角和弧度的对应。ArcFace比Softmax的特征分布更紧凑，决策边界更明显，一个弧长代表一个类。

为了性能的稳定，ArcFace不需要与其他loss函数实现联合监督，可以很容易地收敛于任何训练数据集。

1）源码解析

#参考 https://blog.csdn.net/qq_40859461/article/details/86771136
import math
import torch
from torch import nn
from torch.nn import Parameter
import torch.nn.functional as F

class ArcFaceLoss(nn.Module):
    def __init__(self, in_feature=128, out_feature=10575, s=32.0, m=0.50, easy_margin=False):
        super(ArcFaceLoss, self).__init__()
        self.in_feature = in_feature
        self.out_feature = out_feature
        self.s = s
        self.m = m
        self.weight = Parameter(torch.Tensor(out_feature, in_feature))
        nn.init.xavier_uniform_(self.weight)  #初始化卷积核: 目的是为了使得每一层的方差都尽可能相等, 使网络中的信息更好地流动. 则将每一层权重初始化为如下范围内的均匀分布

        self.easy_margin = easy_margin
        self.cos_m = math.cos(m)
        self.sin_m = math.sin(m)

        # make the function cos(theta+m) monotonic decreasing while theta in [0°,180°]
        self.th = math.cos(math.pi - m)
        self.mm = math.sin(math.pi - m) * m

    def forward(self, x, label):
        # cos(theta)
        cosine = F.linear(F.normalize(x), F.normalize(self.weight))   #包含特征和权重的归一化操作
        # cos(theta + m)
        sine = torch.sqrt(1.0 - torch.pow(cosine, 2))
        phi = cosine * self.cos_m - sine * self.sin_m

        if self.easy_margin:
            phi = torch.where(cosine > 0, phi, cosine)
        else:
            phi = torch.where((cosine - self.th) > 0, phi, cosine - self.mm)

        #one_hot = torch.zeros(cosine.size(), device='cuda' if torch.cuda.is_available() else 'cpu')
        one_hot = torch.zeros_like(cosine)
        one_hot.scatter_(1, label.view(-1, 1), 1)
        output = (one_hot * phi) + ((1.0 - one_hot) * cosine)
        output = output * self.s

        return output

2）实验

class MobileNetV3_Large(nn.Module):
	def __init__(self, num_classes=1000):
		...
		self.linear4 = nn.Linear(1280, 512)  #输出维数从num_classes修改为512
        self.arc_loss = ArcFaceLoss(in_feature=512, out_feature=num_classes,
                               m=0.5)  # ArcFace损失函数,输入的特征维数为身份类别数，输出特征维数为512，间隔m为0.3（自动进行特征和参数的归一化处理）

    def forward(self, x, y):
        out = self.hs1(self.bn1(self.conv1(x)))
        out = self.bneck(out)
        out = self.hs2(self.bn2(self.conv2(out)))
        out = F.avg_pool2d(out, 7)
        out = out.view(out.size(0), -1)
        out = self.hs3(self.bn3(self.linear3(out)))
        out = self.linear4(out)   
        out = self.arc_loss(out,y)  #Arcface损失
        return out
...
#接着在train.py中使用NLLLoss即可
softmax_loss = nn.CrossEntropyLoss().to(device)  # NLLLoss

for epoch in range(0,epoches):
    out = model(sample, y)

    loss = cross_entropy_loss(out,y)

实验效果

Epoch 1/100: 100%|▉| 47568/47571 [46:39<00:00, 16.
epoch = 0, train_loss = 15.873860453981925, train_acc = 0.0,test_loss = 14.844074659837949, test_acc = 0.0
Epoch 2/100:   0%|     | 0/47571 [00:00<?, ?img/s]checkpoint1 is saved
Epoch 2/100: 100%|▉| 47568/47571 [25:02<00:00, 31.
epoch = 1, train_loss = 13.400763185586426, train_acc = 0.0,test_loss = 14.149169119580776, test_acc = 0.0
checkpoint2 is saved
Epoch 3/100: 100%|▉| 47568/47571 [22:34<00:00, 35.WARNING:root:NaN or Inf found in input tensor.
Epoch 3/100: 100%|▉| 47568/47571 [25:00<00:00, 31.
epoch = 2, train_loss = nan, train_acc = 0.0009670366633030609,test_loss = nan, test_acc = 0.0015405864853378665
checkpoint3 is saved
Epoch 4/100: 100%|▉| 47568/47571 [22:39<00:00, 35.WARNING:root:NaN or Inf found in input tensor.
Epoch 4/100: 100%|▉| 47568/47571 [25:05<00:00, 31.
epoch = 3, train_loss = nan, train_acc = 0.0015556676757484023,test_loss = nan, test_acc = 0.0015405864853378665

会发现，当lr=0.01，epoch=2时，arcFace计算的损失值为nan，浏览了各大网站，解决方法有如下几种

降低学习率lr
增大模型训练时的batch_size，batch size最好是128，用多个GPU训练。参考https://github.com/deepinsight/insightface/issues/74
使用centerLoss或者L2-softmax进行模型预训练，参考https://github.com/deepinsight/insightface/issues/387
检查一下该损失函数在模型中是否用得正确，输出结果的取值范围是否在损失函数的定义域内。

3）ArcFace Loss为Nan怎么解决

参考

解决方法：

使用L2-softmax（看上面L2_softmax实验，模型过拟合了）的预训练权重文件进行模型加载，以及将lr调整值0.005，train_loss和test_loss终于可以同时下降了（我感觉是学习率起的作用）。

Epoch 1/500: 100%|▉| 2800/2802 [01:28<00:00, 31.53
epoch = 0, train_loss = 10.648999228818075, train_acc = 0.016428571428571428,test_loss = 9.672125000329245, test_acc = 0.025297619047619048
checkpoint1 is saved
Epoch 2/500: 100%|▉| 2800/2802 [01:27<00:00, 32.07
epoch = 1, train_loss = 10.117614848954338, train_acc = 0.01892857142857143,test_loss = 9.23866881359191, test_acc = 0.022321428571428572
checkpoint2 is saved
Epoch 3/500: 100%|▉| 2800/2802 [01:26<00:00, 32.20
epoch = 2, train_loss = 9.56589084471975, train_acc = 0.02142857142857143,test_loss = 8.703474456355686, test_acc = 0.03571428571428571
checkpoint3 is saved
...
epoch = 34, train_loss = 6.7769753401620045, train_acc = 0.09892857142857144,test_loss = 7.465910258747282, test_acc = 0.11160714285714286
checkpoint35 is saved
Epoch 36/500: 100%|▉| 2800/2802 [01:25<00:00, 32.6
epoch = 35, train_loss = 6.83953810266086, train_acc = 0.10107142857142858,test_loss = 7.755035051277706, test_acc = 0.09523809523809523
checkpoint36 is saved
Epoch 37/500: 100%|▉| 2800/2802 [01:25<00:00, 32.6
epoch = 36, train_loss = 7.095788996134486, train_acc = 0.09142857142857143,test_loss = 8.262529448384331, test_acc = 0.08928571428571429
Epoch 38/500:   0%|     | 0/2802 [00:00<?, ?img/s]checkpoint37 is saved

但是用MobileNetV3 + arcFace loss训练的模型会发现模型容易过拟合，train_loss和test_loss虽然都会下降，但是值相差有点大

Epoch 126/500:   0%|    | 0/2802 [00:00<?, ?img/s]checkpoint125 is saved
Epoch 126/500: 100%|▉| 2800/2802 [01:25<00:00, 32.
epoch = 125, train_loss = 4.749639714328306, train_acc = 0.20285714285714285,test_loss = 7.780712046438739, test_acc = 0.15625
checkpoint126 is saved
Epoch 127/500: 100%|▉| 2800/2802 [01:25<00:00, 32.
epoch = 126, train_loss = 4.593655687602503, train_acc = 0.21392857142857144,test_loss = 7.834138387725467, test_acc = 0.11011904761904762
Epoch 128/500:   0%|    | 0/2802 [00:00<?, ?img/s]checkpoint127 is saved
Epoch 128/500: 100%|▉| 2800/2802 [01:25<00:00, 32.
epoch = 127, train_loss = 4.784141867070326, train_acc = 0.18321428571428572,test_loss = 7.470722722155707, test_acc = 0.14732142857142858

可以考虑使用HRNetV2进行训练。

8、参考源码

王小希ww

关注

1
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
【Pytorch】常见的人脸身份识别损失函数

Pytorch常见损失函数实验环境准备：人脸多角度多光照的图像数据集MUCT（276个受试者）+ MobileNetV3文章目录Pytorch常见损失函数0、Softmax（激活函数）1、NLLLoss（负对数似然损失）1）源码解析2）实验2、CrossEntropyLoss（损失函数）1）源码解析2）实验3、Center loss（损失函数 - 2016）1）源码解析2）实验4、L2-Softmax（损失约束 - 2017 特征归一化）1）源码2）实验5、SphereFace loss（损失约束 -
复制链接

扫一扫