Cross Entropy Loss，Center Loss，CosFace Loss ，ArcFace Loss 理解

WeissSama

已于 2023-04-13 01:05:56 修改

阅读量1.4k

点赞数 1

分类专栏： Deep Learning 文章标签：人工智能深度学习

于 2023-04-03 15:40:57 首次发布

本文链接：https://blog.csdn.net/Bismarckczy/article/details/129572383

版权

Deep Learning 专栏收录该内容

44 篇文章

订阅专栏

人脸识别主要是对人脸特征的提去，属于一种类内特征的区分。
早期训练人脸模型的loss 有contrastive loss， triplet loss等，都是将几个Samples一起送到模型训练，然后努力找到他们的不同性。
今天主要谈一谈近几年比较重要(面试常考，工作常用）的loss 。

1. Center Loss

1 介绍Cross Entropy Loss和人脸识别需要的网络能力之间的差距

作者在introduction中提出：
很多任务都是Close-Set Identification，比如分类网络，
Close-set identification指的是，检测出的结果必须包含在训练数据集的种类中，比如ImageNet数据集，我们检测的图形中的目标类别必须包含在ImageNet数据集的1000个种类中。
这里涉及到了泛化性问题。

人脸识别的本质是要区分每一个人，可以说每个人的脸都是一个类，所以我们可以提出一个新概念。所以人脸识别并不是一个Close-set identification。而解决这种非Close-set identification的问题方法是，让类内的区分更加明显，也就是每个人的人脸特征应该拥有明显的辨识度。
所以同样是分类，作者谈到了两种分类解决方案。

Separable 和 Discriminative。如下图中，同样是分类，但是显然后者的分类能力更强。
由此，作者引出了Center loss ，通过loss来设计一个拥有好的Discriminative的能力的网络。

在这里插入图片描述

cross entropy loss

对于一般的分类模型，我们用cross entropy loss ，下图a和b是分别基于训练集和测试集上画出的层激活值的图像，这是基于cross entropy loss 的分类结果。
在这里插入图片描述
cross entropy loss 的形式就是负对数loss (NLL)，其中预测预测值为softmax的输出。

关于NLL和cross entropy (CE)的关系，NLL是CE的特殊形式，CE的意义是真实分布和预测分布之间的交叉熵，信息论中如果一个事件k发生的概率是 $P_k$ ，那么这个事件信息量就是 $log P_k$ ，那么对于这个概率分布p，它的entropy $H (p)$ 就是。
在这里插入图片描述

而如果事件k有两种不同的概率分布p和q，那么它们的cross-entropy H(p, q) 的定义如下：
在这里插入图片描述
而如果我们把这里的概率p当做训练集的真实分布，q当做预测分布。当p是一个one-hot类型的向量的时候，我们直接把cross entropy loss简化为下图形式。
公式在原论文的含义如下，注意m和n分别是batch size和class_number，
一次loss是在一个batch size上计算的。

在这里插入图片描述

Cross Entropy Loss在工程中

在Pytroch等框架设计torch.nn.CrossEntropyLoss 的时候，他实际上是结合了LogSoftmax()和NLLLoss()两个函数，可以参考这篇文章 pytorch中的一些loss函数。所以接受的是来自模型层直接的输出，而不会经过一个softmax层，因为softmax层会直接整合到了torch.nn.CrossEntropyLoss函数中，这也是为什么有时候大家喜欢叫softmax loss，这其实不太正式。

结论：learned features are separable, but deep features are not discriminative
enough, since they still show significant intra-class variations. Consequently, it
is not suitable to directly use these features for recognition.

2 引出 Center Loss

下图就是center loss，理解起来也比较容易， $c_i$ 是类别为 $y_i$ 的所有样本feature的中心，这种loss就是为了让x尽可能的离loss较近，起到降低类内距离(噪音)的目的。
在这里插入图片描述
下面是三个问题和答案，答案是GPT写的，我觉得挺好的。

解释一下center loss
为什么center loss不能直接使用
为什么center loss具有正则化作用

4. explain center loss
Center loss is a loss function used in deep learning to improve the performance of models on classification tasks, particularly for face recognition. The idea behind center loss is to learn a set of features that better represent the different classes in the dataset by encouraging the model to group together features of the same class and separate those of different classes.

To do this, center loss calculates a "center" for each class, which is a vector that represents the average of all the feature vectors of that class. The goal of the loss function is then to minimize the distance between a given input feature vector and its corresponding class center, while maximizing the distance between feature vectors of different classes.

The formula for center loss can be written as:

L_c = 1/2 * ∑_i=1^N ||x_i - c_yi||^2

where L_c is the center loss, N is the number of samples in a mini-batch, x_i is the feature vector of the i-th sample, c_yi is the center of the class to which the i-th sample belongs, and ||.||^2 denotes the Euclidean distance.

During training, center loss is typically used in conjunction with a softmax cross-entropy loss, which is a standard loss function for classification tasks. The softmax cross-entropy loss encourages the model to correctly predict the class label of each input, while center loss helps to learn better feature representations.

By combining these two losses, the model learns to both accurately classify inputs and separate the feature representations of different classes. This can result in improved performance on face recognition and other classification tasks.




2. why center loss can not be used directly

Center loss is a popular method in deep learning for improving the performance of 
face recognition and other related tasks. It aims to learn a set of feature 
representations such that the distance between the features of the same class is 
minimized while the distance between the features of different classes is 
maximized. This is achieved by adding a center loss term to the standard softmax 
cross-entropy loss.

While center loss can be effective in improving the performance of deep learning 
models, it cannot be used directly in all cases. One reason is that center loss 
is a form of regularization, which means that it is intended to prevent 
overfitting by constraining the model's learned representations. However, if the 
model is already underfitting or is not complex enough to learn meaningful 
features, center loss may not be effective.

Another reason why center loss cannot be used directly is that it may not be 
appropriate for all types of tasks. For example, center loss is most commonly 
used for face recognition tasks, where the goal is to identify individuals based 
on their facial features. However, for tasks that require more general feature 
learning, such as object detection or semantic segmentation, other loss functions 
may be more appropriate.

Additionally, center loss may require careful tuning of hyperparameters, such as 
the weight assigned to the center loss term, in order to achieve the best 
performance. This tuning process can be time-consuming and computationally 
expensive.

In summary, while center loss can be a useful tool in deep learning, it is not a 
one-size-fits-all solution and may require careful consideration and adaptation 
to different tasks and models.




3.why center loss has regularization 

Center loss can be considered as a form of regularization in deep learning 
because it helps to prevent overfitting of the model by constraining the learned 
representations.


During training, the center loss encourages the model to learn features that are 
more representative of each class by minimizing the distance between an input 
feature vector and its corresponding class center. By doing so, the model is 
encouraged to group together features of the same class and separate those of 
different classes.


This has a regularizing effect because it encourages the model to learn more 
generalizable features that are less likely to be specific to the training data 
and more likely to be applicable to new, unseen data. In other words, the model 
is encouraged to learn features that are better at generalizing to new data and 
less likely to memorize specific examples from the training data.


Additionally, center loss can help to reduce the impact of noisy or mislabeled 
data on the model's learned representations. Since the loss is based on the 
distance between input feature vectors and class centers, it can help to pull 
misclassified or noisy samples towards the correct class center and away from the 
incorrect one.


Overall, by encouraging the model to learn better, more generalizable feature 
representations, center loss can help to prevent overfitting and improve the 
performance of deep learning models on classification tasks.

那么为了使用center loss，作者在训练中使用了一些调整方法。

First, insteadof updating the centers with respect to the entire training set, we perform the
update based on mini-batch. In each iteration, the centers are computed by
averaging the features of the corresponding classes (In this case, some of the
centers may not update).
Second, to avoid large perturbations caused by few mislabelled samples, we use a scalar α to control the learning rate of the centers.

而梯度更新的方法可以查看这个博客博客地址或者原文地址
在这里插入图片描述

我们取不同的 $\lambda$ ，可以得到不同的训练结果。结果符合我们的预期。
在这里插入图片描述

2. CosFace

形式
在这里插入图片描述

人脸识别算法，比如CosFace或者ArcFace，里面的 $\theta$ 夹角是什么意思？回答如下

In face recognition systems such as CosFace and ArcFace, the theta angle refers 
to the angle between the feature embedding vector of a face and the weight vector
associated with the ground truth label of that face.

In more detail, the feature embedding vector is a high-dimensional representation 
of a face that is generated by a deep neural network, while the weight vector is 
a learned parameter vector associated with the ground truth label of that face. 
The theta angle is then calculated as the cosine similarity between these two 
vectors, which measures the degree of alignment between the feature embedding 
vector and the weight vector.

The use of the theta angle in face recognition systems allows for a more 
discriminative and robust feature representation of faces, leading to improved 
performance in face recognition tasks.

那特征embedding向量又是什么？回答如下


The feature embedding vector is a high-dimensional numerical representation of a 
face that is learned by a deep neural network. In face recognition systems, the 
neural network is typically trained on a large dataset of face images in an 
unsupervised or supervised manner to learn a feature representation that is 
highly discriminative and robust to variations in lighting, pose, and other 
factors.


The output of the neural network for a given face image is a feature embedding 
vector, which can be thought of as a set of numerical values that encode the 
salient characteristics of the face. These values can be used to compare and 
match faces for tasks such as identification or verification.

The feature embedding vector is typically a high-dimensional vector, ranging from 
a few hundred to a few thousand dimensions, depending on the specific neural 
network architecture used. The dimensions of the vector are not directly 
interpretable, but the values of the vector are learned to be highly informative 
for distinguishing between different faces.

在了解到了 $\theta$ 的含义之后，我们可以更好的了解这些Loss的设计思路，回看CosFace Loss形式。
在这里插入图片描述

设计思路：
假设下面有两个夹角 $\theta_1$ 和 $\theta_2$ ,我们知道 $\theta$ 是输出特征向量和Ground Truth向量之间的夹角，那么对于第i个类而言， $cos(\theta)-m$ 对应的夹角 $\theta_2$ 比 $cos(\theta)$ 对应的夹角 $\theta_1$ 要大，也就是说，加入m这个参数之后，我们学习参数 $\theta$ 的能力变差了，这也意味着，我们真正适应参数m之后， $\theta$ 会更小，也就是类内会更聚拢。
在这里插入图片描述
CosFace Loss的代码实现，直接看forward部分，
先把feature和weight用l2 norm 归一化。
然后两个矩阵相乘，就得到cos值了。因为范数为1了。
接下来减去m，然后通过onehot的label，挑选出需要减去m的样本，保留其他位置的cos值，
最后乘一个s，返回，送入交叉墒中就可以了。

class AddMarginProduct(nn.Module):
    r"""Implement of large margin cosine distance: :
    Args:
        in_features: size of each input sample
        out_features: size of each output sample
        s: norm of input feature
        m: margin
        cos(theta) - m
    """

    def __init__(self, in_features, out_features, s=30.0, m=0.40):
        super(AddMarginProduct, self).__init__()
        self.in_features = in_features
        self。.out_features = out_features
        self.s = s
        self.m = m
        self.weight = Parameter(torch.FloatTensor(out_features, in_features))
        nn.init.xavier_uniform_(self.weight)

    def forward(self, input, label):
        # --------------------------- cos(theta) & phi(theta) ---------------------------
        cosine = F.linear(F.normalize(input), F.normalize(self.weight))
        phi = cosine - self.m
        # --------------------------- convert label to one-hot ---------------------------
        one_hot = torch.zeros(cosine.size(), device='cuda')
        # one_hot = one_hot.cuda() if cosine.is_cuda else one_hot
        one_hot.scatter_(1, label.view(-1, 1).long(), 1)
        # -------------torch.where(out_i = {x_i if condition_i else y_i) -------------
        output = (one_hot * phi) + ((1.0 - one_hot) * cosine)
        # you can use torch.where if your torch.__version__ is 0.4
        output *= self.s
        # print(output)

        return output

    def __repr__(self):
        return self.__class__.__name__ + '(' \
               + 'in_features=' + str(self.in_features) \
               + ', out_features=' + str(self.out_features) \
               + ', s=' + str(self.s) \
               + ', m=' + str(self.m) + ')'

ArcFace

来看看ArcFace的pipeline，很多部分和CosFace相似，不同的是对角度的处理方式。
在这里插入图片描述

跟CosFace的思想大同小异，ArcFace通过在 $\theta$ 上加一个m，那么 $cos(\theta+m)$ 的值小于 $cos(\theta)$ ，然后整个Loss会变大，也会导致学习难度上升，进而达到促进 $\theta$ 变小的目的。

ArcFace Loss 代码在下面
这里m的是一个弧度值角度，实验设置为0.5
代码唯一造成困惑的地方可能就是easy margin那里，如果easy margin为true，只要夹角小于90度，这才再角度上加一个m。如果为false，以self.th为阈值，大家可以画下图，如果角度值theta小于m，在单调区间0-pi上，余弦值一定大于该阈值。所以th的作用就是保证了theta加m，依然在单调区间0-pi上。

class ArcMarginProduct(nn.Module):
    r"""Implement of large margin arc distance: :
        Args:
            in_features: size of each input sample
            out_features: size of each output sample
            s: norm of input feature
            m: margin
            cos(theta + m)
        """

    def __init__(self, in_features, out_features, s=30.0, m=0.50, easy_margin=False):
        super(ArcMarginProduct, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.s = s
        self.m = m
        self.weight = Parameter(torch.FloatTensor(out_features, in_features))
        nn.init.xavier_uniform_(self.weight)

        self.easy_margin = easy_margin
        self.cos_m = math.cos(m)
        self.sin_m = math.sin(m)
        self.th = math.cos(math.pi - m)
        self.mm = math.sin(math.pi - m) * m

    def forward(self, input, label):
        cosine = F.linear(F.normalize(input), F.normalize(self.weight))
        sine = torch.sqrt(1.0 - torch.pow(cosine, 2))
        # cos(a+b)=cos(a)*cos(b)-size(a)*sin(b)
        phi = cosine * self.cos_m - sine * self.sin_m
        if self.easy_margin:
            phi = torch.where(cosine > 0, phi, cosine)
        else:
            phi = torch.where(cosine > self.th, phi, cosine - self.mm)
        one_hot = torch.zeros(cosine.size(), device='cuda')
        one_hot.scatter_(1, label.view(-1, 1).long(), 1)
        output = (one_hot * phi) + ((1.0 - one_hot) * cosine)
        output *= self.s
        return output