Cross Entropy Loss,Center Loss,CosFace Loss ,ArcFace Loss 理解

人脸识别主要是对人脸特征的提去,属于一种类内特征的区分。
早期训练人脸模型的loss 有contrastive loss, triplet loss等,都是将几个Samples一起送到模型训练,然后努力找到他们的不同性。
今天主要谈一谈近几年比较重要(面试常考,工作常用)的loss 。

1. Center Loss

1 介绍Cross Entropy Loss和人脸识别需要的网络能力之间的差距

作者在introduction中提出:
很多任务都是Close-Set Identification,比如分类网络,
Close-set identification指的是,检测出的结果必须包含在训练数据集的种类中,比如ImageNet数据集,我们检测的图形中的目标类别必须包含在ImageNet数据集的1000个种类中。
这里涉及到了泛化性问题。

人脸识别的本质是要区分每一个人,可以说每个人的脸都是一个类,所以我们可以提出一个新概念。所以人脸识别并不是一个Close-set identification。而解决这种非Close-set identification的问题方法是,让类内的区分更加明显,也就是每个人的人脸特征应该拥有明显的辨识度。
所以同样是分类,作者谈到了两种分类解决方案。

Separable 和 Discriminative。如下图中,同样是分类,但是显然后者的分类能力更强。
由此,作者引出了Center loss ,通过loss来设计一个拥有好的Discriminative的能力的网络。

在这里插入图片描述

cross entropy loss

对于一般的分类模型,我们用cross entropy loss ,下图a和b是分别基于训练集和测试集上画出的层激活值的图像,这是基于cross entropy loss 的分类结果。
在这里插入图片描述
cross entropy loss 的形式就是负对数loss (NLL),其中预测预测值为softmax的输出。

关于NLL和cross entropy (CE)的关系,NLL是CE的特殊形式,CE的意义是真实分布和预测分布之间的交叉熵,信息论中如果一个事件k发生的概率是 P k P_k Pk,那么这个事件信息量就是 − l o g P k -log P_k logPk,那么对于这个概率分布p,它的entropy H ( p ) H(p) H(p) 就是。
在这里插入图片描述

而如果事件k有两种不同的概率分布p和q,那么它们的cross-entropy H(p, q) 的定义如下:
在这里插入图片描述
而如果我们把这里的概率p当做训练集的真实分布,q当做预测分布。当p是一个one-hot类型的向量的时候,我们直接把cross entropy loss简化为下图形式。
公式在原论文的含义如下,注意m和n分别是batch size和class_number,
一次loss是在一个batch size上计算的。

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

在这里插入图片描述

Cross Entropy Loss在工程中

在Pytroch等框架设计torch.nn.CrossEntropyLoss 的时候, 他实际上是结合了LogSoftmax()和NLLLoss()两个函数,可以参考这篇文章 pytorch中的一些loss函数。所以接受的是来自模型层直接的输出,而不会经过一个softmax层,因为softmax层会直接整合到了torch.nn.CrossEntropyLoss函数中,这也是为什么有时候大家喜欢叫softmax loss,这其实不太正式。

结论:learned features are separable, but deep features are not discriminative
enough, since they still show significant intra-class variations. Consequently, it
is not suitable to directly use these features for recognition.

2 引出 Center Loss

下图就是center loss,理解起来也比较容易, c i c_i ci是类别为 y i y_i yi的所有样本feature的中心,这种loss就是为了让x尽可能的离loss较近,起到降低类内距离(噪音)的目的。
在这里插入图片描述
下面是三个问题和答案,答案是GPT写的,我觉得挺好的。

  1. 解释一下center loss
  2. 为什么center loss不能直接使用
  3. 为什么center loss具有正则化作用
4. explain center loss
Center loss is a loss function used in deep learning to improve the performance of models on classification tasks, particularly for face recognition. The idea behind center loss is to learn a set of features that better represent the different classes in the dataset by encouraging the model to group together features of the same class and separate those of different classes.

To do this, center loss calculates a "center" for each class, which is a vector that represents the average of all the feature vectors of that class. The goal of the loss function is then to minimize the distance between a given input feature vector and its corresponding class center, while maximizing the distance between feature vectors of different classes.

The formula for center loss can be written as:

L_c = 1/2 * ∑_i=1^N ||x_i - c_yi||^2

where L_c is the center loss, N is the number of samples in a mini-batch, x_i is the feature vector of the i-th sample, c_yi is the center of the class to which the i-th sample belongs, and ||.||^2 denotes the Euclidean distance.

During training, center loss is typically used in conjunction with a softmax cross-entropy loss, which is a standard loss function for classification tasks. The softmax cross-entropy loss encourages the model to correctly predict the class label of each input, while center loss helps to learn better feature representations.

By combining these two losses, the model learns to both accurately classify inputs and separate the feature representations of different classes. This can result in improved performance on face recognition and other classification tasks.




2. why center loss can not be used directly

Center loss is a popular method in deep learning for improving the performance of 
face recognition and other related tasks. It aims to learn a set of feature 
representations such that the distance between the features of the same class is 
minimized while the distance between the features of different classes is 
maximized. This is achieved by adding a center loss term to the standard softmax 
cross-entropy loss.

While center loss can be effective in improving the performance of deep learning 
models, it cannot be used directly in all cases. One reason is that center loss 
is a form of regularization, which means that it is intended to prevent 
overfitting by constraining the model's learned representations. However, if the 
model is already underfitting or is not complex enough to learn meaningful 
features, center loss may not be effective.

Another reason why center loss cannot be used directly is that it may not be 
appropriate for all types of tasks. For example, center loss is most commonly 
used for face recognition tasks, where the goal is to identify individuals based 
on their facial features. However, for tasks that require more general feature 
learning, such as object detection or semantic segmentation, other loss functions 
may be more appropriate.

Additionally, center loss may require careful tuning of hyperparameters, such as 
the weight assigned to the center loss term, in order to achieve the best 
performance. This tuning process can be time-consuming and computationally 
expensive.

In summary, while center loss can be a useful tool in deep learning, it is not a 
one-size-fits-all solution and may require careful consideration and adaptation 
to different tasks and models.




3.why center loss has regularization 

Center loss can be considered as a form of regularization in deep learning 
because it helps to prevent overfitting of the model by constraining the learned 
representations.


During training, the center loss encourages the model to learn features that are 
more representative of each class by minimizing the distance between an input 
feature vector and its corresponding class center. By doing so, the model is 
encouraged to group together features of the same class and separate those of 
different classes.


This has a regularizing effect because it encourages the model to learn more 
generalizable features that are less likely to be specific to the training data 
and more likely to be applicable to new, unseen data. In other words, the model 
is encouraged to learn features that are better at generalizing to new data and 
less likely to memorize specific examples from the training data.


Additionally, center loss can help to reduce the impact of noisy or mislabeled 
data on the model's learned representations. Since the loss is based on the 
distance between input feature vectors and class centers, it can help to pull 
misclassified or noisy samples towards the correct class center and away from the 
incorrect one.


Overall, by encouraging the model to learn better, more generalizable feature 
representations, center loss can help to prevent overfitting and improve the 
performance of deep learning models on classification tasks.

那么为了使用center loss,作者在训练中使用了一些调整方法。

  1. First, insteadof updating the centers with respect to the entire training set, we perform the
    update based on mini-batch. In each iteration, the centers are computed by
    averaging the features of the corresponding classes (In this case, some of the
    centers may not update).
  2. Second, to avoid large perturbations caused by few mislabelled samples, we use a scalar α to control the learning rate of the centers.

而梯度更新的方法可以查看这个博客 博客地址或者原文地址
在这里插入图片描述

我们取不同的 λ \lambda λ,可以得到不同的训练结果。结果符合我们的预期。
在这里插入图片描述

2. CosFace

形式
在这里插入图片描述

人脸识别算法,比如CosFace或者ArcFace,里面的 θ \theta θ夹角是什么意思?回答如下

In face recognition systems such as CosFace and ArcFace, the theta angle refers 
to the angle between the feature embedding vector of a face and the weight vector
associated with the ground truth label of that face.

In more detail, the feature embedding vector is a high-dimensional representation 
of a face that is generated by a deep neural network, while the weight vector is 
a learned parameter vector associated with the ground truth label of that face. 
The theta angle is then calculated as the cosine similarity between these two 
vectors, which measures the degree of alignment between the feature embedding 
vector and the weight vector.

The use of the theta angle in face recognition systems allows for a more 
discriminative and robust feature representation of faces, leading to improved 
performance in face recognition tasks.

那特征embedding向量又是什么?回答如下


The feature embedding vector is a high-dimensional numerical representation of a 
face that is learned by a deep neural network. In face recognition systems, the 
neural network is typically trained on a large dataset of face images in an 
unsupervised or supervised manner to learn a feature representation that is 
highly discriminative and robust to variations in lighting, pose, and other 
factors.


The output of the neural network for a given face image is a feature embedding 
vector, which can be thought of as a set of numerical values that encode the 
salient characteristics of the face. These values can be used to compare and 
match faces for tasks such as identification or verification.

The feature embedding vector is typically a high-dimensional vector, ranging from 
a few hundred to a few thousand dimensions, depending on the specific neural 
network architecture used. The dimensions of the vector are not directly 
interpretable, but the values of the vector are learned to be highly informative 
for distinguishing between different faces.

在了解到了 θ \theta θ的含义之后,我们可以更好的了解这些Loss的设计思路,回看CosFace Loss形式。
在这里插入图片描述

设计思路:
假设下面有两个夹角 θ 1 \theta_1 θ1 θ 2 \theta_2 θ2,我们知道 θ \theta θ是输出特征向量和Ground Truth向量之间的夹角,那么对于第i个类而言, c o s ( θ ) − m cos(\theta)-m cos(θ)m对应的夹角 θ 2 \theta_2 θ2 c o s ( θ ) cos(\theta) cos(θ)对应的夹角 θ 1 \theta_1 θ1,也就是说,加入m这个参数之后,我们学习参数 θ \theta θ的能力变差了,这也意味着,我们真正适应参数m之后, θ \theta θ会更小,也就是类内会更聚拢。
在这里插入图片描述
CosFace Loss的代码实现,直接看forward部分,
先把feature和weight用l2 norm 归一化。
然后两个矩阵相乘,就得到cos值了。因为范数为1了。
接下来减去m,然后通过onehot的label,挑选出需要减去m的样本,保留其他位置的cos值,
最后乘一个s,返回,送入交叉墒中就可以了。

class AddMarginProduct(nn.Module):
    r"""Implement of large margin cosine distance: :
    Args:
        in_features: size of each input sample
        out_features: size of each output sample
        s: norm of input feature
        m: margin
        cos(theta) - m
    """

    def __init__(self, in_features, out_features, s=30.0, m=0.40):
        super(AddMarginProduct, self).__init__()
        self.in_features = in_features
        self。.out_features = out_features
        self.s = s
        self.m = m
        self.weight = Parameter(torch.FloatTensor(out_features, in_features))
        nn.init.xavier_uniform_(self.weight)

    def forward(self, input, label):
        # --------------------------- cos(theta) & phi(theta) ---------------------------
        cosine = F.linear(F.normalize(input), F.normalize(self.weight))
        phi = cosine - self.m
        # --------------------------- convert label to one-hot ---------------------------
        one_hot = torch.zeros(cosine.size(), device='cuda')
        # one_hot = one_hot.cuda() if cosine.is_cuda else one_hot
        one_hot.scatter_(1, label.view(-1, 1).long(), 1)
        # -------------torch.where(out_i = {x_i if condition_i else y_i) -------------
        output = (one_hot * phi) + ((1.0 - one_hot) * cosine)
        # you can use torch.where if your torch.__version__ is 0.4
        output *= self.s
        # print(output)

        return output

    def __repr__(self):
        return self.__class__.__name__ + '(' \
               + 'in_features=' + str(self.in_features) \
               + ', out_features=' + str(self.out_features) \
               + ', s=' + str(self.s) \
               + ', m=' + str(self.m) + ')'

ArcFace

来看看ArcFace的pipeline,很多部分和CosFace相似,不同的是对角度的处理方式。
在这里插入图片描述
在这里插入图片描述
跟CosFace的思想大同小异,ArcFace通过在 θ \theta θ上加一个m,那么 c o s ( θ + m ) cos(\theta+m) cos(θ+m)的值小于 c o s ( θ ) cos(\theta) cos(θ),然后整个Loss会变大,也会导致学习难度上升,进而达到促进 θ \theta θ变小的目的。
在这里插入图片描述

ArcFace Loss 代码在下面
这里m的是一个弧度值角度,实验设置为0.5
代码唯一造成困惑的地方可能就是easy margin那里,如果easy margin为true,只要夹角小于90度,这才再角度上加一个m。如果为false,以self.th为阈值,大家可以画下图,如果角度值theta小于m,在单调区间0-pi上,余弦值一定大于该阈值。所以th的作用就是保证了theta加m,依然在单调区间0-pi上。

class ArcMarginProduct(nn.Module):
    r"""Implement of large margin arc distance: :
        Args:
            in_features: size of each input sample
            out_features: size of each output sample
            s: norm of input feature
            m: margin
            cos(theta + m)
        """

    def __init__(self, in_features, out_features, s=30.0, m=0.50, easy_margin=False):
        super(ArcMarginProduct, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.s = s
        self.m = m
        self.weight = Parameter(torch.FloatTensor(out_features, in_features))
        nn.init.xavier_uniform_(self.weight)

        self.easy_margin = easy_margin
        self.cos_m = math.cos(m)
        self.sin_m = math.sin(m)
        self.th = math.cos(math.pi - m)
        self.mm = math.sin(math.pi - m) * m

    def forward(self, input, label):
        cosine = F.linear(F.normalize(input), F.normalize(self.weight))
        sine = torch.sqrt(1.0 - torch.pow(cosine, 2))
        # cos(a+b)=cos(a)*cos(b)-size(a)*sin(b)
        phi = cosine * self.cos_m - sine * self.sin_m
        if self.easy_margin:
            phi = torch.where(cosine > 0, phi, cosine)
        else:
            phi = torch.where(cosine > self.th, phi, cosine - self.mm)
        one_hot = torch.zeros(cosine.size(), device='cuda')
        one_hot.scatter_(1, label.view(-1, 1).long(), 1)
        output = (one_hot * phi) + ((1.0 - one_hot) * cosine)
        output *= self.s
        return output

<think>好的,我现在需要帮助用户寻找关于CenterLoss的代码实现,特别是在PyTorch和TensorFlow中的实现。首先,我得回忆一下CenterLoss的基本概念。CenterLoss是由魏秀参等人提出的,主要用于人脸识别等任务,通过同时优化类内距离和分类误差来提高特征判别性。核心思想是为每个类别维护一个可学习的中心,计算样本特征到对应类中心的距离作为辅助损失[^1]。 接下来,我需要确认用户的需求。用户明确想要PyTorch和TensorFlow的代码示例。因此,我需要分别查找或编写这两个框架的实现。同时,用户提供的引用内容中提到PyTorch的Tensor数据类型和设置默认类型的方法,以及图像尺寸调整在目标检测中的应用[^3],这些可能对实现有帮助,但暂时不太直接相关。 首先处理PyTorch部分。CenterLoss在PyTorch中通常作为自定义损失函数实现。需要定义一个继承自nn.Module的类,其中包含类中心作为参数。要注意的是,类中心需要是可训练的参数,因此应该使用nn.Parameter。同时,在初始化时,要根据类别数量num_classes和特征维度feat_dim来创建这些中心。在前向传播时,计算每个样本特征到其对应类中心的距离平方和的均值。另外,需要确保在训练过程中,类中心随着反向传播更新。 然后考虑TensorFlow的实现。在TensorFlow 2.x中,可以使用Keras的自定义层或损失函数。同样需要定义可训练的变量作为类中心。这里需要注意,TensorFlow中变量的管理方式与PyTorch不同,需要使用tf.Variable并确保在训练过程中被优化器更新。同时,计算样本到类中心的距离,并整合到总损失中。可能需要结合交叉熵损失和CenterLoss的加权和。 在编写代码时,需要注意以下几点: 1. 类中心的初始化,通常使用零初始化或随机初始化。 2. 在PyTorch中,类中心的维度应为(num_classes, feat_dim)。 3. 在计算损失时,需要根据输入的标签选择对应的类中心。 4. 确保梯度能够正确传播到类中心参数上。 5. 在TensorFlow中,自定义层可能需要继承tf.keras.layers.Layer,并在call方法中实现计算。 另外,用户提供的引用中提到PyTorch默认使用FloatTensor,但有时可能需要设置成DoubleTensor。不过在大多数情况下,使用FloatTensor即可,除非有特殊需求。另外,图像尺寸调整在目标检测中的处理可能不直接影响CenterLoss的实现,但需要注意输入特征的维度是否匹配。 可能遇到的问题包括类中心的正确更新、损失计算的效率问题,以及如何将CenterLoss与分类损失(如交叉熵)结合。需要确保在训练循环中同时计算两个损失,并按比例相加,例如总损失 = cross_entropy + lambda * center_loss,其中lambda是一个超参数。 最后,需要验证代码的正确性,可能通过简单的测试用例,比如检查类中心是否在训练过程中更新,以及损失值是否随着训练下降。此外,参考已有的开源实现也是一个好方法,但需要确保遵循相关的许可协议。 总结步骤: 1. PyTorch实现:定义CenterLoss类,管理类中心,计算损失。 2. TensorFlow实现:定义自定义层或损失函数,管理变量,计算损失。 3. 示例用法,展示如何与分类损失结合使用。 4. 注意事项,如初始化、参数更新、损失权重等。</think>### CenterLoss算法代码实现(PyTorch/TensorFlow) #### 算法简介 CenterLoss通过为每个类别维护一个可学习的中心向量,计算样本特征到对应类中心的距离作为辅助损失,增强特征的类内紧致性。公式如下: $$L_C = \frac{1}{2}\sum_{i=1}^m \| \boldsymbol{x}_i - \boldsymbol{c}_{y_i} \|_2^2$$ 其中$\boldsymbol{c}_{y_i}$表示类别$y_i$对应的中心向量。 --- ### PyTorch实现 ```python import torch import torch.nn as nn class CenterLoss(nn.Module): def __init__(self, num_classes, feat_dim): super().__init__() self.centers = nn.Parameter(torch.randn(num_classes, feat_dim)) # 使用半精度浮点可节省GPU显存[^1] def forward(self, features, labels): batch_size = features.size(0) centers_batch = self.centers[labels] # 根据标签选择对应中心 loss = 0.5 * torch.sum((features - centers_batch)**2) / batch_size return loss # 使用示例 model = ... # 主干网络 criterion_cls = nn.CrossEntropyLoss() # 分类损失 criterion_ctr = CenterLoss(num_classes=10, feat_dim=512) optimizer = torch.optim.SGD([ {'params': model.parameters()}, {'params': criterion_ctr.parameters()} # 同时优化中心向量 ], lr=0.01) # 训练循环 for x, y in dataloader: features = model(x) logits = model.classifier(features) loss_cls = criterion_cls(logits, y) loss_ctr = criterion_ctr(features, y) total_loss = loss_cls + 0.001 * loss_ctr # 平衡系数 optimizer.zero_grad() total_loss.backward() optimizer.step() ``` --- ### TensorFlow实现 ```python import tensorflow as tf from tensorflow.keras.layers import Layer class CenterLoss(Layer): def __init__(self, num_classes, feat_dim): super().__init__() self.num_classes = num_classes self.feat_dim = feat_dim def build(self, input_shape): self.centers = self.add_weight( name='centers', shape=(self.num_classes, self.feat_dim), initializer='glorot_uniform' ) def call(self, inputs): features, labels = inputs labels = tf.reshape(labels, [-1]) centers_batch = tf.gather(self.centers, labels) loss = 0.5 * tf.reduce_mean(tf.square(features - centers_batch)) return loss # 使用示例 inputs = tf.keras.Input(shape=(224,224,3)) x = tf.keras.applications.ResNet50(weights=None)(inputs) features = tf.keras.layers.GlobalAvgPool2D()(x) logits = tf.keras.layers.Dense(10)(features) model = tf.keras.Model(inputs, [logits, features]) center_loss = CenterLoss(num_classes=10, feat_dim=2048) optimizer = tf.keras.optimizers.Adam() cls_loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) # 训练步骤 def train_step(x, y): with tf.GradientTape() as tape: logits, features = model(x) loss_cls = cls_loss_fn(y, logits) loss_ctr = center_loss((features, y)) total_loss = loss_cls + 0.001 * loss_ctr gradients = tape.gradient(total_loss, model.trainable_variables) optimizer.apply_gradients(zip(gradients, model.trainable_variables)) ``` --- #### 关键注意事项 1. **参数初始化**:类中心建议使用零初始化或小随机数初始化 2. **损失平衡系数**:CenterLoss的权重系数(示例中的0.001)需通过实验调整 3. **特征归一化**:建议对特征向量进行L2归一化,防止梯度爆炸 4. **数据预处理**:输入图像需统一尺寸(如224x224) ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值