Arcface v3 论文翻译与解读

最新推荐文章于 2022-07-13 13:09:21 发布

神罗Noctis

最新推荐文章于 2022-07-13 13:09:21 发布

阅读量1.9k

点赞数 2

分类专栏：损失函数人脸识别

本文链接：https://blog.csdn.net/qq_39937396/article/details/102561304

版权

损失函数同时被 2 个专栏收录

6 篇文章 0 订阅

订阅专栏

人脸识别

4 篇文章 0 订阅

订阅专栏

论文地址：http://arxiv.org/pdf/1801.07698.pdf

Arcface v3 与 Arcface v1的内容有较大不同。建议先阅读Arcface v1 的论文，再看v3。可以参考我之前写的Arcface v1 论文翻译与解读

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

Abstract

One of the main challenges in feature learning using Deep Convolutional Neural Networks (DCNNs) for large-scale face recognition is the design of appropriate loss functions that enhance discriminative power. Centre loss penalises the distance between the deep features and their corresponding class centres in the Euclidean space to achieve intra-class compactness. SphereFace assumes that the linear transformation matrix in the last fully connected layer can be used as a representation of the class centres in an angular space and penalises the angles between the deep features and their corresponding weights in a multiplicative way. Recently, a popular line of research is to incorporate margins in well-established loss functions in order to maximise face class separability. In this paper, we propose an Additive Angular Margin Loss (ArcFace) to obtain highly discriminative features for face recognition. The proposed ArcFace has a clear geometric interpretation due to the exact correspondence to the geodesic distance on the hypersphere. We present arguably the most extensive experimental evaluation of all the recent state-of-the-art face recognition methods on over 10 face recognition benchmarks including a new large-scale image database with trillion level of pairs and a large-scale video dataset. We show that ArcFace consistently outperforms the state-of-the-art and can be easily implemented with negligible computational over-head. We release all refined training data, training codes,pretrained models and training logs, which will help reproduce the results in this paper.

摘要

使用深度卷积神经网络(DCNNs)进行大规模人脸识别的特征学习，所面临的一大挑战是如何设计适当的损失函数，来增强类别之间的识别能力。中心损失惩罚了欧几里得空间中深层特征与其相应类中心之间的距离，以实现类内紧凑。SphereFace假设最后一个完全连接层中的线性变换矩阵(W)可以表示角度空间中的类别中心，并以乘法的方式惩罚深度特征与其相应权重之间的角度。最近，一个流行的研究方向是将margin纳入到已确立的损失函数中，以最大限度地提高人脸类别的可分性。在本文中，我们提出了一个加法角度间隔的损失（ArcFace），以获得人脸识别的高判别性特征。由于提出的ArcFace精确对应超球面的测地距离，因此具有清晰的几何解释。我们在超过10个人脸识别基准(包括一个具有万亿对人脸的新大规模图像数据库和一个大规模的视频数据集)上，对所有最新的state-of-the-art人脸识别方法进行了最广泛的实验评估。我们表明ArcFace始终优于state-of-the-art，并且可以用很小的计算成本轻松实现。我们发布了所有经过清洗的训练数据、训练代码、预训练模型和训练日志，这将有助于复现本文的结果。

1. Introduction

Face representation using Deep Convolutional Neural Network (DCNN) embedding is the method of choice for face recognition [32, 33, 29, 24]. DCNNs map the face image, typically after a pose normalisation step [45], into a feature that has small intra-class and large inter-class distance.

1. 介绍

使用深度卷积神经网络(Deep Convolutional Neural Network, DCNN)嵌入的人脸表示是人脸识别的首选方法[32, 33, 29, 24]。DCNNs将人脸图像(通常经过姿态归一化步骤[45])映射成一个具有类内小距离和类间大距离的特征。

图1. 基于中心[18]和特征[37]归一化，所有身份分布在超球面上。为了提高类内紧密性和类间差异性，我们考虑了四种测地距离(GDis)约束。(A)margin损失：在样本和中心之间插入测地距离margin。(B)类内损失：减少样本和相应中心之间的测地距离。(C)类间损失：增加不同类别中心之间的测地距离。(D)三重损失：在三重样本之间插入测地距离margin。为了提高人脸识别模型的判别能力，本文提出了一种加法角度间隔损失(ArcFace)方法，它精确对应(A)中的测地距离(弧)margin(惩罚因子)。大量的实验结果表明(A)策略是最有效的。

There are two main lines of research to train DCNNs for face recognition. Those that train a multi-class classifier which can separate different identities in the training set, such by using a softmax classifier [33, 24, 6], and those that learn directly an embedding, such as the triplet loss [29]. Based on the large-scale training data and the elaborate DCNN architectures, both the softmax-loss-based methods [6] and the triplet-loss-based methods [29] can obtain excellent performance on face recognition. However,both the softmax loss and the triplet loss have some draw-backs. For the softmax loss: (1) the size of the linear transformation matrix $W\in \mathbb{R}^{d\times n}$ increases linearly with the identities number n; (2) the learned features are separable for the closed-set classification problem but not discriminative enough for the open-set face recognition problem. For the triplet loss: (1) there is a combinatorial explosion in the number of face triplets especially for large-scale datasets, leading to a significant increase in the number of iteration steps; (2) semi-hard sample mining is a quite difficult problem for effective model training.

训练用于人脸识别的DCNNs有两个主要的研究方向。那些训练一个可以在训练集中分离不同身份的多类分类器，例如使用一个softmax分类器[33, 24, 6]，和那些直接学习嵌入的分类器，例如三重损失[29]。基于大规模训练数据集和精心制作的DCNN架构，在人脸识别上softmax-loss方法[6]和triplet-loss方法[29]都可以获得优异的性能。然而，softmax loss损失和triplet loss都有一些缺点。

对于Softmax loss：（1）线性变换矩阵 $W\in \mathbb{R}^{d\times n}$ 的大小随着身份数量n的增加而线性增大；（2）对于closed-set 分类问题，所学习的特征是可分离的，但对于open-set人脸识别问题，判别性不够。

对于triplet loss：（1）特别是对于大型数据集，人脸三元组的数量出现爆炸式增长，导致迭代次数显著增加；（2）对于有效训练模型来说，semi-hard样本挖掘是一个相当困难的问题。

Several variants [38, 9, 46, 18, 37, 35, 7, 34, 27] have been proposed to enhance the discriminative power of the softmax loss. Wen et al. [36] pioneered the centre loss, the Euclidean distance between each feature vector and its class centre, to obtain intra-class compactness while the inter-class dispersion is guaranteed by the joint penalisation of the softmax loss. Nevertheless, updating the actual centres during training is extremely difficult as the number of face classes available for training has recently dramatically increased.

已经提出了几种用来增强Softmax loss判别能力的变体[38, 9, 46, 18, 37, 35, 7, 34, 27]。Wen等人率先提出了中心损失，即每个特征向量与其类中心之间的欧氏距离，由softmax loss的联合惩罚来保证获得类内紧凑，类间分离。然而，在训练期间更新实际类别中心非常困难，因为可供训练的人脸类别数量最近急剧增加。

By observing that the weights from the last fully connected layer of a classification DCNN trained on the softmax loss bear conceptual similarities with the centres of each face class, the works in [18, 19] proposed a multiplicative angular margin penalty to enforce extra intra-class compactness and inter-class discrepancy simultaneously, leading to a better discriminative power of the trained model.Even though Sphereface [18] introduced the important idea of angular margin, their loss function required a series of approximations in order to be computed, which resulted in an unstable training of the network. In order to stabilise training, they proposed a hybrid loss function which includes the standard softmax loss. Empirically, the softmax loss dominates the training process, because the integer-based multiplicative angular margin makes the target logit curve very precipitous and thus hinders convergence. CosFace [37, 35] directly adds cosine margin penalty to the target logit, which obtains better performance compared to SphereFace but admits much easier implementation and relieves the need for joint supervision from the softmax loss.

通过观察，发现通过Softmax loss 训练的分类DCNN最后一个完全连接层的权重与每个人脸的类别中心具有概念上的相似性，在(18、19)中提出了一个乘法角度间隔惩罚，以加强类内紧凑和类间差异，使得训练模型有更好的判别能力。尽管Sphereface[18]引入了角度间隔的重要概念，但它的损失函数需要一系列近似才能计算出来，从而导致网络训练不稳定。为了稳定训练，他们提出了一个混合损失函数，其中包括标准的Softmax loss。经验表明，softmax loss 在训练过程中占主导地位，因为基于整数的乘法角度间隔使得目标logit曲线非常陡峭，从而阻碍了收敛。CosFace[37,35]直接在目标logit中增加了余弦间隔惩罚，与SphereFace相比，它获得了更好的性能并且更容易实现，去除了softmax loss 联合监督的需要。

图2. 在ArcFace损失的监督下，训练一个用于人脸识别的DCNN。基于特征 $x_{i}$ 和权重 $W$ 归一化，我们得到每个类别的logit $cos\theta _{j}$ （通过 $W_{j}^{T}x_{i}$ 变换）。通过计算 $arccos\theta _{y_{i}}$ ，得到特征 $x_{i}$ 和真实权重 $W_{y_{i}}$ 之间的夹角。事实上， $W_{j}$ 为每个类别提供了一个中心。然后，我们在目标(ground truth)角 $\theta _{y_{i}}$ 上添加一个角度间隔惩罚 $m$ 。之后，我们计算 $cos\left ( \theta _{y_{i}}+m \right )$ ，并将所有的logit乘以特征缩放因子 $s$ 。然后将这些 logit 通过softmax损失函数，将得出的概率投放到交叉熵损失中计算。

算法1 ArcFace在MxNet上的伪代码

输入：特征缩放因子 $s$ ，式子3中的间隔参数 $m$ ，类别数量 $n$ ，真实ID $gt$ 。

1. x = mx.symbol.L2Normalization (x, mode = ’instance’)
2. W = mx.symbol.L2Normalization (W, mode = ’instance’)
3. fc7 = mx.sym.FullyConnected (data = x, weight = W, no bias = True, num hidden = n)
4. original target logit = mx.sym.pick (fc7, gt, axis = 1)
5. theta = mx.sym.arccos (original target logit)
6. marginal target logit = mx.sym.cos (theta + m)
7. one hot = mx.sym.one hot (gt, depth = n, on value = 1.0, off value = 0.0)
8. fc7 = fc7 + mx.sym.broadcast mul (one hot, mx.sym.expand dims (marginal target logit - original target logit, 1))
9. fc7 = fc7 * s

输出：智能分类的喜好得分 $fc7$ 。

In this paper, we propose an Additive Angular Margin Loss (ArcFace) to further improve the discriminative power of the face recognition model and to stabilise the training process. As illustrated in Figure 2, the dot product between the DCNN feature and the last fully connected layer is equal to the cosine distance after feature and weight normalisation. We utilise the arc-cosine function to calculate the angle between the current feature and the target weight. Afterwards, we add an additive angular margin to the target angle, and we get the target logit back again by the cosine function. Then, we re-scale all logits by a fixed feature norm, and the subsequent steps are exactly the same as in the softmax loss. The advantages of the proposed ArcFace can be summarised as follows:
Engaging. ArcFace directly optimises the geodesic distance margin by virtue of the exact correspondence between
the angle and arc in the normalised hypersphere. We intuitively illustrate what happens in the 512-D space via analysing the angle statistics between features and weights.
Effective. ArcFace achieves state-of-the-art performance on ten face recognition benchmarks including large-scale
image and video datasets.
Easy. ArcFace only needs several lines of code as given in Algorithm 1 and is extremely easy to implement in the
computational-graph-based deep learning frameworks, e.g. MxNet [8], Pytorch [25] and Tensorflow [4]. Furthermore,
contrary to the works in [18, 19], ArcFace does not need to be combined with other loss functions in order to have
stable performance, and can easily converge on any training datasets.
Efficient. ArcFace only adds negligible computational complexity during training. Current GPUs can easily support millions of identities for training and the model parallel strategy can easily support many more identities.

为了进一步提高人脸识别模型的判别能力，稳定训练过程，本文提出了一种加法角度间隔损失 (ArcFace) 算法。如图2所示，DCNN特征和最后一个完全连接层之间的点积等于特征和权重归一化后的余弦距离。我们使用反余弦函数来计算当前特征和目标权重之间的夹角。然后，我们在目标角上添加一个加法角度间隔，然后通过余弦函数复原，获得目标logit。然后，我们用一个固定的特征范数重新缩放所有的logit，随后的步骤与Softmax loss中的步骤完全相同。提出的ArcFace的优点可以总结如下：

迷人的。由于归一化超球体中的角和弧精确对应，ArcFace 可以直接优化测地距离margin。我们通过分析特征和权重之间的角度统计，直观地说明了512-D空间中发生了什么。

有效。 Arcface在10个人脸识别基准上实现了state-of-the-art的性能，包括大规模图像和视频数据集。

容易。ArcFace只需要若算法1所示的几行代码，并且非常容易在computational-graph-based(基于计算图的)深度学习框架中实现，例如MxNet[8]、Pythort[25]和TensorFlow[4]。此外，与[18，19]中的工作相比，ArcFace不需要与其他损失函数相结合以获得稳定的性能，并且可以很容易地收敛于任何训练数据集。

高效。在训练过程中，Arcface只增加了可忽略的计算复杂性。当前的GPU可以轻松支持数百万个身份训练，模型并行策略可以轻松支持更多的身份。

2. Proposed Approach

2.1. ArcFace

The most widely used classification loss function, softmax loss, is presented as follows:

where $x_{i} \in \mathbb{R}^{d}$ denotes the deep feature of the $i$ -th sample, belonging to the $y_{i}$ -th class. The embedding feature dimension d is set to 512 in this paper following [38, 46, 18, 37]. $W_{j} \in \mathbb{R}^{d}$ denotes the $j$ -th column of the weight $W\in \mathbb{R}^{d\times n}$ and $b_{j} \in \mathbb{R}^{n}$ is the bias term. The batch size and the class number are N and n, respectively. Traditional softmax loss is widely used in deep face recognition [22, 5]. However,the softmax loss function does not explicitly optimise the feature embedding to enforce higher similarity for intra-class samples and diversity for inter-class samples, which results in a performance gap for deep face recognition under large intra-class appearance variations (e.g. pose variations [30, 48] and age gaps [22, 49]) and large-scale test scenarios (e.g. million [15, 39, 21] or trillion pairs [2]).

2. 提出方法

2.1. ArcFace

最广泛使用的分类损失函数Softmax Loss如下所示：

式中， $x_{i} \in \mathbb{R}^{d}$ 表示第 $i$ 个样本的深层特征，属于 $y_{i}$ 类别。参照[38, 46, 18, 37]，本文将嵌入特征维数d设为512。 $W_{j} \in \mathbb{R}^{d}$ 表示权重 $W\in \mathbb{R}^{d\times n}$ 的第 $j$ 列， $b_{j} \in \mathbb{R}^{n}$ 为偏置项。批大小和类别的数量分别为N和n。传统的Softmax loss 在人脸识别中得到了广泛应用[24，6]。然而，Softmax loss函数并没有明确地优化特征嵌入，以增强类内样本的相似度和类间样本差异性，这导致在较大的类内外观变化（例如姿态变化[30，48]和年龄差异[22，49]）和大规模测试场景（例如百万对[15, 39, 21]或万亿对[2]），深度人脸识别的性能存在差异。

For simplicity, we fix the bias $b_{j}=0$ as in [18]. Then, we transform the logit [26] as $W_{j}^{T}x_{i}=||W_{j}||||x_{i}||cos\theta _{j}$ , where $\theta _{j}$ is the angle between the weight $W_{j}$ and the feature $x_{i}$ . Following [18, 37, 36], we fix the individual weight $||W_{j}||=1$ by $l_{2}$ normalisation. Following [28, 37, 36, 35], we also fix the embedding feature $||x_{i}||$ by $l_{2}$ normalisation and rescale it to $s$ . The normalisation step on features andweights makes the predictions only depend on the angle between the feature and the weight. The learned embedding features are thus distributed on a hypersphere with a radius of $s$ .

为了简单起见，像在[18]中那样，我们将偏差固定为 $b_{j}=0$ 。然后，我们将 logit [26] 变换为 $W_{j}^{T}x_{i}=||W_{j}||||x_{i}||cos\theta _{j}$ ，其中 $\theta _{j}$ 是权重 $W_{j}$ 和特征 $x_{i}$ 之间的角度。参照[18, 37, 36]，我们通过L2归一化固定单个权重 $||W_{j}||=1$ 。参照 [28, 37, 36, 35]，我们也通过L2归一化来固定嵌入特征 $||x_{i}||$ ，并将其重新缩放成 $s$ 。特征和权重的归一化步骤使预测仅取决于特征和权重之间的角度。因此，所学的嵌入特征分布在半径为s的超球体上。

As the embedding features are distributed around each feature centre on the hypersphere, we add an additive angular margin penalty m between $x_{i}$ and $W_{y_{i}}$ to simultaneously enhance the intra-class compactness and inter-class discrepancy. Since the proposed additive angular margin penalty is equal to the geodesic distance margin penalty in the normalised hypersphere, we name our method as ArcFace.

由于嵌入特征分布在超球体上的每个特征中心的周围，我们在 $x_{i}$ 和 $W_{y_{i}}$ 之间增加了一个加法角度间隔惩罚 $m$ ，从而同时增强了类内紧凑和类间差异。由于所提出的加法角度间隔惩罚等于归一化超球体中的测地距离间隔惩罚，因此我们命名我们的方法为ArcFace。

We select face images from 8 different identities containing enough samples (around 1,500 images/class) to train 2D feature embedding networks with the softmax and ArcFace loss, respectively. As illustrated in Figure 3, the softmax loss provides roughly separable feature embedding but produces noticeable ambiguity in decision boundaries, while the proposed ArcFace loss can obviously enforce a more evident gap between the nearest classes.

我们从8个不同的身份中选择了包含足够样本的人脸图像(大约1500张/类)，分别用softmax和ArcFace损失来训练2D特征嵌入网络。如图3所示，softmax损失提供了粗略可分离的嵌入特征，但在决策边界上产生了明显的模糊性，而所提出的ArcFace损失显然会在最接近的类别间产生更明显的差距。

在softmax个Arcface损失下，8个2D特征身份的玩具实验。点表示样本，线表示每个身份的中心方向。在特征归一化的基础上，将所有的人脸特征投射到一个固定半径的弧(arc)空间。当引入了加法角度间隔惩罚，最接近的类别之间的测地距离差距变得明显。

2.2. Comparison with SphereFace and CosFace

Numerical Similarity. In SphereFace [18, 19], ArcFace, and CosFace [37, 35], three different kinds of margin penalty are proposed, e.g. multiplicative angular margin $m_{1}$ , additive angular margin $m_{2}$ , and additive cosine margin $m_{3}$ , respectively. From the view of numerical analysis, different margin penalties, no matter add on the angle [18] or cosine space [37], all enforce the intra-class compactness and inter-class diversity by penalising the target logit [26].In Figure 4(b), we plot the target logit curves of SphereFace, ArcFace and CosFace under their best margin settings. We only show these target logit curves within $\left [ 20^{\circ },100^{\circ } \right ]$ because the angles between $W_{y_{i}}$ and $x_{i}$ start from around $90^{\circ }$ (random initialisation) and end at around $30^{\circ }$ during Arc-Face training as shown in Figure 4(a). Intuitively, there are three factors in the target logit curves that affect the performance, i.e. the starting point, the end point and the slope.

2.2. 与SphereFace和CosFace进行比较

数值相似性。在SphereFace [18, 19], ArcFace和CosFace [37, 35]中，分别提出了乘法角度间隔 $m_{1}$ 、加法角度间隔 $m_{2}$ 和 加法余弦间隔 $m_{3}$ 三种不同的间隔惩罚。从数值分析的角度看，不同的间隔惩罚，无论是加在角度[18]还是加在余弦空间[37]，都通过惩罚目标logit [26]来加强类内紧凑性和类间差异性。在图4(b)中，我们绘制了SphereFace、ArcFace和CosFace在最佳margin设置下的目标logit曲线。我们只在 $\left [ 20^{\circ },100^{\circ } \right ]$ 内显示这些目标logit曲线，因为 $W_{y_{i}}$ 和 $x_{i}$ 之间的角度大约从 $90^{\circ }$ 周围开始(随机初始化)，在 $30^{\circ }$ 附近结束，ArcFace训练过程如图4(a)所示。从直观上看，影响性能的目标logit曲线有三个因素，即起点、终点和斜率。

图4. 目标logit分析。(a)Arcface训练期间，从开始到结束 $\theta _{j}$ 的分布。(b)对于softmax、SphereFace、ArcFace、CosFace和联合margin惩罚 $\left ( cos\left ( m_{1}\theta +m_{2} \right )-m_{3} \right )$ 的目标logit曲线。

By combining all of the margin penalties, we implement SphereFace, ArcFace and CosFace in an united framework with $m_{1}$ , $m_{2 }$ and $m_{3}$ as the hyper-parameters.

As shown in Figure 4(b), by combining all of the abovemotioned margins $\left ( cos\left ( m_{1}\theta +m_{2} \right )-m_{3} \right )$ , we can easily get some other target logit curves which also have high performance.

通过合并全部间隔惩罚，我们结合 $m_{1}$ ， $m_{2 }$ 和 $m_{3}$ 这些超参数，在一个统一的框架下实现了SphereFace、ArcFace和CosFace。

如图4(b)所示，通过合并上述所有的margin，我们可以很容易地得到其他一些同样具有高性能的目标logit曲线。

Geometric Difference. Despite the numerical similarity between ArcFace and previous works, the proposed additive angular margin has a better geometric attribute as the angular margin has the exact correspondence to the geodesic distance. As illustrated in Figure 5, we compare the decision boundaries under the binary classification case. The proposed ArcFace has a constant linear angular margin throughout the whole interval. By contrast, SphereFace and CosFace only have a nonlinear angular margin.

几何差异。尽管ArcFace在数值上与以往的工作有相似之处，由于角度间隔与测地距离的精确对应，所提出的加法角度间隔具有更好的几何属性。如图5所示，我们比较了二分类情况下的决策边界。所提出的Arcface在整个区间内具有恒定的线性角度间隔。相比之下，SphereFace和CosFace只有一个非线性的角度间隔。

图5. 二分类情况下不同损失函数的决策间隔。虚线表示决策边界，灰色区域表示决策间隔。

The minor difference in margin designs can have “butterfly effect” on the model training. For example, the original SphereFace [18] employs an annealing optimisation strategy. To avoid divergence at the beginning of training, joint supervision from softmax is used in SphereFace to weaken the multiplicative margin penalty. We implement a new version of SphereFace without the integer requirement on the margin by employing the arc-cosine function instead of using the complex double angle formula. In our implementation, we find that m = 1.35 can obtain similar performance compared to the original SphereFace without any convergence difficulty.

间隔设计的细微差异会对模型训练产生“蝴蝶效应”。例如，原始的SphereFace[18]采用了退火优化策略。为了避免在训练开始时出现发散，在SphereFace中使用了softmax的联合监督来弱化乘法角度间隔惩罚。利用反余弦函数代替复杂的倍角公式，实现了一种对margin无整数要求的新版本SphereFace。在我们的实现中，我们发现 m = 1.35 可以获得与original SphereFace相似的性能，并且没有任何收敛困难。

2.3. Comparison with Other Losses

Other loss functions can be designed based on the angular representation of features and weight-vectors. For examples, we can design a loss to enforce intra-class compactness and inter-class discrepancy on the hypersphere. As shown in Figure 1, we compare with three other losses in this paper.

Intra-Loss is designed to improve the intra-class compactness by decreasing the angle/arc between the sample and the ground truth centre.

Inter-Loss targets at enhancing inter-class discrepancy by increasing the angle/arc between different centres.

The Inter-Loss here is a special case of the Minimum Hyper-spherical Energy (MHE) method [17]. In [17], both hidden layers and output layers are regularised by MHE. In the MHE paper, a special case of loss function was also proposed by combining the SphereFace loss with MHE loss on the last layer of the network.

Triplet-loss aims at enlarging the angle/arc margin between triplet samples. In FaceNet [29], Euclidean margin is applied on the normalised features. Here, we employ the triplet-loss by the angular representation of our features as $arccos\left ( x_{i}^{pos}x_{i} \right )+m\leqslant arccos\left ( x_{i}^{neg}x_{i} \right )$ .
.

2.3. 与其他损失函数的比较

其他损失函数可以根据特征和权重向量的角度表示来设计。例如，我们可以设计一个损失来加强超在球面上的类内紧凑和类间差异。如图1所示，我们与本文中的其他三个损失进行了比较。

Intra-Loss的目的是通过减小样本与真实中心之间的角度/弧来提高类内紧凑性。

Inter-Loss的目标是通过加大不同中心之间的角度/弧来加强类间差异。

这里的Inter-Loss是最小超球能量(MHE)方法[17]的一个特例。在[17]中，隐藏层和输出层都由MHE进行正则化。文中还提出了一种特殊情况下的损失函数，将SphereFace loss和MHE loss在网络的最后一层相结合。

Triplet-loss旨在扩大三个样品之间的角度/弧间隔。在FaceNet[29]中，欧几里得间隔应用于归一化特征。在这里，我们通过将特征的角度表示为 $arccos\left ( x_{i}^{pos}x_{i} \right )+m\leqslant arccos\left ( x_{i}^{neg}x_{i} \right )$ 来使用三重损失。

3. Experiments

3.1. Implementation Details

Datasets. As given in Table 1, we separately employ CASIA [43], VGGFace2 [6], MS1MV2 and DeepGlint-Face (including MS1M-DeepGlint and Asian-DeepGlint) [2] as our training data in order to conduct fair comparison with other methods. Please note that the proposed MS1MV2 is a semi-automatic refined version of the MS-Celeb-1M dataset [10]. To best of our knowledge, we are the first to employ ethnicity-specific annotators for large-scale face image annotations, as the boundary cases (e.g. hard samples and noisy samples) are very hard to distinguish if the annotator is not familiar with the identity. During training, we explore efficient face verification datasets (e.g. LFW [13], CFP-FP [30], AgeDB-30 [22]) to check the improvement from different settings. Besides the most widely used LFW [13] and YTF [40] datasets, we also report the performance of ArcFace on the recent large-pose and large-age datasets(e.g. CPLFW [48] and CALFW [49]). We also extensively test the proposed ArcFace on large-scale image datasets (e.g. MegaFace [15], IJB-B [39], IJB-C [21] and Trillion-Pairs [2]) and video datasets (iQIYI-VID [20]).

3. 实验

3.1. 实现细节

数据集。如表1所示，我们分别使用CASIA[43]、VGGFace2[6]、MS1MV2和deep - glint(包括MS1M-DeepGlint和Asian-DeepGlint)[2]作为训练数据，以便与其他方法进行公平比较。请注意，提出的MS1MV2是MS-Celeb-1M数据集[10]半自动清洗后的版本。据我们所知，我们是第一个雇佣了特定种族的注释者，进行大规模人脸图像注释，如果注释者不熟悉身份，边界情况(例如困难样本和噪声样本)很难区分。在训练中，我们研究了有效的人脸验证数据集(如LFW [13]， CFP-FP [30]， AgeDB-30[22])来检测不同设置下的改进。除了最广泛使用的LFW[13]和YTF[40]数据集外，我们还报告了在最近的大姿态和大年龄数据集(例如，CPLFW[48]和CALFW[49])中ArcFace的性能。我们还广泛地在大型图像数据集(如MegaFace[15]、IJB-B[39]、IJB-C[21]和万亿对[2])和视频数据集(iQIYI-VID[20])上测试了所提出的ArcFace。

表1. 用于训练和测试的人脸数据集。“(P)”和“(G)”分别测试图像集和参考图像集。

有关gallery和probe的概念，可以参考这篇文章：https://blog.csdn.net/u011557212/article/details/60963237

Experimental Settings. For data prepossessing, we follow the recent papers [18, 37] to generate the normalised face crops (112 × 112) by utilising five facial points. For the embedding network, we employ the widely used CNN architectures,ResNet50 and ResNet100 [12, 11]. After the last convolutional layer, we explore the BN [14]-Dropout [31]-FC-BN structure to get the final 512-D embedding feature.In this paper, we use ([training dataset, network structure,loss]) to facilitate understanding of the experimental settings.

实验设置。为了获得数据，我们参考了最近的文献[18,37]，通过5个面部关键点来生成归一化的人脸裁剪图像(112×112)。对于嵌入网络，我们采用了广泛使用的CNN架构，ResNet50和ResNet100[12,11]。在最后一层卷积之后，我们探索BN [14]-Dropout [31]-FC-BN结构，得到最终的512-D嵌入特征。在本文中，我们使用([训练数据集，网络结构，损失])来帮助理解实验设置。

We follow [37] to set the feature scale s to 64 and choose the angular margin m of ArcFace at 0.5. All experiments in this paper are implemented by MXNet [8]. We set the batch size to 512 and train models on four NVIDIA Tesla P40 (24GB) GPUs. On CASIA, the learning rate starts from 0.1 and is divided by 10 at 20K, 28K iterations. The training process is finished at 32K iterations. On MS1MV2, we divide the learning rate at 100K,160K iterations and finish at 180K iterations. We set momentum to 0.9 and weight decay to 5e - 4. During testing, we only keep the feature embedding network without the fully connected layer (160MB for ResNet50 and 250MB for ResNet100) and extract the 512-D features (8.9 ms/face for ResNet50 and 15.4 ms/face for ResNet100) for each normalised face. To get the embedding features for templates (e.g. IJB-B and IJB-C) or videos (e.g. YTF and iQIYI-VID), we simply calculate the feature centre of all images from the template or all frames from the video. Note that, overlap identities between the training set and the test set are removed for strict evaluations, and we only use a single crop for all testing.

我们按照[37]将特征缩放因子 $s$ 设置为64，选择ArcFace的间隔 $m$ 为0.5。本文所有的实验均由MXNet[8]实现。我们将批大小设置为512，并在4台NVIDIA Tesla P40 (24GB) gpu上训练模型。在CASIA上，学习率从0.1开始，在20K，28k次迭代时除以10。训练过程在32K次迭代中完成。在MS1MV2中，我们在100K、160K次迭代时分配学习速率，在180K迭代时完成。我们设置动量为0.9，权重衰减为5e - 4。在测试过程中，我们只保留了特征嵌入没有全连接层(160MB为ResNet50, 250MB为ResNet100)的网络，和对每个归一化的人脸提取512-D特征(8.9 ms/face为ResNet50, 15.4 ms/face为ResNet100)。为了获取模板(如IJB-B和IJB-C)或视频(如IJB-B和IJB-C)的嵌入特性。我们简单地计算从模板或从视频(全部帧)中所有图像的特征中心。注意，训练集和测试集之间的重复身份将被删除，以进行严格的评估，并且我们只对所有测试使用单一的裁剪。

3.2. Ablation Study on Losses

In Table 2, we first explore the angular margin setting for ArcFace on the CASIA dataset with ResNet50. The best margin observed in our experiments was 0.5. Using the proposed combined margin framework in Eq. 4, it is easier to set the margin of SphereFace and CosFace which we found to have optimal performance when setting at 1.35 and 0.35,respectively. Our implementations for both SphereFace and CosFace can lead to excellent performance without observing any difficulty in convergence. The proposed ArcFace achieves the highest verification accuracy on all three test sets. In addition, we performed extensive experiments with the combined margin framework (some of the best performance was observed for CM1 (1, 0.3, 0.2) and CM2 (0.9,0.4, 0.15)) guided by the target logit curves in Figure 4(b).The combined margin framework led to better performance than individual SphereFace and CosFace but upper-bounded by the performance of ArcFace.

3.2. 深入研究损失函数

在表2中，我们首先使用ResNet50在CASIA数据集上ArcFace的角度间隔设置。在我们的实验中观察到的最佳间隔是0.5。使用式4中提出的合并保证金框架，我们可以更容易地设置SphereFace和CosFace的margin，当margin设置为1.35和0.35时，我们发现性能最佳。我们对SphereFace和CosFace的实现获得出色的性能，没有观察到有任何收敛的困难。提出的ArcFace在三个测试集上都达到了最高的验证精度。此外，我们在图4(b)中目标logit曲线的引导下，对组合间隔框架进行了大量的实验（观察到CM1（1，0.3，0.2）和CM2（0.9，0.4，0.15）时性能最佳）。组合间隔框架比单独的SphereFace和CosFace具有更好的性能，但其上限是ArcFace的性能。

表2. 不同损失函数的验证结果(%)([CASIA,ResNet50, loss*])

Besides the comparison with margin-based methods, we conduct a further comparison between ArcFace and other losses which aim at enforcing intra-class compactness (Eq.5) and inter-class discrepancy (Eq. 6). As the baseline we have chosen the softmax loss and we have observed performance drop on CFP-FP and AgeDB-30 after weight and feature normalisation. By combining the softmax with the intra-class loss, the performance improves on CFP-FP and AgeDB-30. However, combining the softmax with the inter-class loss only slightly improves the accuracy. The fact that Triplet-loss outperforms Norm-Softmax loss indicates the importance of margin in improving the performance.However, employing margin penalty within triplet samples is less effective than inserting margin between samples and centres as in ArcFace. Finally, we incorporate the Intra-loss, Inter-loss and Triplet-loss into ArcFace, but no improvement is observed, which leads us to believe that ArcFace is already enforcing intraclass compactness, interclass discrepancy and classification margin.

除了与margin-based方法比较,我们进行进一步对比ArcFace和其他旨在增强类内紧凑性(Eq.5)和类间差异(Eq.6)的损失。我们选择将softmax损失作为baseline，并观察到在权重和特征归一化后在CFP-FP和AGEDB-30的性能有所下降。通过将softmax与intra-class loss相结合，可以提高CFP-FP和AgeDB-30的性能。然而，将softmax与inter-class loss结合起来只能略微提高准确性。Triplet-loss优于Norm-Softmax loss这一事实表明了margin在提升性能中的重要性。然而，在triplet样本中使用间隔惩罚比在样本和中心之间插入间隔(类似ArcFace)的效果要差。最后，我们将Intra-loss、Inter-loss和Triplet-loss合并到ArcFace中，但没有观察到有提升，这使得我们认为ArcFace已经加强类内的紧凑性、类间的差异性和分类间隔。

To get a better understanding of ArcFace’s superiority,we give the detailed angle statistics on training data (CASIA)and test data (LFW) under different losses in Table 3. We find that (1) $W_{j}$ is nearly synchronised with embedding feature centre for ArcFace ( $14.29^{\circ }$ ), but there is an obvious deviation ( $44.26^{\circ }$ ) between $W_{j}$ and the embedding feature centre for Norm-Softmax. Therefore, the angles between $W_{j}$ cannot absolutely represent the interclass discrepancy on training data. Alternatively, the embedding feature centres calculated by the trained network are more representative. (2) Intra-Loss can effectively compress
intra-class variations but also brings in smaller interclass angles. (3) Inter-Loss can slightly increase inter-class discrepancy on both $W$ (directly) and the embedding network (indirectly), but also raises intra-class angles. (4) ArcFace already has very good intra-class compactness and inter-class discrepancy. (5) Triplet-Loss has similar intraclass compactness but inferior inter-class discrepancy compared to ArcFace. In addition, ArcFace has a more distinct margin than Triplet-Loss on the test set as illustrated in Figure 6.

为了更好的了解ArcFace的优势，我们在表3中给出了不同损失下训练数据(CASIA)和测试数据(LFW)的详细角度统计。我们发现 (1) $W_{j}$ 与 ArcFace( $14.29^{\circ }$ ) 的嵌入特征中心几乎是一致的，但在 $W_{j}$ 与Norm-Softmax中的嵌入特征中心存在明显的偏差( $44.26^{\circ }$ )。因此， $W_{j}$ 之间的角度不能完全代表训练数据的类间差异。或者，经过训练的网络计算出的嵌入特征中心更具有代表性。(2) Intra-Loss可以有效压缩类内的变化，但也会带来较小的组间角度。(3)Inter-Loss可以略微增加在 $W$ (直接)和嵌入网络(间接)上的类间差异，但也会增大类内角度。(4) ArcFace已经具有很好的类内紧凑性和类间差异性。(5)与ArcFace相比，Triplet-Loss具有相似的类内紧凑性，但类间差异较差。此外，如图6所示，ArcFace在测试集上具有比Triplet-Loss更明显的间隔。

图6. LFW中所有正对和随机负对(~0.5M)的角度分布。红色区域表示正对，蓝色表示负对。所有的角都用度数表示。([CASIA ResNet50,loss*])。

表3. 不同损失下的角度统计([CASIA,ResNet50, loss*])。每一列表示一个特定的损失。“WEC”是指 $W_{j}$ 与对应嵌入特征中心夹角的平均值。“W-Inter”是指 $W_{j}$ 之间最小角度的平均值。“Intra1”和“Intra2”分别为 $x_{i}$ 与CASIA和LFW上嵌入特征中心夹角的平均值。“Inter1”和“Inter2”分别是指嵌入特征中心在CASIA和LFW上的最小角度的平均值。

3.3. Evaluation Results

Results on LFW, YTF, CALFW and CPLFW. LFW [13] and YTF [40] datasets are the most widely used benchmark for unconstrained face verification on images and videos. In this paper, we follow the unrestricted with labelled outside data protocol to report the performance. As reported in Table 4, ArcFace trained on MS1MV2 with ResNet100 beats the baselines (e.g. SphereFace [18] and CosFace [37]) by a significant margin on both LFW and YTF, which shows that the additive angular margin penalty can notably enhance the discriminative power of deeply learned features,demonstrating the effectiveness of ArcFace.

3.3. 结果评估

LFW、YTF、CALFW、CPLFW检测结果。LFW[13]和YTF[40]数据集是应用最广泛的图像和视频无约束人脸验证基准。在本文中，我们采用无约束带标签的外部数据协议来报告性能。报告在表4中，在LFW和YTF验证集上，ArcFace使用ResNet100训练MS1MV2，通过一个更有效的margin，击败baseline(例如SphereFace[18]和CosFace[37]) ，这表明加法角度间隔惩罚可以明显提高深层学习特征的判别能力，从而证明ArcFace的有效性。

表4. 不同方法在LFW和YTF上的验证性能(%)

Besides on LFW and YTF datasets, we also report the performance of ArcFace on the recently introduced datasets
(e.g. CPLFW [48] and CALFW [49]) which show higher pose and age variations with same identities from LFW. Among all of the open-sourced face recognition models, the ArcFace model is evaluated as the top-ranked face recognition model as shown in Table 5, outperforming counterparts by an obvious margin. In Figure 7, we illustrate the angle distributions (predicted by ArcFace model trained on MS1MV2 with ResNet100) of both positive and negative pairs on LFW, CFP-FP, AgeDB-30, YTF, CPLFW and CALFW. We can clearly find that the intra-variance due to pose and age gaps significantly increases the angles between positive pairs thus making the best threshold for face verification increasing and generating more confusion regions
on the histogram.

除了在LFW和YTF数据集上，我们还报告了ArcFace在最近引入的数据集上的性能(如CPLFW[48]和CALFW[49])，与LFW的身份相同，并且具有较高的姿态和年龄变化。在所有的开源人脸识别模型中，如表5所示，ArcFace模型的性能评估位于人脸识别模型的首位，通过一个明显的间隔优于其他模型。在图7中，我们展示了LFW、CFP-FP、AgeDB-30、YTF、CPLFW和CALFW上正对和负对的角度分布(使用ResNet100在MS1MV2上训练的ArcFace模型进行预测)。我们可以清楚地发现，由于姿态和年龄差距引起的内方差显著增加了正对之间的角度范围（从图上可以看到红色区域往两边伸展），从而使得人脸验证的最佳阈值增加，并在直方图上产生更多的混淆区域（红色区域和蓝色区域重叠的深红色区域）。

表5. 基于LFW、CALFW和CPLFW的开源人脸识别模型的验证性能(%)

图7. LFW、CFP-FP、AgeDB-30、YTF、CPLFW、CALFW上正对和负对的角度分布。红色区域表示正对，蓝色表示负对。所有的角都用度数表示。([MS1MV2、ResNet100 ArcFace])

Results on MegaFace. The MegaFace dataset [15] includes 1M images of 690K different individuals as the gallery set
and 100K photos of 530 unique individuals from FaceScrub [23] as the probe set. On MegaFace, there are two testing scenarios (identification and verification) under two protocols(large or small training set). The training set is defined as large if it contains more than 0.5M images. For the fair comparison, we train ArcFace on CAISA and MS1MV2 under the small protocol and large protocol, respectively.In Table 6, ArcFace trained on CASIA achieves the best single-model identification and verification performance,not only surpassing the strong baselines (e.g. SphereFace[18] and CosFace [37]) but also outperforming other published methods [38, 17].

在MegaFace数据集上的结果。MegaFace数据集[15]包括100万张含有690K不同个体的图像作为gallery集，10万张来自FaceScrub[23]的530个独特个体的照片作为probe集。在MegaFace中，有两个协议(大训练集或小训练集)下的测试场景(识别和验证)。如果训练集包含超过0.5M张图像，则将其定义为大训练集。为了公平比较，我们分别在CAISA和MS1MV2（包括小训练集和小训练集）上训练ArcFace。在表6中，在CASIA上训练的ArcFace实现了最好的单模型识别和验证性能，不仅超过了强大的baseline(如SphereFace[18]和CosFace[37])，而且还超过了其他已发表的方法[38,17]。

表6所示。以FaceScrub为probe集，对MegaFace Challenge1中不同方法的人脸识别和验证评估。“Id”表示带有1M干扰集的rank-1人脸识别精度，“Ver”表示的人脸验证TAR在 $10^{-6}$ FAR。“R”表示probe集和1M干扰集上的数据清洗。ArcFace在大小协议下都能获得state-of-the-art的性能。

As we observed an obvious performance gap between identification and verification, we performed a thorough manual check in the whole MegaFace dataset and found many face images with wrong labels, which significantly affects the performance. Therefore, we manually refined the whole MegaFace dataset and report the correct performance of ArcFace on MegaFace. On the refined MegaFace,ArcFace still clearly outperforms CosFace and achieves the best performance on both verification and identification.

由于我们观察到识别和验证之间存在一个明显的性能差距，我们对整个MegaFace数据集进行了彻底的人工检查，发现很多人脸图像的标签都是错误的，这对性能有很大的影响。因此，我们手动清洗了整个MegaFace数据集，并给出了ArcFace在MegaFace上的正确性能。在清洗的MegaFace上，ArcFace仍然明显优于CosFace，在验证和识别上都取得了最好的性能。

Under large protocol, ArcFace surpasses FaceNet [29] by a clear margin and obtains comparable results on identification
and better results on verification compared to CosFace [37]. Since CosFace employs a private training data, we retrain CosFace on our MS1MV2 dataset with ResNet100. Under fair comparison, ArcFace shows superiority over CosFace and forms an upper envelope of Cos-Face under both identification and verification scenarios as shown in Figure 8.

在大协议下，ArcFace通过一个清晰的间隔(margin)，超过FaceNet[29]，并获得了与CosFace[37]相当的识别结果和比Cosface更好的验证结果。由于CosFace使用了一个私有的训练数据，我们使用ResNet100在我们的MS1MV2数据集上重新训练CosFace。在公平比较下，ArcFace比CosFace有优势，在识别和验证场景下ArcFace都形成了CosFace的上包络（也就是说Arcface的曲线始终高与Cosface的曲线），如图8所示。

图8. 不同模型在MegaFace上的CMC和ROC曲线。结果在原始和清洗后的MegaFace数据集上进行评估。

Results on IJB-B and IJB-C. The IJB-B dataset [39]contains 1,845 subjects with 21.8K still images and 55K frames from 7,011 videos. In total, there are 12,115 templates with 10,270 genuine matches and 8M impostor matches. The IJB-C dataset [39] is a further extension of IJB-B, having 3,531 subjects with 31.3K still images and 117.5K frames from 11,779 videos. In total, there are 23,124 templates with 19,557 genuine matches and 15,639K impostor matches.

在IJB-B和IJB-C上的结果。IJB-B数据集[39]包含1845个受试者，21.8K张静态图像和来自7011个视频的55K帧。总共有12115个模板，包含10270个真实的匹配项和800万个冒名顶替的匹配项。IJB-C数据集[39]是IJB-B的进一步扩展，有3531名受试者，31.3万张静止图像和117.5K帧，来自11779个视频。总共有23,124个模板，其中有19,557个真实的匹配项和15,639K个冒名顶替的匹配项。

On the IJB-B and IJB-C datasets, we employ the VGG2 dataset as the training data and the ResNet50 as the embedding
network to train ArcFace for the fair comparison with the most recent methods [6, 42, 41]. In Table 7, we compare the TAR (@FAR=1e-4) of ArcFace with the previous state-of-the-art models [6, 42, 41]. ArcFace can obviously boost the performance on both IJB-B and IJB-C (about 3 ~ 5%,which is a significant reduction in the error). Drawing support from more training data (MS1MV2) and deeper neural network (ResNet100), ArcFace can further improve the TAR (@FAR=1e-4) to 94.2% and 95.6% on IJB-B and IJBC,respectively. In Figure 9, we show the full ROC curves of the proposed ArcFace on IJB-B and IJB-C 2, and ArcFace achieves impressive performance even at FAR=1e-6 setting a new baseline.

在IJB-B和IJB-C数据集上，我们使用VGG2数据集作为训练数据，使用ResNet50作为嵌入网络来训练ArcFace，与最新的方法进行公平的比较[6,42,41]。在表7中，我们将ArcFace的TAR (@FAR=1e-4)与以往state-of-the-art的模型[6,42,41]进行比较。ArcFace可以明显地提高在IJB-B和IJB-C上的性能(约3 ~ 5%，误差显著降低)。利用更多的训练数据(MS1MV2)和更深的神经网络(ResNet100)， ArcFace可以进一步提高TAR (@FAR=1e-4)，在IJB-B和IJBC上分别提高到94.2%和95.6%。在图9中，我们展示了在IJB-B和IJB-C 2上ArcFace的完整ROC曲线，即使在FAR=1e-6设置一个新的baseline，ArcFace也取得了令人印象深刻的性能。

表7. 在IJB-B和IJBC数据集上的1:1验证 TAR (@FAR=1e-4)

图9. 在IJB-B和IJB-C数据集上1:1验证协议的ROC曲线。

Results on Trillion-Pairs. The Trillion-Pairs dataset [2]provides 1.58M images from Flickr as the gallery set and 274K images from 5.7k LFW [13] identities as the probe set. Every pair between gallery and probe set is used for evaluation (0.4 trillion pairs in total). In Table 8, we compare the performance of ArcFace trained on different datasets. The proposed MS1MV2 dataset obviously boosts the performance compared to CASIA and even slightly outperforms the DeepGlint-Face dataset, which has
a double identity number. When combining all identities from MS1MV2 and Asian celebrities from DeepGlint, Arc-Face achieves the best identification performance 84.840%(@FPR=1e-3) and comparable verification performance compared to the most recent submission (CIGIT IRSEC) from the lead-board.

在Trillion-Pairs上的结果。Trillion-Pairs数据集[2]提供了来自Flickr的158万张图片作为gallery集，274K张来自LFW[13] 5700个身份的图片作为probe集。gallery集和probe集之间的每一对人脸都用于评估(总计0.4万亿对人脸)。在表8中，我们比较了在不同数据集上训练的ArcFace的性能。与CASIA相比，提出的MS1MV2数据集明显提高了性能，甚至比具有双重身份号的DeepGlint-Face数据集的性能还要好一点。当综合MS1MV2和DeepGlint的亚洲名人的所有身份时，Arc-Face获得了最佳的识别性能84.840%(@FPR=1e-3)，与来自排行榜的最新提交(CIGIT IRSEC)相比，其验证性能也不相上下。

表8. 对Trillion-Pairs数据集的识别和验证结果(%) ([数据集*,ResNet100 ArcFace])

Results on iQIYI-VID. The iQIYI-VID challenge [20]contains 565,372 video clips (training set 219,677, validation set 172,860, and test set 172,835) of 4934 identities from iQIYI variety shows, films and television dramas. The length of each video ranges from 1 to 30 seconds. This dataset supplies multi-modal cues, including face, cloth, voice,gait and subtitles, for character identification. The iQIYI-VID dataset employs MAP@100 as the evaluation indicator. MAP (Mean Average Precision) refers to the overall average accuracy rate, which is the mean of the average accuracy rate of the corresponding videos of person
ID retrieved in the test set for each person ID (as the query) in the training set.。

在iQIYI-VID上的结果。iQIYI- vid challenge[20]包含了来自iQIYI综艺节目、电影和电视剧的4934个身份的565,372个视频剪辑(训练集219,677个，验证集172,860个，测试集172,835个)。每个视频的长度从1秒到30秒不等。对于角色识别，这个数据集提供了多种模式的线索，包括人脸，衣服，声音，步态和字幕。iQIYI-VID数据集使用MAP@100作为评价指标。MAP (Mean Average Precision)是指总体平均准确率，是训练集中每个person ID(即查询)在测试集中检索到的person ID对应视频的平均准确率的平均值。

As shown in Table 9, ArcFace trained on combined MS1MV2 and Asian datasets with ResNet100 sets a high baseline (MAP=(79.80%)). Based on the embedding feature for each training video, we train an additional threelayer fully connected network with a classification loss to get the customised feature descriptor on the iQIYI-VID dataset. The MLP learned on the iQIYI-VID training set significantly boosts the MAP by 6.60%. Drawing support from the model ensemble and context features from the off-the-shelf object and scene classifier [1], our final result surpasses the runner-up by a clear margin ( 0.99%).

如表9所示，表中的第一行MS1MV2+Asian, R100, ArcFace。结合MS1MV2和亚洲数据集，使用ResNet100训练的ArcFace作为一个高标准的baseline(MAP=(79.80%))。基于每个训练视频的嵌入特征，我们训练了一个额外的三层带有分类损失的全连接网络，在iQIYI-VID数据集上得到自定义的特征描述符。在iQIYI-VID训练集上学习的MLP显著提升6.60%的MAP（79.80%+6.60%=86.40%）。获得ensemble模型和来自现成对象和场景分类器[1]的context features的支持，我们的最终结果以明显的优势(0.99%)超过亚军。

表9. 我们的方法在iQIYI-VID测试集上的MAP。MLP是指在iQIYI-VID训练数据集上训练的三层全连接网络。

4. Conclusions

In this paper, we proposed an Additive Angular Margin Loss function, which can effectively enhance the discriminative
power of feature embeddings learned via DCNNs for face recognition. In the most comprehensive experiments reported in the literature we demonstrate that our method consistently outperforms the state-of-the-art. Code and details have been released under the MIT license.

4. 结论

在本文中，我们提出了一种加法角度间隔的损失函数，它通过DCNNs学习的特征嵌入，可以有效地增强人脸识别的判别能力。在文献阐述的最全面实验中，我们证明我们的方法始终优于state-of-the-art方法。代码和细节已经在MIT许可下发布。

5. Appendix

5.1. Parallel Acceleration

Can we apply ArcFace on large-scale identities? Yes,millions of identities are not a problem.

The concept of Centre (W) is indispensable in ArcFace,but the parameter size of Centre (W) is proportional to the number of classes. When there are millions of identities in the training data, the proposed ArcFace confronts with substantial training difficulties, e.g. excessive GPU memory consumption and massive computational cost, even at a prohibitive level.

In our implementation , we employ a parallel acceleration strategy [44] to relieve this problem. We optimise our training code to easily and efficiently support million level identities on a single machine by parallel acceleration on both feature x (it known as the general data parallel strategy) and centre W (we named it as the centre parallel strategy). As shown in Figure 10, our parallel acceleration on both feature x and centre W can significantly decrease the GPU memory consumption and accelerate the training speed. Even for one million identities trained on 8*1080ti (11GB), our implementation (ResNet 50, batch size 8*64,feature dimension 512 and float point 32) can still run at 800 samples per second. Compared to the approximate acceleration method proposed in [47], our implementation has no performance drop.

5. 附录

5.1. 并行加速

我们可以将ArcFace应用于大规模的身份吗? 是的，数百万的身份不是问题。

中心( $W$ )的概念在ArcFace中是必不可少的，但是中心( $W$ )的参数大小与类的数量成正比。当训练数据中存在数百万个身份时，所提出的ArcFace面临着大量的训练困难，如GPU内存消耗过大，计算量巨大，甚至到了令人望而却步的程度。

在我们的实现中，我们使用了一个并行加速策略[44]来缓解这个问题。我们优化了我们的训练代码，通过在特征 $x$ (它被称为通用数据并行策略)和中心 $W$ (我们将其命名为中心并行策略)上并行加速，在一台机器上轻松有效地支持上百万级的身份。如图10所示，我们在特征 $x$ 和中心 $W$ 上的并行加速可以显著降低GPU内存消耗，加速训练速度。即使在8*1080ti (11GB)上训练100万个身份，我们的实现(ResNet 50、批大小8*64、特征维数为512和浮点数32)仍然可以以每秒800个样本运行。与[47]中提出的近似加速方法相比，我们的实现没有性能下降。

图10. 在特征x和中心W上的并行加速。设置:ResNet 50，批大小 8*64，特征维度 512，浮点数32,GPU 8*P40 (24GB)

In Figure 11, we illustrate the main calculation steps of the parallel acceleration by simple matrix partition, which can be easily grasped and reproduced by beginners [3].

(1) Get feature (x). Face embedding features are aggregated into one feature matrix (batch size 8*64 × feature dimension 512) from 8 GPU cards. The size of the aggregated feature matrix is only 1MB, and the communication cost is negligible when we transfer the feature matrix.

(2) Get similarity score matrix (score = xW). We copy the feature matrix into each GPU, and concurrently multiply the feature matrix by the centre sub-matrix (feature dimension 512 × identity number 1M/8) to get the similarity score sub-matrix (batch size 512 × identity number 1M/8) on each GPU. The similarity score matrix goes forward to calculate the ArcFace loss and the gradient. Here, we conduct a simple matrix partition on the centre matrix and the similarity score matrix along the identity dimension, and there is no communication cost on the centre and similarity score matrix. Both the centre sub-matrix and the similarity score sub-matrix are only 256MB on each GPU.

(3) Get gradient on centre (dW). We transpose the feature matrix on each GPU, and concurrently multiply the transposed feature matrix by the gradient sub-matrix of the similarity score.

(4) Get gradient on feature (x). We concurrently multiply the gradient sub-matrix of similarity score by the transposed centre sub-matrix and sum up the outputs from 8 GPU cards to get the gradient on feature x.Considering the communication cost (MB level), our implementation of ArcFace can be easily and efficiently trained on millions of identities by clusters.

在图11中，我们通过简单的矩阵划分说明了并行加速的主要计算步骤，这对于初学者[3]来说很容易掌握和再现。

(1)如图11(a)所示,获取特征(x)，将8张GPU卡的人脸嵌入特征聚合为一个特征矩阵(批大小8*64×特征维数512)。聚合的特征矩阵的大小只有1MB，当我们传输特征矩阵时，通信成本可以忽略不计。

(2)如图11(b)所示，获取相似度得分矩阵(score = xW)。我们将特征矩阵复制到每个GPU中，同时将特征矩阵乘以中心W的子矩阵(特征维数512×身份数量1M/8)，得到每个GPU的相似度得分的子矩阵(批大小512×身份数量1M/8)。相似度得分矩阵前向传播计算Arcface损失和梯度。在这里，我们对中心W矩阵和相似度得分矩阵沿着身份维数轴的方向进行简单的矩阵划分，并且在中心W矩阵和相似度得分矩阵上没有通信成本。中心子矩阵和相似度得分子矩阵在每个GPU上都只有256MB。

(3)如图11(c)所示，获取中心W上的梯度(dW)。我们对每个GPU上的特征矩阵进行转置，同时将转置后的特征矩阵乘以相似度得分子矩阵的梯度。

(4)如图11(d)所示，获得特征(x)的梯度(dx)，我们同时将相似度得分子矩阵的梯度乘以转置后的中心W子矩阵，将8张GPU卡的输出相加得到特征x的梯度。考虑到通信成本(MB级)，通过集群我们的ArcFace实现可以轻松有效地在数百万个身份上进行训练。

图11. 通过简单的矩阵划分并行计算。设置:ResNet 50，批大小8*64，特征维数512，浮点数32，身份数量100万，GPU 8* 1080ti (11GB)。通讯成本:1MB(特征x)，训练速度:每秒800个样本。

5.2. Feature Space Analysis

Is the 512-d hypersphere space large enough to hold large-scale identities? Theoretically, Yes.

We assume that the identity centre $W_{j}$ ’s follow a realistically spherical uniform distribution, the expectation of the nearest neighbour separation[5] is

where $d$ is the space dimension, $n$ is the identity number,and $\theta \left ( W_{j} \right )=min_{1\leqslant i,j\leqslant n,i\neq j}arccos\left ( W_{i},W_{j} \right )\forall i,j$ . In Figure 12, we give $E\left \left [\theta \left ( W_{j} \right ) \right ]$ in the 128-d, 256-d and 512-d space with the class number ranging from 10K to 100M.The high-dimensional space is so large that $E\left \left [\theta \left ( W_{j} \right ) \right ]$ decreases slowly when the class number increases exponentially.

5.2. 特征空间分析

512-d超球空间是否足够大来容纳大规模的身份? 从理论上说,是的。

我们假设身份中心 $W_{j}$ 服从一个现实的球面均匀分布，最近邻分离[5]的期望是

其中 $d$ 是空间维数， $n$ 是身份数量，并且 $\theta \left ( W_{j} \right )=min_{1\leqslant i,j\leqslant n,i\neq j}arccos\left ( W_{i},W_{j} \right )\forall i,j$ 。在图12中，我们给出了类别的数量范围从10K到100M的128-d、256-d和512-d空间中的 $E\left \left [\theta \left ( W_{j} \right ) \right ]$ 。高维空间非常大，当类别数量呈指数增长时， $E\left \left [\theta \left ( W_{j} \right ) \right ]$ 下降缓慢。

图12. 高维空间是如此大，以至于类别数量呈指数增长时，最接近的角度的均值缓慢下降。

神罗Noctis

关注

2
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
Arcface v3 论文翻译与解读

论文地址：http://arxiv.org/pdf/1801.07698.pdfArcface v3与Arcface v1的内容有较大不同。建议先阅读Arcface v1 的论文，再看v3。可以参考我之前写的Arcface v1 论文翻译与解读ArcFace: Additive Angular Margin Loss for Deep Face Recognition目录Abs...
复制链接

扫一扫