DeepID1翻译:Deep Learning Face Representation from Predicting 10,000 Classes

Abstract

摘要

This paper proposes to learn a set of high-level feature representations through deep learning, referred to as Deep hidden IDentity features (DeepID), for face verification. We argue that DeepID can be effectively learned through challenging multi-class face identification tasks, whilst they can be generalized to other tasks (such as verification) and new identities unseen in the training set. Moreover, the generalization capability of DeepID increases as more face classes are to be predicted at training.DeepID features are taken from the last hidden layer neuron activations of deep convolutional networks (ConvNets). When learned as classifiers to recognize about 10, 000 face identities in the training set and configured to keep reducing the neuron numbers along the feature extraction hierarchy, these deep ConvNets gradually form compact identity-related features in the top layers with only a small number of hidden neurons. The proposed features are extracted from various face regions to form complementary and over-complete representations. Any state-of-the-art classifiers can be learned based on these high-level representations for face verification. 97.45% verification accuracy on LFW is achieved with only weakly aligned faces.

本文提出通过深度学习来学习一组高级特征表示,称为深度隐藏身份特征(DeepID),以进行人脸验证。我们认为,可以通过挑战性的多类人脸识别任务来有效地学习DeepID,而可以将其推广到其他任务(例如,验证)和训练集中看不到的新身份。此外,随着训练中预测更多的面部类别,DeepID的泛化能力将提高。DeepID功能取自深度卷积网络(ConvNets)的最后一个隐藏层神经元激活。当作为分类器来学习时,可以在训练集中识别约10,000个面部身份,并配置为沿着特征提取层次结构不断减少神经元数量,这些深层的ConvNets会逐渐在顶层形成紧凑的身份相关特征,只有少量的隐藏的神经元。从各种面部区域中提取提出的特征,以形成互补和过于完整的表示。基于这些用于面部验证的高级表示,可以学习任何最新的分类器。 LFW的验证精度仅通过弱对齐的面即可实现。

1. Introduction

1.简介

Face verification in unconstrained conditions has been studied extensively in recent years [21, 15, 7, 34, 17, 26, 18, 8, 2, 9, 3, 29, 6] due to its practical applications and the publishing of LFW [19], an extensively reported dataset for face verification algorithms. The current bestperforming face verification algorithms typically represent faces with over-complete low-level features, followed by shallow models [9, 29, 6]. Recently, deep models such as ConvNets [24] have been proved effective for extracting high-level visual features [11, 20, 14] and are used for face verification [18, 5, 31, 32, 36].Huang et al. [18] learned a generative deep model without supervision. Cai

由于其实际应用和LFW的发表,近年来在无约束条件下的面部验证已得到了广泛的研究[21,15,7,34,17,26,18,8,2,9,9,3,29,6]。 19],有关面部验证算法的广泛报道的数据集。当前表现最佳的人脸验证算法通常代表具有过度完整的低层特征的人脸,然后是浅层模型[9、29、6]。最近,已经证明,诸如ConvNets [24]之类的深度模型可有效提取高级视觉特征[11、20、14],并用于面部验证[18、5、31、32、36]。黄等。 [18]学习了无需监督的生成型深度模型。蔡

1 et al. [5] learned deep nonlinear metrics. In [31], the deep models are supervised by the binary face verification target. Differently, in this paper we propose to learn highlevel face identity features with deep models through face identification, i.e. classifying a training image into one of n identities (  in this work). This highdimensional prediction task is much more challenging than face verification, however, it leads to good generalization of the learned feature representations.Although learned through identification, these features are shown to be effective for face verification and new faces unseen in the training set.

1等。 [5]学习了深层的非线性度量。在[31]中,深度模型由二值脸部验证目标监督。与此不同,在本文中,我们建议通过面部识别通过深度模型来学习高级面部识别功能,即将训练图像分类为n个身份之一(本工作中为)。这种高维预测任务比人脸验证更具挑战性,但是,它可以很好地概括学习的特征表示。尽管通过识别学习到了这些功能,但这些功能已显示出对于面部验证和训练集中看不到的新面孔有效。

Figure 1. An illustration of the feature extraction process. Arrows indicate forward propagation directions. The number of neurons in each layer of the multiple deep ConvNets are labeled beside each layer. The DeepID features are taken from the last hidden layer of each ConvNet, and predict a large number of identity classes. Feature numbers continue to reduce along the feature extraction cascade till the DeepID layer.

图1.特征提取过程示意图。箭头指示向前传播方向。多个深层卷积网络每一层中的神经元数量在每一层旁边标记。DeepID功能取自每个ConvNet的最后一个隐藏层,并预测大量身份类。沿着特征提取级联,特征数量继续减少,直到DeepID层。

We propose an effective way to learn high-level overcomplete features with deep ConvNets. A high-level illustration of our feature extraction process is shown in Figure 1. The ConvNets are learned to classify all the faces available for training by their identities, with the last hidden layer neuron activations as features (referred to as Deep hidden IDentity features or DeepID). Each ConvNet takes a face patch as input and extracts local low-level features in the bottom layers. Feature numbers continue to reduce along the feature extraction cascade while gradually more global and high-level features are formed in the top layers. Highly compact 160-dimensional DeepID is acquired at the end of the cascade that contain rich identity information and directly predict a much larger number (e.g., 10, 000) of identity classes. Classifying all the identities simultaneously instead of training binary classifiers as in [21, 2, 3] is based on two considerations. First, it is much more difficult to predict a training sample into one of many classes than to perform binary classification. This challenging task can make full use of the super learning capacity of neural networks to extract effective features for face recognition. Second, it implicitly adds a strong regularization to ConvNets, which helps to form shared hidden representations that can classify all the identities well. Therefore, the learned high-level features have good generalization ability and do not over-fit to a small subset of training faces. We constrain DeepID to be significantly fewer than the classes of identities they predict, which is key to learning highly compact and discriminative features. We further concatenate the DeepID extracted from various face regions to form complementary and over-complete representations. The learned features can be well generalized to new identities in test, which are not seen in training, and can be readily integrated with any state-of-the-art face classifiers (e.g., Joint Bayesian [8]) for face verification.

我们提出了一种有效的方法,可以使用深层的ConvNets学习高级的超完备功能。图1展示了我们的特征提取过程的高级示意图。通过学习ConvNet,可以根据其身份对可用于训练的所有面孔进行分类,最后隐藏层神经元激活作为特征(称为“深层隐藏身份特征”或“ DeepID”)。每个ConvNet都将一个面部补丁作为输入,并在底层提取局部的低层特征。沿着特征提取级联,特征数量继续减少,而在顶层逐渐形成更多的全局和高级特征。在级联的末尾获取高度紧凑的160维DeepID,其中包含丰富的身份信息,并直接预测大量(例如10、000)身份类别。而不是像[21,2,3]中那样训练二进制分类器,同时对所有身份进行分类是基于两个考虑。首先,将训练样本预测为多个类别之一要比执行二进制分类困难得多。这项具有挑战性的任务可以充分利用神经网络的超强学习能力,以提取有效的面部识别特征。其次,它隐式地为ConvNets添加了强大的正则化,这有助于形成可以很好地对所有身份进行分类的共享隐藏表示。因此,学习到的高级特征具有良好的泛化能力,并且不适合于训练面孔的一小部分。我们将DeepID限制为大大少于他们预测的身份类别,这对于学习高度紧凑和具有区别性的功能至关重要。我们进一步串联从各个面部区域提取的DeepID,以形成互补和过于完整的表示形式。可以将学习到的特征很好地推广到测试中看不到的新身份,并且可以轻松地与任何最新的面部分类器(例如,联合贝叶斯[8])集成在一起进行面部验证。

Our method achieves 97.45% face verification accuracy on LFW using only weakly aligned faces, which is almost as good as human performance of 97.53%. We also observe that as the number of training identities increases, the verification performance steadily gets improved. Although the prediction task at the training stage becomes more challenging, the discrimination and generalization ability of the learned features increases. It leaves the door wide open for future improvement of accuracy with more training data.

我们的方法仅使用弱对齐的脸部就可以在LFW上达到97.45%的脸部验证精度,几乎与97.53%的人类表现一样好。我们还观察到,随着训练身份数量的增加,验证性能稳步提高。尽管训练阶段的预测任务变得更具挑战性,但学习特征的判别和泛化能力却有所提高。它为将来提供更多培训数据而提高准确性打开了大门。

2. Related work

2.相关工作

[1] descriptors, and learned sparse Mahalanobis metrics for face verification.

[1]描述符,并学习了用于面部验证的稀疏Mahalanobis度量。

Some previous studies have further learned identityrelated features based on low-level features. Kumar et al.

先前的一些研究基于低级特征进一步学习了与身份相关的特征。 Kumar等。

[21] trained attribute and simile classifiers to detect facial attributes and measure face similarities to a set of reference people. Berg and Belhumeur [2, 3] trained classifiers to distinguish the faces from two different people. Features are outputs of the learned classifiers. They used SVM classifiers, which are shallow structures, and their learned features are still relatively low-level. In contrast, we classify all the identities from the training set simultaneously. Moreover, we use the last hidden layer activations as features instead of the classifier outputs. In our ConvNets, the neuron number of the last hidden layer is much smaller than that of the output, which forces the last hidden layer to learn shared hidden representations for faces of different people in order to well classify all of them, resulting in highly discriminative and compact features with good generalization ability.

[21]训练过的属性和比喻分类器可以检测面部特征并测量与一组参考人的面部相似度。Berg和Belhumeur [2,3]训练了分类器来区分两个不同人的面孔。特征是学习的分类器的输出。他们使用了SVM分类器,这些分类器是浅层结构,其学习的功能仍然相对较低。相反,我们同时对训练集中的所有身份进行分类。此外,我们使用最后的隐藏层激活作为特征而不是分类器输出。在我们的ConvNets中,最后一个隐藏层的神经元数量远小于输出的神经元数量,这迫使最后一个隐藏层学习不同人的面孔的共享隐藏表示,以便对所有人进行正确分类,从而造成高度歧视以及具有良好泛化能力的紧凑功能。

A few deep models have been used for face verification or identification. Chopra et al. [10] used a Siamese network

一些深层模型已用于人脸验证或识别。 Chopra等。 [10]使用了暹罗网络

[4] for deep metric learning. The Siamese network extracts features separately from two compared inputs with two identical sub-networks, taking the distance between the outputs of the two sub-networks as dissimilarity. [10] used deep ConvNets as the sub-networks. In contrast to the Siamese network in which feature extraction and recognition are jointly learned with the face verification target, we conduct feature extraction and recognition in two steps, with the first feature extraction step learned with the target of face identification, which is a much stronger supervision signal than verification. Huang et al. [18] generatively learned features with CDBNs [25], then used ITML [13] and linear SVM for face verification. Cai et al.

[4]用于深度度量学习。暹罗网络从具有两个相同子网络的两个比较输入中分别提取特征,并将两个子网络的输出之间的距离视为不相似。 [10]使用深层ConvNets作为子网。与通过面部验证目标共同学习特征提取和识别的暹罗网络相反,我们分两步进行特征提取和识别,而以面部识别目标学习的第一个特征提取步骤则要强大得多监督信号胜于核查。黄等。 [18]使用CDBN生成的学习特征[25],然后使用ITML [13]和线性SVM进行人脸验证。蔡等。

[5] also learned deep metrics under the Siamese network framework as [10], but used a two-level ISA network [23] as the sub-networks instead. Zhu et al. [35, 36] learned deep neural networks to transform faces in arbitrary poses and illumination to frontal faces with normal illumination, and then used the last hidden layer features or the transformed faces for face recognition. Sun et al. [31] used multiple deep ConvNets to learn high-level face similarity features and trained classification RBM [22] for face verification. Their features are jointly extracted from a pair of faces instead of from a single face.

[5]在[10]中也学习了暹罗网络框架下的深度度量,但改为使用二级ISA网络[23]作为子网。朱等。 [35,36]学习了深度神经网络,将任意姿势和光照下的人脸转换为正常照明下的正面人脸,然后使用最后的隐藏层特征或经过变换的人脸进行人脸识别。Sun等。 [31]使用多个深度ConvNets学习高级面部相似性特征并训练分类RBM [22]进行面部验证。它们的特征是从一对脸而不是从单个脸共同提取的。

3. Learning DeepID for face verification

3.学习DeepID进行面部验证

3.1. Deep ConvNets

3.1。深度卷积网

Our deep ConvNets contain four convolutional layers (with max-pooling) to extract features hierarchically, followed by the fully-connected DeepID layer and the softmax output layer indicating identity classes. The input is   for rectangle patches, and  for square patches, where  for color patches and  for gray patches. Figure 2 shows the detailed structure of the ConvNet which takes  input and predicts n (e.g.,  ) identity classes. When the input sizes change, the height and width of maps in the following layers will change accordingly. The dimension of the DeepID layer is fixed to 160, while the dimension of the output layer varies according to the number of classes it predicts. Feature numbers continue to reduce along the feature extraction hierarchy until the last hidden layer (the DeepID layer), where highly compact and predictive features are formed, which predict a much larger number of identity classes with only a few features.

我们的深层ConvNets包含四个卷积层(具有最大池),以按层次提取特征,然后是完全连接的DeepID层和softmax输出层,用于指示身份类别。输入是用于矩形色块的 ,以及用于正方形色块的,其中用于颜色色块的和用于灰色色块的。图2显示了ConvNet的详细结构,该结构接受输入并预测n个(例如)身份类。当输入大小更改时,后续图层中地图的高度和宽度将相应更改。DeepID层的尺寸固定为160,而输出层的尺寸则根据其预测的类数而变化。特征数量沿特征提取层次结构继续减少,直到最后一个隐藏层(DeepID层)形成了高度紧凑和可预测的特征,从而仅用少量特征就可以预测大量的身份类别。

Figure 2. ConvNet structure. The length, width, and height of each cuboid denotes the map number and the dimension of each map for all input, convolutional, and max-pooling layers. The inside small cuboids and squares denote the 3D convolution kernel sizes and the 2D pooling region sizes of convolutional and maxpooling layers, respectively. Neuron numbers of the last two fullyconnected layers are marked beside each layer.

图2. ConvNet结构。每个长方体的长度,宽度和高度表示所有输入,卷积和最大合并层的地图编号和每个地图的尺寸。内部小长方体和正方形分别表示卷积层和maxpooling层的3D卷积核大小和2D池区域大小。最后两层完全连接的层的神经元编号标记在每一层旁边。

The convolution operation is expressed as

卷积运算表示为

where  and  are the i-th input map and the j-th output map, respectively.  is the convolution kernel between the i-th input map and the j-th output map. ∗ denotes convolution.  is the bias of the j-th output map. We use ReLU nonlinearity (  ) for hidden neurons, which is shown to have better fitting abilities than the sigmoid function [20]. Weights in higher convolutional layers of our ConvNets are locally shared to learn different mid- or high-level features in different regions [18]. r in Equation 1 indicates a local region where weights are shared. In the third convolutional layer, weights are locally shared in every  regions, while weights in the fourth convolutional layer are totally unshared. Max-pooling is formulated as

其中分别是第i个输入映射和第j个输出映射。是第i个输入映射和第j个输出映射之间的卷积内核。 *表示卷积。是第j个输出映射的偏差。我们对隐藏的神经元使用ReLU非线性(),它被证明比S型函数具有更好的拟合能力[20]。我们的卷积网络的较高卷积层的权重在本地共享,以学习不同区域中不同的中高级特征[18]。等式1中的r指示共享权重的局部区域。在第三卷积层中,权重在每个区域中局部共享,而第四卷积层中的权重完全不共享。最大池公式为

where each neuron in the i-th output map  pools over an  non-overlapping local region in the i-th input map  .

其中第i个输出映射表中的每个神经元在第i个输入映射表中的不重叠局部区域上合并。

Figure 3. Top: ten face regions of medium scales. The five regions in the top left are global regions taken from the weakly aligned faces, the other five in the top right are local regions centered around the five facial landmarks (two eye centers, nose tip, and two mouse corners). Bottom: three scales of two particular patches.

图3.顶部:十个中等比例的面部区域。左上方的五个区域是取自弱对齐面部的全局区域,右上方的其他五个是以五个面部标志(两个眼中心,鼻尖和两个鼠标角)为中心的局部区域。底部:两个特定补丁的三个音阶。

The last hidden layer of DeepID is fully connected to both the third and fourth convolutional layers (after maxpooling) such that it sees multi-scale features [28] (features in the fourth convolutional layer are more global than those in the third one). This is critical to feature learning because after successive down-sampling along the cascade, the fourth convolutional layer contains too few neurons and becomes the bottleneck for information propagation. Adding the bypassing connections between the third convolutional layer (referred to as the skipping layer) and the last hidden layer reduces the possible information loss in the fourth convolutional layer. The last hidden layer takes the function

DeepID的最后一个隐藏层完全连接到第三和第四卷积层(在maxpooling之后),因此它可以看到多尺度特征[28](第四卷积层的特征比第三卷积的特征更全局)。这对于特征学习至关重要,因为在沿着级联进行连续下采样之后,第四卷积层包含的神经元太少,成为信息传播的瓶颈。在第三卷积层(称为跳过层)和最后一个隐藏层之间添加旁路连接可减少第四卷积层中可能的信息丢失。最后一个隐藏层具有功能

where  ,  ,  ,  denote neurons and weights in the third and fourth convolutional layers, respectively. It linearly combines features in the previous two convolutional layers, followed by ReLU non-linearity.

其中分别表示第三和第四卷积层中的神经元和权重。它线性组合了前两个卷积层中的特征,然后是ReLU非线性。

The ConvNet output is an n-way softmax predicting the probability distribution over n different identities.

ConvNet输出是n向softmax,用于预测n个不同身份上的概率分布。

where  linearly combines the 160 DeepID features  as the input of neuron j, and  is its output. The ConvNet is learned by minimizing  , with the t-th target class.Stochastic gradient descent is used with gradients calculated by back-propagation.

其中将160个DeepID特征线性组合为神经元j的输入,而是其输出。通过使用第t个目标类最小化来学习ConvNet。随机梯度下降用于通过反向传播计算的梯度。

3.2. Feature extraction

3.2。特征提取

We detect five facial landmarks, including the two eye centers, the nose tip, and the two mouth corners, with the facial point detection method proposed by Sun et al. [30]. Faces are globally aligned by similarity transformation according to the two eye centers and the mid-point of the two mouth corners.Features are extracted from 60 face patches with ten regions, three scales, and RGB or gray channels.Figure 3 shows the ten face regions and the three scales of two particular face regions. We trained 60 ConvNets, each of which extracts two 160-dimensional DeepID vectors from a particular patch and its horizontally flipped counterpart. A special case is patches around the two eye centers and the two mouth corners, which are not flipped themselves, but the patches symmetric with them (for example, the flipped counterpart of the patch centered on the left eye is derived by flipping the patch centered on the right eye). The total length of DeepID is 19, 200 (  ), which is ready for the final face verification.

我们利用Sun等人提出的面部点检测方法检测了五个面部标志,包括两个眼中心,鼻尖和两个嘴角。 [30]。通过根据两个眼睛中心和两个嘴角的中点的相似度转换,全局对齐脸部。从具有十个区域,三个比例以及RGB或灰色通道的60个面部补丁中提取特征。图3显示了十个面部区域和两个特定面部区域的三个比例。我们训练了60个ConvNet,每个ConvNet都从特定补丁及其水平翻转的对等体中提取两个160维DeepID向量。特例是两个眼中心和两个嘴角周围的斑块,它们本身不翻转,但与它们对称的斑块(例如,以左眼为中心的斑块的翻转对等体是通过将斑块居中对齐而得到的。在右眼上)。DeepID的总长度为19,200(),已准备好进行最终人脸验证。

3.3. Face verification

3.3。脸部验证

We use the Joint Bayesian [8] technique for face verification based on the DeepID. Joint Bayesian has been highly successful for face verification [9, 6]. It represents the extracted facial features x (after subtracting the mean) by the sum of two independent Gaussian variables

我们使用联合贝叶斯[8]技术基于DeepID进行人脸验证。联合贝叶斯算法在面部验证方面非常成功[9,6]。它通过两个独立的高斯变量的总和表示提取的面部特征x(减去平均值后)

where  represents the face identity and  the intra-personal variations. Joint Bayesian models the joint probability of two faces given the intraor extra-personal variation hypothesis,  and . It is readily shown from Equation 5 that these two probabilities are also Gaussian with variations

其中,代表面部识别,代表个人内部变化。联合贝叶斯模型根据给定的内部或个人外变异假设对两个面孔的联合概率进行建模。从等式5可以很容易地看出,这两个概率也是具有变化的高斯

respectively.  and  can be learned from data with EM algorithm. In test, it calculates the likelihood ratio

分别。可以使用EM算法从数据中学习。在测试中,它计算似然比

which has closed-form solutions and is efficient.

具有封闭形式的解决方案,并且非常有效。

We also train a neural network for verification and compare it to Joint Bayesian to see if other models can also learn from the extracted features and how much the features and a good face verification model contribute to the performance, respectively. The neural network contains one input layer taking the DeepID, one locally-connected layer, one fullyconnected layer, and a single output neuron indicating face similarities. The input features are divided into 60 groups, each of which contains 640 features extracted from a particular patch pair with a particular ConvNet. Features in the same group are highly correlated. Neurons in the locally-connected layer only connect to a single group of features to learn their local relations and reduce the feature dimension at the same time. The second hidden layer is fully-connected to the first hidden layer to learn global relations. The single output neuron is fully connected to the second hidden layer. The hidden neurons are ReLUs and the output neuron is sigmoid. An illustration of the neural network structure is shown in Figure 4. It has 38, 400 input neurons with 19, 200 DeepID features from each patch, and 4, 800 neurons in the following two hidden layers, with every 80 neurons in the first hidden layer locally connected to one of the 60 groups of input neurons.

我们还训练了一个用于验证的神经网络,并将其与联合贝叶斯算法进行比较,以查看其他模型是否也可以从提取的特征中学习,以及特征和良好的人脸验证模型分别对性能有贡献。神经网络包含一个带DeepID的输入层,一个本地连接层,一个完全连接层以及一个表示面部相似性的单个输出神经元。输入特征分为60组,每组包含从具有特定ConvNet的特定补丁对中提取的640个特征。同一组中的要素高度相关。本地连接层中的神经元仅连接到一组要素以了解它们的局部关系并同时缩小要素维。第二个隐藏层完全连接到第一个隐藏层,以学习全局关系。单个输出神经元完全连接到第二个隐藏层。隐藏的神经元是ReLU,输出神经元是S型。神经网络结构的图示如图4所示。它具有38、400个输入神经元,每个修补程序具有19、200个DeepID特征,并且在以下两个隐藏层中包含4,800个神经元,第一个隐藏层中的每80个神经元都本地连接到60组输入神经元中的一个。

Figure 4. The structure of the neural network used for face verification. The layer type and dimension are labeled beside each layer. The solid neurons form a subnetwork.

图4.用于面部验证的神经网络的结构。图层类型和尺寸标注在每层旁边。实体神经元形成一个子网。

Dropout learning [16] is used for all the hidden neurons. The input neurons cannot be dropped because the learned features are compact and distributed representations (representing a large number of identities with very few neurons) and have to collaborate with each other to represent the identities well.On the other hand, learning high-dimensional features without dropout is difficult due to gradient diffusion. To solve this problem, we first train 60 subnetworks, each with features of a single group as input. A particular subnetwork is illustrated in Figure 4. We then use the first-layer weights of the subnetworks to initialize those of the original network, and tune the second and third layers of the original network with the first layer weights clipped.

辍学[16]用于所有隐藏的神经元。输入的神经元不能被丢弃,因为学习到的特征是紧凑且分散的表示形式(用很少的神经元表示大量身份),并且必须彼此协作才能很好地表示身份。另一方面,由于梯度扩散,很难学习没有丢失的高维特征。为了解决这个问题,我们首先训练了60个子网,每个子网都以一组为特征。图4展示了一个特定的子网。然后,我们使用子网的第一层权重来初始化原始网络的权重,并在第一层权重被裁剪的情况下调整原始网络的第二层和第三层。

4. Experiments

4.实验

We evaluate our algorithm on LFW, which reveals the state-of-the-art of face verification in the wild.Though LFW contains 5749 people, only 85 have more than 15 images, and 4069 people have only one image. It is inadequate to train identity classifiers with so few images per person. Instead, we trained our model on CelebFaces

我们在LFW上评估了我们的算法,该算法揭示了野外人脸验证的最新技术。尽管LFW包含5749个人,但只有85个人拥有超过15张图像,而4069个人只有一张图像。用每人很少的图像来训练身份分类器是不够的。相反,我们在CelebFaces上训练了模型

[31] and tested on LFW (Section 4.1 - 4.3). CelebFaces contains 87, 628 face images of 5436 celebrities from the Internet, with approximately 16 images per person on average. People in LFW and CelebFaces are mutually exclusive.

[31]并在LFW上进行了测试(第4.1-4.3节)。 CelebFaces包含来自Internet的5436名名人的87、628张脸部图像,平均每人平均约16张图像。LFW和CelebFaces中的人是互斥的。

We randomly choose 80% (4349) people from CelebFaces to learn the DeepID, and use the remaining 20% people to learn the face verification model (Joint Bayesian or neural networks). For feature learning, ConvNets are supervised to classify the 4349 people simultaneously from a particular kind of face patches and their flipped counterparts. We randomly select 10% images of each training person to generate the validation data. After each training epoch, we observe the top-1 validation set error rates and select the model that provides the lowest one.

我们从CelebFaces中随机选择80%(4349)个人来学习DeepID,并使用其余20%的人来学习面部验证模型(Joint Bayesian或神经网络)。为了进行特征学习,对ConvNets进行了监督,以从一种特定的面部补丁及其翻转的对应对象中同时对4349人进行分类。我们随机选择每个培训人员的10%图像来生成验证数据。在每个训练时期之后,我们观察前1个验证集的错误率,并选择提供最低错误率的模型。

In face verification, our feature dimension is reduced to 150 by PCA before learning the Joint Bayesian model. Performance almost retains in a wide range of dimensions. In test, each face pair is classified by comparing the Joint Bayesian likelihood ratio to a threshold optimized in the training data.

在人脸验证中,在学习联合贝叶斯模型之前,PCA将特征尺寸减小到150。性能几乎可以在各种尺寸范围内保持。在测试中,通过将联合贝叶斯似然比与训练数据中优化的阈值进行比较,对每个面部对进行分类。

To evaluate the performance of our approach at an even larger training scale in Section 4.4, we extend CelebFaces to the CelebFaces+ dataset, which contains 202, 599 face images of 10, 177 celebrities.Again, people in LFW and CelebFaces+ are mutually exclusive. The ConvNet structure and feature extraction process described in the previous section remains unchanged.

为了在第4.4节中以更大的培训规模评估我们的方法的性能,我们将CelebFaces扩展到CelebFaces +数据集,该数据集包含10、177个名人的202、599个面部图像。同样,LFW和CelebFaces +中的人是互斥的。上一节中描述的ConvNet结构和特征提取过程保持不变。

4.1. Multi-scale ConvNets

4.1。多尺度卷积网络

We verify the effectiveness of directly connecting neurons in the third convolutional layer (after max-pooling) to the last hidden layer (the DeepID layer), such that it sees both the third and fourth convolutional layer features, forming the so-called multi-scale ConvNets. It also results in reducing feature numbers from the convolutional layers to the DeepID layer (shown in Figure 1), which helps the latter to learn higher-level features in order to well represent the face identities with fewer neurons.Figure 5 compares the top-1 validation set error rates of the 60 ConvNets learned to classify the 4349 classes of identities, either with or without the skipping layer. The lower error rates indicate the better hidden features learned. Allowing the DeepID to pool over multi-scale features reduces validation errors by an average of 4.72%. It actually also improves the final face verification accuracy from 95.35% to 96.05% when concatenating the DeepID from the 60 ConvNets and using Joint Bayesian for face verification.

我们验证了直接将第三卷积层(最大池化后)中的神经元连接到最后一个隐藏层(DeepID层)的有效性,这样它就可以看到第三和第四卷积层的特征,从而形成了所谓的多扩展ConvNets。这也导致从卷积层到DeepID层(如图1所示)的特征数量减少,这有助于后者学习更高级别的特征,以便用较少的神经元很好地表示人脸身份。图5比较了学习的60个ConvNet的top-1验证集错误率,可以对有或没有跳过层的4349类身份进行分类。较低的错误率表示学习到的隐藏特征更好。允许DeepID合并多尺度功能,可将验证错误平均减少4.72%。实际上,当将来自60个ConvNet的DeepID进行连接并使用联合贝叶斯进行脸部验证时,最终脸部验证的准确性也从95.35%提高到96.05%。

4.2. Learning effective features

4.2。学习有效功能

Classifying a large number of identities simultaneously is key to learning discriminative and compact hidden features. To verify this, we increase the identity classes for training exponentially (and output neuron numbers correspondingly) from 136 to 4349 while fixing the neuron numbers in all previous layers (the DeepID is kept to be 160 dimensional). We observe the classification ability of ConvNets (measured by the top-1 validation set error rates) and the effectiveness of the learned hidden representations for face verification (measured by the test set verification accuracy) with the increasing identity classes. The input is a single patch covering the whole face in this experiment. As shown in Figure 6, both Joint Bayesian and neural network improve linearly in verification accuracy when the identity classes double. The improvement is significant. When identity classes increase 32 times from 136 to 4349, the accuracy increases by 10.13% and 8.42% for Joint Bayesian and neural networks, respectively, or 2.03% and 1.68% on average, respectively, whenever the identity classes double. At the same time, the validation set error rates drop, even when the predicted classes are tens of times more than the last hidden layer neurons, as shown in Figure 7. This phenomenon indicates that ConvNets can learn from classifying each identity and form shared hidden representations that can classify all the identities well. More identity classes help to learn better hidden representations that can distinguish more people (discriminative) without increasing the feature length (compact). The linear increasing of test accuracy with respect to the exponentially increasing training data indicates that our features would be further improved if even more identities are available. Examples of the 160-dimensional DeepID learned from the 4349 training identities and extracted from LFW test pairs are shown in Figure 8. We find that faces of the same identity tend to have more commonly activated neurons (positive features being in the same position) than those of different identities. So the learned features extract identity information.

同时对大量身份进行分类是学习区分性和紧凑型隐藏特征的关键。为了验证这一点,我们将用于身份训练的身份类别从136增加到4349(并相应地输出神经元编号),同时在之前的所有层中固定了神经元编号(DeepID保持为160维)。我们观察到随着身份类别的增加,ConvNets的分类能力(由前1个验证集错误率衡量)和学习到的隐藏表示对脸部验证的有效性(由测试集验证准确性衡量)。在此实验中,输入是覆盖整个脸部的单个面片。如图6所示,当身份类别加倍时,联合贝叶斯网络和神经网络的验证准确性均线性提高。改善是显着的。当身份类别从136增加到4349的32倍时,联合贝叶斯网络和神经网络的准确度分别增加10.13%和8.42%,或者每当身份类别加倍时,准确度分别平均增加2.03%和1.68%。同时,即使预测的类别比最后一个隐藏层神经元大数十倍,验证集的错误率也会下降,如图7所示。这种现象表明ConvNets可以从对每个身份进行分类中学习,并形成可以对所有身份进行良好分类的共享隐藏表示。更多身份类有助于学习更好的隐藏表示形式,从而可以在不增加特征长度(紧凑)的情况下区分更多人(区分性)。相对于呈指数增长的训练数据,测试准确性的线性提高表明,如果可以使用更多的身份,我们的功能将得到进一步改善。从4349个训练身份学习并从LFW测试对中提取的160维DeepID的示例如图8所示。我们发现,具有相同身份的面孔比具有不同身份的面孔倾向于具有更常见的激活神经元(正性特征处于相同位置)。因此,学习到的功能会提取身份信息。

We also test the 4349-dimensional classifier outputs as features for face verification. Joint Bayesian only achieves approximately 66% accuracy on these features, while the neural network fails, where it accounts all the face pairs as positive or negative pairs. With so many classes and few samples for each class, the classifier outputs are diverse and unreliable, therefore cannot be used as features.

我们还将4349维分类器输出作为面部验证的功能进行测试。联合贝叶斯算法在这些特征上仅能达到约66%的准确度,而神经网络则无法正常工作,因为它会将所有面部对视为正对或负对。由于类别太多,每个类别的样本很少,因此分类器的输出是多样且不可靠的,因此不能用作特征。

4.3. Over-complete representation

4.3。代表不完整

We evaluate how much combining features extracted from various face patches would contribute to the performance. We train the face verification model with features from k patches (  ). It is impossible to numerate all the possible combinations of patches, so we select the most representative ones. We report the bestperforming single patch (  ), the global color patches in a single scale (  ), all the global color patches (  ), all the color patches (  ), and all the patches (  ). As shown in Figure 9, adding more features from various regions, scales, and color channels consistently improves the performance.Combing 60 patches increases the accuracy by 4.53% and 5.27% over best single patch for Joint Bayesian and neural networks, respectively. We achieve 96.05% and 94.32% accuracy using Joint Bayesian and neural networks, respectively. The curves show that the performance may be further improved if more features are extracted.

我们评估了从各种面部补丁中提取的组合特征对性能的贡献。我们使用k个补丁()中的特征训练人脸验证模型。不可能计算出所有可能的补丁组合,因此我们选择最具代表性的补丁。我们报告了表现最佳的单个色块(),单一比例的全局色块(),所有全局色块(),所有色块()和所有色块()。如图9所示,添加来自各个区域,比例和颜色通道的更多功能始终可以提高性能。与联合贝叶斯网络和神经网络的最佳单个补丁相比,梳理60个补丁分别使准确性提高了4.53%和5.27%。使用联合贝叶斯网络和神经网络,我们分别达到96.05%和94.32%的精度。曲线表明,如果提取更多特征,则可以进一步提高性能。

Figure 8. Examples of the learned 160-dimensional DeepID. The left column shows three test pairs in LFW. The first two pairs are of the same identity, the third one is of different identities. The corresponding features extracted from each patch are shown in the right. The features are in one dimension. We rearrange them as  for the convenience of illustration. The feature values are non-negative since they are taken from the ReLUs. Approximately 40% features have positive values. The brighter squares indicate higher values.

图8.学习到的160维DeepID的示例。左列显示了LFW中的三个测试对。前两对具有相同的标识,第三对具有不同的标识。从每个补丁中提取的相应功能显示在右侧。特征是一维的。为了便于说明,我们将它们重新排列为。特征值是非负的,因为它们取自ReLU。大约40%的特征具有正值。正方形越亮表示值越高。

4.4. Method comparison

4.4。方法比较

To show how our algorithm would benefit from more training data, we enlarge the CelebFaces dataset to CelebFaces+, which contains 202, 599 face images of 10, 177 celebrities. People in CelebFaces+ and LFW are mutually exclusive. We randomly choose 8700 people from CelebFaces+ to learn the DeepID, and use the remaining 1477 people to learn Joint Bayesian for face verification. Since extracting DeepID from many different face patches also helps, we increase the patch number to 100 by using five different scales of patches instead of three. This results in a 32, 000-dimensional DeepID feature vector, which is then reduced to 150 dimensions by PCA. Joint Bayesian learned on this 150-dimensional feature vector achieves 97.20% test accuracy on LFW.

为了显示我们的算法如何从更多训练数据中受益,我们将CelebFaces数据集扩大到CelebFaces +,其中包含10、177个名人的202、599个面部图像。CelebFaces +和LFW中的人是互斥的。我们从CelebFaces +中随机选择8700人来学习DeepID,并使用其余的1477人来学习联合贝叶斯算法进行面部验证。由于从许多不同的面部补丁中提取DeepID也有帮助,因此我们通过使用五种不同比例的补丁(而不是三个)将补丁数增加到100。这将产生32,000维DeepID特征向量,然后通过PCA将其缩减为150维。在此150维特征向量上学习到的联合贝叶斯算法在LFW上达到了97.20%的测试精度。

Due to the difference in data distributions, models well fitted to CelebFaces+ may not have equal generalization ability on LFW. To solve this problem, Cao et al. [6] proposed a practical transfer learning algorithm to adapt the Joint Bayesian model from the source domain to the target domain. We implemented their algorithm by using the 1477 people from CelebFaces+ as the source domain data and nine out of ten folders from LFW as the target domain data for transfer learning Joint Bayesian, and conduct tenfold cross validation on LFW. The transfer learning Joint Bayesian based on our DeepID features achieves 97.45% test accuracy on LFW, which is on par with the human-level performance of 97.53%.

由于数据分布的差异,非常适合CelebFaces +的模型在LFW上可能没有同等的泛化能力。为了解决这个问题,曹等。 [6]提出了一种实用的转移学习算法,以使联合贝叶斯模型从源域适应目标域。我们通过使用CelebFaces +的1477人作为源域数据并使用LFW的十分之九的文件夹作为目标域数据来实现学习算法,以进行联合贝叶斯学习,并对LFW进行十倍交叉验证。基于DeepID功能的转学联合贝叶斯技术在LFW上达到97.45%的测试准确度,与人类水平的97.53%的表现相当。

We compare with the state-of-the-art face verification methods on LFW. In the comparison, we report three results. The first two are trained on CelebFaces and CelebFaces+, respectively, without transfer learning, and tested on LFW. The third one is trained on CelebFaces+ with transfer learning on LFW.Table 1 comprehensively compares the accuracies, the number of facial points used for alignment, the number of outside training images (if applicable), and the final feature dimensions for each face (if applicable). Low feature dimensions indicate efficient face recognition systems. Figure 10 compares the ROC curves. Our DeepID learning method achieves the best performance on LFW. The first four best methods compared used dense facial landmarks, while our faces are weakly aligned with only five points. The deep learning work (DeepFace) [32] independently developed by Facebook at the same time of this paper achieved the second best performance of 97.25% accuracy on LFW. It utilized 3D alignment and pose transform as preprocessing, and more than seven million outside training images plus training images from LFW.

我们将其与最先进的LFW人脸验证方法进行了比较。在比较中,我们报告了三个结果。前两个没有经过学习就分别在CelebFaces和CelebFaces +上进行了培训,并在LFW上进行了测试。第三个是在CelebFaces +上接受培训的,并在LFW上进行了转移学习。表1全面比较了精度,用于对齐的面部点数量,外部训练图像的数量(如果适用)以及每张脸部的最终特征尺寸(如果适用)。低特征尺寸表示有效的人脸识别系统。图10比较了ROC曲线。我们的DeepID学习方法可在LFW上实现最佳性能。比较的前四种最佳方法是使用密集的面部标志,而我们的面部仅五个点就很难对齐。Facebook在同一时间自主开发的深度学习工作(DeepFace)[32]在LFW上达到了97.25%的准确率,位居第二。它利用3D对齐和姿势变换作为预处理,以及超过700万个外部训练图像以及LFW的训练图像。

5. Conclusion and Discussion

5.结论与讨论

This paper proposed to learn effective high-level features revealing identities for face verification. The features are built on top of the feature extraction hierarchy of deep ConvNets and are summarized from multi-scale mid-level features. By representing a large amount of different identities with a small number of hidden variables, highly compact and discriminative features are acquired. The features extracted from different face regions are complementary and further boost the performance. It achieved 97.45% face verification accuracy on LFW, while only requiring weakly aligned faces.

本文提出要学习有效的高级特征以揭示人脸验证的身份。这些功能建立在深度ConvNets的功能提取层次结构之上,并从多尺度中级功能进行了总结。通过用少量的隐藏变量表示大量不同的身份,可以获得高度紧凑和具有区别性的特征。从不同面部区域提取的特征是互补的,并进一步提高了性能。在LFW上,其面部验证的准确率达到97.45%,而仅需要弱对齐的面部。

Even more compact and discriminative DeepID can be learned if more identities are available to increase the dimensionality of prediction at the training stage. We look forward to larger training sets to further boost our performance. A recent work [27] reported 98.52% accuracy on LFW with Gaussian Processes and multi-source training sets, achieving even higher than human performance. This could be due to the fact that the nonparametric Bayesian kernel method can adapt model complexity to data distribution. Gaussian processes can also be modeled with deep learning [12]. This could be another interesting direction to be explored in the future.

如果在训练阶段可以使用更多身份来增加预测的维度,则可以学习到更紧凑和更具区分性的DeepID。我们期待更多的培训,以进一步提高我们的绩效。最近的一项工作[27]报告说,采用高斯过程和多源训练集的LFW的准确性为98.52%,甚至比人类的表现还高。这可能是由于非参数贝叶斯核方法可以使模型复杂度适应数据分布这一事实。高斯过程也可以用深度学习来建模[12]。这可能是将来要探索的另一个有趣的方向。

Figure 10. ROC comparison with the state-of-the-art face verification methods on LFW. TL in our method means transfer learning Joint Bayesian.

图10. ROC与LFW上最新的人脸验证方法的比较。在我们的方法中,TL意味着转移学习联合贝叶斯。

Table 1. Comparison of state-of-the-art face verification methods on LFW. Column 2 compares accuracy. Letters in the parentheses denote the training protocols used. r denotes the restricted training protocol, where the 6000 face pairs given by LFW are used for ten-fold crossvalidation. u denotes the unrestricted protocol, where additional training pairs can be generated from LFW using the identity information. o denotes using outside training data, however, without using training data from LFW. o+r denotes using both outside data and LFW data in the restricted protocol for training. (o+u) denotes using both outside data and LFW data in the unrestricted protocol for training. Column 3 compares the number of facial points used for alignment. Column 4 compares the number of outside images used for training (if applicable). The last column compares the final feature dimensions for each face (if applicable). DeepFace used six 2D points and 67 3D points for alignment. TL in our method means transfer learning Joint Bayesian.

表1. LFW上最先进的面部验证方法的比较。第2列比较准确性。括号中的字母表示所使用的训练规程。 r表示受限训练协议,其中LFW给定的6000个面部对用于十次交叉验证。 u表示不受限制的协议,其中可以使用身份信息从LFW生成其他训练对。 o表示使用外部训练数据,但是不使用LFW的训练数据。 o + r表示在受限协议中同时使用外部数据和LFW数据进行训练。 (o + u)表示在不受限制的协议中同时使用外部数据和LFW数据进行训练。第3列比较用于对齐的面部点的数量。第4列比较用于训练的外部图像的数量(如果适用)。最后一列比较每个面的最终特征尺寸(如果适用)。DeepFace使用了6个2D点和67个3D点进行对齐。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值