原文链接:FaceNet: A Unified Embedding for Face Recognition and Clustering
FaceNet
摘要(Abstract)
Despite significant recent advances in the field of face recognition [10, 14, 15, 17], implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.
尽管最近在人脸识别领域取得了重大进展[10,14,15,17],但大规模有效地实施面部验证和识别对当前方法提出了严峻挑战。 在本文中,我们提出了一个名为FaceNet的系统,它直接学习从面部图像到紧凑的欧几里德空间的映射,其中距离直接对应于面部相似性的度量。 生成此空间后,可以使用FaceNet嵌入作为特征向量的标准技术轻松实现面部识别,验证和聚类等任务。
Our method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, we use triplets of roughly aligned matching / non-matching face patches generated using a novel online triplet mining method. The benefit of our approach is much greater representational efficiency: we achieve state-of-the-art face recognition performance using only 128-bytes per face.
我们的方法使用深度卷积网络训练直接优化嵌入本身,而不是像以前的深度学习方法那样的中间瓶颈层。 为了训练,我们使用triplet,这些triplet使用新颖的在线triplet挖掘方法生成大致对齐的匹配/非匹配面部补丁。 我们的方法的好处是更高的表现效率:我们使用每张人脸只有128个字节来实现最先进的面部识别性能。
On the widely used Labeled Faces in the Wild (LFW) dataset, our system achieves a new record accuracy of 99.63%. On YouTube Faces DB it achieves 95.12%. Our system cuts the error rate in comparison to the best published result [15] by 30% on both datasets.
在广泛使用的Labeled Faces in the Wild(LFW)数据集中,我们的系统实现了99.63%的新记录准确率。 在YouTube Faces DB上,它达到了95.12%。 与两个数据集中的以往最佳发布结果[15]相比,我们的系统将错误率降低了30%。
We also introduce the concept of harmonic embeddings, and a harmonic triplet loss, which describe different versions of face embeddings (produced by different networks) that are compatible to each other and allow for direct comparison between each other.
我们还介绍了谐波嵌入(harmonic embedding)和谐波三重态损耗(harmonic triplet loss)的概念,它描述了不同版本的人脸嵌入(由不同网络产生),它们彼此兼容并允许彼此之间的直接比较。
1. 介绍(Introduction)
In this paper we present a unified system for face verification (is this the same person), recognition (who is this person) and clustering (find common people among these faces). Our method is based on learning a Euclidean embedding per image using a deep convolutional network. The network is trained such that the squared L2 distances in the embedding space directly correspond to face similarity: faces of the same person have small distances and faces of distinct people have large distances.
在本文中,我们提出了一个统一的面部验证系统(这是同一个人吗?),识别(这个人是谁)和聚类(在很多面孔中找到同一个人)。 我们的方法基于使用深度卷积网络学习每个图像的欧几里德嵌入。 训练网络使得嵌入空间中的平方L2距离直接对应于面部相似性:同一人的面部具有小距离并且不同人的面部具有大距离。
Once this embedding has been produced, then the aforementioned tasks become straight-forward: face verification simply involves thresholding the distance between the two embeddings; recognition becomes a k-NN classification problem; and clustering can be achieved using off-theshelf techniques such as k-means or agglomerative clustering.
一旦产生了这种嵌入,则上述任务变得直截了当:面部验证仅涉及对两个嵌入之间的距离进行阈值处理; 识别任务成为k-NN分类问题; 并且可以使用诸如k均值或凝聚聚类之类的现有技术来实现聚类。
Previous face recognition approaches based on deep networks use a classification layer [15, 17] trained over a set of known face identities and then take an intermediate bottle-neck layer as a epresentation used to generalize recognition beyond the set of identities used in training. The downsides of this approach are its indirectness and its inefficiency: one has to hope that the bottleneck representation generalizes well to new faces; and by using a bottleneck layer the representation size per face is usually very large (1000s of dimensions). Some recent work [15] has reduced this dimensionality using PCA, but this is a linear transformation that can be easily learnt in one layer of the network.
先前基于深度网络的面部识别方法使用在一组已知面部身份上训练的分类层[15,17],然后采用中间瓶颈层作为表示,用于概括超出训练中使用的身份集合的识别。 这种方法的缺点是它的间接性和效率低下:人们不得不希望瓶颈表现能够很好地概括为新面孔; 通过使用瓶颈层,每个面的表示大小通常非常大(1000维)。 最近的一些工作[15]使用PCA降低了这种维度,但这是一种线性转换,可以在网络的一个层中轻松学习。
In contrast to these approaches, FaceNet directly trains its output to be a compact 128-D embedding using a triplet-based loss function based on LMNN [19]. Our triplets consist of two matching face thumbnails and a non-matching face thumbnail and the loss aims to separate the positive pair from the negative by a distance margin. The thumbnails are tight crops of the face area, no 2D or 3D alignment, other than scale and translation is performed.与这些方法相比,FaceNet直接将其输出训练为紧凑的128-D嵌入[19],通过使用基于LMNN的基于triplet的loss函数。 我们的三元组(triplet)由两个匹配的面部缩略图和一个不匹配的面部缩略图组成,并且loss的目标是将正对与负对分开一个距离边距(margin)。 缩略图是面部区域的紧密裁剪,除了缩放和平移之外,没有2D或3D对齐。
Choosing which triplets to use turns out to be very important for achieving good performance and, inspired by curriculum learning [1], we present a novel online negative exemplar mining strategy which ensures consistently increasing difficulty of triplets as the network trains. To improve clustering accuracy, we also explore hard-positive mining techniques which encourage spherical clusters for the embeddings of a single person.
选择使用哪些三元组(triplet)对于实现良好的表现非常重要,并且受Curriculum learning的启发[1],我们提出了一种新颖的在线负面(negative)样本挖掘策略,确保在网络训练时不断增加三元组(triplet)的难度。 为了提高聚类精度,我们还探索了hard-positive挖掘技术,该技术鼓励球形聚类用于同一个人的嵌入( embedding)。
Figure 1. Illumination and Pose invariance.
Pose and illumination have been a long standing problem in face recognition. This figure shows the output distances of FaceNet between pairs of faces of the same and a different person in different pose and illumination combinations. A distance of 0.0 means the faces are identical, 4.0 corresponds to the opposite spectrum, two different identities. You can see that a threshold of 1.1 would classify every pair correctly.
图1.照明和姿势不变性。
姿势和照明是人脸识别中长期存在的问题。 该图显示了FaceNet在不同姿势和照明组合中相同和不同人的面对(pairs of faces)之间的输出距离。 距离0.0表示面相同,4.0对应相反的光谱,两个不同的身份。 您可以看到1.1的阈值会正确地对每一对进行分类。
As an illustration of the incredible variability that our method can handle see Figure 1. Shown are image pairs from PIE [13] that previously were considered to be very difficult for face verification systems.
作为我们的方法可以处理令人难以置信的可变性的说明,请参见图1.显示的来自PIE [13]的图像对,之前被认为对于面部验证系统来说非常困难。
An overview of the rest of the paper is as follows: in section 2 we review the literature in this area; section 3.1 defines the triplet loss and section 3.2 describes our novel triplet selection and training procedure; in section 3.3 we describe the model architecture used. Finally in section 4 and 5 we present some quantitative results of our embeddings and also qualitatively explore some clustering results.本文其余部分的概述如下:在第2节中,我们回顾了该领域的文献; 第3.1节定义了三元组损失(triplet loss),第3.2节描述了我们新颖的三元组(triplet)选择和训练程序; 在3.3节中,我们描述了使用的模型架构。 最后在第4节和第5节中,我们提供了嵌入(embedding)的一些定量结果,并定性地探索了一些聚类结果。