论文阅读笔记（二十三）：FaceNet: A Unified Embedding for Face Recognition and Clustering

最新推荐文章于 2023-11-16 22:52:19 发布

__Sunshine__

最新推荐文章于 2023-11-16 22:52:19 发布

阅读量590

点赞数

分类专栏：笔记文章标签： FaceNet

本文链接：https://blog.csdn.net/sunshine_010/article/details/80012216

版权

笔记专栏收录该内容

64 篇文章 7 订阅

订阅专栏

Despite significant recent advances in the field of face recognition [10, 14, 15, 17], implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.

尽管人脸识别领域最近取得了重大进展[10、14、15、17]，但在规模上有效地实施人脸验证和识别，对当前的研究方法提出了严峻的挑战。在本文中，我们提出了一个叫做FaceNet的系统，它直接从脸部图像学习到一个紧凑的欧几里得空间，距离直接对应于面部相似度的度量。一旦完成了这个空间，就可以通过使用带有FaceNet嵌入特性的标准技术轻松实现人脸识别、验证和集群等任务。

Our method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, we use triplets of roughly aligned matching / non-matching face patches generated using a novel online triplet mining method. The benefit of our approach is much greater representational efficiency: we achieve state-of-the-art face recognition performance using only 128-bytes per face.

我们的方法使用深度卷积网络来直接优化嵌入本身，而不是像之前的深度学习方法那样的中间瓶颈层。为了训练，我们使用一种新颖的在线三重挖掘方法生成的匹配/不匹配的面块。我们的方法的好处是更大的代表性效率:我们实现了最先进的人脸识别性能，每个脸只有128个字节。

On the widely used Labeled Faces in the Wild (LFW) dataset, our system achieves a new record accuracy of 99.63%. On YouTube Faces DB it achieves 95.12%. Our system cuts the error rate in comparison to the best published result [15] by 30% on both datasets.
We also introduce the concept of harmonic embeddings, and a harmonic triplet loss, which describe different versions of face embeddings (produced by different networks) that are compatible to each other and allow for direct comparison between each other.

在广泛使用的LFW数据集上，我们的系统实现了99.63%的新记录精度。在YouTube上，DB达到了95.12%。我们的系统将这两个数据集的错误率降低了30%。

In this paper we present a unified system for face verification (is this the same person), recognition (who is this person) and clustering (find common people among these faces). Our method is based on learning a Euclidean embedding per image using a deep convolutional network. The network is trained such that the squared L2 distances in the embedding space directly correspond to face similarity: faces of the same person have small distances and faces of distinct people have large distances.

在本文中，我们提出了一个统一的人脸识别系统(就是这个人)、识别(此人就是这个人)和聚类(在这些面孔中找到普通的人)。我们的方法是建立在学习一种欧几里得的嵌入每个图像使用一个深度卷积网络。该网络经过训练，使嵌入空间中的平方L2距离直接对应于人脸的相似度:同一个人的面有小距离，不同人群的面距离较大。

Once this embedding has been produced, then the aforementioned tasks become straight-forward: face verification simply involves thresholding the distance between the two embeddings; recognition becomes a k-NN classification problem; and clustering can be achieved using off-theshelf techniques such as k-means or agglomerative clustering.

一旦这种嵌入已经产生，那么上述的任务就会变得很简单:脸验证仅仅涉及到两个嵌入之间的距离的阈值;识别成为k - nn分类问题;可以使用诸如k - means或聚集集群等现成技术实现集群化。

Previous face recognition approaches based on deep networks use a classification layer [15, 17] trained over a set of known face identities and then take an intermediate bottleneck layer as a representation used to generalize recognition beyond the set of identities used in training. The downsides of this approach are its indirectness and its inefficiency: one has to hope that the bottleneck representation generalizes well to new faces; and by using a bottleneck layer the representation size per face is usually very large (1000s of dimensions). Some recent work [15] has reduced this dimensionality using PCA, but this is a linear transformation that can be easily learnt in one layer of the network.

先前基于深层网络的人脸识别方法采用了一种分类层[15,17]，在一组已知的人脸识别上进行训练，然后将一个中间的瓶颈层作为一种表征，用于在训练中使用的识别集以外的识别识别。这种方法的缺点是它的不直接性和低效率:人们必须希望瓶颈表示能够很好地推广到新面孔;通过使用一个瓶颈层，每个面的表示尺寸通常非常大(10世纪的尺寸)。最近的一些研究[15]通过使用PCA降低了这种维度的维度，但这是一个线性转换，可以很容易地在网络的一层中学习。

In contrast to these approaches, FaceNet directly trains its output to be a compact 128-D embedding using a tripletbased loss function based on LMNN [19]. Our triplets consist of two matching face thumbnails and a non-matching face thumbnail and the loss aims to separate the positive pair from the negative by a distance margin. The thumbnails are tight crops of the face area, no 2D or 3D alignment, other than scale and translation is performed.

与这些方法不同的是，FaceNet直接将其输出训练成一个紧凑的128 - d嵌入，使用基于LMNN的三重损失函数[19]。我们的三胞胎由两个匹配的面部缩略图和一个不匹配的脸缩略图组成，而损失的目标是将积极的对与消极的距离分开。缩略图是脸部区域的致密作物，没有2D或3D对齐，除了缩放和翻译。

Choosing which triplets to use turns out to be very important for achieving good performance and, inspired by curriculum learning [1], we present a novel online negative exemplar mining strategy which ensures consistently increasing difficulty of triplets as the network trains. To improve clustering accuracy, we also explore hard-positive mining techniques which encourage spherical clusters for the embeddings of a single person.

选择使用哪三胞胎对于获得良好的表现来说是非常重要的，而受课程学习的启发，我们提出了一种新的在线负面范例挖掘策略，它确保了网络训练作为三胞胎的难度不断增加。为了提高聚类的准确性，我们还探索了一种积极的挖掘技术，这种技术鼓励单个人的嵌入。

As an illustration of the incredible variability that our method can handle see Figure 1. Shown are image pairs from PIE [13] that previously were considered to be very difficult for face verification systems.

作为我们的方法可以处理的不可思议的可变性，请参见图1。显示的是PIE[13]的图像对，以前被认为是很难的人脸验证系统。

An overview of the rest of the paper is as follows: in section 2 we review the literature in this area; section 3.1 defines the triplet loss and section 3.2 describes our novel triplet selection and training procedure; in section 3.3 we describe the model architecture used. Finally in section 4 and 5 we present some quantitative results of our embeddings and also qualitatively explore some clustering results.

本文其余部分的概述如下:在第2节中，我们回顾了该领域的文献;第3.1节定义了三重损失，第3.2节描述了我们的新三合选择和训练过程;在第3.3节中，我们描述了所使用的模型架构。最后，在第四节和第五节中，我们给出了一些嵌入的定量结果，并定性地探讨了一些聚类结果。

Similarly to other recent works which employ deep networks [15, 17], our approach is a purely data driven method which learns its representation directly from the pixels of the face. Rather than using engineered features, we use a large dataset of labelled faces to attain the appropriate invariances to pose, illumination, and other variational conditions.

类似于其他使用深层网络的近期作品[15,17]，我们的方法是一种纯粹的数据驱动方法，它直接从脸部的象素上学习它的表示。我们不使用工程特性，而是使用一个大数据集来达到适当的不变性，以构成、照明和其他变化的条件。

In this paper we explore two different deep network architectures that have been recently used to great success in the computer vision community. Both are deep convolutional networks [8, 11]. The first architecture is based on the Zeiler&Fergus [22] model which consists of multiple interleaved layers of convolutions, non-linear activations, local response normalizations, and max pooling layers. We additionally add several 1×1×d convolution layers inspired by the work of [9]. The second architecture is based on the Inception model of Szegedy et al. which was recently used as the winning approach for ImageNet 2014 [16]. These networks use mixed layers that run several different convolutional and pooling layers in parallel and concatenate their responses. We have found that these models can reduce the number of parameters by up to 20 times and have the potential to reduce the number of FLOPS required for comparable performance.

在本文中，我们探索了两种不同的深度网络架构，它们最近在计算机视觉社区中取得了巨大的成功。两者都是深度卷积网络[8,11]。第一个架构基于Zeiler&Fergus[22]模型，该模型由多个交错层、非线性激活、局部响应规范化和max池化层组成。我们另外添加几个1×1×d卷积层受[9]的工作。第二个架构基于Szegedy等的初始模型，该模型最近被用作ImageNet 2014的获胜方法[16]。这些网络使用混合层来并行运行多个不同的卷积和汇聚层，并将它们的响应连接起来。我们发现，这些模型可以将参数数量减少多达20倍，并有可能减少类似性能所需的失败次数。

There is a vast corpus of face verification and recognition works. Reviewing it is out of the scope of this paper so we will only briefly discuss the most relevant recent work.

有大量的面部验证和识别功能。回顾一下这篇文章的范围，我们只会简单地讨论一下最近最相关的工作。

The works of [15, 17, 23] all employ a complex system of multiple stages, that combines the output of a deep convolutional network with PCA for dimensionality reduction and an SVM for classification.

[15,17,23]的作品都采用了复杂的多阶段系统，将一个深卷积网络的输出与PCA进行了维数减少和SVM分类。

Zhenyao et al. [23] employ a deep network to “warp” faces into a canonical frontal view and then learn CNN that classifies each face as belonging to a known identity. For face verification, PCA on the network output in conjunction with an ensemble of SVMs is used.

Zhenyao等人[23]利用一个深度网络将面孔“扭曲”成一个典型的正面视图，然后学习CNN，将每个面孔分类为一个已知的身份。为了进行面部验证，使用了与SVMs集成的网络输出的PCA。

Taigman et al. [17] propose a multi-stage approach that aligns faces to a general 3D shape model. A multi-class network is trained to perform the face recognition task on over four thousand identities. The authors also experimented with a so called Siamese network where they directly optimize the L1-distance between two face features. Their best performance on LFW (97.35%) stems from an ensemble of three networks using different alignments and color channels. The predicted distances (non-linear SVM predictions based on the x² kernel) of those networks are combined using a non-linear SVM.

Taigman等[17]提出了一种将人脸与通用三维形状模型相结合的多级方法。经过训练，一个多类网络对超过4000个身份进行人脸识别任务。作者还尝试了一个所谓的Siamese网络，他们直接优化了两个脸部特征之间的L - 1距离。他们在LFW上的最佳表现(97.35%)来自三个使用不同的对齐和颜色通道的网络。预测的距离(x²内核)基础上的非线性支持向量机预测的网络使用非线性支持向量机相结合。

Sun et al. [14, 15] propose a compact and therefore relatively cheap to compute network. They use an ensemble of 25 of these network, each operating on a different face patch. For their final performance on LFW (99.47% [15]) the authors combine 50 responses (regular and flipped). Both PCA and a Joint Bayesian model [2] that effectively correspond to a linear transform in the embedding space are employed. Their method does not require explicit 2D/3D alignment. The networks are trained by using a combination of classification and verification loss. The verification loss is similar to the triplet loss we employ [12, 19], in that it minimizes the L₂-distance between faces of the same identity and enforces a margin between the distance of faces of different identities. The main difference is that only pairs of images are compared, whereas the triplet loss encourages a relative distance constraint.

Sun等[14,15]提出了一个紧凑的，因此相对便宜的计算网络。他们使用了25个这样的网络，每一个都使用不同的面部贴片。在LFW(99.47%[15])上，作者将50个回复(规则和翻转)结合在一起。采用PCA和联合贝叶斯模型[2]，有效地对应了嵌入空间中的线性变换。他们的方法不需要显式的2D / 3D对齐方式。通过使用分类和验证损失的组合来训练网络。验证损失类似于我们使用的三重损失[12,19]，因为它最小化了相同身份的面之间的 L₂ -距离，并在不同身份的面之间设置了一个边界。主要的区别在于，只有对图像进行比较，而三重损失则鼓励相对距离约束。

A similar loss to the one used here was explored in Wang et al. [18] for ranking images by semantic and visual similarity.

一个类似的损失在这里使用的是wang等人 [18] 由语义和视觉相似性进行的图像排列。

FaceNet uses a deep convolutional network. We discuss two different core architectures: The Zeiler&Fergus [22] style networks and the recent Inception [16] type networks. The details of these networks are described in section 3.3.

FaceNet使用一个深度卷积网络。我们讨论了两个不同的核心架构:Zeiler&Fergus[22]风格的网络和最近的Inception[16]类型网络。这些网络的细节在第3.3节中描述。

Given the model details, and treating it as a black box (see Figure 2), the most important part of our approach lies in the end-to-end learning of the whole system. To this end we employ the triplet loss that directly reflects what we want to achieve in face verification, recognition and clustering. Namely, we strive for an embedding f(x), from an image x into a feature space R^d, such that the squared distance between all faces, independent of imaging conditions, of the same identity is small, whereas the squared distance between a pair of face images from different identities is large.

考虑到模型的细节，并将其视为一个黑盒(见图2)，我们的方法最重要的部分在于整个系统的端到端学习。为此，我们采用三重损失直接反映了我们在人脸验证、识别和集群中所要达到的目标。即,我们争取一个嵌入f(x),从一个图像x到特征空间 R^d,这样的平方距离所有面孔,独立的成像条件下,相同的身份很小,而平方距离的不同身份的大脸图像。

Although we did not directly compare to other losses, e.g. the one using pairs of positives and negatives, as used in [14] Eq. (2), we believe that the triplet loss is more suitable for face verification. The motivation is that the loss from [14] encourages all faces of one identity to be projected onto a single point in the embedding space. The triplet loss, however, tries to enforce a margin between each pair of faces from one person to all other faces. This allows the faces for one identity to live on a manifold, while still enforcing the distance and thus discriminability to other identities.

虽然我们并没有直接与其他损失进行比较，例如，在[14]Eq .(2)中使用成对的阳性和阴性，但我们认为三重损失更适合于面部验证。损失的动机是[14]鼓励所有的面孔一个身份是˘aÿ预计˘´Z到嵌入空间中的一个点。然而，三重损失试图在每一对人脸之间执行一个边界，从一个人脸到所有其他的脸。这样一来，人脸就可以在流形上生存，同时还能保持距离，从而辨别出其他身份。

The following section describes this triplet loss and how it can be learned efficiently at scale.

下面的部分描述了这个三重损失，以及如何在缩放后有效地学习它。

These are very interesting findings and it is somewhat surprising that it works so well. Future work can explore how far this idea can be extended. Presumably there is a limit as to how much the v2 embedding can improve over v1, while still being compatible. Additionally it would be interesting to train small networks that can run on a mobile phone and are compatible to a larger server side model.

我们提供了一种直接学习嵌入到欧几里得空间的方法来进行人脸验证。这使它有别于使用CNN瓶颈层的其他方法[15,17]，或者需要额外的后处理，例如多个模型和PCA的连接，以及SVM分类。我们的端到端训练既简化了设置，又表明直接优化与手头任务相关的损失提高了性能。
我们模型的另一个优点是，它只需要最小的对准(在脸部区域的密集作物)。[17]例如，执行复杂的3D对齐。我们还尝试了相似转换比对，并注意到这实际上可以稍微提高性能。不清楚它是否值得额外的复杂性。
未来的工作将集中于更好地理解错误案例，进一步改进模型，并减少模型大小和减少CPU需求。我们还将研究如何改进目前极长的训练时间，例如，伴随着较小批量大小的改变，我们课程学习的变化，以及离线/在线积极和消极的挖掘。

Figure 1. Illumination and Pose invariance. Pose and illumination have been a long standing problem in face recognition. This figure shows the output distances of FaceNet between pairs of faces of the same and a different person in different pose and illumination combinations. A distance of 0.0 means the faces are identical, 4.0 corresponds to the opposite spectrum, two different identities. You can see that a threshold of 1.1 would classify every pair correctly.

图1所示。光照和姿态不变性。姿势与照度是人脸识别中的一个长期存在的问题。这个图显示了相同的人脸和不同的姿势和照明组合的人脸之间的距离。0.0表示两张脸是相同的，4.0对应于相反的光谱，两种不同的恒等式。您可以看到，1.1的阈值将对每一对进行正确的分类。

Figure 2. Model structure. Our network consists of a batch input layer and a deep CNN followed by L₂ normalization, which results in the face embedding. This is followed by the triplet loss during training.

图2.模型结构。我们的网络由一个批输入层和一个深CNN组成，然后是 L₂的标准化，这导致了人脸的嵌入。接下来是训练期间的三重损失。

Figure 3. The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.

图3。三重损失最小化了锚和相同身份的正值之间的距离，并最大化锚点与不同身份的负值之间的距离。