Google人脸识别系统Facenet paper解析

最新推荐文章于 2022-02-14 19:08:13 发布

chzylucky

最新推荐文章于 2022-02-14 19:08:13 发布

阅读量6.7k

点赞数 6

分类专栏： DeepLearning 文章标签： facenet 人脸识别深度学习

本文链接：https://blog.csdn.net/chzylucky/article/details/79716272

版权

本文深入解析了Google的FaceNet系统，它通过深度学习直接学习人脸图像到欧几里得空间的映射，实现了人脸识别、验证和聚类任务。FaceNet使用triplet loss函数训练，即使在小的128-D embedding下也能达到最先进的性能，例如在LFW数据集上准确率高达99.63%，相比于现有方法错误率降低了30%。此外，文章还探讨了harmonic embeddings的概念，以实现不同网络版本间的兼容性。

摘要由CSDN通过智能技术生成

Facenet paper地址 : facenet； 论文解析下载地址（PDF版）：论文解析

FaceNet: A Unified Embedding for Face Recognition and Clustering

Abstract摘要

Despite significantrecent advances in the field of face recognition [10, 14, 15, 17], implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.

尽管人脸识别领域已经取得了重大的进步，但是对于当下的方法如何有效的运用人脸验证和人脸识别仍然有巨大的挑战。在这个论文里，我们提出了一个叫facenet的系统，这个系统直接学习了一个从人脸图像到紧密型欧几里得空间的映射，在那里距离直接和人脸的相似度相关。一旦这个空间产生，诸如人脸识别、验证、聚集这类的任务可以在运用FaceNet embeddings为特征向量的标准技术下轻松实现。

Our method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, we use triplets of roughly aligned matchin/ non-matching face patches generated using a novel online triplet mining method. The benefit of our approach is much greater representational efficiency: we achieve state-of-the-art face recognition performance using only 128-bytes per face.

我们的方法运用深度卷积网络训练直接优化embedding,而不是像原来的运用中间瓶颈层为人脸图像的向量映射，然后以分类层作为输出层。对于训练，我们运用online triplet

mining 的方法生成triplets大致校准匹配或非匹配人脸补丁。我们方法的最大优点是拥有最大表征效率：我们仅用每张人脸128位取得了人脸识别的最先进的性能。

On the widely used Labeled Faces in the Wild(LFW) dataset, our system achieves a new record accuracy of 99.63%

On YouTubeFaces DB it achieves 95.12%. Our system cuts the error rate in comparison to the best published result [15] by30% on both datasets.

在广泛使用的LFW数据库我们的系统取得了99.63%准确性的新纪录。在YouTub数据库取得了95.12%的成绩。我们的系统相比已经发布的最好结果在这两个数据集的错误率减少了30%

We also introduce the concept of harmonic embeddings ,and a harmonic triplet loss, which describe different versions of face embeddings (produced by different networks) that are compatible to each other and allow for direct comparison between each other.

我们也介绍了harmonic embeddings和harmonic triplet loss的概念，它们描述了由不同网络产生的不同版本的face embeddings他们之间是兼容的而且可以直接比较。

1. Introduction 引言

In this paper we present a unified system for face verification (is this the same person), recognition (who is this person) and clustering (find common people among these faces). Our method is based on learning a Euclidean embedding per image using a deep convolutional network. The network is trained such that the squared L2 distances in the embedding space directly correspond to face similarity: faces of the same person have small distances and faces of distinct people have large distances.

在这个论文里我们呈现了包含人脸验证（是不是同一个人）、识别（他是谁？）、聚集（在人脸中找到相同的人进行归类）的一个完整的系统。我们的方法是基于每张图像用一个深度卷积网络学习一个Euclidean embedding。然后把这个网络进行训练这样在embedding space的squared L2距离直接对应人脸相似度：同一个人的人脸具有小的距离不同人的人脸具有较大的距离。

Once this embedding has been produced, then the aforementioned tasks become straight-forward: face verification simply involves thresholding the distance between the two embeddings; recognition becomes a k-NN classification problem; and clusteringcan be achieved using off-the-shelf techniques such as k-means or agglomerative clustering.

一旦这个embedding产生前面提到的任务就变得很简单了：人脸验证仅仅只涉及两个embedding距离的阀值；识别成了一个k-NN分类问题；聚集可以用现成的例如k-means or agglomerative clustering之类的技术来实现。

Previous face recognition approaches based on deep networks use a classification layer [15, 17] trained over a set of known face identities and then take an intermediate bottleneck layer as a representation used to generalize recognition beyond the set of identities used in training. The downsides of this approach are its indirectness and its inefficiency: one has to hope that the bottleneck representation generalizes well to new faces; and by using a bottleneck layer the representationsize per face is usually very large (1000s of dimensions ).Some recent work [15] has reduced this dimensionality using PCA, but this is a linear transformation that can be easily learnt in one layer of the network.

其中一个期望改进的是瓶颈层对新人脸也可以很好的泛化并且用瓶颈层每一个脸的表示尺寸都非常大（1000s of dimensions）。一些现在的工作用PCA已经减小了维度，但是这是一个线性的变换可以在网络的一层轻松学习到。

In contrast to these approaches, FaceNet directly trains its output to be a compact 128-D embedding using a triplet-based loss function based on LMNN[19]. Our triplets consist of two matching face thumbnails and a non-matching face thumbnail and the loss aims to separate the positive pair from the negative by a distance margin. The thumbnails are tight crops of the face area, no 2D or 3D alignment, other than scale and translation is performed.

与上述那些方法比起来，FaceNe运用基于LMNN的triplet- based loss函数直接把输出训练成一个紧凑的128-Dembedding。我们的triplets包含两个匹配的人脸缩略图和一个非匹配的人脸缩略图。Loss函数的目的就是通过距离边界区分正负类。缩略图为精密剪裁的脸部区域除了执行缩放平移外没有2D or 3D校准。

Choosing which triplets to use turns out to be very important for achieving good performance and, inspired by curriculum learning [1], we present a novel online negative exemplar mining strategy which ensures consistently increasing difficulty of triplets as the network trains. To improve clustering accuracy, we also explore hard-positive mining techniques which encourage spherical clusters for the embeddings of a single person.

选择运用triplets被证明对于取得好的性能是非常重要的，并且受curriculum learning启发，我们提出了一个online negative exemplar mining策略，它确保了随着网络训练triplets的难度持续增长。为了提高聚类准确度，我们也探索了hard-positive mining技术它对于每个人的embeddings激发出球形聚类。

As an illustration of the incredible variability that our method can handle see Figure 1.Shown are image pairs from PIE [13] that previously were considered to be very difficult for face verification systems.

我们的方法可以产生惊人的变化性就像图例1显示那样。展现的是来自PIE的图像对其曾经被认为对于人脸识别系统是十分困难的事情

Figure 1. Illumina tion a nd Pose invar iane. Pose and illumination have been a long standing problem in face recognition. This figure shows the output distances of FaceNet between pairs of faces of the same and adifferent person in different pose and illumination combinations. A distance of0.0 means the faces are identical, 4.0 corresponds to the opposite spectrum,two different identities. You can see that a threshold of 1.1 would classifyevery pair correctly.

（图1.光照与姿态不变性.姿态和光照是一个长期存在的问题在人脸识别中。这个图像显示了运用FaceNet生成的相同／不同人在不同的姿态和光照组合下的人脸对的输出距离，距离0意味着相同的人脸，4.0对应着相反的频谱，有不同的人脸特征。可以观察到门限值1.1可以准确区分每对人脸）

An overview of the rest of the paper is as follows: in section 2 we review the literature in this area;section 3.1 defines the triplet loss and section 3.2 describes our novel triplet selection and training procedure; in section 3.3 we describe the model architecture used. Finally in section 4 and 5 we present some quantitative results of our embeddings and also qualitatively explore some clustering results.

其余论文内容概述如下：

section 2：回顾本领域的相关文献

section 3.1：定义了triplet loss

section 3.2：描述了triplet selection& training procedure

section 3.3：所用的模型结构

section 4and 5：提出了一些关于embeddings的定量的结论，并且定性地探索了一些聚类结论。

2. Related Work

Similarly to other recent works which employ deep networks [15, 17], our approach is a purely data driven method which learns its representation directly from the pixels of the face. Rather than using engineered features, we use a large dataset of labelled faces to attain the appropriate invariances to pose, illumination, and other variational conditions.

和现在的运用的深度卷积网络的方法类似，我们的方法是一个纯粹的数据驱动方法，该方法从人脸的每一个像素开始直接学习它的表示。我们运用标记人脸的大型数据库去获得合适的姿态光照和其他可变的情况的不变性，而不是运用engineered features。

In this paper we explore two different deep network architectures that have been recently used to great success in the computer vision community. Both are deep convolutional networks [8, 11]. The first architecture is based on the Zeiler&Fergus [22] model which consists of multiple interleavedlayers of convolutions, non-linear activations, local response normalizations, and max pooling layers. We additionally add several 1*1*d convolution layers inspired by the work of [9]. The second architecture is based on the Inception model of Szegedy et al. which was recently used as the winning approach for ImageNet 2014[16]. These networks use mixed layers that run several different convolutional and polling layers in parallel and concatenate their responses. We have found that these models can reduce the number of parameters by up to 20 times and have the potential to reduce the number of FLOPS required for comparable performance.

在论文里我们探究了最近在计算机视觉社区成功使用的两类不同的深度卷积神经网络架构。第一个架构是基于Zeiler&Fergus模型其包含multiple interleaved layers of convolutions, non-linear activations, local response normalizations,and max pooling layers。受[9]的工作启发我们添加了几个1*1*d卷积层。第二个架构是基于the Inception model of Szegedy et al 这种架构被称为ImageNet 2014 中最优的方式。这些网络在并联和串联它们的相应的时候运用运行在几个不同的卷积和池化层组成的混合层。我们发现这两种模型都可以减少参数的使用次数达20次并且有减少浮点运算次数的潜在性能。

There is a vast corpus of face verification and recognition works. Reviewing it is out of this paper so we will only briefly discuss the most relevant recent work.

下面是人脸验证和识别工作相关的大量资料，但是它们和这篇论文不大相关所以我们仅仅讨论一下最相关的最近工作。

The works of [15, 17, 23] all employ acomplex system of multiple stages, that combines the output of a deep convolutional network with PCA for dimensionality reduction and an SVM for classification.

[15, 17, 23]中的工作都是运用一个复杂的多级系统，其中结合了带有用于减少纬度的PCA技术和分类的SVM的深度卷积网络的输出。

Zhenyao et al. [23] employ a deep network to “warp” faces into a canonical frontal view and then learn CNN that classifies each face as belonging to a known identity. For face verification,PCA on the network output in conjunction with an ensemble of SVMs is used.

Taigman et al. [17] propose a multi-stage approach that aligns faces to a general 3D shape model. A multi-class network is trained to perform the face recognition task on over four thousand identities. The authors also experimented with a so called Siamese network where they directly optimize the L1-distance between two face features. Their best performance on LFW (97.35%) stems from an ensemble of three networks using different alignments and color channels. The predicted distances (non-linear SVM predictions based on the 2 kernel)of those networks are combined using a non-linear SVM.