Google人脸识别系统Facenet paper解析

Facenet paper地址 : facenet;   论文解析下载地址(PDF版):论文解析

FaceNet: A Unified Embedding for Face Recognition and Clustering 

Abstract摘要

   Despite significantrecent advances in the field of face recognition [10, 14, 15, 17], implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using  standard techniques with FaceNet embeddings as feature vectors.

    尽管人脸识别领域已取得了重大的步,但是于当下的方法如何有效的运用人脸验证和人脸识别仍然有巨大的挑战。在这个论文里,我们提出了一个叫facenet的系统,这个系统直接学习了一个从人脸图像到紧密型欧几里得空间的映射,在那里距离直接和人脸的相似度相关。一旦这个空间产生,诸如人脸识别、验证、聚集这类的任务可以在运用FaceNet embeddings特征向量的准技下轻松实现。

    Our method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, we use triplets of roughly aligned matchin/ non-matching face patches generated using a novel online triplet mining method. The benefit of our approach is much greater representational efficiency: we achieve state-of-the-art face recognition performance using only 128-bytes per face

   我们的方法运用深度卷积网络训练直接优化embedding,而不是像原来的运用中间瓶颈层为人脸图像的向量映射,然后以分类层作为输出层。对于训练,我们运用online triplet 

mining 的方法生成triplets大致校准匹配或非匹配人脸补丁。我们方法的最大优点是拥有最大表征效率:我们仅用每张人脸128位取得了人脸识别的最先进的性能。

    On the widely used Labeled Faces in the Wild(LFW) dataset, our system achieves a new record accuracy of  99.63%

On YouTubeFaces DB it achieves 95.12%. Our system cuts the error rate in comparison to the best published result [15] by30% on both datasets.

   在广泛使用的LFW数据库我们的系统取得了99.63%准确性的新纪录。在YouTub数据库取得了95.12%的成绩。我们的系统相比已经发布的最好结果在这两个数据集的错误率减少了30%

    We also introduce the concept of harmonic embeddings ,and a harmonic triplet loss, which describe different versions of face embeddings (produced by different networks) that are compatible to each other and allow for direct comparison between each other.

    我们也介绍了harmonic embeddings和harmonic triplet loss的概念,它们描述了由不同网络产生的不同版本的face embeddings他们之间是兼容的而且可以直接比较。

1.  Introduction 引言

     In this paper we present a unified system for face verification (is this the same person), recognition (who is this person) and clustering (find common people among these faces). Our method is based on learning a Euclidean embedding per image using a deep convolutional network. The network is trained such that the squared L2 distances in the embedding space directly correspond to face similarity: faces of the same person have small distances and faces of distinct people have large distances.

    在这个论文里我们呈现了包含人脸验证(是不是同一个人)、识别(他是谁?)、聚集(在人脸中找到相同的人进行归类)的一个完整的系统。我们的方法是基于每张图像用一个深度卷积网络学习一个Euclidean embedding。然后把这个网络进行训练这样在embedding space的squared L2距离直接对应人脸相似度:同一个人的人脸具有小的距离不同人的人脸具有较大的距离。

   Once this embedding has been produced, then the aforementioned tasks become straight-forward: face verification simply involves thresholding the distance between the two embeddings; recognition becomes a k-NN classification problem; and clusteringcan be achieved using off-the-shelf techniques such as k-means or agglomerative clustering.

    一旦这个embedding产生前面提到的任务就变得很简单了:人脸验证仅仅只涉及两个embedding距离的阀值;识别成了一个k-NN分类问题;聚集可以用现成的例如k-means or agglomerative clustering之类的技术来实现。

    Previous face recognition approaches based on deep networks use a classification layer [15, 17] trained over a set of known face identities and then take an intermediate bottleneck layer as a representation used to generalize recognition beyond the set of identities used in training. The downsides of this approach are its indirectness and its inefficiency: one has to hope that the bottleneck representation generalizes well to new faces; and by using a bottleneck layer the representationsize per face is usually very large (1000s of dimensions ).Some recent work [15] has reduced this dimensionality using PCA, but this is a linear transformation that can be easily learnt in one layer of the network.

  其中一个期望改进的是瓶颈层对新人脸也可以很好的泛化并且用瓶颈层每一个脸的表示尺寸都非常大(1000s of dimensions)。一些在的工作用PCA减小了度,但是是一个线性的变换可以在网的一层轻松学到。

      In contrast to these approaches, FaceNet directly trains its output to be a compact 128-D embedding using a triplet-based loss function based on LMNN[19]. Our triplets consist of two matching face thumbnails and a non-matching face thumbnail and the loss aims to separate the positive pair from the negative by a distance margin. The thumbnails are tight crops of the face area, no 2D or 3D alignment, other than scale and translation is performed.

     与上述那些方法比起来,FaceNe运用基于LMNN的triplet- based loss函数直接把训练成一个凑的128-Dembedding。我的triplets包含两个匹配的人脸缩略图和一个非匹配的人脸缩略图。Loss函数的目的就是通距离界区分正负类。缩略图为精密剪裁的脸部区域除了执行缩放平移外没有2D or 3D校准。                               

    Choosing which triplets to use turns out to be very important for achieving good performance and, inspired by curriculum learning [1], we present a novel online negative exemplar mining strategy which ensures consistently increasing difficulty of triplets as the network trains. To improve clustering accuracy, we also explore hard-positive mining techniques which encourage spherical clusters for the embeddings of a single person.

   选择运用triplets于取得好的性能是非常重要的,并且受curriculum learning,我提出了一个online negative exemplar mining策略,它确保了随着网络训练triplets度持了提高聚准确度,我也探索了hard-positive mining术它对于每个人的embeddings激发出球形聚类。

   As an illustration of the incredible variability that our method can handle see Figure 1.Shown are image pairs from PIE [13] that previously were considered to be very difficult for face verification systems.

    的方法可以生惊人的化性就像例1示那。展的是来自PIE图像对其曾经被认为对于人脸识别系统是十分困难的事情

 

Figure 1. Illumina tion a nd Pose invar iane. Pose and illumination have been a long standing problem in face recognition. This figure shows the output distances of FaceNet between pairs of faces of the same and adifferent person in different pose and illumination combinations. A distance of0.0 means the faces are identical, 4.0 corresponds to the opposite spectrum,two different identities. You can see that a threshold of 1.1 would classifyevery pair correctly.

1.光照与姿态不性.姿态和光照是一个期存在的问题在人脸识别中。示了运用FaceNet生成的相同/不同人在不同的姿态和光照组合下的人脸对的输出距离,距离0意味着相同的人脸,4.0对应着相反的频谱,有不同的人脸特征。可以观察到门限值1.1可以准确区分每对人脸

     An overview of the rest of the paper is as follows: in section 2 we review the literature in this area;section 3.1 defines the triplet loss and section 3.2 describes our novel triplet selection and training procedure; in section 3.3 we describe the model architecture used. Finally in section 4 and 5 we present some quantitative results of our embeddings and also qualitatively explore some clustering results.

其余文内容概述如下:

section 2:回域的相关文献

section 3.1:定triplet loss

section 3.2:描述了triplet selection& training procedure

section 3.3:所用的模型

section 4and 5:提出了一些关于embeddings的定量的结论,并且定性地探索了一些聚类结论。

2. Related Work

   Similarly to other recent works which employ deep networks [15, 17], our approach is a purely data driven method which learns its representation directly from the pixels of the face. Rather than using engineered features, we use a large dataset of labelled faces to attain the appropriate invariances to pose, illumination, and other variational conditions.

    在的运用的深度卷的方法似,我的方法是一个粹的数据驱动方法,方法从人的每一个像素开始直接学它的表示。我运用标记的大型数据得合适的姿态光照和其他可的情况的不变性,而不是运用engineered features

   In this paper we explore two different deep network architectures that have been recently used to great success in the computer vision community. Both are deep convolutional networks [8, 11]. The first architecture is based on the Zeiler&Fergus [22] model which consists of multiple interleavedlayers of convolutions, non-linear activations, local response normalizations, and max pooling layers. We additionally add several 1*1*d convolution layers inspired by the work of [9]. The second architecture is based on the Inception model of Szegedy et al. which was recently used as the winning approach for ImageNet 2014[16]. These networks use mixed layers that run several different convolutional and polling layers in parallel and concatenate their responses. We have found that these models can reduce the number of parameters by up to 20 times and have the potential to reduce the number of FLOPS required for comparable performance.

   在论文里我们探究了最近在计算机视觉社区成功使用的两类不同的深度卷积神经网络架构。第一个架构是基于Zeiler&Fergus模型其包含multiple interleaved layers of convolutions, non-linear activations, local response normalizations,and max pooling layers[9]的工作启添加了几个1*1*d卷积层。第二个架构是基于the Inception model of Szegedy et al 这种架构被称为ImageNet 2014 中最的方式些网络在并联和串联它们的相应的时候运用运行在几个不同的卷和池化层组成的混合层。我们发现这两种模型都可以减少参数的使用次数达20次并且有减少浮点运算次数的潜在性能。

    There is a vast corpus of face verification and recognition works. Reviewing it is out of this paper so we will only briefly discuss the most relevant recent work.

    下面是人脸验证识别工作相关的大量料,但是它文不大相关所以我们仅仅讨论一下最相关的最近工作

    The works of [15, 17, 23] all employ acomplex system of multiple stages, that combines the output of a deep convolutional network with PCA for dimensionality reduction and an  SVM for classification.

   [15, 17, 23]中的工作都是运用一个复的多,其中结合了带有用于减少纬度的PCA和分SVM的深度卷

    Zhenyao et al. [23] employ a deep network to “warp” faces into a canonical frontal view and then learn CNN that classifies each face as belonging to a known identity. For face verification,PCA on the network output in conjunction with an ensemble of SVMs is used.

   Taigman et al. [17] propose a multi-stage approach that aligns faces to a general 3D shape model. A multi-class network is trained to perform the face recognition task on over four thousand identities. The authors also experimented with a so called Siamese network where they directly optimize the L1-distance between two face features. Their best performance on LFW (97.35%) stems from an ensemble of three networks using different alignments and color channels. The predicted distances (non-linear SVM predictions based on the 2 kernel)of those networks are combined using non-linear SVM.

  • 6
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值