FaceNet论文翻译学习2

最新推荐文章于 2022-06-18 11:18:00 发布

周智强

最新推荐文章于 2022-06-18 11:18:00 发布

阅读量1.1k

点赞数 2

本文链接：https://blog.csdn.net/weixin_42510957/article/details/88538850

版权

原文链接:FaceNet: A Unified Embedding for Face Recognition and Clustering

FaceNet

2. 相关工作(Related Work)

Similarly to other recent works which employ deep networks [15, 17], our approach is a purely data driven method which learns its representation directly from the pixels of the face. Rather than using engineered features, we use a large dataset of labelled faces to attain the appropriate invariances to pose, illumination, and other variational conditions.
与其他最近使用深度网络的作品[15,17]类似，我们的方法是一种纯粹的数据驱动方法，它直接从面部像素中学习它的表示。我们使用标记过人脸的大型数据集来获得姿势，光照和其他变化条件的适当不变性，而不是使用工程特征。

In this paper we explore two different deep network architectures that have been recently used to great success in the computer vision community. Both are deep convolutional networks [8, 11]. The first architecture is based on the Zeiler&Fergus [22] model which consists of multiple interleaved layers of convolutions, non-linear activations, local response normalizations, and max pooling layers. We additionally add several $1\times 1 \times d$ convolution layers inspired by the work of [9]. The second rchitecture is based on the Inception model of Szegedy et al. which was recently used as the winning approach for ImageNet 2014 [16]. These networks use mixed layers that run several different convolutional and pooling layers in parallel and concatenate their responses. We have found that these models can reduce the number of parameters by up to 20 times and have the potential to reduce the number of FLOPS required for comparable performance.
在本文中，我们探讨了最近在计算机视觉社区中取得巨大成功的两种不同的深度网络架构。两者都是深度卷积网络[8,11]。第一种架构基于Zeiler＆Fergus [22]模型，该模型由多个交错的卷积层，非线性激活，局部响应归一化和最大池化层组成。我们还增加了几个 $1\times 1 \times d$ 卷积层，灵感来自[9]的工作。第二种结构基于Szegedy等人的Inception模型。最近被用作ImageNet 2014的获胜方法[16]。这些网络使用混合层，并行运行几个不同的卷积和池化层并连接它们的响应。我们发现这些模型可以将参数数量减少多达20倍，并且有可能减少所需的FLOPS数量（就能达到）同样的性能。

There is a vast corpus of face verification and recognition works. Reviewing it is out of the scope of this paper so we will only briefly discuss the most relevant recent work.目前有大量的人脸验证和识别工作。回顾它超出了本文的范围，因此我们将仅简要讨论最相关的最新工作。

The works of [15, 17, 23] all employ a complex system of multiple stages, that combines the output of a deep convolutional network with PCA for dimensionality reduction and an SVM for classification.
论文[15,17,23]都使用了一个复杂的多级系统，它将深度卷积网络的输出与PCA相结合，以降低维数，并将SVM用于分类。

Zhenyao et al. [23] employ a deep network to “warp” faces into a canonical frontal view and then learn CNN that classifies each face as belonging to a known identity. For face verification, PCA on the network output in conjunction with an ensemble of SVMs is used.
Zhenyao等人 [23]采用深度网络将面部“扭曲”成规范的正面视图，然后学习CNN，将每个面部分类为属于已知身份。对于人脸验证，使用网络输出上的PCA和一组SVM。

Taigman et al. [17] propose a multi-stage approach that aligns faces to a general 3D shape model. A multi-class network is trained to perform the face recognition task on over four thousand identities. The authors also experimented with a so called Siamese network where they directly optimize the $L 1$ -distance between two face features. Their best performance on LFW (97.35%) stems from an ensemble of three networks using different alignments and color channels. The predicted distances (non-linear SVM predictions based on the $\chi^ 2$ kernel) of those networks are combined using a non-linear SVM.
Taigman等 [17]提出了一种多阶段方法，将面部与一般的三维形状模型对齐。训练多级网络以执行超过四千个特征的面部识别任务。作者还试验了一个所谓的连体网络(Siamese network)，他们直接优化了两个面部特征之间的 $L 1$ 距离。他们在LFW上的最佳表现（97.35％）源于使用不同比对和颜色通道的三个网络的集合。使用非线性SVM组合这些网络的预测距离（基于 $\chi^ 2$ 内核的非线性SVM预测）。

Sun et al. [14, 15] propose a compact and therefore relatively cheap to compute network. They use an ensemble of 25 of these network, each operating on a different face patch. For their final performance on LFW (99.47% [15]) the authors combine 50 responses (regular and flipped). Both PCA and a Joint Bayesian model [2] that effectively correspond to a linear transform in the embedding space are employed. Their method does not require explicit 2D/3D alignment. The networks are trained by using a combination of classification and verification loss. The verification loss is similar to the triplet loss we employ [12, 19], in that it minimizes the L2-distance between faces of the same identity and enforces a margin between the distance of faces of different identities. The main difference is that only pairs of images are compared, whereas the triplet loss encourages a relative distance constraint.
Sun等人 [14,15]提出了一种紧凑且因此相对便宜的计算网络。他们使用25个这样的网络集合，每个网络在不同的面部补丁(face patch)上运行。对于他们在LFW上的最终表现（99.47％[15]），作者结合了50个回答（常规和翻转）。 PCA和联合贝叶斯模型[2]都有效地对应于嵌入(embedding)空间中的线性变换。他们的方法不需要明确的2D / 3D对齐。通过使用分类和验证损失函数(loss)的组合来训练网络。验证损失函数类似于我们采用的三元组损失函数[12,19]，因为它最小化了相同身份的面部之间的L2距离，并在不同身份的面部距离之间实施了边界(margin)。 (和我们的相比)主要区别在于仅比较成对图像，而三元组损失促使相对距离约束。

A similar loss to the one used here was explored in Wang et al. [18] for ranking images by semantic and visual similarity.
通过运用语义和视觉相似性对图像进行排序，Wang等人[18]研究了与我们此处使用的类似的损失函数。

3. 方法(Method)

FaceNet uses a deep convolutional network. We discuss two different core architectures: The Zeiler&Fergus [22] style networks and the recent Inception [16] type networks.The details of these networks are described in section 3.3.
FaceNet使用深度卷积网络。我们讨论了两种不同的核心架构：Zeiler＆Fergus [22]式网络和最近的Inception [16]型网络。这些网络的细节在3.3节中描述。

Given the model details, and treating it as a black box (see Figure 2), the most important part of our approach lies in the end-to-end learning of the whole system. To this end we employ the triplet loss that directly reflects what we want to achieve in face verification, recognition and clustering. Namely, we strive for an embedding $f (x)$ , from an image $x$ into a feature space $\Re^d$ , such that the squared distance between all faces, independent of imaging conditions, of the same identity is small, whereas the squared distance between a pair of face images from different identities is large.
将模型细节视为黑盒子（见图2），我们方法中最重要的部分在于整个系统的端到端学习。为此，我们采用三元组损失(triplet loss)，直接反映了我们想要在人脸验证，识别和聚类中实现的目标。即，我们努力获取到嵌入( embedding) $f (x)$ (输入图像 $x$ 到 $f (x)$ ,输出特征空间 $\Re^d$ )使得相同身份的所有面部图像之间的平方距离（与成像条件无关）很小，来自不同身份的一对面部图像之间的平方距离很大。
在这里插入图片描述

Figure 2. Model structure.
Our network consists of a batch input layer and a deep CNN followed by L2 normalization, which results in the face embedding. This is followed by the triplet loss during training.
图2.模型结构。我们的网络由批量输入层和深度CNN组成，然后进行L2归一化，从而实现人脸嵌入(face embedding)。接下来是训练期间的三元组损失函数(triplet loss)。

Although we did not directly compare to other losses, e.g. the one using pairs of positives and negatives, as used in [14] Eq. (2), we believe that the triplet loss is more suitable for face verification. The motivation is that the loss from [14] encourages all faces of one identity to be projected onto a single point in the embedding space. The triplet loss, however, tries to enforce a margin between each pair of faces from one person to all other faces. This allows the faces for one identity to live on a manifold, while still enforcing the distance and thus discriminability to other identities.
虽然我们没有直接与其他损失函数进行比较，例如使用[14] Eq.(2)中使用正(positives)和负(negatives)对的那个,我们认为三元组损失(triplet loss)更适合面部识别。论文[14]中损失(loss)的动机是鼓励将同一个人的所有面部投射到嵌入空间中的单个点上。然而，三元组损失(triplet loss)（的做法是）试图在每一对面部(不管这对面部是同一个人的，还是和所有其他人组成的)强制生成一个边界。这允许同一个身份的面部有个人的多样，同时仍然强制距离并因此可以与其他身份相区别。

The following section describes this triplet loss and how it can be learned efficiently at scale.
以下部分描述了这种三元组损失函数(triplet loss)以及如何有效地大规模学习它。

3.1. 三元组损失函数(Triplet Loss)

The embedding is represented by $f(x)\in\Re^d$ . It embeds an image x into a d-dimensional Euclidean space. Additionally, we constrain this embedding to live on the d-dimensional hypersphere, $i.e.\|f(x)\|_{2}=1$ . This loss is motivated in [19] in the context of nearest-neighbor classification. Here we want to ensure that an image $x_{i}^a$ (anchor) of a specific person is closer to all other images $x_{i}^p$ (positive) of the same person than it is to any image $x_{i}^n$ (negative) of any other person. This is visualized in Figure 3.
Thus we want,
嵌入(embedding)由 $f(x)\in\Re^d$ 表示。它将图像 $x$ 嵌入到 $d$ 维欧几里德空间中。另外，我们将这种嵌入(embedding)限制在d维超球面上，即 $f(x)\|_{2}=1$ .这种损失(loss)在[19]中在最近邻分类的背景下被激发。在这里，我们希望确保特定人的图像 $x_{i}^a$ （anchor）更接近同一个人的所有其他图像 $x_{i}^p$ （positive），而不是任何其他人的任何图像 $x_{i}^n$ （negative）。见图3中的图示。
因此我们希望
${\|f(x_{i}^a)-f(x_{i}^p)\|}_{2}^2+\alpha<{\|f(x_{i}^a)-f(x_{i}^n)\|}_{2}^2,(1)$
$\forall(f(x_{i}^a),f(x_{i}^p),f(x_{i}^n))\in\tau.(2)$
where $\alpha$ is a margin that is enforced between positive and negative pairs. $\tau$ is the set of all possible triplets in the training set and has cardinality $N$ .
The loss that is being minimized is then $L =$
在这里插入图片描述
在公式里 $\alpha$ 是正负对之间区分界限的阈值(margin)。 $\tau$ 是训练集中所有可能的三元组的集合，并具有基数 $N$ 损失函数定义为
那么最小化的损失就是 $L =$

Generating all possible triplets would result in many triplets that are easily satisfied (i.e. fulfill the constraint in Eq. (1)). These triplets would not contribute to the training and result in slower convergence, as they would still be passed through the network. It is crucial to select hard triplets, that are active and can therefore contribute to improving the model. The following section talks about the different approaches we use for the triplet selection.生成所有可能的三元组将导致许多容易满足的三元组（即满足方程（1）中的约束）。这些三元组不会对训练有所贡献，导致收敛速度变慢，因为它们仍会通过网络传递。选择活跃的困难三元组(hard triplets)至关重要，因为可以有助于改进模型。下一节讨论了我们用于三元组选择的不同方法。

3.2. 选择三元组(Triplet Selection)

In order to ensure fast convergence it is crucial to select triplets that violate the triplet constraint in Eq. (1). This means that, given $x_{i}^a$ , we want to select an $x_{i}^p$ (hard positive) such that $argmax_{x_{i}^p}\|f(x_{i}^a)-f(x_{i}^p)\|_{2}^2$ and similarly $x_{i}^n$ (hard negative) such that $argmin_{x_{i}^n}\|f(x_{i}^a)-f(x_{i}^n)\|_{2}^2$ .
为了确保快速收敛，选择违反方程式（1）中的三元组约束的三元组是至关重要的。这意味着，对于给定的 $x_{i}^a$ ，我们需要选择一个 $x_{i}^p$ (hard positive)和 $x_{i}^n$ (hard negative)，使得 $argmax_{x_{i}^p}\|f(x_{i}^a)-f(x_{i}^p)\|_{2}^2$ 和 $argmin_{x_{i}^n}\|f(x_{i}^a)-f(x_{i}^n)\|_{2}^2$

It is infeasible to compute the $a r g m i n$ and $a r g m a x$ across the whole training set. Additionally, it might lead to poor training, as mislabelled and poorly imaged faces would dominate the hard positives and negatives. There are two obvious choices that avoid this issue:
在整个训练集中计算 $a r g m i n$ 和 $a r g m a x$ 是不可行的。此外，它可能导致训练不佳，因为错误标记和不良成像的面部将主导hard positives和negatives。有两个明显的选择可以避免这个问题：

Generate triplets offline every n steps, using the most recent network checkpoint and computing the argmin and argmax on a subset of the data.
每n步离线生成三元组(triplet)，使用最新的网络检查点(checkpoint)并在数据的子集上计算 $a r g m i n$ 和 $a r g m a x$ 。
Generate triplets online. This can be done by selecting the hard po-sitive/negative exemplars from within a mini-batch.
在线生成三元组(triplet)。这可以通过从mini-batch中选择hard positive/negative样本来完成。

Here, we focus on the online generation and use large mini-batches in the order of a few thousand exemplars and only compute the argmin and argmax within a mini-batch.
在这里，我们专注于在线生成并使用大约几千个样本的大型mini-batch，并且仅在mini-batch中计算 $a r g m i n$ 和 $a r g m a x$ 。

To have a meaningful representation of the anchor-positive distances, it needs to be ensured that a minimal number of exemplars of any one identity is present in each mini-batch. In our experiments we sample the training data such that around 40 faces are selected per identity per minibatch. Additionally, randomly sampled negative faces are added to each mini-batch.
为了有效地表示anchor-positive之间的距离，需要确保每个mini-batch中存在任何一个身份的最小数量的样本。在我们的实验中，我们对训练数据进行采样，使得每个mini-batch中每个身份选择约40个面部。另外，随机抽样的negative面部被添加到每个mini-batch。

Instead of picking the hardest positive, we use all anchor-positive pairs in a mini-batch while still selecting the hard negatives. We don’t have a side-by-side comparison of hard anchor-positive pairs versus all anchor-positive pairs within a mini-batch, but we found in practice that the all anchor-positive method was more stable and converged slightly faster at the beginning of training.
我们不是挑选最困难的正样本(hardest positive)，而是在mini-batch中使用所有anchor-positive对，同时仍然选择hard negatives。我们没有将一个mini-batch中的hard anchor-positive对与所有anchor-positive对进行并排比较，但我们在实践中发现所有anchor-positive方法在开始训练时更稳定并且收敛得稍快一些。

We also explored the offline generation of triplets in conjunction with the online generation and it may allow the use of smaller batch sizes, but the experiments were inconclusive.
我们还与在线生成一起探索了三元组(triplet)的离线生成，并且可能允许使用更小的batch，但实验尚无定论。

Selecting the hardest negatives can in practice lead to bad local minima early on in training, specifically it can result in a collapsed model ( $i . e . f (x) = 0$ ). In order to mitigate this, it helps to select $x_{i}^n$ such that
${\|f(x_{i}^a)-f(x_{i}^p)\|}_{2}^2<{\|f(x_{i}^a)-f(x_{i}^n)\|}_{2}^2.(4)$
选择最困难的负样本(negative)实际上可能在训练早期导致不良的局部最小值，特别是它可能导致模型崩溃( $i . e . f (x) = 0$ )。为了减轻这种影响，用如下条件选择 $x_{i}^n$ 会有所帮助
${\|f(x_{i}^a)-f(x_{i}^p)\|}_{2}^2<{\|f(x_{i}^a)-f(x_{i}^n)\|}_{2}^2.(4)$

We call these negative exemplars semi-hard, as they are further away from the anchor than the positive exemplar, but still hard because the squared distance is close to the anchor-positive distance. Those negatives lie inside the margin .
我们将这些负(negative)样本称为 semi-hard，因为它们远离 anchor而不是正(positive)样本，但仍然很难，因为平方距离接近 anchor-positive距离。这些负面影响在阈值(margin)范围内。

As mentioned before, correct triplet selection is crucial for fast convergence. On the one hand we would like to use small mini-batches as these tend to improve convergence during Stochastic Gradient Descent (SGD) [20]. On the other hand, implementation details make batches of tens to hundreds of exemplars more efficient. The main constraint with regards to the batch size, however, is the way we select hard relevant triplets from within the mini-batches. In most experiments we use a batch size of around 1,800 exemplars.
如前所述，正确的三元组选择对于快速收敛至关重要。一方面，我们希望使用小型mini-batch，因为这些趋向于在随机梯度下降(SGD)[20]期间改善收敛。另一方面，实施细节使得数十到数百个样本的批次更有效。然而，关于 batch大小的主要限制是我们从mini-batch中选择困难相关三元组(hard relevant triplet)的方式。在大多数实验中，我们使用的 batch大小约为1,800个样本。

3.3. 深度卷积网络(Deep Convolutional Networks)

In all our experiments we train the CNN using Stochastic Gradient Descent (SGD) with standard backprop [8, 11] and AdaGrad [5]. In most experiments we start with a learning rate of 0.05 which we lower to finalize the model. The models are initialized from random, similar to [16], and trained on a CPU cluster for 1,000 to 2,000 hours. The decrease in the loss (and increase in accuracy) slows down drastically after 500h of training, but additional training can still significantly improve performance. The margin $\alpha$ is set to 0.2.
在我们的所有实验中，我们使用随机梯度下降(SGD)的标准反向传播[8,11]和AdaGrad [5]训练CNN。在大多数实验中，我们从学习率0.05开始，我们降低（学习率）以完成模型。模型从随机初始化，类似于[16]，并在CPU集群上训练1,000到2,000小时。在训练500小时后，损失(loss)的减少（以及准确度的增加）急剧减慢，但额外的训练仍然可以显着提高性能。阈值(margin) $\alpha$ 设置为0.2。

We used two types of architectures and explore their trade-offs in more detail in the experimental section. Their practical differences lie in the difference of parameters and FLOPS. The best model may be different depending on the application. $E . g .$ a model running in a datacenter can have many parameters and require a large number of FLOPS, whereas a model running on a mobile phone needs to have few parameters, so that it can fit into memory. All our models use rectified linear units as the non-linear activation function.
我们使用了两种类型的体系结构，并在实验部分中更详细地探讨了它们的权衡(trade-off)。它们的实际差异在于参数和FLOPS的差异。根据不同应用，最佳模型可能会有所不同。例如，在数据中心中运行的模型可以具有许多参数并且需要大量FLOPS，而在移动电话上运行的模型需要具有很少的参数，以便它可以适应内存。我们所有的模型都使用整流线性单元（RELU）作为非线性激活函数。
在这里插入图片描述
Table 1. NN1. This table show the structure of our Zeiler&Fergus [22] based model with $1\times1$ convolutions inspired by [9]. The input and output sizes are described in $rows \times cols \times filters.$ The kernel is specified as $rows \times cols,stride$ and the maxout [6] pooling size as $p = 2$ .
表1. NN1。该表显示了我们的Zeiler＆Fergus [22]模型的结构，其灵感来自于[9]的 $1\times1$ 卷积。输入和输出大小描述为 $rows \times cols \times filters$ 。内核被指定为 $rows \times cols,stride$ 和maxout [6]的池化大小为 $p = 2$ 。

The first category, shown in Table 1, adds $1\times1\times d$ convolutional layers, as suggested in [9], between the standard convolutional layers of the Zeiler&Fergus [22] architecture and results in a model 22 layers deep. It has a total of 140 million parameters and requires around 1.6 billion FLOPS per image.
第一类结构，如表1所示，如[9]中建议，在Zeiler＆Fergus [22]架构的标准卷积层之间增加 $1\times1\times d$ 卷积层，得到22层深的模型。它总共有1.4亿个参数，每个图像需要大约16亿FLOPS。

The second category we use is based on GoogLeNet style Inception models [16]. These models have $20\times$ fewer parameters (around 6.6M-7.5M) and up to $5\times$ fewer FLOPS (between 500M-1.6B). Some of these models are dramatically reduced in size (both depth and number of filters), so that they can be run on a mobile phone. One, NNS1, has 26M parameters and only requires 220M FLOPS per image. The other, NNS2, has 4.3M parameters and 20M FLOPS. Table 2 describes NN2 our largest network in detail. NN3 is identical in architecture but has a reduced input size of 160x160. NN4 has an input size of only 96x96, thereby drastically reducing the CPU requirements (285M FLOPS vs 1.6B for NN2). In addition to the reduced input size it does not use 5x5 convolutions in the higher layers as the receptive field is already too small by then. Generally we found that the 5x5 convolutions can be removed throughout with only a minor drop in accuracy. Figure 4 compares all our models.
我们使用的第二类是基于GoogLeNet风格的Inception模型组[16]。部分模型使参数减少20倍（约6.6M-7.5M），FLOPS减少5倍（500M-1.6B之间）。其中一些型号的尺寸（深度和滤波器数量）都大幅减少，因此可以在手机上运行。其中的NNS1具有26M参数，每个图像仅需要220M FLOPS。另一个是NNS2，有4.3M参数和20M FLOPS。表2详细描述了我们最大的网络NN2。 NN3在架构上是相同的，但输入尺寸减小了160x160。 NN4的输入大小仅为96x96，从而大幅降低了CPU要求（NN4的 285M FLOPS vs NN2 的 1.6B FLOPS）。除了减小的输入尺寸之外，它在较高层中不使用5x5卷积，因为此时感受野已经太小。一般来说，我们发现5x5卷积可以在整个过程中被移除，只有很小的精度下降。图4比较了我们所有的模型。
在这里插入图片描述
Table 2. NN2. Details of the NN2 Inception incarnation. This model is almost identical to the one described in [16]. The two major differences are the use of $L_{2}$ pooling instead of max pooling (m), here specified. I.e. instead of taking the spatial max the L2 norm is computed. The pooling is always 33 (aside from the final average pooling) and in parallel to the convolutional modules inside each Inception module. If there is a dimensionality reduction after the pooling it is denoted with p. 11, 33, and 55 pooling are then concatenated to get the final output.表2.NN2。
NN2 Inception 实现的详细信息。该模型几乎与[16]中描述的模型相同。两个主要区别是使用L2池化而不是最大池化（m），这里指定。即不是取空间最大值，而是计算L2范数。池化窗总是33（除了最终的平均池化），并且与每个Inception模块内的卷积模块并行。如果在池化后存在降维，则用p表示。然后连接11,33和55池化会汇集以获得最终输出。
在这里插入图片描述
Figure 4. FLOPS vs. Accuracy trade-off. Shown is the trade-off between FLOPS and accuracy for a wide range of different model sizes and architectures. Highlighted are the four models that we focus on in our experiments.
图4. FLOPS vs.准确性权衡(trade-off)。显示了各种不同模型尺寸和架构的FLOPS与精度之间的权衡(trade-off)。重点介绍了我们在实验中关注的四种模型。

周智强

关注

2
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
FaceNet论文翻译学习2

三元组损失应该要满足下面的式子，f(xia)f(x_{i}^a)f(xia)表示anchor图的特征向量，f(xip)f(x_{i}^p)f(xip)表示positive的图片的特征向量，f(xin)f(x_{i}^n)f(xin)表示negative的图片的特征向量，α\alphaα表示人为设置的参数。∥f(xia)−f(xip)∥22{\|f(x_{i}^a)-f(x_{i}^p)\|...
复制链接

扫一扫