Towards Flops-constrained Face Recognition论文翻译

最新推荐文章于 2022-05-30 21:57:21 发布

xxzccccccc

最新推荐文章于 2022-05-30 21:57:21 发布

阅读量759

点赞数 1

分类专栏：图像处理

本文链接：https://blog.csdn.net/baidu_32048673/article/details/103294854

版权

图像处理专栏收录该内容

13 篇文章 1 订阅

订阅专栏

图片上传失败，有图片的版本：

https://note.youdao.com/ynoteshare1/index.html?id=1755761dbb6032f160ba30975afa985d&type=note

论文地址：https://arxiv.org/abs/1909.00632

竞赛链接：https://ibug.doc.ic.ac.uk/resources/lightweight-face-recognition-challenge-workshop/

该模型取得视频组人脸识别冠军（2019LFR）

Towards Flops-constrained Face Recognition

面向触发器约束的人脸识别

Abstract

摘要

Large scale face recognition is challenging especially when the computational budget is limited. Given a flops upper bound, the key is to find the optimal neural network architecture and optimization method. In this article, we briefly introduce the solutions of team ’trojans’ for the ICCV19 - Lightweight Face Recognition Challenge [2]. The challenge requires each submission to be one single model with computational budget no higher than 30 GFlops. We introduce a searched network architecture ‘Efficient PolyFace’ based on the Flops constraint, a novel loss function ‘ArcNegFace’, a novel frame aggregation method ‘QAN++’, together with a bag of useful tricks in our implementation (augmentations, regular face, label smoothing, anchor finetuning, etc.). Our basic model, ‘Efficient PolyFace’, takes 28.25 Gflops for the ‘deepglint-large’ imagebased track, and the ‘PolyFace+QAN++’ solution takes 24.12 Gflops for the ‘iQiyi-large’ video-based track. These two solutions achieve 94.198% @ 1e-8 and 72.981% @ 1e4 in the two tracks respectively, which are the state-of-theart results1 in this competition.

当计算预算有限时，大规模人脸识别尤其具有挑战性。比赛规定了计算资源是上限(flops upper bound)，关键是找到最佳的神经网络架构和优化方法。在本文中，我们简要介绍了“trojans”团队在 ICCV19-轻量级人脸识别挑战[2]的解决方案。挑战在于，每个提交都必须是计算预算不得超过30GFlops的模型。我们基于Flops约束引入了搜索网络架构“ Efficient PolyFace”，并提出了一种新的损失函数“ ArcNegFace”和一种新颖的帧聚合方法“ QAN ++”，以及实施过程中的一系列有用技巧（增强效果，规则的面部(regular face)，标签平滑(label smoothinig)，锚点微调(archor finetuning)等）。我们的基本模型“ Efficient PolyFace”需要28.25 Gflops来处理“ deepglint-large” image-based track，同时“ PolyFace + QAN ++”解决方案需要24.12Gflops去解决“ iQiyi-large” video-based track。这两个解决方案在两个组比赛上分别达到94.198％@ 1e-8和72.981％@ 1e4，这是本届比赛的最新结果。

1. Lightweight Face Recognition Challenge

1.轻量级人脸识别挑战

The ICCV19-Lightweight Face Recognition Challenge [2] is one of the most strict competitions in open set face recognition. It requires the strict consistency of training data [4], face detector [3] and alignment method between different submissions. There are four tracks in this competition: small image-based, large image-based,small video-based and large video-based. The computational budged is 1Gflops and 30Gflops for the small and large tracks respectively.

ICCV19-轻量级人脸识别挑战赛[2]是开集(open set)人脸识别中最严格的比赛之一。它要求训练数据[4]，面部检测器[3]和不同提交之间的对齐方法具有严格的一致性。这项比赛有四条赛道：基于小型图像，基于大型图像，基于小型视频和大型视频。对于小模型赛道和大模型赛道，计算的上限分别是1Gflops和30Gflops。

2. Image-based baseline model

2.基于图像的基线模型

We adopt two different CNN architectures R100 [1] and a proposed PolyFace as our base models. The input sizes of the two basic architectures are both 112 × 112 as required by the challenge [2].

PolyFace. Similar to the structure of PolyNet [11], the basic PolyFace is designed by repeating its basic blocks. Details of the basic blocks are shown in Fig 1. In the stem block of the proposed PolyFace, the spatial size is first upsampled to 235 × 235 and then downsized to 112 × 112 by an upsampling and a convolutional layer, which we call ’stem-enrichment block’. The data flow in the whole PolyFace is:

Stem block -- A × blockA -- blockA2B -- B× blockB -- blockB2C -- C× blockC.

At the end of all backbones, a fully connected layer with 256 out-channels is adopted to generate the representation, followed by a BatchNorm1d layer. The block number of [A,B,C] in base model is [10,20,10].

Training details. During the training process of the base models, 16 GPUs are used to enable a global batch size of 1,024. Synchronized BN is used with group size 1. The total training iterations is set to 100,000, and the initial learning rate is 0.001 and warms up to 0.4 during the first 10,000 iterations. The weight decay is set to 1e-5 and momentum is set to 0.9. Dropout with drop rate of 0.4 for the final embedding is used to prevent overfitting.

The results of two base models on the challenge test server [2] are shown in Tab 1

我们采用两种不同的CNN架构，R100 [1]和提出的PolyFace作为我们的基本模型。根据挑战[2]的要求，两种基本架构的输入大小均为112×112。

PolyFace。与PolyNet [11]的结构相似，基本PolyFace通过重复其基本块进行设计。基本块的详细信息如图1所示。在提出的PolyFace的stem block中，空间大小首先通过上采样和卷积层（我们称为“stem-enrichment block”）上采样到235×235，然后再缩小到112×112。整个PolyFace中的数据流为：

Stem block -- A × blockA -- blockA2B -- B× blockB -- blockB2C -- C× blockC.

在所有主干(backbones)的末尾，采用具有256个外通道的完全连接层来生成表示，然后是BatchNorm1d层。基本模型中[A，B，C]的块号为[10,20,10]。

训练细节。在基本模型的训练过程中，使用16个GPU来实现1,024的全局批处理大小。同步BN与组大小1(group size 1)一起使用。总训练迭代次数设置为100,000，初始学习率为0.001，并且在前10,000迭代期间预热到0.4。权重衰减设置为1e-5，动量(momentum)设置为0.9。最终嵌入的Dropout为0.4的去dorp用于防止过度拟合。

表1显示了在比赛测试数据[2]上两个基本模型的结果。

3. New loss function: ArcNegFace

3. 新的损失函数：ArcNegFace

We introduce a new robust loss named ArcNegFace in this section. Unlike most of the recent novel losses that try to find an ‘optimal’ logits curve to regularize the margin between embedding and class anchors, ArcNegFace takes the distance between anchors into consideration.

Define θyi as the angle between the feature f with label yi and the anchor weight Wyi , the original ArcFace can be defined as:

where hyperparam s and m represent the scale and margin. In order to utilize hard negative mining and weaken the influence of the error labeling, we improve the ArcFace to ArcNegFace formulated as:

where tj,yi is G(Cj, Cyi), Cj and Cyi mean the cosθj and cos(θyi+m). The function G(·, ·) is the Gaussian function which is formulated as:

where α, µ and σ are set to 1.2, 0 and 1, respectively. The performance of ArcNegFace is shown in Tab 2

在本节中，我们将引入一个名为ArcNegFace的新的稳健的损失函数。与最近的大多数新颖的尝试试图找到“最佳”对数曲线以使嵌入archor和class arhors之间的边距规则化的新损失不同，ArcNegFace考虑了archors之间的距离。

将θyi定义为带有标签yi的特征f与anchor权重Wyi之间的角度，原始ArcFace可以定义为：（见上面公式一）

其中超参数s和m表示比例和边距。为了利用负面的硬挖掘和削弱错误标记的影响，我们将ArcFace改进为ArcNegFace，公式为：（见上面公式二）

其中tj，yi是G（Cj，Cyi），Cj和Cyi表示cosθj和cos（θyi+ m）。函数G（·，·）是高斯函数，公式为：（见上面公式三）

其中α，μ和σ分别设置为1.2、0和1。表2中显示了ArcNegFace的性能。（见上面Table 2）

4. Efficient PolyFace

4. 高效的PolyFace

Inspired by the idea of efficientnet [10], we launch a NAS processing to expand the basic models in depth and width with the constraint of the computation budget. Some selected results on R100 are shown in Tab 3. Note that all of the experiments are trained under the same basic setting. Finally, we found one of the expanded PolyFace models outperforms all searched candidates with the same Flops (∼28 Gflops), so we adopt it, called Efficient PolyFace, as the final backbone 2. Some selected results are shown in Tab 7.

受高效网络[10]想法的启发，我们启动了NAS处理，以在计算预算的约束下扩展基本模型的深度和宽度。 R100上的某些选定结果显示在表3中。请注意，所有实验均在相同的基本设置下进行训练。最后，我们发现其中一个扩展的PolyFace模型在相同的Flops（〜28 Gflops）上胜过所有搜索的候选模型，因此我们将其称为Efficient PolyFace作为最终的主干2（backbone2）。一些选定的结果如表7所示。(图表见上面)

5. Bag of tricks

5. 一些tricks方法

5.1. Anchor finetuning

5.1. Anchor 微调

We introduce a new regularization term named anchor finetuning. Given a convergent model, we extract the features of the training set and re-init the weight W in the classification layer by the mean feature of the corresponding identity. Then, the model will be finetuned based on this as shown in Tab 6.

我们引入了一个新的正则化术语，称为“锚点微调”（anchor finetuning）。给定一个收敛模型，我们提取训练集的特征，并通过相应身份的均值特征在分类层中重新初始化权重W。然后，将基于此对模型进行微调，如表6所示。

5.2. Scale & Shift augmentations

5.2. 缩放和移位增强

Data augmentation is used during the training process for all settings. The original image will be re-scaled and shifted within ±1% randomly. The performance is shown in Tab 6

在训练过程中，所有设置都使用数据扩充。原始图像将被重新缩放并在±1％范围内随机移位。表6中显示了性能。

5.3. Color jitter

5.3. 色彩抖动

The brightness, contrast, and saturation are set to 0.125 when adding color jitter.

添加色彩抖动时，亮度，对比度和饱和度设置为0.125

5.4. Flip strategy

5.4. 翻转策略

The flip strategy is adopted during the training stage. During the inference stage, we extract the features for both the original and the flipped image. The final feature is the average of them. Results are shown in Tab 6.

在训练阶段采用翻转策略。在推理阶段，我们提取原始图像和翻转图像的特征。最终特征是它们的平均值。结果显示在表6中。

5.5. Regular face

Regular face [12] is adapted to constrain the inter-class distance, but we find it can rarely bring improvement while consuming a large memory.

常规人脸[12]适用于限制类间距离，但是我们发现它在消耗大量内存的情况下很少带来改善。

5.6. Label smooth

5.6. 标签平滑

We explore the label smooth strategy, which is widely used in ImageNet classification. The result is shown in Tab 6.

我们探索了标签平滑策略，该策略在ImageNet分类中得到了广泛使用。结果显示在表6中。

5.7. AdaBN

Considering the domain shift between the training set and the testset, we perform the AdaBN [7] on the convergent model to improve its performance. Results are shown in Tab 4.

考虑到训练集和测试集之间的域偏移，我们在收敛模型上执行AdaBN [7]，以提高其性能。结果显示在选项卡4中。

5.8. Modification of margin

5.8. 修改边距

We modify the margin in ArcFace and it brings a few improvements as shown in Tab 5.

我们在ArcFace中修改了边距，如表5所示，它带来了一些改进。

5.9. Cosine learning rate and stochastic depth

5.9. 余弦学习率和随机深度

We explore the cosine learning rate decay and stochastic depth [6] to achieve further gain. The keep rate in stochastic depth is set to 0.8 in all experiments. The function of learning rate w.r.t. iteration is shown in Fig 2, and results are shown in Tab 7. The losses during the training of basic PolyFace is shown in Fig 2

我们探索余弦学习速率的衰减和随机深度[6]以获得进一步的增益。在所有实验中，随机深度的保持率均设置为0.8。 w.r.t.的学习率函数迭代如图2所示，结果如表7所示。基本PolyFace训练期间的损耗如图2所示。

6. Enhanced quality aware network for video face recognition

6. 增强的质量感知网络，用于视频面部识别

To generate the robust video representation for set-toset recognition in IQIYI track [2], inspired by QAN and RQEN [8, 9], we propose a new quality estimation strategy called enhanced quality aware network (QAN++) to approximate the quality of each image. The representation of the image set can be aggregated by the weighted sum of frame representations with the assistant of the image quality.

Different from the subjective quality judgment of image, our method assigns the image quality from the characteristics of feature discrimination. Define the dataset D with C identities and the weight anchor Wi, i ∈ [1, C] in the final classification layer, the quality of image I with ID c can be computed by:

The image quality is computed on the training set and in order to obtain the image quality during the inference stage, we add a lightweight quality generation branch to regress the quality value computed on the training set. To better regress the quality, we normalize it as:

where σ(·), mean(Q) and std(Q) mean the sigmoid function, mean value and standard deviation value in the whole training set respectively. The L2 loss is adopted as the training loss.

During the inference stage, given the video Ii, i ∈ [1, n] where n means the total image number and the corresponding feature representation Fi, we extract the quality value Qi of Ii. The quality value will be re-scaled by:

If the image number n in the image set is less than 3, we directly adopt Eq 9 to aggregate them without re-scaling the quality value

为了在IQIYI赛道[2]中生成用于集对集识别的鲁棒视频表示，在QQI和RQEN [8，9]的启发下，我们提出了一种新的质量估计策略，称为增强质量感知网络（QAN ++），以近似估计每个质量图片。图像集的表示可以在图像质量的辅助下通过帧表示的加权总和进行汇总。

与图像的主观质量判断不同，我们的方法根据特征识别的特征分配图像质量。在最终分类层中使用C identities和weight anchor Wi，i∈[1，C]定义数据集D，可以通过以下公式计算ID为c的图像I的质量：（见公式四）

在训练集上计算图像质量，为了在推理阶段获得图像质量，我们添加了一个轻量级的质量生成分支以回归在训练集上计算出的质量值。为了更好地降低质量，我们将其标准化为：（见公式五）

其中σ（·），mean（Q）和std（Q）分别表示整个训练集中的S型函数，平均值和标准偏差值。 L2损失被用作训练损失。

在推理阶段，给定视频Ii，i∈[1，n]，其中n表示总图像数和相应的特征表示Fi，我们提取Ii的质量值Qi。质量值将通过以下方式重新缩放：（见公式6，7，8，9）

如果图像集中的图像编号n小于3，则我们直接采用等式9对其进行汇总，而无需重新缩放质量值

6.1. Performance of different aggregation strategies

6.1. 不同聚合策略下的性能表现

We evaluate the effectiveness of the proposed quality estimation strategy on IQIYI in LFR. Results are shown in Tab 8. We embed a new quality branch into PolyFace. The new branch looks like a tiny version of ResNet-18. The block number in each stage is [2, 2, 2, 2] and the channel number in each stage is set to [8,16,32,48]. We add a fully connected layer with output number 1 after the global average pooling to regress the quality. The flops of the quality net is 81.9 Mflops and the input is the same as the PolyFace.Table 8. Comparison with different quality strategies on IQIYIlarge track in LFR. The performance 72.981 won the 1st place in this competition.

我们评估在LFR中对IQIYI提出的质量估计策略的有效性。结果显示在表8中。我们在PolyFace中嵌入了一个新的质量分支。新分支看起来像是ResNet-18的小版本。每个阶段中的块号为[2、2、2、2]，每个阶段中的通道号设置为[8、16、32、48]。在全局平均池之后，我们添加输出编号为1的完全连接层以降低质量。质量网的触发器为81.9 Mflops，输入与PolyFace。相同。表8.在LFR的IQIYIlarge磁道上与不同质量策略的比较。性能72.981赢得了比赛中的第一名。

7. Conclusion

7.总结

In this article, we show the details of our solution to the ICCV19-LRF challenge. For the image-based and videobased tracks, We introduce a new backbone Efficient PolyFace and a new loss function ArcNegFace. For the video based track, we propose a novel quality estimator QAN++ to generate quality score for each frame. Besides, we also explore some useful tricks in face recognition model. Results on the challenge test server demonstrate the effectiveness of the proposed methods.

在本文中，我们显示了针对ICCV19-LRF挑战的解决方案的详细信息。对于基于图像和基于视频的轨道，我们引入了新的主干Efficient PolyFace和新的损失函数ArcNegFace。对于基于视频的轨道，我们提出了一种新颖的质量估算器QAN ++来为每个帧生成质量得分。此外，我们还探索了面部识别模型中的一些有用技巧。挑战测试服务器上的结果证明了所提出方法的有效性。

xxzccccccc

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Towards Flops-constrained Face Recognition论文翻译

图片上传失败，有图片的版本：https://note.youdao.com/ynoteshare1/index.html?id=1755761dbb6032f160ba30975afa985d&type=note论文地址：https://arxiv.org/abs/1909.00632竞赛链接：https://ibug.doc.ic.ac.uk/resources/lig...
复制链接

扫一扫