[论文笔记]Adversarial Discriminative Heterogeneous Face Recognition
1.Abstract
不同人脸模式的感知模式之间的差距在异质人脸识别中仍是一个具有挑战性的问题(HFR)。本文提出了一种基于原始像素空间(raw-pixel space)和紧致特征空间( compact feature space)的对抗性特征学习框架,通过对抗性学习来缩小感知差距。
该框架将交叉光谱的人脸幻觉(cross-spectral face hallucination) 和 鉴别特征学习(discriminative feature learning )集成到端到端对抗性网络中。在像素空间中,我们利用生成的对抗性网络来实现交叉光谱的人脸幻觉。并提出了一种复杂的双路径模型(two-path model),该模型既考虑了全局结构,又考虑了局部纹理,从而缓解了图像配对不足的问题。在特征空间中,分别用一个对抗性损失(adversarial loss)和一个高阶方差差异损失(high-order variance discrepancy loss)来度量两个异质分布之间的全局和局部差异。这两种损失增强了域不变特征学习(domain-invariant feature learning)和模态无关噪声去除(modality independent noise removing)。
2.Introduction
Typical heterogeneous face recognition (HFR) tasks:
1.visual versus near infrared (VIS-NIR) face recognition
2.visual versus thermal infrared (VIS-TIR) face recognition
3.face photo versus face sketch
4.face recognition across pose
VIS-NIR HFR是HFR中最受欢迎和代表性的任务,这是因为NIR成像提供低成本和有效的解决方案用以在低光照情况下获取高质量图像。它在当今的监控系统中得到了广泛的应用。然而,NIR图像的普及与VIS图像相去甚远,而且大多数的数据集都是VIS域的。
为解决两个挑战:1.不同人脸模式之间的感知模式的差距(现有方法中大多数只侧重于减少感测间隙(sensing gap),但不强调不同受试者之间的歧视,从而在受试者数量增加时导致绩效降低。)2.缺乏成对的训练数据集。本文通过引入原始像素空间和紧凑特征空间上的对抗性学习,提出了一种针对HFR的对抗性判别特征学习框架。
-
在像素空间
我们利用生成对抗性网络(GaN)作为子网络来实现交叉光谱的人脸幻觉。
在该子网中引入了详细的双路模型,以减轻对配对图像的缺乏,这将考虑全局结构和局部纹理,并产生更好的VIS结果。
-
在特征空间
分别用对抗性损失和高阶方差差异损失来度量两种异质特征分布之间的全局和局部差异。
3.The Proposed Approach
3.1 Cross-spectral Face Hallucination
NIR-VIS图像转换中的一个主要挑战是图像对在大多数数据库中不能准确地对齐。尽管我们可以根据landmarks对准图像,但是同一对象的姿势和面部表情仍然有很大的变化。因此,我们基于CycleGAN框架来处理unpaired图像的翻译任务。
L
G
−
a
d
v
=
−
E
I
∼
P
(
I
)
l
o
g
D
(
G
(
I
)
)
L_{G-adv}=-E_{I\sim P(I)}logD(G(I))
LG−adv=−EI∼P(I)logD(G(I))
L D − a d v = − E I ′ ∼ P ( I ′ ) l o g ( 1 − D ( I ′ ) ) + E I ∼ P ( I ) l o g D ( G ( I ) ) L_{D-adv}=-E_{I'^\sim P(I')}log(1-D(I'))+E_{I\sim P(I)}logD(G(I)) LD−adv=−EI′∼P(I′)log(1−D(I′))+EI∼P(I)logD(G(I))
$$
L_{cyc}=E_{I\sim P(I)}||I-F(G(I))||_1
$$
(where F is the opposite generator to G)
我们发现single发生器很难合成具有全局结构和局部细节的高质量交叉光谱图像,可能的解释是在所有空间位置之间共享卷积滤波器,这几乎不适合同时恢复全局和局部信息。因此,我们采用了一种双路径架构,如图2所示。由于眼周区域在不同面部区域的NIR图像和VIS图像之间表现出特殊的对应关系,因此我们在眼睛周围添加了一条local path,以便精确地恢复眼周区域。
由于VIS图像和NIR图像主要在光谱上存在差异,因此应在交叉光谱转换后保留结构信息,在全局路径中采用亮度保持项来保证结构的一致性。
L
i
n
t
e
n
s
i
t
y
=
E
I
∼
P
(
I
)
∣
∣
Y
(
I
)
−
Y
(
G
(
I
)
)
∣
∣
1
L_{intensity} = E_{I\sim P(I)}||Y(I)-Y(G(I))||_1
Lintensity=EI∼P(I)∣∣Y(I)−Y(G(I))∣∣1
Y(·)表示YCbCr空间中图像的Y通道。
总而言之:
L
G
=
L
G
−
a
d
v
+
α
1
L
c
y
c
+
α
2
L
i
n
t
e
n
s
i
t
y
L_G=L_{G-adv}+\alpha_1L_{cyc}+\alpha_2L_{intensity}
LG=LG−adv+α1Lcyc+α2Lintensity
3.2 Adversarial Discriminative Feature Learning
3.2.1 Adversarial Loss
如上所述,通过简单的min-maxtwo-player game,GaN具有很强的拟合目标分布的能力。在这一部分中,我们使用gan进行横向特征学习,以消除域间的差异。如图1、采用额外鉴别器
D
F
D_F
DF作为我们特征提取器的对手。
D
F
D_F
DF输出一个标量值,表示属于VIS特征空间的概率。
L
F
−
a
d
v
=
−
E
I
N
∼
P
(
I
N
)
l
o
g
D
f
(
F
(
G
V
(
I
N
)
)
)
L_{F-adv}=-E_{I^N\sim P(I^N)}logD_f(F(G_V(I^N)))
LF−adv=−EIN∼P(IN)logDf(F(GV(IN)))
(此处的F是图一种标注的F,表示提取特征)
3.2.2 Variance Discrepancy
Considering that the feature distribution of the same subject should be as close as possible ideally, we employ the class-wise variance discrepancy (CVD) to enforce the consistency of subject-related variation with the guide of identity label information.
σ
(
F
)
=
E
(
(
F
−
E
(
F
)
)
2
)
,
\sigma(F)=E((F-E(F))^2),
σ(F)=E((F−E(F))2),
L C V D = ∑ c = 1 C E ( ∣ ∣ σ ( F c V ) − σ ( F c N ) ∣ ∣ 2 ) L_{CVD}=\sum_{c=1}^{C}E(||\sigma(F_c^V)-\sigma(F_c^N)||_2) LCVD=c=1∑CE(∣∣σ(FcV)−σ(FcN)∣∣2)
σ ( ⋅ ) \sigma(·) σ(⋅)是方差函数 (variance function), F c V , F c N F_c^V,F_c^N FcV,FcN分别表示属于VIS和NIR域中的第c类的特征观测值。
3.2.3 Cross-Entropy Loss
由于adversarial loss和variance discrepancy penalties不能保证群体间的多样性,我们还采用分类体系结构强化学习特征的辨析性和紧凑性。
L
c
l
s
=
1
∣
N
∣
+
∣
V
∣
∑
i
∈
N
,
V
L
(
W
F
i
,
y
i
)
L_{cls}=\frac{1}{|N|+|V|}\sum_{i\in {N,V}}L(WF_i,y_i)
Lcls=∣N∣+∣V∣1i∈N,V∑L(WFi,yi)
L(·, ·) is the cross-entropy loss function,W是softmax规格化的参数。
最终损失函数是上面定义的所有损失的加权和:
L
a
d
v
L_{adv}
Ladv to remove the modality gap,
L
C
V
D
L_{CVD}
LCVD to guarantee intra-class consistency, and
L
c
l
s
L_{cls}
Lcls to preserve identity discrimination.
L
=
L
F
a
d
v
+
λ
1
L
C
V
D
+
λ
2
L
c
l
s
L=L_{F_adv}+\lambda_1L_{CVD}+\lambda_2L_{cls}
L=LFadv+λ1LCVD+λ2Lcls
4. Experiments
4.1 Datasets and Protocols
4.2 Implementation Details
Our cross-spectral hallucination network is trained on the CASIA NIR-VIS 2.0 face dataset.
The feature extraction network is pre-trained on the MS-Celeb-1M dataset , and finetuned on each testing datasets respectively.
All the face images are normalized by similarity transformation using the locations of two eyes, and then cropped to
144 × 144 size, of which 128 × 128 sized sub images are selected by random cropping in training and center cropping
in testing. For the local-path, 32 × 32 patches are cropped around two eyes, and then flipped to the same side. As mentioned above, in the cross-spectral hallucination module, images are encoded in YCbCr space. In the feature extraction step, grayscale images are used as input.
Network architecture:
Our cross-spectral hallucination networks take the architecture of ResNet, where the global-path is comprised of 6 residual blocks and the local-path contains 3 residual blocks. Output of the local-path is feed to the global-path before the last block.
In the adversarial discriminative feature learning module, we employ the model-B of the Light CNN as our basic model, which includes 9 convolution layers, 4 max-pooling and one fully-connected layer.
-B of the Light CNN as our basic model, which includes 9 convolution layers, 4 max-pooling and one fully-connected layer.