ArcFace: Additive Angular Margin Loss for Deep Face Recognition
本文目的是梳理loss函数由欧式空间(Euclidean Space)转换到角度空间(Angular Space)的发展过程和理解。
针对图片分类问题,如脸部识别,常见的深度学习模型中的损失函数是交叉熵函数,loss函数对网络性能有着重要影响。Centre loss[1]: 惩罚人脸深层特征与其相应的类中心在欧氏空间中的距离,以缩小类内距离; SphereFace[2]: 假设最后一个全连接层中的线性变换矩阵可以表示角度空间中类别中心,并以乘法的方式惩罚深特征与其相应权重之间的角度。目前的研究主线是在已有的损失函数中加入裕度,以增强对人脸的分类能力,本文提出 Additive Angular Margin Loss (ArcFace)[3] 增强使用特征进行人脸分类的能力,将人脸识别的能力提高了10个点。本文尽量给出论文和代码链接。
若有疏漏之处,望不吝赐教!
论文: https://arxiv.org/abs/1801.07698
官方代码: https://github.com/deepinsight/insightface
以上的各种faceLoss是如何演变发展的呢?
经典的分类损失函数softmax,即:
s
o
f
t
m
a
x
(
f
i
)
=
e
f
i
∑
c
=
1
C
e
f
c
(1)
softmax(f_i) = \frac{e^{f_i}}{\sum \limits_{c=1}^{C} e^{f_c}} \tag{1}
softmax(fi)=c=1∑Cefcefi(1)
L S = − 1 N ∑ i = 1 N l o g e W y i T x i + b y i ∑ j = 1 n e W j T x i + b j = − 1 N ∑ i = 1 N l o g e ∣ ∣ W y i ∣ ∣ ∣ ∣ x i ∣ ∣ c o s θ y i ∑ j e ∣ ∣ W j ∣ ∣ ∣ ∣ x i ∣ ∣ c o s θ y j (2) L_{S}= -\frac{1}{N}\sum \limits_{i=1}^{N}log \frac{e^{W_{y_i}^{T}x_i+b_{y_i}}}{\sum \limits_{j=1}^{n}e^{W_{j}^{T}x_i+b_{j}}} \\= -\frac{1}{N}\sum \limits_{i=1}^{N}log \frac{{e^{||W_{y_i}||||x_i||} cosθ_{y_i}}}{\sum_j e^{||W_{j}||||x_i||} cosθ_{y_j}} \tag{2} LS=−N1i=1∑Nlogj=1∑neWjTxi+bjeWyiTxi+byi=−N1i=1∑Nlog∑je∣∣Wj∣∣∣∣xi∣∣cosθyje∣∣Wyi∣∣∣∣xi∣∣cosθyi(2)
这是传统的的Softmax公式, W y i T x i + b y i W_{y_i}^{T}x_i+b_{y_i} WyiTxi+byi代表的是全链接的输出,通过计算会得到每个类别的概率。而这种方式主要考虑是否能够正确的分类,缺乏类内和类间距的约束。
CenterLoss
SIAT , 港中文,乔宇 ECCV 2016
论文:A Discriminative Feature Learning Approach for Deep Face Recognition
code: https://github.com/ydwen/caffe-face
在[A Discriminative Feature Learning Approach for Deep Face Recognition][1]这篇文章中,作者使用了一个比LeNet更深的网络结构,用Mnist做了一个小实验来证明Softmax学习到的特征与理想状态下的差距:
实验结果表明,传统的Softmax仍存在着很大的类内距离,也就是说,通过对损失函数增加类内距离的约束,能达到比更新现有网络结构更加事半功倍的效果。于是,[A Discriminative Feature Learning Approach for Deep Face Recognition]的作者提出了Center Loss,并从不同角度对结果的提升做了论证。
Center Loss的整体思想是希望一个batch中每个样本的feature你feature的中心的距离的平方和要越小越好,也就是类内距离越小越好。作者提出,最终的损失函数包含softmax loss和center loss,用参数λ来控制二者的比重,如下面公式所示:
做如下更新:
加入了Center Loss后增加了对类内距离的约束,使得同个类直接的样本的类内特征距离变得紧凑。
Large Margin softmax loss: L-softmax loss
北京大学 华南理工大学 深圳大学 卡内基梅隆大学 2016.12
参考文献: Liu W, Wen Y, Yu Z, et al. Large-Margin Softmax Loss for Convolutional Neural Networks[C]//Proceedings of The 33rd International Conference on Machine Learning (ICML).
2016: 507-516.
论文:Large-Margin Softmax Loss for Convolutional Neural Networks
code: https://github.com/wy1iu/LargeMargin_Softmax_Loss
假设一个2分类问题,x属于类别1,那么原来的softmax肯定是希望:
W
1
T
x
>
W
2
T
x
(3)
W_{1} ^T x > W_2^Tx \tag{3}
W1Tx>W2Tx(3)
也就是属于类别1的概率大于类别2的概率,这个式子和下式是等效的:
∣
∣
W
1
∣
∣
∣
∣
x
∣
∣
c
o
s
(
θ
1
)
>
∣
∣
W
2
∣
∣
∣
∣
x
∣
∣
c
o
s
(
θ
2
)
(4)
∣ ∣ W_1 ∣ ∣ ∣ ∣ x ∣ ∣ c o s ( θ_1 ) > ∣ ∣ W_2 ∣ ∣ ∣ ∣ x ∣ ∣ c o s ( θ_2 ) \tag{4}
∣∣W1∣∣∣∣x∣∣cos(θ1)>∣∣W2∣∣∣∣x∣∣cos(θ2)(4)
large margin softmax就是将上面不等式替换为:
∣
∣
W
1
∣
∣
∣
∣
x
∣
∣
c
o
s
(
m
θ
1
)
>
∣
∣
W
2
∣
∣
∣
∣
x
∣
∣
c
o
s
(
θ
2
)
(
0
<
θ
1
<
p
i
=
3.14
m
)
(5)
∣∣ W_1∣|∣∣ x ∣∣ c o s ( m θ_1 ) > ∣∣ W_2∣∣ ∣∣ x ∣∣cos ( θ_2 ) ( 0 < θ_1 <\frac{ pi =3.14}{ m }) \tag{5}
∣∣W1∣∣∣∣x∣∣cos(mθ1)>∣∣W2∣∣∣∣x∣∣cos(θ2)(0<θ1<mpi=3.14)(5)
m是正整数,cos函数在0到π范围又是单调递减的,所以cos(mx)要小于cos(x)。通过这种方式定义损失会逼得模型学到类间距离更大的,类内距离更小的特征。
从几何的角度看两种损失的差别:
设置为cos(mx)后,使得学习到的W参数更加的扁平,可以加大样本的类间距离。
Large-Margin Softmax的实验效果:
提出的large-marin softmax (L-Softmax) loss, 能够有效地引导网络学习使得类内距离较小、类间距离较大的特征。同时,L-Softmax不但能够调节不同的间隔(margin),而且能够防止过拟合。可以使用随机梯度下降法推算出它的前向和后向反馈,实验证明L-Softmax学习出的特征更加有可区分性,并且在分类和验证任务上均取得比softmax更好的效果。
A-softmax: SphereFace Loss
1佐治亚理工学院2卡内基梅隆大学3中山大学 2017.4 CVPR
SphereFace: Deep Hypersphere Embedding for Face Recognition
论文链接:Deep Hypersphere Embedding for Face Recognition
代码地址:https://github.com/wy1iu/sphereface
A-softmax loss简单讲就是在large margin softmax loss的基础上添加了两个限制条件||W||=1和b=0,使得预测仅取决于W和x之间的角度。
softmax的计算若引入两个限制条件,
∣
∣
W
1
∣
∣
=
∣
∣
W
2
∣
∣
=
1
|| W_1|| = ∣∣W_2∣∣ = 1
∣∣W1∣∣=∣∣W2∣∣=1以及
b
1
=
b
2
=
0
b_1 = b_2 = 0
b1=b2=0。
decision boundary变为:$∣∣ x ∣∣ ( cos θ_1 − cosθ_2 ) = 0 ,只取决于角度了,则损失函数变为:
这两个限制条件的基础上,作者又添加了和large margin softmax loss一样的角度参数,使得公式变为:
代码搬运工-Pytorch实现:
# SphereFace
class SphereProduct(nn.Module):
r"""Implement of large margin cosine distance: :
Args:
in_features: size of each input sample
out_features: size of each output sample
m: margin
cos(m*theta)
"""
def __init__(self, in_features, out_features, m=4):
super(SphereProduct, self).__init__()
self.in_features = in_features
self.out_features = out_features
self.m = m
self.base = 1000.0
self.gamma = 0.12
self.power = 1
self.LambdaMin = 5.0
self.iter = 0
self.weight = Parameter(torch.FloatTensor(out_features, in_features))
nn.init.xavier_uniform(self.weight)
# duplication formula
# 将x\in[-1,1]范围的重复index次映射到y\[-1,1]上
self.mlambda = [
lambda x: x ** 0,
lambda x: x ** 1,
lambda x: 2 * x ** 2 - 1,
lambda x: 4 * x ** 3 - 3 * x,
lambda x: 8 * x ** 4 - 8 * x ** 2 + 1,
lambda x: 16 * x ** 5 - 20 * x ** 3 + 5 * x
]
"""
执行以下代码直观了解mlambda
import matplotlib.pyplot as plt
mlambda = [
lambda x: x ** 0,
lambda x: x ** 1,
lambda x: 2 * x ** 2 - 1,
lambda x: 4 * x ** 3 - 3 * x,
lambda x: 8 * x ** 4 - 8 * x ** 2 + 1,
lambda x: 16 * x ** 5 - 20 * x ** 3 + 5 * x
]
x = [0.01 * i for i in range(-100, 101)]
print(x)
for f in mlambda:
plt.plot(x,[f(i) for i in x])
plt.show()
"""
def forward(self, input, label):
# lambda = max(lambda_min,base*(1+gamma*iteration)^(-power))
self.iter += 1
self.lamb = max(self.LambdaMin, self.base * (1 + self.gamma * self.iter) ** (-1 * self.power))
# --------------------------- cos(theta) & phi(theta) ---------------------------
cos_theta = F.linear(F.normalize(input), F.normalize(self.weight))
cos_theta = cos_theta.clamp(-1, 1)
cos_m_theta = self.mlambda[self.m](cos_theta)
theta = cos_theta.data.acos()
k = (self.m * theta / 3.14159265).floor()
phi_theta = ((-1.0) ** k) * cos_m_theta - 2 * k
NormOfFeature = torch.norm(input, 2, 1)
# --------------------------- convert label to one-hot ---------------------------
one_hot = torch.zeros(cos_theta.size())
one_hot = one_hot.cuda() if cos_theta.is_cuda else one_hot
one_hot.scatter_(1, label.view(-1, 1), 1)
# --------------------------- Calculate output ---------------------------
output = (one_hot * (phi_theta - cos_theta) / (1 + self.lamb)) + cos_theta
output *= NormOfFeature.view(-1, 1)
return output
def __repr__(self):
return self.__class__.__name__ + '(' \
+ 'in_features=' + str(self.in_features) \
+ ', out_features=' + str(self.out_features) \
+ ', m=' + str(self.m) + ')'
AM-softmax: cosFace Loss
腾讯,哥伦比亚大学 2018.1 CVPR
论文:CosFace: Large Margin Cosine Loss for Deep Face Recognition
在A-softmax的基础上,修改Cos(mθ)为一个新函数:
ψ
(
θ
)
=
c
o
s
θ
−
m
\psi(θ) = cos θ -m
ψ(θ)=cosθ−m
与ASoftmax中定的的类似,可以达到减小对应标签项的概率,增大损失的效果,因此对同一类的聚合更有帮助。然后根据Normface,对f进行归一化,乘上缩放系数s,最终的损失函数变为:
这样做的好处在于A-Softmax的倍角计算是要通过倍角公式,反向传播时不方便求导,而只减m反向传播时导数不用变化。
Asoftmax是用m乘以θ,而AMSoftmax是用cosθ减去m,这是两者的最大不同之处:一个是角度距离,一个是余弦距离。
之所以选择cosθ-m而不是cos(θ-m),这是因为我们从网络中得到的是
W
W
W和
f
f
f的内积,如果要优化
c
o
s
(
θ
−
m
)
cos(θ-m)
cos(θ−m)那么会涉及到
a
r
c
c
o
s
arccos
arccos操作,计算量过大。
代码搬运工-Pytorch实现:
# CosFace
class AddMarginProduct(nn.Module):
r"""Implement of large margin cosine distance: :
Args:
in_features: size of each input sample
out_features: size of each output sample
s: norm of input feature
m: margin
cos(theta) - m
"""
def __init__(self, in_features, out_features, s=30.0, m=0.40):
super(AddMarginProduct, self).__init__()
self.in_features = in_features
self.out_features = out_features
self.s = s
self.m = m
self.weight = Parameter(torch.FloatTensor(out_features, in_features))
nn.init.xavier_uniform_(self.weight)
def forward(self, input, label):
# --------------------------- cos(theta) & phi(theta) ---------------------------
cosine = F.linear(F.normalize(input), F.normalize(self.weight))
phi = cosine - self.m
# --------------------------- convert label to one-hot ---------------------------
one_hot = torch.zeros(cosine.size(), device='cuda')
# one_hot = one_hot.cuda() if cosine.is_cuda else one_hot
one_hot.scatter_(1, label.view(-1, 1).long(), 1)
# -------------torch.where(out_i = {x_i if condition_i else y_i) -------------
output = (one_hot * phi) + ((1.0 - one_hot) * cosine)
# you can use torch.where if your torch.__version__ is 0.4
output *= self.s
# print(output)
return output
def __repr__(self):
return self.__class__.__name__ + '(' \
+ 'in_features=' + str(self.in_features) \
+ ', out_features=' + str(self.out_features) \
+ ', s=' + str(self.s) \
+ ', m=' + str(self.m) + ')'
Arcface Loss
敦帝国理工学院邓建康等在2018.01 CVPR
论文: ArcFace: Additive Angular Margin Loss for Deep Face Recognition
官方代码: https://github.com/deepinsight/insightface
分类正确label的值为:
cos函数在(0,1)内是单调递减少的,加上m,会使该值变得更小,从而loss会变得很大。这样修改的原因:角度距离比余弦距离在对角度的影响更加直接。
组合使用SphereFace, ArcFace and CosFace于一个框架内:
代码搬运工:
def get_symbol(args):
# 获得一个特征向量
embedding = eval(config.net_name).get_symbol()
# 定义一个标签的占位符,用来存放标签
all_label = mx.symbol.Variable('softmax_label')
gt_label = all_label
is_softmax = True
# 如果损失函数为softmax
if config.loss_name == 'softmax':
# 定义一个全连接层的权重,使用全局池化代替全链接层
_weight = mx.symbol.Variable("fc7_weight", shape=(config.num_classes, config.emb_size),
lr_mult=config.fc7_lr_mult, wd_mult=config.fc7_wd_mult, init=mx.init.Normal(0.01))
# 如果不设置bias,使用全局池化代替全链接层,得到每个id的概率值
if config.fc7_no_bias:
fc7 = mx.sym.FullyConnected(data=embedding, weight=_weight, no_bias=True, num_hidden=config.num_classes,
name='fc7')
# 如果设置_bias,使用全局池化代替全链接层,得到每个id的cos_t
else:
_bias = mx.symbol.Variable('fc7_bias', lr_mult=2.0, wd_mult=0.0)
fc7 = mx.sym.FullyConnected(data=embedding, weight=_weight, bias=_bias, num_hidden=config.num_classes,
name='fc7')
# 如果损失函数为margin_softmax
elif config.loss_name == 'margin_softmax':
# 定义一个全连接层的权重,使用全局池化代替全链接层
_weight = mx.symbol.Variable("fc7_weight", shape=(config.num_classes, config.emb_size),
lr_mult=config.fc7_lr_mult, wd_mult=config.fc7_wd_mult, init=mx.init.Normal(0.01))
# 获得loss中m的缩放系数
s = config.loss_s
# 先进行L2正则化,然后进行全链接
_weight = mx.symbol.L2Normalization(_weight, mode='instance')
nembedding = mx.symbol.L2Normalization(embedding, mode='instance', name='fc1n') * s
#使用全局池化代替全链接层,得到每个id的角度*64
fc7 = mx.sym.FullyConnected(data=nembedding, weight=_weight, no_bias=True, num_hidden=config.num_classes,
name='fc7')
in_shape,out_shape,uax_shape = fc7.infer_shape(data = (2,3,112,112))
print('fc7',out_shape)
# 其存在m1,m2,m3是为了把算法整合在一起,
# arcface cosface combined
if config.loss_m1 != 1.0 or config.loss_m2 != 0.0 or config.loss_m3 != 0.0:
# cosface loss
if config.loss_m1 == 1.0 and config.loss_m2 == 0.0:
s_m = s * config.loss_m3
gt_one_hot = mx.sym.one_hot(gt_label, depth=config.num_classes, on_value=s_m, off_value=0.0)
fc7 = fc7 - gt_one_hot
# arcface combined
else:
# fc7每一行找出gt_label对应的值,即 角度*s
zy = mx.sym.pick(fc7, gt_label, axis=1)
in_shape,out_shape,uax_shape = zy.infer_shape(data = (2,3,112,112),softmax_label = (2,))
print('zy', out_shape)
# 进行复原,前面乘以了s,cos_t为-1到1之间
cos_t = zy / s
# t为0-3.14之间
# 该arccos是为了让后续的cos单调递增
t = mx.sym.arccos(cos_t)
# m1 sphereface
if config.loss_m1 != 1.0:
t = t * config.loss_m1
# arcface或者combined
if config.loss_m2 > 0.0:
t = t + config.loss_m2
# t为0-3.14之间,单调递增
body = mx.sym.cos(t)
# combined 或者 arcface
if config.loss_m3 > 0.0:
body = body - config.loss_m3
new_zy = body * s
# 得到差值
diff = new_zy - zy
# 扩展一个维度
diff = mx.sym.expand_dims(diff, 1)
# 把标签转化为one_hot编码
gt_one_hot = mx.sym.one_hot(gt_label, depth=config.num_classes, on_value=1.0, off_value=0.0)
# 进行更新
body = mx.sym.broadcast_mul(gt_one_hot, diff)
fc7 = fc7 + body
# 如果损失函数为triplet
elif config.loss_name.find('triplet') >= 0:
is_softmax = False
nembedding = mx.symbol.L2Normalization(embedding, mode='instance', name='fc1n')
anchor = mx.symbol.slice_axis(nembedding, axis=0, begin=0, end=args.per_batch_size // 3)
positive = mx.symbol.slice_axis(nembedding, axis=0, begin=args.per_batch_size // 3,
end=2 * args.per_batch_size // 3)
negative = mx.symbol.slice_axis(nembedding, axis=0, begin=2 * args.per_batch_size // 3, end=args.per_batch_size)
if config.loss_name == 'triplet':
ap = anchor - positive
an = anchor - negative
ap = ap * ap
an = an * an
ap = mx.symbol.sum(ap, axis=1, keepdims=1) # (T,1)
an = mx.symbol.sum(an, axis=1, keepdims=1) # (T,1)
triplet_loss = mx.symbol.Activation(data=(ap - an + config.triplet_alpha), act_type='relu')
triplet_loss = mx.symbol.mean(triplet_loss)
else:
ap = anchor * positive
an = anchor * negative
ap = mx.symbol.sum(ap, axis=1, keepdims=1) # (T,1)
an = mx.symbol.sum(an, axis=1, keepdims=1) # (T,1)
ap = mx.sym.arccos(ap)
an = mx.sym.arccos(an)
triplet_loss = mx.symbol.Activation(data=(ap - an + config.triplet_alpha), act_type='relu')
triplet_loss = mx.symbol.mean(triplet_loss)
triplet_loss = mx.symbol.MakeLoss(triplet_loss)
out_list = [mx.symbol.BlockGrad(embedding)]
# 如果使用了softmax
if is_softmax:
softmax = mx.symbol.SoftmaxOutput(data=fc7, label=gt_label, name='softmax', normalization='valid')
out_list.append(softmax)
if config.ce_loss:
# ce_loss = mx.symbol.softmax_cross_entropy(data=fc7, label = gt_label, name='ce_loss')/args.per_batch_size
body = mx.symbol.SoftmaxActivation(data=fc7)
body = mx.symbol.log(body)
_label = mx.sym.one_hot(gt_label, depth=config.num_classes, on_value=-1.0, off_value=0.0)
body = body * _label
ce_loss = mx.symbol.sum(body) / args.per_batch_size
out_list.append(mx.symbol.BlockGrad(ce_loss))
# 如果是triplet
else:
out_list.append(mx.sym.BlockGrad(gt_label))
out_list.append(triplet_loss)
# 聚集所有的符号
out = mx.symbol.Group(out_list)
return out
若有疏漏之处,望不吝赐教!
参考资料:
[1] 博文:人脸识别损失函数笔记
[2] 博文:insightFace-损失函数arcface