Prior works
- MV-Softmax[2020-AAAI] [强烈建议]
Motivation
- MV-Sotamax存在的问题:从training起始阶段就开始强调semi-hard/hard-sample,可能会导致模型的收敛问题!
- insight : easy sample first, hard sample later!
Code
- CurricularFace[Pytorchvision]
class CurricularFace(nn.Module):
def __init__(self, in_features, out_features, m = 0.5, s = 64.):
super(CurricularFace, self).__init__()
self.in_features = in_features
self.out_features = out_features
self.m = m
self.s = s
self.cos_m = math.cos(m)
self.sin_m = math.sin(m)
self.threshold = math.cos(math.pi - m)
self.mm = math.sin(math.pi - m) * m
self.kernel = Parameter(torch.Tensor(in_features, out_features))
self.register_buffer('t', torch.zeros(1))
nn.init.normal_(self.kernel, std=0.01)
def forward(self, embbedings, label):
embbedings = l2_norm(embbedings, axis = 1)
kernel_norm = l2_norm(self.kernel, axis = 0)
cos_theta = torch.mm(embbedings, kernel_norm)
cos_theta = cos_theta.clamp(-1, 1) # for numerical stability
with torch.no_grad():
origin_cos = cos_theta.clone()
target_logit = cos_theta[torch.arange(0, embbedings.size(0)), label].view(-1, 1)
sin_theta = torch.sqrt(1.0 - torch.pow(target_logit, 2))
cos_theta_m = target_logit * self.cos_m - sin_theta * self.sin_m #cos(target+margin)
mask = cos_theta > cos_theta_m
final_target_logit = torch.where(target_logit > self.threshold, cos_theta_m, target_logit - self.mm)
hard_example = cos_theta[mask]
with torch.no_grad():
self.t = target_logit.mean() * 0.01 + (1 - 0.01) * self.t
cos_theta[mask] = hard_example * (self.t + hard_example)
cos_theta.scatter_(1, label.view(-1, 1).long(), final_target_logit)
output = cos_theta * self.s
return output, origin_cos * self.s
Details
- Curricular Loss
其中,T(cos(θ_y)) = cos(θ_y + m), I (t, cos(θ_j))表示样本的权重函数,N(t, cos(θ_j))定义如下:
- Training Curve
- x-axis : iterations, y-axis : 难样本的调整系数[modulation coefficients];
- t : adaptive parameter; M(MV-Arc-Softmax) : MV-Arc-Softmax; M(ours) : gradient modulation coefficients;
- 在训练早期,t --> 0, I(t, cos(θ_j)) = 1,模型可以利用easy-sample加速收敛;在训练中后期t不断增大使得I(t, cos(θ_j)) > 1,这样模型可以更多地关注hard-smaples.
eary, later
Note : (a, b), a表示在训练过程中[某个时刻] curricular_loss和arcface-loss的比值;b表示max {cos(θ_j), j ≠ yi}
- Adaptive Estimation of t
- r^(k)表示第k个mini-batch中positive-cosin similarity的均值,r^(0) = 0;
- , α = 0.99. 【大家可以脑补一下:为什么t^(k)随着k的增加,会呈现出单调递增的趋势呢?】
- different strategies or vaule of t
Experiment
- Benchmark
- Challenge
Reference
[1]. CurricularFace: Adaptive Curriculum Learning Loss for Deep Face Recognition[2020-CVPR]