项目地址:CircleLoss
主要观点
- 重新计算每个相似度的权重,以突出优化程度较低的相似度得分;因此提出circle loss,由于决策边界为circle,故为circle loss;
- 对class-level label和pair-wise label进行统一;
原理阐述
1比1识别任务应该是减少类间相似度 s n s_n sn,增大类内相似度 s p s_p sp,但是现有的loss在反向传播时,对 s n s_n sn与 s p s_p sp的梯度更新是一致的。这就导致某些点无法得到很好的更新,比如 s n s_n sn与 s p s_p sp都趋于0时, s n s_n sn不应该再优化了,重点应该放在 s p s_p sp
- (a)图中,三点的 ( s n − s p ) (s_n-s_p) (sn−sp)一致,更新方向也一致,但其实点A的 s n s_n sn已经更新得不错了,需要增加 s p s_p sp的更新,点B反之
- (a)图中,决策边界处的所有点(T, T’)都可以作为优化目标,梯度更新方向不明确,(b)图中的点T很明确地指出了优化目标
- 图(a)(b)中,在达到边界之前, s n s_n sn与 s p s_p sp的梯度更新是一致;
- 图(a)(b)中,在达到边界之前,梯度基本是固定值,在收敛时突然减小,点B更靠近边界,但是与点A的梯度是一致的;
- 图(a)(b)中,决策边界(白虚线)平行于 s p − s n = m s_p-s_n=m sp−sn=m,边界上的任意点都可以作为目标;
具体实现
-
定义一个统一的损失函数
L u n i = log [ 1 + ∑ i = 1 K ∑ j = 1 L exp ( γ ( s n j − s p i + m ) ) ] = log [ 1 + ∑ j = 1 L exp ( γ ( s n j + m ) ) ∑ i = 1 K exp ( γ ( − s p i ) ) ] \begin{aligned} \mathcal{L}_{u n i} &=\log \left[1+\sum_{i=1}^{K} \sum_{j=1}^{L} \exp \left(\gamma\left(s_{n}^{j}-s_{p}^{i}+m\right)\right)\right] \\ &=\log \left[1+\sum_{j=1}^{L} \exp \left(\gamma\left(s_{n}^{j}+m\right)\right) \sum_{i=1}^{K} \exp \left(\gamma\left(-s_{p}^{i}\right)\right)\right] \end{aligned} Luni=log[1+i=1∑Kj=1∑Lexp(γ(snj−spi+m))]=log[1+j=1∑Lexp(γ(snj+m))i=1∑Kexp(γ(−spi))]
其中, γ \gamma γ表示缩放因子 -
对于class-level labels,比如AM-Softmax,有
L a m = log [ 1 + ∑ j = 1 N − 1 exp ( γ ( s n j + m ) ) exp ( − γ s p ) ] = − log exp ( γ ( s p − m ) ) exp ( γ ( s p − m ) ) + ∑ j = 1 N − 1 exp ( γ s n j ) \begin{aligned} \mathcal{L}_{a m} &=\log \left[1+\sum_{j=1}^{N-1} \exp \left(\gamma\left(s_{n}^{j}+m\right)\right) \exp \left(-\gamma s_{p}\right)\right] \\ &=-\log \frac{\exp \left(\gamma\left(s_{p}-m\right)\right)}{\exp \left(\gamma\left(s_{p}-m\right)\right)+\sum_{j=1}^{N-1} \exp \left(\gamma s_{n}^{j}\right)} \end{aligned} Lam=log[1+j=1∑N−1exp(γ(snj+m))exp(−γsp)]=−logexp(γ(sp−m))+∑j=1N−1exp(γsnj)exp(γ(sp−m)) -
对于pair-wise labels,比如Triplet loss,有
L tri = lim γ → + ∞ 1 γ L u n i = lim γ → + ∞ 1 γ log [ 1 + ∑ i = 1 K ∑ j = 1 L exp ( γ ( s n j − s p i + m ) ) ] = max [ s n j − s p i ] + \begin{aligned} \mathcal{L}_{\text {tri }} &=\lim _{\gamma \rightarrow+\infty} \frac{1}{\gamma} \mathcal{L}_{u n i} \\ &=\lim _{\gamma \rightarrow+\infty} \frac{1}{\gamma} \log \left[1+\sum_{i=1}^{K} \sum_{j=1}^{L} \exp \left(\gamma\left(s_{n}^{j}-s_{p}^{i}+m\right)\right)\right] \\ &=\max \left[s_{n}^{j}-s_{p}^{i}\right]_{+} \end{aligned} Ltri =γ→+∞limγ1Luni=γ→+∞limγ1log[1+i=1∑Kj=1∑Lexp(γ(snj−spi+m))]=max[snj−spi]+ -
将 ( s n − s p ) (s_n-s_p) (sn−sp)调整为 ( α n s n − α p s p ) (\alpha_ns_n-\alpha_ps_p) (αnsn−αpsp),赋予这两种相似度不同的初始学习率
{ α p i = [ O p − s p i ] + , α n j = [ s n j − O n ] + \left\{\begin{array}{c} \alpha_{p}^{i}=\left[O_{p}-s_{p}^{i}\right]_{+}, \\ \alpha_{n}^{j}=\left[s_{n}^{j}-O_{n}\right]_{+} \end{array}\right. {αpi=[Op−spi]+,αnj=[snj−On]+
[·]+
表示在0处截断, O n O_n On与 O p O_p Op分别表示最佳位置
L circle = log [ 1 + ∑ i = 1 n ∑ j = 1 L exp ( γ ( α n j s n j − α p i s p i ) ) ] = log [ 1 + ∑ j = 1 L exp ( γ α n j s n j ) ∑ i = 1 K exp ( − γ α p i s p i ) , ] \begin{aligned} \mathcal{L}_{\text {circle }} &=\log \left[1+\sum_{i=1}^{n} \sum_{j=1}^{L} \exp \left(\gamma\left(\alpha_{n}^{j} s_{n}^{j}-\alpha_{p}^{i} s_{p}^{i}\right)\right)\right] \\ &=\log \left[1+\sum_{j=1}^{L} \exp \left(\gamma \alpha_{n}^{j} s_{n}^{j}\right) \sum_{i=1}^{K} \exp \left(-\gamma \alpha_{p}^{i} s_{p}^{i}\right),\right] \end{aligned} Lcircle =log[1+i=1∑nj=1∑Lexp(γ(αnjsnj−αpispi))]=log[1+j=1∑Lexp(γαnjsnj)i=1∑Kexp(−γαpispi),] -
给 s n s_n sn与 s p s_p sp设置不同的margin
L circle = log [ 1 + ∑ j = 1 L exp ( γ α n j ( s n j − Δ n ) ) ∑ i = 1 K exp ( − γ α p i ( s p i − Δ p ) ) ] \mathcal{L}_{\text {circle }}=\log \left[1+\sum_{j=1}^{L} \exp \left(\gamma \alpha_{n}^{j}\left(s_{n}^{j}-\Delta_{n}\right)\right) \sum_{i=1}^{K} \exp \left(-\gamma \alpha_{p}^{i}\left(s_{p}^{i}-\Delta_{p}\right)\right)\right] Lcircle =log[1+j=1∑Lexp(γαnj(snj−Δn))i=1∑Kexp(−γαpi(spi−Δp))] -
对于决策边界 α n ( s n − Δ n ) − α p ( s p − Δ p ) = 0 \alpha_{n}\left(s_{n}-\Delta_{n}\right)-\alpha_{p}\left(s_{p}-\Delta_{p}\right)=0 αn(sn−Δn)−αp(sp−Δp)=0,将学习率代入其中,可以得到
( s n − O n + Δ n 2 ) 2 + ( s p − O p + Δ p 2 ) 2 = C \left(s_{n}-\frac{O_{n}+\Delta_{n}}{2}\right)^{2}+\left(s_{p}-\frac{O_{p}+\Delta_{p}}{2}\right)^{2}=C (sn−2On+Δn)2+(sp−2Op+Δp)2=C
其中, C = ( ( O n − Δ n ) 2 + ( O p − Δ p ) 2 ) / 4 C=\left(\left(O_{n}-\Delta_{n}\right)^{2}+\left(O_{p}-\Delta_{p}\right)^{2}\right) / 4 C=((On−Δn)2+(Op−Δp)2)/4。此时,决策边界是一个circle。 -
由于超参数过多,统一用 m m m进行管理,令 O p = 1 + m O_p=1+m Op=1+m, O n = − m O_n=-m On=−m, Δ p = 1 − m \Delta_p=1-m Δp=1−m, Δ n = m \Delta_n=m Δn=m,可以得到
( s n − 0 ) 2 + ( s p − 1 ) 2 = 2 m 2 \left(s_{n}-0\right)^{2}+\left(s_{p}-1\right)^{2}=2 m^{2} (sn−0)2+(sp−1)2=2m2
此时,超参数只剩下缩放因子 γ \gamma γ和裕度值 m m m
引申文献
- 《 Adaptive Margin Circle Loss for Speaker Verification》
- 当margin太小时,梯度会变为线性函数;margin过大时,在决策边界处,梯度会迅速下降为0【这有什么不好吗?】
- 在训练阶段,margin由大到小变化
- 当语音块chunk较小时,训练比较困难,使得margin根据chunk size进行更新
m = ( 1 − λ L − L min L max − L min ) m 0 m=\left(1-\lambda \frac{L-L_{\min }}{L_{\max }-L_{\min }}\right) m_{0} m=(1−λLmax−LminL−Lmin)m0