【论文笔记】Large-Margin Softmax Loss for Convolutional Neural Networks

最新推荐文章于 2024-07-31 15:02:43 发布

有来有去-CV

最新推荐文章于 2024-07-31 15:02:43 发布

阅读量1.3w

点赞数 18

分类专栏： CV参考资料 CV理论知识 CV论文笔记文章标签： softmax L-Softmax CNN loss

本文链接：https://blog.csdn.net/shaoxiaohu1/article/details/53325945

版权

CV论文笔记同时被 3 个专栏收录

17 篇文章 16 订阅

订阅专栏

CV理论知识

15 篇文章 1 订阅

订阅专栏

CV参考资料

14 篇文章 0 订阅

订阅专栏

参考文献： Liu W, Wen Y, Yu Z, et al. Large-Margin Softmax Loss for Convolutional Neural Networks[C]//Proceedings of The 33rd International Conference on Machine Learning. 2016: 507-516.

摘要

Softmax Loss 函数经常在卷积神经网络被用到，较为简单实用，但是它并不能够明确引导网络学习区分性较高的特征。这篇文章提出了large-marin softmax (L-Softmax) loss, 能够有效地引导网络学习使得类内距离较小、类间距离较大的特征。同时，L-Softmax不但能够调节不同的间隔（margin），而且能够防止过拟合。可以使用随机梯度下降法推算出它的前向和后向反馈，实验证明L-Softmax学习出的特征更加有可区分性，并且在分类和验证任务上均取得比softmax更好的效果。

算法介绍

1. Softmax Loss回顾

在介绍L-Softmax之前，我们先来回顾下softmax loss。当定义第 $i$ 个输入特征 $X_i$ 以及它的标签 $y_i$ 时，softmax loss 记为：

L = 1 N \sum i L i = 1 N \sum i - l o g (e f y i \sum j e f j)

$L= \frac{1}{N} \sum_{i}{L_i}=\frac{1}{N} \sum_{i}{-log(\frac{e^{f_{y_i}}}{\sum_je^{f_j}})}$
其中

fj $f_j$ 表示最终全连接层的类别输出向量

f $\mathbf{f}$ 的第

j $j$ 个元素,

N $N$ 为训练样本的个数。由于

f $\mathbf{f}$ 是全连接层的激活函数

W $\mathbf{W}$ 的输出，所以

fyi $f_{y_i}$ 可以表示为

fyi=WTyixi $f_{y_i}=\mathbf{W}_{y_i}^{T}\mathbf{x}_i$ , 最终的损失函数又可以写为：

L i = - l o g (e ∥ W y i ∥ ∥ x i ∥ c o s ( θ y i ) \sum j e ∥ W j ∥ ∥ x i ∥ c o s ( θ j ))

$L_i= -log(\frac{e^{\Vert\mathbf{W}_{y_i}\Vert\Vert\mathbf{x}_i\Vert cos(\theta_{y_i})}} {\sum_j{e^{\Vert\mathbf{W}_j\Vert\Vert\mathbf{x_i}\Vert cos(\theta_j)}}})$
其中

0≤θj≤π $0\le\theta_j\le\pi$ 。虽然softmax在深度卷积神经网络中有着广泛的应用，但是这种形式并不能够有效地学习得到使得类内较为紧凑、类间较离散的特征。

2. 动机

初始的softmax的目的是使得 $\mathbf{W}_1^{T}\mathbf{x} \gt \mathbf{W}_2^{T}\mathbf{x}$ ，即 $\Vert\mathbf{W}_1\Vert\Vert\mathbf{x}\Vert cos(\theta_1) \gt \Vert\mathbf{W}_2\Vert\Vert\mathbf{x}\Vert cos(\theta_2)$ ，从而得到 $\mathbf{x}$ （来自类别1）正确的分类结果。作者提出large-magrin softmax loss的动机是希望通过增加一个正整数变量 $m$ ，从而产生一个决策余量，能够更加严格地约束上述不等式，即：

∥ W 1 ∥ ∥ x ∥ c o s (θ 1) \geq ∥ W 1 ∥ ∥ x ∥ c o s (m θ 1) > ∥ W 2 ∥ x ∥ c o s (θ 2)

$\Vert\mathbf{W}_1\Vert \Vert\mathbf{x}\Vert cos(\theta_1) \ge \Vert\mathbf{W}_1\Vert \Vert\mathbf{x}\Vert cos(m\theta_1)\gt \Vert\mathbf{W}_2\Vert \mathbf{x}\Vert cos(\theta_2)$
其中

0≤θ1<πm $0\le\theta_1\lt \frac{\pi}{m}$ 。如果

W1 $\mathbf{W}_1$ 和

W2 $\mathbf{W}_2$ 能够满足

∥W1∥∥x∥cos(mθ1)>∥W2∥∥x∥cos(θ2) $\Vert\mathbf{W}_1\Vert\Vert\mathbf{x}\Vert cos(m\theta_1)\gt \Vert\mathbf{W}_2\Vert\Vert\mathbf{x}\Vert cos(\theta_2)$ ，那么就必然满足

∥W1∥∥x∥cos(θ1)>∥W2∥∥x∥cos(θ2) $\Vert\mathbf{W}_1\Vert\Vert\mathbf{x}\Vert cos(\theta_1) \gt \Vert\mathbf{W}_2\Vert\Vert\mathbf{x}\Vert cos(\theta_2)$ 。这样的约束对学习

W1 $\mathbf{W}_1$ 和

W2 $\mathbf{W}_2$ 的过程提出了更高的要求，从而使得1类和2类有了更宽的分类决策边界。

（其实说白了，基于softmax loss学习同类和不同类样本时，都用的是同一种格式，因此学习到的特征的类内和类间的可区分性不强。而这篇论文是在学习同类样本时，特意增强了同类学习的难度，这个难度要比不同类的难度要大些。这样的区别对待使得特征的可区分性增强。感觉就像是管孩子，对自己家的孩子严一些，对别人家的孩子宽容些，哈哈）

Large-Margin Softmax Loss

按照上节的思路，L-Softmax loss可写为：

L i = - l o g (e ∥ W y i ∥ ∥ x i ∥ ψ ( θ y i ) ∥ W y i ∥ ∥ x i ∥ ψ ( θ y i ) + \sum j \neq y i e ∥ W j ∥ ∥ x i ∥ c o s ( θ j ))

$L_i= -log(\frac{e^{\Vert\mathbf{W}_{y_i}\Vert\Vert\mathbf{x}_i\Vert\psi(\theta_{y_i})}} {\Vert\mathbf{W}_{y_i}\Vert\Vert\mathbf{x}_i\Vert\psi(\theta_{y_i})+\sum_{j\neq y_i}{e^{\Vert\mathbf{W}_{j}\Vert\Vert\mathbf{x_i}\Vert cos(\theta_j)}}})$

在这里， $\psi(\theta)$ 可以表示为：

ψ (θ) = {c o s (m θ), 0 \leq θ \leq π m D (θ), π m < θ \leq π

$\psi(\theta)= \begin{cases} cos(m\theta), 0 \le \theta \le \frac{\pi}{m} \\ \mathcal{D}(\theta) , \frac{\pi}{m} \lt \theta \le \pi \\ \end{cases}$
当

m $m$ 越大时，分类的边界越大，学习难度当然就越高。同时，公式中的

D(θ) $\mathcal{D} (\theta)$ 必须是一个单调减函数且

D(πm)=cos(πm) $\mathcal{D} (\frac{\pi}{m})=cos(\frac{\pi}{m})$ ，以保证

ψ(θ) $\psi(\theta)$ 是一个连续函数。（这样的要求是为了保证

ψ(θ) $\psi(\theta)$ 和

cos(θ) $cos(\theta)$ 是较为类似的函数，具体的数学原理我不是特别清楚）

作者为了能够简化前向和后向传播，构建了这样一种函数形式 $\psi(\theta)$ ：

ψ (θ) = (- 1) k c o s (m θ) - 2 k, θ \in [k π m, ( k + 1 ) π m]

$\psi(\theta) = (-1)^kcos(m\theta)-2k, \theta \in[\frac{k\pi}{m}, \frac{(k+1)\pi}{m}]$
其中

k $k$ 是一个整数且

k∈[0,m−1] $k \in [0,m-1]$ 。下图是softmax loss 和L-Softmax loss的比较。

这里写图片描述

再使用 $\frac{\mathbf{W}_j^T \mathbf{x_i}} {\Vert \mathbf{W}_j\Vert \Vert x_i \Vert}$ 替代 $cos(\theta_j)$ ，以及将 $cos(m\theta_{y_i})$ 替换为 $cos(\theta_{y_i})$ 和 $m$ 的函数（论文中已交待，太长，我就不敲上去了），这样，最终的L-Softmax loss 函数就可以分别对 $\mathbf{x}$ 和 $\mathbf{W}$ 进行求导。后续的推导过程可以参考原论文（公式太多，我又太懒）。

简单分析

为了简单明了地表明L-Softmax Loss的有效性，作者讨论了一个二分类问题，只包含 $\mathbf{W_1}$ 和 $\mathbf{W_2}$ 。分析结果如下图所示。

这里写图片描述

在训练过程中，当 $\mathbf{W}_1=\mathbf{W}_2$ 时，softmax loss 要求 $\theta_1 \lt \theta_2$ , 而 L-Softmax则要求 $m\theta_1 \lt \theta_2$ ,我们从图中可以看到L-Softmax得到了一个更严格的分类标准。当 $\mathbf{W}_1 \gt \mathbf{W}_2$ 和 $\mathbf{W}_1\lt \mathbf{W}_2$ 时，虽然情况会复杂些，但是同样可以看到L-Softmax会产生一个较大的决策余量。