多分类(Softmax)交叉熵损失函数反向传播推导

        在上一篇随笔中,我提到了多分类(Softmax)交叉熵损失函数反向传播为y-t,但并未证明,现将证明过程附上。

        首先,多分类的Softmax函数为:

y_{i} = \frac{e^{x_{i}}}{\sum_{j}e^{x_{j}}}

        交叉熵损失函数为:

L = -\sum_{i}t_{i}lny_{i}

        x为输入值,k为输入值的第k个维度,y为预测值,i为第i个标签,t为真实值。所求的反向传播,实际上是损失函数L对于入参每个维度k的偏导数\frac{\partial L}{\partial x_{k}}

\frac{\partial L}{\partial x_{k}} = \sum_{i}^{}\frac{\partial L}{\partial y_{i}}\frac{\partial y_{i}}{\partial x_{k}}

\frac{\partial L}{\partial y_{i}} = -\frac{t_{i}}{y_{i}}\frac{\partial y_{i}}{\partial x_{k}} = \frac{\partial }{\partial x_{k}}\left (\frac{e^{x_{i}}}{\sum_{j}e^{x_{j}}} \right )

因此\frac{\partial L}{\partial x_{k}} = -\sum_{i}^{}\frac{t_{i}}{y_{i}}\frac{\partial }{\partial x_{k}}\left (\frac{e^{x_{i}}}{\sum_{j}e^{x_{j}}} \right )

        注意到,上式右边偏导数的算法取决于ik是否相等,因此我们将上式分开。有:

\frac{\partial L}{\partial x_{k}} = -\left (\frac{t_{k}}{y_{k}}\frac{\partial }{\partial x_{k}}\left (\frac{e^{x_{k}}}{\sum_{j}e^{x_{j}}} \right ) + \sum_{i\neq k}^{}\frac{t_{i}}{y_{i}}\frac{\partial }{\partial x_{k}}\left (\frac{e^{x_{i}}}{\sum_{j}e^{x_{j}}} \right ) \right )

\frac{\partial }{\partial x_{k}}\left (\frac{e^{x_{k}}}{\sum_{j}e^{x_{j}}} \right ) = \frac{e^{x_{k}}\left ( \sum_{j}e^{x_{j}} - e^{x_{k}} \right )}{\left (\sum_{j}e^{x_{j}} \right )^{2}} = y_{k}\left ( 1 - y_{k} \right )

\frac{\partial }{\partial x_{k}}\left (\frac{e^{x_{i}}}{\sum_{j}e^{x_{j}}} \right ) = - \frac{e^{x_{i}}e^{x_{k}}}{\left (\sum_{j}e^{x_{j}} \right )^{2}} = -y_{i}y_{k}

因此\frac{\partial L}{\partial x_{k}} = -\left (\frac{t_{k}}{y_{k}} y_{k}\left ( 1 - y_{k} \right ) - \sum_{i\neq k}^{}\frac{t_{i}}{y_{i}}y_{i}y_{k} \right )

=\sum_{i\neq k}^{}t_{i}y_{k} - t_{k}\left ( 1 - y_{k} \right )

= y_{k}\sum_{i}t_{i} - t_{k}

        因为对于多分类任务,真实值t是one-hot向量,所以\sum_{i}t_{i} = 1,因此,

\frac{\partial L}{\partial x_{k}} = y_{k}- t_{k}

        上面这一点也提醒我们,多分类交叉熵损失函数只适用于单标签的多分类任务,不适用于该函数训练多标签分类任务(本人惨痛教训),切记。望多指教。

交叉熵损失函数反向传播可以使用链式法则进行推导。假设神经网络的输出为 $y_i$,真实标签为 $t_i$,则交叉熵损失函数为: $$ L = -\sum_i t_i \log y_i $$ 我们需要计算 $\frac{\partial L}{\partial y_i}$,再通过链式法则计算出其他参数的梯度。 首先,根据导数的定义,有: $$ \frac{\partial L}{\partial y_i} = -\frac{t_i}{y_i} $$ 接下来,我们需要计算 $\frac{\partial y_i}{\partial z_j}$,其中 $z_j$ 表示第 $j$ 个神经元的输入。根据 softmax 函数的定义,有: $$ y_i = \frac{e^{z_i}}{\sum_j e^{z_j}} $$ 当 $i=j$ 时有: $$ \frac{\partial y_i}{\partial z_i} = \frac{\partial}{\partial z_i} \frac{e^{z_i}}{\sum_j e^{z_j}} = \frac{e^{z_i} \sum_j e^{z_j} - e^{z_i} e^{z_i}}{(\sum_j e^{z_j})^2} = y_i (1 - y_i) $$ 当 $i \neq j$ 时有: $$ \frac{\partial y_i}{\partial z_j} = \frac{\partial}{\partial z_j} \frac{e^{z_i}}{\sum_j e^{z_j}} = -\frac{e^{z_i} e^{z_j}}{(\sum_j e^{z_j})^2} = -y_i y_j $$ 接下来,我们可以计算 $\frac{\partial L}{\partial z_j}$: $$ \frac{\partial L}{\partial z_j} = \sum_i \frac{\partial L}{\partial y_i} \frac{\partial y_i}{\partial z_j} = -\sum_i \frac{t_i}{y_i} y_i (1 - y_i) + \sum_{i \neq j} \frac{t_i}{y_i} y_i y_j = -t_j + y_j \sum_i t_i = y_j - t_j $$ 最后,根据链式法则,我们可以计算出其他参数的梯度: $$ \frac{\partial L}{\partial w_{jk}} = \frac{\partial L}{\partial z_j} \frac{\partial z_j}{\partial w_{jk}} = x_k (y_j - t_j) $$ $$ \frac{\partial L}{\partial b_j} = \frac{\partial L}{\partial z_j} \frac{\partial z_j}{\partial b_j} = y_j - t_j $$ 其中 $w_{jk}$ 表示第 $j$ 个神经元与第 $k$ 个输入之间的权重,$b_j$ 表示第 $j$ 个神经元的偏置。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值