欢迎Star我的Machine Learning Blog:https://github.com/purepisces/Wenqing-Machine_Learning_Blog。
交叉熵损失
交叉熵损失是用于基于概率的分类问题最常用的损失函数之一。
交叉熵损失前向方程
首先,我们使用softmax函数将原始模型输出 A A A转换成由输入数值的指数决定的 C C C类的概率分布。
ι N \iota_N ιN、 ι C \iota_C ιC是大小为 N N N和 C C C的列向量,包含全部为1的值。
softmax ( A ) = σ ( A ) = exp ( A ) ∑ j = 1 C exp ( A i j ) \text{softmax}(A) = \sigma(A) = \frac{\exp(A)}{\sum\limits_{j=1}^{C} \exp(A_{ij})} softmax(A)=σ(A)=j=1∑Cexp(Aij)exp(A)
现在,A的每一行代表模型对概率分布的预测,而Y的每一行代表一个输入的目标分布。
然后,我们计算分布Ai相对于目标分布Yi的交叉熵H(A,Y),对于i = 1,…,N:
crossentropy = H ( A , Y ) = ( − Y ⊙ log ( σ ( A ) ) ) ⋅ ι C \text{crossentropy} = H(A, Y) = (-Y \odot \log(\sigma(A))) \cdot \mathbf{\iota}_C crossentropy=H(A,Y)=(−Y⊙log(σ(A)))⋅ιC
记住,损失函数的输出是一个标量,但现在我们有一个大小为N的列矩阵。要将其转换为标量,我们可以使用所有交叉熵的和或平均值。
这里,我们选择使用平均交叉熵作为交叉熵损失,这也是PyTorch的默认设置:
sumcrossentropyloss : = ι N T ⋅ H ( A , Y ) = S C E ( A , Y ) \text{sumcrossentropyloss} := \mathbf{\iota}_N^T \cdot H(A, Y) = SCE(A, Y) sumcrossentropyloss:=ιNT⋅H(A,Y)=SCE(A,Y)
meancrossentropyloss
:
=
S
C
E
(
A
,
Y
)
N
\text{meancrossentropyloss} := \frac{SCE(A, Y)}{N}
meancrossentropyloss:=NSCE(A,Y)
交叉熵损失反向方程
xent.backward ( ) = σ ( A ) − Y N \text{xent.backward}() = \frac{\sigma(A) - Y}{N} xent.backward()=Nσ(A)−Y
梯度的推导(我的证明)
要找到交叉熵损失相对于对数几率 A i A_i Ai的梯度,我们需要计算导数 ∂ H ∂ A i \frac{\partial H}{\partial A_i} ∂Ai∂H。这涉及到应用链式法则到对数和softmax函数的复合中。
注意对于一个对数几率 A i c A_{ic} Aic,softmax函数定义为:
σ ( A i c ) = e A i c ∑ j = 1 C e A i j \sigma(A_{ic}) = \frac{e^{A_{ic}}}{\sum\limits_{j=1}^{C} e^{A_{ij}}} σ(Aic)=j=1∑CeAijeAic
步骤1:应用链式法则
首先,注意我们需要对一个复合函数的导数应用链式法则,这个复合函数是softmax输出的对数:
H ( A i , Y i ) = − ∑ c = 1 C Y i c log ( σ ( A i c ) ) H(A_i, Y_i) = -\sum_{c=1}^{C} Y_{ic} \log(\sigma(A_{ic})) H(Ai,Yi)=−c=1∑CYiclog(σ(Aic))
∂ H ∂ A i c = ∂ ( − Y i 1 log ( σ ( A i 1 ) ) − Y i 2 log ( σ ( A i 2 ) ) − . . . − Y i C log ( σ ( A i C ) ) ) ∂ A i c = ∂ ( − Y i 1 log ( σ ( A i 1 ) ) ) ∂ A i c + ∂ ( − Y i 2 log ( σ ( A i 2 ) ) ) ∂ A i c + . . . + ∂ ( − Y i C log ( σ ( A i C ) ) ) ∂ A i c = − Y i 1 ∂ log ( σ ( A i 1 ) ) ∂ A i c − Y i 2 ∂ log ( σ ( A i 2 ) ) ∂ A i c − . . . − Y i C ∂ log ( σ ( A i C ) ) ∂ A i c = − ∑ k = 1 C Y i k ∂ log ( σ ( A i k ) ) ∂ A i c \begin{align*} \frac{\partial H}{\partial A_{ic}} &= \frac{\partial (-Y_{i1}\log(\sigma(A_{i1})) -Y_{i2}\log(\sigma(A_{i2}))-...-Y_{iC}\log(\sigma(A_{iC})))}{\partial A_{ic}} \\ &= \frac{\partial (-Y_{i1}\log(\sigma(A_{i1})))}{\partial A_{ic}} + \frac{\partial (-Y_{i2}\log(\sigma(A_{i2})))}{\partial A_{ic}} + ...+ \frac{\partial (-Y_{iC}\log(\sigma(A_{iC})))}{\partial A_{ic}} \\ &=-Y_{i1}\frac{\partial \log(\sigma(A_{i1}))}{\partial A_{ic}} -Y_{i2}\frac{\partial \log(\sigma(A_{i2}))}{\partial A_{ic}} -...-Y_{iC}\frac{\partial \log(\sigma(A_{iC}))}{\partial A_{ic}} \\ &=- \sum_{k=1}^{C} Y_{ik} \frac{\partial \log(\sigma(A_{ik}))}{\partial A_{ic}}\\ \end{align*} ∂Aic∂H=∂Aic∂(−Yi1log(σ(Ai1))−Yi2log(σ(Ai2))−...−YiClog(σ(AiC)))=∂Aic∂(−Yi1log(σ(Ai1)))+∂Aic∂(−Yi2log(σ(Ai2)))+...+∂Aic∂(−YiClog(σ(AiC)))=−Yi1∂Aic∂log(σ(Ai1))−Yi2∂Aic∂log(σ(Ai2))−...−YiC∂Aic∂log(σ(AiC))=−k=1∑CYik∂Aic∂log(σ(Aik))
∂ H ∂ A i c = − ∑ k = 1 C Y i k ∂ log ( σ ( A i k ) ) ∂ A i c \frac{\partial H}{\partial A_{ic}} = - \sum_{k=1}^{C} Y_{ik} \frac{\partial \log(\sigma(A_{ik}))}{\partial A_{ic}} ∂Aic∂H=−k=1∑CYik∂Aic∂log(σ(Aik))
步骤2:Softmax的对数的导数
log ( σ ( A i k ) ) \log(\sigma(A_{ik})) log(σ(Aik))关于 A i c A_{ic} Aic的导数涉及两种情况:当 k = c k=c k=c和当 k ≠ c k \neq c k=c时。
当 k = c k=c k=c时,使用对数的导数 ∂ log ( x ) ∂ x = 1 x \frac{\partial \log(x)}{\partial x} = \frac{1}{x} ∂x∂log(x)=x1和softmax的定义,我们得到:
∂ log ( σ ( A i k ) ) ∂ A i c = ∂ log ( σ ( A i c ) ) ∂ σ ( A i c ) ⋅ ∂ σ ( A i c ) ∂ A i c = 1 σ ( A i c ) ⋅ σ ( A i c ) ⋅ ( 1 − σ ( A i c ) ) = 1 − σ ( A i c ) \begin{align*} \frac{\partial \log(\sigma(A_{ik}))}{\partial A_{ic}} &= \frac{\partial \log(\sigma(A_{ic}))}{\partial \sigma(A_{ic})} \cdot \frac{\partial \sigma(A_{ic})}{\partial A_{ic}} \\ &= \frac{1}{\sigma(A_{ic})} \cdot \sigma(A_{ic}) \cdot (1 - \sigma(A_{ic})) \\ &= 1 - \sigma(A_{ic}) \end{align*} ∂Aic∂log(σ(Aik))=∂σ(Aic)∂log(σ(Aic))⋅∂Aic∂σ(Aic)=σ(Aic)1⋅σ(Aic)⋅(1−σ(Aic))=1−σ(Aic)
当 k ≠ c k\neq c k=c时,导数涉及不同类的softmax函数,结果是:
∂ log ( σ ( A i k ) ) ∂ A i c = ∂ log ( σ ( A i k ) ) ∂ σ ( A i c ) ⋅ ∂ σ ( A i c ) ∂ A i c = 1 σ ( A i k ) ⋅ − σ ( A i k ) ⋅ σ ( A i c ) = − σ ( A i c ) \begin{align*} \frac{\partial \log(\sigma(A_{ik}))}{\partial A_{ic}} &= \frac{\partial \log(\sigma(A_{ik}))}{\partial \sigma(A_{ic})} \cdot \frac{\partial \sigma(A_{ic})}{\partial A_{ic}} \\ &= \frac{1}{\sigma(A_{ik})} \cdot -\sigma(A_{ik}) \cdot \sigma(A_{ic}) \\ &= -\sigma(A_{ic}) \end{align*} ∂Aic∂log(σ(Aik))=∂σ(Aic)∂log(σ(Aik))⋅∂Aic∂σ(Aic)=σ(Aik)1⋅−σ(Aik)⋅σ(Aic)=−σ(Aic)
步骤3:合并情况
由于 Y i Y_i Yi对于真实类别只能为1,否则为0,这简化为:
∂ H ∂ A i c = − ∑ k = 1 C Y i k ∂ log ( σ ( A i k ) ) ∂ A i c = o r { − 1 ( 1 − σ ( A i c ) ) = σ ( A i c ) − 1 , for Y i c = 0 − 1 ( − σ ( A i c ) ) = σ ( A i c ) − 0 , for Y i c = 1 = σ ( A i c ) − Y i c \begin{align*} \frac{\partial H}{\partial A_{ic}} &= - \sum_{k=1}^{C} Y_{ik} \frac{\partial \log(\sigma(A_{ik}))}{\partial A_{ic}} \\ &= or\begin{cases} -1 (1 - \sigma(A_{ic})) = \sigma(A_{ic}) - 1, & \text{for } Y_{ic} = 0 \\ -1(-\sigma(A_{ic})) = \sigma(A_{ic}) - 0, & \text{for } Y_{ic} = 1 \end{cases} \\ &= \sigma(A_{ic}) - Y_{ic} \end{align*} ∂Aic∂H=−k=1∑CYik∂Aic∂log(σ(Aik))=or{−1(1−σ(Aic))=σ(Aic)−1,−1(−σ(Aic))=σ(Aic)−0,for Yic=0for Yic=1=σ(Aic)−Yic
示例
让我给出一个具体的例子来说明它:
示例 1
考虑这个案例 Y = [ 1 , 0 , 0 ] Y = [1,0,0] Y=[1,0,0] 和 A = [ 2 , 1 , − 1 ] A = [2, 1, -1] A=[2,1,−1]。
Y
11
=
1
,
Y
12
=
0
,
Y
13
=
0
Y_{11} = 1, Y_{12} = 0, Y_{13} = 0
Y11=1,Y12=0,Y13=0
A
11
=
2
,
A
12
=
1
,
A
13
=
−
1
A_{11} = 2, A_{12} = 1, A_{13} = -1
A11=2,A12=1,A13=−1
然后当计算
∂ H ∂ A 13 = − ∑ k = 1 C Y 1 k ∂ log ( σ ( A 1 k ) ) ∂ A 13 = − Y 11 ∂ log ( σ ( A 11 ) ) ∂ A 13 − Y 12 ∂ log ( σ ( A 12 ) ) ∂ A 13 − Y 13 ∂ log ( σ ( A 13 ) ) ∂ A 13 = − 1 ( − σ ( A 13 ) ) − 0 − 0 = σ ( A 13 ) − 0 = σ ( A 13 ) − Y 13 \begin{align*} \frac{\partial H}{\partial A_{13}} &= - \sum\limits_{k=1}^{C} Y_{1k} \frac{\partial \log(\sigma(A_{1k}))}{\partial A_{13}}\\ &=-Y_{11}\frac{\partial \log(\sigma(A_{11}))}{\partial A_{13}}-Y_{12}\frac{\partial \log(\sigma(A_{12}))}{\partial A_{13}}-Y_{13}\frac{\partial \log(\sigma(A_{13}))}{\partial A_{13}}\\ &= -1(-\sigma(A_{13}))-0-0 \\ &= \sigma(A_{13}) - 0 \\ &= \sigma(A_{13}) - Y _{13}\\ \end{align*} ∂A13∂H=−k=1∑CY1k∂A13∂log(σ(A1k))=−Y11∂A13∂log(σ(A11))−Y12∂A13∂log(σ(A12))−Y13∂A13∂log(σ(A13))=−1(−σ(A13))−0−0=σ(A13)−0=σ(A13)−Y13
示例 2
考虑这个案例 Y = [ 0 , 0 , 1 ] Y = [0,0,1] Y=[0,0,1] 和 A = [ 2 , 1 , − 1 ] A = [2, 1, -1] A=[2,1,−1]。
Y
11
=
0
,
Y
12
=
0
,
Y
13
=
1
Y_{11} = 0, Y_{12} = 0, Y_{13} = 1
Y11=0,Y12=0,Y13=1
A
11
=
2
,
A
12
=
1
,
A
13
=
−
1
A_{11} = 2, A_{12} = 1, A_{13} = -1
A11=2,A12=1,A13=−1
然后当计算
∂ H ∂ A 13 = − ∑ k = 1 C Y 1 k ∂ log ( σ ( A 1 k ) ) ∂ A 13 = − Y 11 ∂ log ( σ ( A 11 ) ) ∂ A 13 − Y 12 ∂ log ( σ ( A 12 ) ) ∂ A 13 − Y 13 ∂ log ( σ ( A 13 ) ) ∂ A 13 = − 0 − 0 − 1 ( 1 − σ ( A 13 ) ) = σ ( A 13 ) − 1 = σ ( A 13 ) − Y 13 \begin{align*} \frac{\partial H}{\partial A_{13}} &= - \sum\limits_{k=1}^{C} Y_{1k} \frac{\partial \log(\sigma(A_{1k}))}{\partial A_{13}}\\ &=-Y_{11}\frac{\partial \log(\sigma(A_{11}))}{\partial A_{13}}-Y_{12}\frac{\partial \log(\sigma(A_{12}))}{\partial A_{13}}-Y_{13}\frac{\partial \log(\sigma(A_{13}))}{\partial A_{13}}\\ &= -0-0-1(1 - \sigma(A_{13})) \\ &= \sigma(A_{13}) - 1 \\ &= \sigma(A_{13}) - Y _{13}\\ \end{align*} ∂A13∂H=−k=1∑CY1k∂A13∂log(σ(A1k))=−Y11∂A13∂log(σ(A11))−Y12∂A13∂log(σ(A12))−Y13∂A13∂log(σ(A13))=−0−0−1(1−σ(A13))=σ(A13)−1=σ(A13)−Y13
梯度的推导(YouTube的证明)
- Softmax函数是一个向量。
- 每个元素 e z k ∑ c = 1 C e z c \frac{e^{z_k}}{\sum\limits_{c=1}^{C} e^{z_c}} c=1∑Cezcezk由于分母的原因依赖于所有输入元素。
- 向量关于向量的梯度是一个矩阵。
- 为了简化和巩固这个概念,让我们通过查看一个大小为 3 3 3的向量来使其更具体:
( z 1 z 2 z 3 ) → ( e z 1 e z 1 + e z 2 + e z 3 e z 2 e z 1 + e z 2 + e z 3 e z 3 e z 1 + e z 2 + e z 3 ) = ( a 1 a 2 a 3 ) = ( y 1 ^ y 2 ^ y 3 ^ ) \begin{pmatrix} z_1\\ z_2\\ z_3 \end{pmatrix} \rightarrow \begin{pmatrix} \frac{e^{z_1}}{e^{z_1} + e^{z_2} + e^{z_3}}\\ \frac{e^{z_2}}{e^{z_1} + e^{z_2} + e^{z_3}}\\ \frac{e^{z_3}}{e^{z_1} + e^{z_2} + e^{z_3}} \end{pmatrix} = \begin{pmatrix} a_1\\ a_2\\ a_3 \end{pmatrix} = \begin{pmatrix} \hat{y_1}\\ \hat{y_2}\\ \hat{y_3} \end{pmatrix} z1z2z3 → ez1+ez2+ez3ez1ez1+ez2+ez3ez2ez1+ez2+ez3ez3 = a1a2a3 = y1^y2^y3^
矩阵的对角元素会发生什么?我们有导数关于分子中相同元素。例如,对于 ∂ a 1 ∂ z 1 \frac{\partial a_1}{\partial z_1} ∂z1∂a1我们得到:
∂ a 1 ∂ z 1 = e z 1 ( e z 1 + e z 2 + e z 3 ) − e z 1 e z 1 ( e z 1 + e z 2 + e z 3 ) ( e z 1 + e z 2 + e z 3 ) = e z 1 e z 1 + e z 2 + e z 3 ⋅ e z 1 + e z 2 + e z 3 − e z 1 e z 1 + e z 2 + e z 3 = a 1 ( 1 − a 1 ) \frac{\partial a_1}{\partial z_1} = \frac{e^{z_1}(e^{z_1} + e^{z_2} + e^{z_3}) - e^{z_1}e^{z_1}}{(e^{z_1} + e^{z_2} + e^{z_3})(e^{z_1} + e^{z_2} + e^{z_3})} = \frac{e^{z_1}}{e^{z_1} + e^{z_2} + e^{z_3}} \cdot \frac{e^{z_1}+e^{z_2}+e^{z_3}-e^{z_1}}{e^{z_1} + e^{z_2} + e^{z_3}} = a_1(1 - a_1) ∂z1∂a1=(ez1+ez2+ez3)(ez1+ez2+ez3)ez1(ez1+ez2+ez3)−ez1ez1=ez1+ez2+ez3ez1⋅ez1+ez2+ez3ez1+ez2+ez3−ez1=a1(1−a1)
所以我们得到了非常接近于sigmoid导数的东西。
对角线元素以外的元素会发生什么?例如,对于 ∂ a 1 ∂ z 2 \frac{\partial a_1}{\partial z_2} ∂z2∂a1我们得到:
∂ a 1 ∂ z 2 = 0 ⋅ ( e z 1 + e z 2 + e z 3 ) − e z 1 e z 2 ( e z 1 + e z 2 + e z 3 ) 2 = − e z 1 e z 1 + e z 2 + e z 3 ⋅ e z 2 e z 1 + e z 2 + e z 3 = − a 1 a 2 \frac{\partial a_1}{\partial z_2} = \frac{0 \cdot (e^{z_1} + e^{z_2} + e^{z_3}) - e^{z_1}e^{z_2}}{(e^{z_1} + e^{z_2} + e^{z_3})^2} = -\frac{e^{z_1}}{e^{z_1} + e^{z_2} + e^{z_3}} \cdot \frac{e^{z_2}}{e^{z_1} + e^{z_2} + e^{z_3}} = -a_1a_2 ∂z2∂a1=(ez1+ez2+ez3)20⋅(ez1+ez2+ez3)−ez1ez2=−ez1+ez2+ez3ez1⋅ez1+ez2+ez3ez2=−a1a2
对于我们的3x3矩阵,我们将得到:
∂ a ∂ z = ( a 1 ( 1 − a 1 ) − a 1 a 2 − a 1 a 3 − a 2 a 1 a 2 ( 1 − a 2 ) − a 2 a 3 − a 3 a 1 − a 3 a 2 a 3 ( 1 − a 3 ) ) \frac{\partial \mathbf{a}}{\partial \mathbf{z}} = \begin{pmatrix} a_1(1 - a_1) & -a_1a_2 & -a_1a_3 \\ -a_2a_1 & a_2(1 - a_2) & -a_2a_3 \\ -a_3a_1 & -a_3a_2 & a_3(1 - a_3) \end{pmatrix} ∂z∂a= a1(1−a1)−a2a1−a3a1−a1a2a2(1−a2)−a3a2−a1a3−a2a3a3(1−a3)
对于损失相对于最终输出的导数 - 我们有一个标量相对于向量的导数,所以结果也将是一个向量:
∂ L ∂ a L = [ ∂ ∂ a L 1 ( − ∑ c = 1 C y c log a L c ) ⋮ ∂ ∂ a L C ( − ∑ c = 1 C y c log a L c ) ] = − [ y 1 a L 1 ⋮ y C a L C ] \frac{\partial \mathcal{L}}{\partial a_L} = \begin{bmatrix} \frac{\partial}{\partial a_{L1}} \left(-\sum\limits_{c=1}^{C} y_c \log a_{Lc}\right) \\ \vdots \\ \frac{\partial}{\partial a_{LC}} \left(-\sum\limits_{c=1}^{C} y_c \log a_{Lc}\right) \end{bmatrix} = - \begin{bmatrix} \frac{y_1}{a_{L1}} \\ \vdots \\ \frac{y_C}{a_{LC}} \end{bmatrix} ∂aL∂L= ∂aL1∂(−c=1∑CyclogaLc)⋮∂aLC∂(−c=1∑CyclogaLc) =− aL1y1⋮aLCyC
记住对于每个1-hot向量 y y y,我们只有一个元素等于1,其余都是0。
回到我们具体的 3 × 3 3 \times 3 3×3例子,并把所有东西放在一起,我们得到:
∂ L ∂ z L = ∂ L ∂ a L ∂ a L ∂ z L = − [ y 1 a 1 y 2 a 2 y 3 a 3 ] ( a 1 ( 1 − a 1 ) − a 1 a 2 − a 1 a 3 − a 2 a 1 a 2 ( 1 − a 2 ) − a 2 a 3 − a 3 a 1 − a 3 a 2 a 3 ( 1 − a 3 ) ) = − [ y 1 − a 1 ( y 1 + y 2 + y 3 ) y 2 − a 2 ( y 1 + y 2 + y 3 ) y 3 − a 3 ( y 1 + y 2 + y 3 ) ] = a − y \begin{align} \frac{\partial \mathcal{L}}{\partial z_L} &= \frac{\partial \mathcal{L}}{\partial a_L} \frac{\partial a_L}{\partial z_L} \\ &= -\begin{bmatrix} \frac{y_1}{a_1} \\ \frac{y_2}{a_2} \\ \frac{y_3}{a_3} \end{bmatrix} \begin{pmatrix} a_1(1 - a_1) & -a_1a_2 & -a_1a_3 \\ -a_2a_1 & a_2(1 - a_2) & -a_2a_3 \\ -a_3a_1 & -a_3a_2 & a_3(1 - a_3) \end{pmatrix} \\ &= -\begin{bmatrix} y_1 - a_1(y_1 + y_2 + y_3) & y_2 - a_2(y_1 + y_2 + y_3) & y_3 - a_3(y_1 + y_2 + y_3) \end{bmatrix} \\ &= \mathbf{a} - \mathbf{y} \end{align} ∂zL∂L=∂aL∂L∂zL∂aL=− a1y1a2y2a3y3 a1(1−a1)−a2a1−a3a1−a1a2a2(1−a2)−a3a2−a1a3−a2a3a3(1−a3) =−[y1−a1(y1+y2+y3)y2−a2(y1+y2+y3)y3−a3(y1+y2+y3)]=a−y
注意这里的 a \mathbf{a} a和 y \mathbf{y} y是向量,而不是标量。
交叉熵损失的示例
为了说明交叉熵损失,让我们考虑一个具体的例子,用一个小数据集。假设我们有一个简单的分类问题,有三个类别(C=3),我们正在处理两个样本的批次( N = 2 N=2 N=2)。这两个样本的模型原始输出分数( A A A)和相应的真实标签( Y Y Y)可能如下所示:
-
两个样本的原始模型输出( A A A):
- 样本 1: [ 2.0 , 1.0 , 0.1 ] [2.0, 1.0, 0.1] [2.0,1.0,0.1]
- 样本 2: [ 0.1 , 2.0 , 1.9 ] [0.1, 2.0, 1.9] [0.1,2.0,1.9]
-
真实类别分布( Y Y Y,独热编码):
- 样本 1: [ 0 , 1 , 0 ] [0, 1, 0] [0,1,0] (类别 2 是真实类别)
- 样本 2: [ 1 , 0 , 0 ] [1, 0, 0] [1,0,0] (类别 1 是真实类别)
让我们逐步计算这个例子的交叉熵损失:
1. 应用 Softmax
首先,我们对原始输出应用 softmax 函数,以获得每个类别的预测概率。
对于样本 1,softmax 计算如下:
σ
(
A
1
)
=
[
e
2.0
e
2.0
+
e
1.0
+
e
0.1
,
e
1.0
e
2.0
+
e
1.0
+
e
0.1
,
e
0.1
e
2.0
+
e
1.0
+
e
0.1
]
\sigma(A_1) = \left[\frac{e^{2.0}}{e^{2.0} + e^{1.0} + e^{0.1}}, \frac{e^{1.0}}{e^{2.0} + e^{1.0} + e^{0.1}}, \frac{e^{0.1}}{e^{2.0} + e^{1.0} + e^{0.1}}\right]
σ(A1)=[e2.0+e1.0+e0.1e2.0,e2.0+e1.0+e0.1e1.0,e2.0+e1.0+e0.1e0.1]
对于样本 2,类似地:
σ
(
A
2
)
=
[
e
0.1
e
0.1
+
e
2.0
+
e
1.9
,
e
2.0
e
0.1
+
e
2.0
+
e
1.9
,
e
1.9
e
0.1
+
e
2.0
+
e
1.9
]
\sigma(A_2) = \left[\frac{e^{0.1}}{e^{0.1} + e^{2.0} + e^{1.9}}, \frac{e^{2.0}}{e^{0.1} + e^{2.0} + e^{1.9}}, \frac{e^{1.9}}{e^{0.1} + e^{2.0} + e^{1.9}}\right]
σ(A2)=[e0.1+e2.0+e1.9e0.1,e0.1+e2.0+e1.9e2.0,e0.1+e2.0+e1.9e1.9]
2. 计算交叉熵损失
接下来,我们计算每个样本的交叉熵损失。单个样本的损失由下式给出:
H
(
A
i
,
Y
i
)
=
−
∑
c
=
1
C
Y
i
c
log
(
σ
(
A
i
c
)
)
H(A_i, Y_i) = -\sum_{c=1}^{C} Y_{ic} \log(\sigma(A_{ic}))
H(Ai,Yi)=−c=1∑CYiclog(σ(Aic))
对于样本 1:
H
(
A
1
,
Y
1
)
=
−
[
0
×
log
(
σ
(
A
11
)
)
+
1
×
log
(
σ
(
A
12
)
)
+
0
×
log
(
σ
(
A
13
)
)
]
H(A_1, Y_1) = -[0 \times \log(\sigma(A_{11})) + 1 \times \log(\sigma(A_{12})) + 0 \times \log(\sigma(A_{13}))]
H(A1,Y1)=−[0×log(σ(A11))+1×log(σ(A12))+0×log(σ(A13))]
对于样本 2:
H
(
A
2
,
Y
2
)
=
−
[
1
×
log
(
σ
(
A
21
)
)
+
0
×
log
(
σ
(
A
22
)
)
+
0
×
log
(
σ
(
A
23
)
)
]
H(A_2, Y_2) = -[1 \times \log(\sigma(A_{21})) + 0 \times \log(\sigma(A_{22})) + 0 \times \log(\sigma(A_{23}))]
H(A2,Y2)=−[1×log(σ(A21))+0×log(σ(A22))+0×log(σ(A23))]
3. 计算平均交叉熵损失
最后,我们计算这些损失的平均值,以得到批次的平均交叉熵损失:
meancrossentropyloss
=
H
(
A
1
,
Y
1
)
+
H
(
A
2
,
Y
2
)
2
\text{meancrossentropyloss} = \frac{H(A_1, Y_1) + H(A_2, Y_2)}{2}
meancrossentropyloss=2H(A1,Y1)+H(A2,Y2)
4. 交叉熵损失的反向传播
对于反向传播,交叉熵损失相对于应用 softmax 之前的原始模型输出的梯度由下式给出:
∂
Loss
∂
A
=
σ
(
A
)
−
Y
N
\frac{\partial \text{Loss}}{\partial A} = \frac{\sigma(A) - Y}{N}
∂A∂Loss=Nσ(A)−Y
对于批次中的每个样本,我们计算:
- 对于样本 1: σ ( A 1 ) − Y 1 2 \frac{\sigma(A_1) - Y_1}{2} 2σ(A1)−Y1
- 对于样本 2: σ ( A 2 ) − Y 2 2 \frac{\sigma(A_2) - Y_2}{2} 2σ(A2)−Y2
这给了我们需要通过网络反向传播的梯度。
基于计算:
-
两个样本的 softmax 概率大约为:
- 样本 1: [ 0.659 , 0.242 , 0.099 ] [0.659, 0.242, 0.099] [0.659,0.242,0.099]
- 样本 2: [ 0.073 , 0.487 , 0.440 ] [0.073, 0.487, 0.440] [0.073,0.487,0.440]
-
两个样本的交叉熵损失为:
- 样本 1: 1.417
- 样本 2: 2.620
-
这个批次的平均交叉熵损失大约为 2.019。
-
损失相对于原始模型输出( A A A)的梯度为:
- 对于样本 1: [ 0.330 , − 0.379 , 0.049 ] [0.330, -0.379, 0.049] [0.330,−0.379,0.049]
- 对于样本 2: [ − 0.464 , 0.243 , 0.220 ] [-0.464, 0.243, 0.220] [−0.464,0.243,0.220]
这些结果给出了我们使用 softmax 函数得到的每个类别的预测概率,每个样本的个别交叉熵损失,批次的整体平均交叉熵损失,以及反向传播所需的梯度。梯度中的负值指示了我们应该调整模型参数以减少损失的方向,而正值则建议相反的方向。
class CrossEntropyLoss:
def softmax(self, x):
# 通过在每个输入向量中减去最大值来改善 softmax 的数值稳定性。
# 这通过指数化大的正数来防止潜在的溢出。
e_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return e_x / e_x.sum(axis=1, keepdims=True)
def forward(self, A, Y):
self.A = A
self.Y = Y
self.softmax = self.softmax(A)
crossentropy = -Y * np.log(self.softmax)
# 在批次上平均损失
L = np.sum(crossentropy) / A.shape[0]
return L
def backward(self):
# 计算损失相对于对数(预 softmax 激活)A的梯度
# 这个梯度还包括在批次上的平均
dLdA = (self.softmax - self.Y) / self.A.shape[0]
return dLdA
参考资料:
- 在YouTube上观看视频
- CMU_11785_Introduction_To_Deep_Learning