最近学了下陈天奇大佬的DeepLearningSystem课程,HW2里面有一块是对LogSumExp(简称LSE)算子求导数。
LSE应用非常广泛(例如多分类里的Softmax可以利用LSE来解决上溢问题 )。
所以这篇文章对LSE做了一个求导(但写的有点繁琐
顺便练练LaTeX 😄
下面是一些符号的说明:
i
n
p
u
t
:
z
∈
R
n
a
r
g
m
a
x
(
z
)
=
j
,
max
z
=
z
j
z
i
^
=
z
i
−
max
z
=
z
i
−
z
j
L
o
g
S
u
m
E
x
p
(
z
i
)
=
log
(
∑
k
=
1
n
exp
(
z
i
−
max
z
)
)
+
max
z
=
log
(
∑
k
=
1
n
exp
(
z
i
^
)
)
+
z
j
L
S
E
=
L
o
g
S
u
m
E
x
p
input: z \in \mathbb{R}^n \\ argmax \left(z \right) = j, \max{z}=z_j\\ \hat{z_{i}} = z_{i} - \max{z}=z_i-z_j\\ LogSumExp(z_i) = \log(\sum_{k=1}^{n}\exp(z_{i}-\max{z}))+\max{z}=\log(\sum_{k=1}^{n}\exp(\hat{z_i}))+z_j \\ LSE=LogSumExp
input:z∈Rnargmax(z)=j,maxz=zjzi^=zi−maxz=zi−zjLogSumExp(zi)=log(k=1∑nexp(zi−maxz))+maxz=log(k=1∑nexp(zi^))+zjLSE=LogSumExp
- 当
i
≠
j
i\neq j
i=j时
∂ L S E ∂ z i = ∂ L S E ∂ log ∑ k = 1 n exp ( z k ^ ) ⋅ ∂ log ∑ k = 1 n exp ( z k ^ ) ∂ z i + ∂ L S E ∂ max z ⋅ ∂ max z ∂ z i = 1 ⋅ ∂ log ∑ k = 1 n exp ( z k ^ ) ∂ ∑ k = 1 n exp ( z k ^ ) ⋅ ∂ ∑ k = 1 n exp ( z k ^ ) ∂ z i ^ + 1 ⋅ 0 = ∂ log ∑ k = 1 n exp ( z k ^ ) ∂ ∑ k = 1 n exp ( z k ^ ) ⋅ ∑ k = 1 n ( ∂ exp ( z k ^ ) ∂ z i ^ ) = 1 ∑ k = 1 n exp ( z k ^ ) ⋅ ∑ k = 1 n ( ∂ exp ( z k ^ ) ∂ z k ^ ⋅ ∂ z k ^ ∂ z i ) = 1 ∑ k = 1 n exp ( z k ^ ) ⋅ ∑ k = 1 n ( exp ( z k ^ ) ⋅ ∂ ( z k − max z ) ∂ z i ) = 1 ∑ k = 1 n exp ( z k ^ ) ⋅ ∑ k = 1 n ( exp ( z k ^ ) ⋅ I ( k = i ) ) = 1 ∑ exp ( z k ^ ) ⋅ exp ( z i ^ ) = exp ( z i ^ ) ∑ k = 1 n exp ( z k ^ ) \begin{align} \frac{\partial{LSE}}{\partial{z_{i}}} &= \frac{\partial{LSE}}{\partial{\log\sum_{k=1}^{n}\exp(\hat{z_{k}})}} \cdot \frac{\partial{\log\sum_{k=1}^{n}\exp(\hat{z_{k}})}}{\partial{z_{i}}} + \frac{\partial{LSE}}{\partial{\max{z}}} \cdot \frac{\partial{\max{z}}}{\partial{z_{i}}} \\ &= 1 \cdot \frac{\partial{\log\sum_{k=1}^{n}\exp(\hat{z_{k}})}}{\partial{\sum_{k=1}^{n}{\exp(\hat{z_{k}})}}} \cdot \frac{\partial{{\sum_{k=1}^{n}\exp(\hat{z_{k}})}}}{\partial{\hat{z_{i}}}} + 1 \cdot 0 \\ &= \frac{\partial{\log\sum_{k=1}^{n}\exp(\hat{z_{k}})}}{\partial{\sum_{k=1}^{n}{\exp(\hat{z_{k}})}}} \cdot \sum_{k=1}^{n}\left(\frac{\partial{{\exp(\hat{z_{k}})}}}{\partial{\hat{z_{i}}}}\right) \\ &= \frac{1}{{\sum_{k=1}^{n}\exp(\hat{z_{k}})}} \cdot \sum_{k=1}^{n}\left(\frac{\partial{{\exp(\hat{z_{k}})}}}{\partial{\hat{z_{k}}}} \cdot \frac{{\partial{{\hat{z_{k}}}}}}{\partial{z_{i}}} \right) \\ &= \frac{1}{{\sum_{k=1}^{n}\exp(\hat{z_{k}})}} \cdot \sum_{k=1}^{n}\left(\exp(\hat{z_k}) \cdot \frac{{\partial{({z_{k}-\max{z}})}}}{\partial{z_{i}}} \right) \\ &= \frac{1}{{\sum_{k=1}^{n}\exp(\hat{z_{k}})}} \cdot \sum_{k=1}^{n}\left(\exp(\hat{z_{k}}) \cdot \mathbb{I}\left(k=i\right) \right) \\ &= \frac{1}{{\sum\exp(\hat{z_{k}})}} \cdot \exp(\hat{z_{i}}) \\ &= \frac{\exp(\hat{z_{i}})}{\sum_{k=1}^{n} {\exp(\hat{z_{k}})}} \nonumber \end{align} ∂zi∂LSE=∂log∑k=1nexp(zk^)∂LSE⋅∂zi∂log∑k=1nexp(zk^)+∂maxz∂LSE⋅∂zi∂maxz=1⋅∂∑k=1nexp(zk^)∂log∑k=1nexp(zk^)⋅∂zi^∂∑k=1nexp(zk^)+1⋅0=∂∑k=1nexp(zk^)∂log∑k=1nexp(zk^)⋅k=1∑n(∂zi^∂exp(zk^))=∑k=1nexp(zk^)1⋅k=1∑n(∂zk^∂exp(zk^)⋅∂zi∂zk^)=∑k=1nexp(zk^)1⋅k=1∑n(exp(zk^)⋅∂zi∂(zk−maxz))=∑k=1nexp(zk^)1⋅k=1∑n(exp(zk^)⋅I(k=i))=∑exp(zk^)1⋅exp(zi^)=∑k=1nexp(zk^)exp(zi^) - 当
i
=
j
i=j
i=j时,即
z
i
=
z
j
=
max
z
z_i=z_j=\max{z}
zi=zj=maxz
∂ L S E ∂ z i = ∂ L S E ∂ log ∑ k = 1 n exp ( z k ^ ) ⋅ ∂ log ∑ k = 1 n exp ( z k ^ ) ∂ z i + ∂ L S E ∂ max z ⋅ ∂ max z ∂ z i = 1 ⋅ ∂ log ∑ k = 1 n exp ( z k ^ ) ∂ ∑ k = 1 n exp ( z k ^ ) ⋅ ∂ ∑ k = 1 n exp ( z k ^ ) ∂ z i ^ + 1 ⋅ 1 = 1 ∑ k = 1 n exp ( z k ^ ) ⋅ ∑ k = 1 n ( ∂ exp ( z k ^ ) ∂ z i ^ ) + 1 = 1 ∑ k = 1 n exp ( z k ^ ) ⋅ ∑ k = 1 n ( exp ( z k ^ ) ⋅ ∂ ( z k − max z ) ∂ z i ) + 1 = 1 ∑ k = 1 n exp ( z k ^ ) ⋅ ( exp ( z k ^ ) ⋅ ∂ ( z i − max z ) ∂ z i + ∑ k = 1 , k ≠ i n ( exp ( z k ^ ) ⋅ ∂ ( z k − max z ) ∂ z i ) ) + 1 = 1 ∑ k = 1 n exp ( z k ^ ) ⋅ ( exp ( z k ^ ) ⋅ ∂ ( z i − z i ) ∂ z i + ∑ k = 1 , k ≠ i n ( exp ( z k ^ ) ⋅ ∂ ( z k − z i ) ∂ z i ) ) + 1 = 1 ∑ k = 1 n exp ( z k ^ ) ⋅ ( exp ( z k ^ ) ⋅ 0 + ∑ k = 1 , k ≠ i n ( exp ( z k ^ ) ⋅ − 1 ) ) + 1 = 1 ∑ k = 1 n exp ( z k ^ ) ⋅ ( − ∑ k = 1 , k ≠ i n exp ( z k ^ ) ) + 1 = − ∑ k = 1 , k ≠ i n exp ( z k ^ ) ∑ k = 1 n exp ( z k ^ ) + ∑ k = 1 n exp ( z k ^ ) ∑ k = 1 n exp ( z k ^ ) = exp ( z i ^ ) ∑ k = 1 n exp ( z k ^ ) \begin{align} \frac{\partial{LSE}}{\partial{z_{i}}} &= \frac{\partial{LSE}}{\partial{\log\sum_{k=1}^{n}\exp(\hat{z_{k}})}} \cdot \frac{\partial{\log\sum_{k=1}^{n}\exp(\hat{z_{k}})}}{\partial{z_{i}}} + \frac{\partial{LSE}}{\partial{\max{z}}} \cdot \frac{\partial{\max{z}}}{\partial{z_{i}}} \\ &= 1 \cdot \frac{\partial{\log\sum_{k=1}^{n}\exp(\hat{z_{k}})}}{\partial{\sum_{k=1}^{n}{\exp(\hat{z_{k}})}}} \cdot \frac{\partial{{\sum_{k=1}^{n}\exp(\hat{z_{k}})}}}{\partial{\hat{z_{i}}}} + 1 \cdot 1 \\ &= \frac{1}{\sum_{k=1}^{n}{\exp(\hat{z_{k}})}} \cdot \sum_{k=1}^{n}\left(\frac{\partial{{\exp(\hat{z_{k}})}}}{\partial{\hat{z_{i}}}}\right) + 1 \\ &= \frac{1}{{\sum_{k=1}^{n}\exp(\hat{z_{k}})}} \cdot \sum_{k=1}^{n}\left(\exp(\hat{z_{k}}) \cdot \frac{{\partial{(z_k-\max{z})}}}{\partial{z_{i}}} \right) + 1 \\ &= \frac{1}{{\sum_{k=1}^{n}\exp(\hat{z_{k}})}} \cdot \left( \exp(\hat{z_k})\cdot \frac{\partial{(z_i-\max{z})}}{\partial{z_i}} + \sum_{k=1,k \neq i}^{n}\left(\exp(\hat{z_{k}}) \cdot \frac{{\partial{(z_k-\max{z})}}}{\partial{z_{i}}} \right)\right) + 1 \\ &= \frac{1}{{\sum_{k=1}^{n}\exp(\hat{z_{k}})}} \cdot \left( \exp(\hat{z_k})\cdot \frac{\partial{(z_i-z_i)}}{\partial{z_i}} + \sum_{k=1,k \neq i}^{n}\left(\exp(\hat{z_{k}}) \cdot \frac{{\partial{(z_k-z_i)}}}{\partial{z_{i}}} \right)\right) + 1 \\ &= \frac{1}{{\sum_{k=1}^{n}\exp(\hat{z_{k}})}} \cdot \left( \exp(\hat{z_k})\cdot 0 + \sum_{k=1,k \neq i}^{n}\left(\exp\left(\hat{z_{k}}\right) \cdot -1 \right)\right) + 1 \\ &= \frac{1}{{\sum_{k=1}^{n}\exp(\hat{z_{k}})}} \cdot \left( -\sum_{k=1,k \neq i}^{n}\exp\left(\hat{z_{k}} \right)\right) + 1 \\ &= \frac{-\sum_{k=1,k \neq i}^{n}\exp\left(\hat{z_{k}} \right)}{{\sum_{k=1}^{n}\exp(\hat{z_{k}})}} + \frac{\sum_{k=1}^{n}\exp(\hat{z_{k}})}{{\sum_{k=1}^{n}\exp(\hat{z_{k}})}}\\ &= \frac{\exp(\hat{z_{i}})}{{\sum_{k=1}^{n}\exp(\hat{z_{k}})}} \end{align} ∂zi∂LSE=∂log∑k=1nexp(zk^)∂LSE⋅∂zi∂log∑k=1nexp(zk^)+∂maxz∂LSE⋅∂zi∂maxz=1⋅∂∑k=1nexp(zk^)∂log∑k=1nexp(zk^)⋅∂zi^∂∑k=1nexp(zk^)+1⋅1=∑k=1nexp(zk^)1⋅k=1∑n(∂zi^∂exp(zk^))+1=∑k=1nexp(zk^)1⋅k=1∑n(exp(zk^)⋅∂zi∂(zk−maxz))+1=∑k=1nexp(zk^)1⋅ exp(zk^)⋅∂zi∂(zi−maxz)+k=1,k=i∑n(exp(zk^)⋅∂zi∂(zk−maxz)) +1=∑k=1nexp(zk^)1⋅ exp(zk^)⋅∂zi∂(zi−zi)+k=1,k=i∑n(exp(zk^)⋅∂zi∂(zk−zi)) +1=∑k=1nexp(zk^)1⋅ exp(zk^)⋅0+k=1,k=i∑n(exp(zk^)⋅−1) +1=∑k=1nexp(zk^)1⋅ −k=1,k=i∑nexp(zk^) +1=∑k=1nexp(zk^)−∑k=1,k=inexp(zk^)+∑k=1nexp(zk^)∑k=1nexp(zk^)=∑k=1nexp(zk^)exp(zi^)
可知两种情况下对LSE求导结果都等于 exp ( z i ^ ) ∑ k = 1 n exp ( z k ^ ) \frac{\exp(\hat{z_{i}})}{{\sum_{k=1}^{n}\exp(\hat{z_{k}})}} ∑k=1nexp(zk^)exp(zi^)
P.S. 封面图片源: What Is a Gradient in Machine Learning?
参考: