对于《Deep Learning》一书中12.4.3.3节,公式(12.13)至(12.16)的详细推导过程。
加速神经语言模型训练的一种方式是,避免明确的计算所有未出现在下一位置的词对梯度的贡献。每个不正确的词在此模型下具有低概率。枚举所有这些词的计算成本可能会很高。相反,我们可以仅采样词的子集,梯度的推导过程如下:
∂
l
o
g
P
(
y
∣
C
)
∂
θ
=
∂
l
o
g
s
o
f
t
m
a
x
y
(
a
)
∂
θ
=
∂
∂
θ
l
o
g
e
a
y
∑
i
e
a
i
=
∂
∂
θ
(
a
y
−
l
o
g
∑
i
e
a
i
)
=
∂
a
y
∂
θ
−
∂
∂
θ
l
o
g
∑
i
e
a
i
=
∂
a
y
∂
θ
−
1
∑
j
e
a
j
∂
∂
θ
∑
i
e
a
i
=
∂
a
y
∂
θ
−
1
∑
j
e
a
j
∑
i
∂
e
a
i
∂
θ
=
∂
a
y
∂
θ
−
1
∑
j
e
a
j
∑
i
e
a
i
∂
a
i
∂
θ
=
∂
a
y
∂
θ
−
∑
i
e
a
i
∑
j
e
a
j
∂
a
i
∂
θ
=
∂
a
y
∂
θ
−
∑
i
P
(
y
=
i
∣
C
)
∂
a
i
∂
θ
\begin {aligned} \frac{\partial {\rm log}P(y|C)}{\partial \theta} &= \frac{\partial {\rm log\ softmax}_y(a) }{\partial \theta} \\ &= \frac{\partial}{\partial \theta} {\rm log} \frac{e^{a_y}}{\sum_i{e^{a_i}}}\\ &= \frac{\partial}{\partial \theta}(a_y - {\rm log}\sum_i{e^{a_i}}) \\ &=\frac{\partial a_y}{\partial \theta} -\frac{\partial}{\partial \theta}{\rm log} \sum_i{e^{a_i}}\\ &=\frac{\partial a_y}{\partial \theta} - \frac{1}{\sum_j{e^{a_j}}}\frac{\partial}{\partial \theta}\sum_i{e^{a_i}}\\ &= \frac{\partial a_y}{\partial\theta} - \frac{1}{\sum_j{e^{a_j}}} \sum_i\frac{\partial e^{a_i}}{\partial \theta}\\ &= \frac{\partial a_y}{\partial\theta} - \frac{1}{\sum_j{e^{a_j}}} \sum_i{e^{a_i}}\frac{\partial a_i}{\partial \theta}\\ &= \frac{\partial a_y}{\partial\theta} - \sum_i\frac{e^{a_i}}{\sum_j{e^{a_j}}} \frac{\partial a_i}{\partial \theta}\\ &= \frac{\partial a_y}{\partial\theta} - \sum_i P(y=i|C) \frac{\partial a_i}{\partial \theta} \end {aligned}
∂θ∂logP(y∣C)=∂θ∂log softmaxy(a)=∂θ∂log∑ieaieay=∂θ∂(ay−logi∑eai)=∂θ∂ay−∂θ∂logi∑eai=∂θ∂ay−∑jeaj1∂θ∂i∑eai=∂θ∂ay−∑jeaj1i∑∂θ∂eai=∂θ∂ay−∑jeaj1i∑eai∂θ∂ai=∂θ∂ay−i∑∑jeajeai∂θ∂ai=∂θ∂ay−i∑P(y=i∣C)∂θ∂ai
本文内容编辑:郑杜磊
自然语言处理:重要采样的梯度推导
最新推荐文章于 2024-05-21 07:42:55 发布