在看《动手学深度学习》一书时,里面有介绍简化版的对RNN求梯度。其中求隐含层梯度时,作者只是简略地说了句“将上⾯的递归公式展开”就直接给出了结果,下面我详细地给出中间步骤。
∂
L
∂
h
t
=
W
h
h
⊤
⋅
∂
L
∂
h
t
+
1
+
W
q
h
⊤
⋅
∂
L
∂
O
t
\frac{\partial L}{\partial h_t} = W^{\top}_{hh} \cdot \frac{\partial L}{\partial h_{t+1}} + W^{\top}_{qh} \cdot \frac{\partial L}{\partial O_t}
∂ht∂L=Whh⊤⋅∂ht+1∂L+Wqh⊤⋅∂Ot∂L
=
W
h
h
⊤
⋅
(
W
h
h
⊤
⋅
∂
L
∂
h
t
+
2
+
W
q
h
⊤
⋅
∂
L
∂
O
t
+
1
)
+
W
q
h
⊤
⋅
∂
L
∂
O
t
= W^{\top}_{hh} \cdot(W^{\top}_{hh} \cdot \frac{\partial L}{\partial h_{t+2}} + W^{\top}_{qh} \cdot \frac{\partial L}{\partial O_{t+1}}) + W^{\top}_{qh} \cdot \frac{\partial L}{\partial O_t}
=Whh⊤⋅(Whh⊤⋅∂ht+2∂L+Wqh⊤⋅∂Ot+1∂L)+Wqh⊤⋅∂Ot∂L
=
(
W
h
h
⊤
)
2
⋅
∂
L
∂
h
t
+
2
+
W
h
h
⊤
⋅
W
q
h
⊤
⋅
∂
L
∂
O
t
+
1
+
W
q
h
⊤
⋅
∂
L
∂
O
t
=( W^{\top}_{hh})^2\cdot \frac{\partial L}{\partial h_{t+2}} + W^{\top}_{hh} \cdot W^{\top}_{qh} \cdot \frac{\partial L}{\partial O_{t+1}} + W^{\top}_{qh} \cdot \frac{\partial L}{\partial O_t}
=(Whh⊤)2⋅∂ht+2∂L+Whh⊤⋅Wqh⊤⋅∂Ot+1∂L+Wqh⊤⋅∂Ot∂L
=
W
h
h
⊤
⋅
(
W
h
h
⊤
⋅
(
W
h
h
⊤
⋅
∂
L
∂
h
t
+
3
+
W
q
h
⊤
⋅
∂
L
∂
O
t
+
2
)
+
W
q
h
⊤
⋅
∂
L
∂
O
t
+
1
)
+
W
q
h
⊤
⋅
∂
L
∂
O
t
= W^{\top}_{hh} \cdot(W^{\top}_{hh} \cdot (W^{\top}_{hh} \cdot \frac{\partial L}{\partial h_{t+3}} + W^{\top}_{qh} \cdot \frac{\partial L}{\partial O_{t+2}}) + W^{\top}_{qh} \cdot \frac{\partial L}{\partial O_{t+1}}) + W^{\top}_{qh} \cdot \frac{\partial L}{\partial O_t}
=Whh⊤⋅(Whh⊤⋅(Whh⊤⋅∂ht+3∂L+Wqh⊤⋅∂Ot+2∂L)+Wqh⊤⋅∂Ot+1∂L)+Wqh⊤⋅∂Ot∂L
=
(
W
h
h
⊤
)
3
⋅
∂
L
∂
h
t
+
3
+
(
W
h
h
⊤
)
2
⋅
∂
L
∂
h
t
+
2
+
W
h
h
⊤
⋅
W
q
h
⊤
⋅
∂
L
∂
O
t
+
1
+
W
q
h
⊤
⋅
∂
L
∂
O
t
=(W^{\top}_{hh})^3\cdot \frac{\partial L}{\partial h_{t+3}} +(W^{\top}_{hh})^2\cdot \frac{\partial L}{\partial h_{t+2}} + W^{\top}_{hh} \cdot W^{\top}_{qh} \cdot \frac{\partial L}{\partial O_{t+1}} + W^{\top}_{qh} \cdot \frac{\partial L}{\partial O_t}
=(Whh⊤)3⋅∂ht+3∂L+(Whh⊤)2⋅∂ht+2∂L+Whh⊤⋅Wqh⊤⋅∂Ot+1∂L+Wqh⊤⋅∂Ot∂L
=
⋯
⋯
=\cdots\cdots
=⋯⋯
=
(
W
h
h
⊤
)
T
−
t
⋅
∂
L
∂
h
T
+
∑
i
=
t
+
1
T
[
(
W
h
h
⊤
)
T
−
i
⋅
W
q
h
⊤
⋅
∂
L
∂
O
T
+
t
−
i
]
=(W^{\top}_{hh})^{T-t}\cdot \frac{\partial L}{\partial h_T} + \sum_{i=t+1}^{T}\textbf{[}(W^{\top}_{hh})^{T-i}\cdot W^{\top}_{qh} \cdot \frac{\partial L}{\partial O_{T+t-i}}\textbf{]}
=(Whh⊤)T−t⋅∂hT∂L+i=t+1∑T[(Whh⊤)T−i⋅Wqh⊤⋅∂OT+t−i∂L]
又
∂
L
∂
h
T
=
(
W
q
h
⊤
)
T
⋅
∂
L
∂
O
T
又\qquad\qquad\frac{\partial L}{\partial h_T}=(W^{\top}_{qh})^T\cdot \frac{\partial L}{\partial O_T}
又∂hT∂L=(Wqh⊤)T⋅∂OT∂L
将其代入上式, 即得:
∂
L
∂
h
t
=
∑
i
=
t
T
[
(
W
h
h
⊤
)
T
−
i
⋅
W
q
h
⊤
⋅
∂
L
∂
O
T
+
t
−
i
]
\frac{\partial L}{\partial h_t}=\sum_{i=t}^{T}\textbf{[}(W^{\top}_{hh})^{T-i}\cdot W^{\top}_{qh} \cdot \frac{\partial L}{\partial O_{T+t-i}}\textbf{]}
∂ht∂L=i=t∑T[(Whh⊤)T−i⋅Wqh⊤⋅∂OT+t−i∂L]
RNN对隐含层求梯度
最新推荐文章于 2024-07-14 11:02:46 发布