RNN和LSTM的反向传播公式推导
这篇博客主要推导RNN和LSTM的反向传播公式,以便更好的理解RNN和LSTM的运算
一. RNN
1. RNN正向传播
h
~
t
=
W
x
t
+
U
h
t
−
1
+
b
\widetilde h_{t}=Wx_{t}+Uh_{t-1}+b
h
t=Wxt+Uht−1+b
h
t
=
tanh
(
h
~
t
)
h_{t}= \tanh(\widetilde h_{t})
ht=tanh(h
t)
2. RNN梯度计算
设 t a n h ( x ) tanh(x) tanh(x)函数的导数为 t a n h ′ ( x ) tanh^{'}(x) tanh′(x),则RNN中可训练参数的梯度为:
d h t d W = tanh ′ ( h ~ t ) [ x t x t T + U T d h t − 1 d W ] \frac{\mathrm{d}h_{t}}{\mathrm{d}W}= \tanh^{'}(\widetilde h_{t})[x_{t}x^{T}_{t} + U^{T}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}W}] dWdht=tanh′(h t)[xtxtT+UTdWdht−1]
d h t d U = tanh ′ ( h ~ t ) [ h t − 1 h t − 1 T + U T d h t − 1 d U ] \frac{\mathrm{d}h_{t}}{\mathrm{d}U}= \tanh^{'}(\widetilde h_{t})[h_{t-1}h^{T}_{t-1}+U^{T}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}U}] dUdht=tanh′(h t)[ht−1ht−1T+UTdUdht−1]
d h t d b = tanh ′ ( h ~ t ) [ 1 + U T d h t − 1 d b ] \frac{\mathrm{d} h_{t}}{\mathrm{d} b}= \tanh^{'}(\widetilde h_{t})[1+U^{T}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}b}] dbdht=tanh′(h t)[1+UTdbdht−1]
其中:
d
h
1
d
W
=
tanh
′
(
h
~
1
)
[
x
1
x
1
T
]
\frac{\mathrm{d}h_{1}}{\mathrm{d}W}= \tanh^{'}(\widetilde h_{1}) [x_{1}x^{T}_{1}]
dWdh1=tanh′(h
1)[x1x1T]
d h 1 d U = tanh ′ ( h ~ 1 ) [ h 0 h 0 T ] \frac{\mathrm{d}h_{1}}{\mathrm{d}U}= \tanh^{'}(\widetilde h_{1}) [h_{0}h^{T}_{0}] dUdh1=tanh′(h 1)[h0h0T]
d h 1 d b = tanh ′ ( h ~ 1 ) \frac{\mathrm{d} h_{1}}{\mathrm{d} b}= \tanh^{'}(\widetilde h_{1}) dbdh1=tanh′(h 1)
二. LSTM
1. LSTM正向传播
为了求导时方便书写,将LSTM写成如下形式:
i
~
t
=
W
i
x
t
+
U
i
h
t
−
1
+
b
i
\widetilde i_{t}=W^{i}x_{t}+U^{i}h_{t-1}+b^{i}
i
t=Wixt+Uiht−1+bi
f
~
t
=
W
f
x
t
+
U
f
h
t
−
1
+
b
f
\widetilde f_{t}=W^{f}x_{t}+U^{f}h_{t-1}+b^{f}
f
t=Wfxt+Ufht−1+bf
o
~
t
=
W
o
x
t
+
U
o
h
t
−
1
+
b
o
\widetilde o_{t}=W^{o}x_{t}+U^{o}h_{t-1}+b^{o}
o
t=Woxt+Uoht−1+bo
g
~
t
=
W
c
x
t
+
U
c
h
t
−
1
+
b
c
\widetilde{g}_{t}=W^{c}x_{t}+U^{c}h_{t-1}+b^{c}
g
t=Wcxt+Ucht−1+bc
i
t
=
σ
(
i
~
t
)
i_{t} = \sigma(\widetilde i_{t})
it=σ(i
t)
f
t
=
σ
(
f
~
t
)
f_{t} = \sigma(\widetilde f_{t})
ft=σ(f
t)
o
t
=
σ
(
o
~
t
)
o_{t} = \sigma(\widetilde o_{t})
ot=σ(o
t)
g
t
=
σ
(
g
~
t
)
g_{t} = \sigma(\widetilde g_{t})
gt=σ(g
t)
c
t
=
f
t
⊗
c
t
−
1
+
i
t
⊗
g
t
c_{t}=f_{t}\otimes c_{t-1}+i_{t}\otimes g_{t}
ct=ft⊗ct−1+it⊗gt
h
t
=
o
t
⊗
tanh
(
c
t
)
h_{t}=o_{t}\otimes \tanh(c_{t})
ht=ot⊗tanh(ct)
2. LSTM反向传播
设 t a n h ( x ) tanh(x) tanh(x)函数的导数为 t a n h ′ ( x ) tanh^{'}(x) tanh′(x), σ ( x ) \sigma(x) σ(x)函数的导数为 s i g m ′ ( x ) sigm^{'}(x) sigm′(x), 则LSTM中可训练参数的梯度为:
输入门:
d
h
t
d
W
i
=
[
s
i
g
m
′
(
o
~
t
)
U
o
T
d
h
t
−
1
d
W
i
]
tanh
(
c
t
)
+
o
t
tanh
′
(
c
t
)
d
c
t
d
W
i
\frac{\mathrm{d}h_{t}}{\mathrm{d}W^{i}}= [sigm^{'}(\widetilde o_{t})U^{o^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}W^{i}}]\tanh(c_{t})+o_{t}\tanh^{'}(c_{t})\frac{\mathrm{d}c_{t}}{\mathrm{d}W^{i}}
dWidht=[sigm′(o
t)UoTdWidht−1]tanh(ct)+ottanh′(ct)dWidct
d h t d U i = [ s i g m ′ ( o ~ t ) U o T d h t − 1 d U i ] tanh ( c t ) + o t tanh ′ ( c t ) d c t d U i \frac{\mathrm{d}h_{t}}{\mathrm{d}U^{i}}= [sigm^{'}(\widetilde o_{t})U^{o^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}U^{i}}]\tanh(c_{t})+o_{t}\tanh^{'}(c_{t})\frac{\mathrm{d}c_{t}}{\mathrm{d}U^{i}} dUidht=[sigm′(o t)UoTdUidht−1]tanh(ct)+ottanh′(ct)dUidct
d h t d b i = [ s i g m ′ ( o ~ t ) U o T d h t − 1 d b i ] tanh ( c t ) + o t tanh ′ ( c t ) d c t d b i \frac{\mathrm{d}h_{t}}{\mathrm{d}b^{i}}= [sigm^{'}(\widetilde o_{t})U^{o^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}b^{i}}]\tanh(c_{t})+o_{t}\tanh^{'}(c_{t})\frac{\mathrm{d}c_{t}}{\mathrm{d}b^{i}} dbidht=[sigm′(o t)UoTdbidht−1]tanh(ct)+ottanh′(ct)dbidct
遗忘门:
d
h
t
d
W
f
=
[
s
i
g
m
′
(
o
~
t
)
U
o
T
d
h
t
−
1
d
W
f
]
tanh
(
c
t
)
+
o
t
tanh
′
(
c
t
)
d
c
t
d
W
f
\frac{\mathrm{d}h_{t}}{\mathrm{d}W^{f}}=[sigm^{'}(\widetilde o_{t})U^{o^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}W^{f}}]\tanh(c_{t})+o_{t}\tanh^{'}(c_{t})\frac{\mathrm{d}c_{t}}{\mathrm{d}W^{f}}
dWfdht=[sigm′(o
t)UoTdWfdht−1]tanh(ct)+ottanh′(ct)dWfdct
d h t d U f = [ s i g m ′ ( o ~ t ) U o T d h t − 1 d U f ] tanh ( c t ) + o t tanh ′ ( c t ) d c t d U f \frac{\mathrm{d}h_{t}}{\mathrm{d}U^{f}}=[sigm^{'}(\widetilde o_{t})U^{o^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}U^{f}}]\tanh(c_{t})+o_{t}\tanh^{'}(c_{t})\frac{\mathrm{d}c_{t}}{\mathrm{d}U^{f}} dUfdht=[sigm′(o t)UoTdUfdht−1]tanh(ct)+ottanh′(ct)dUfdct
d h t d b f = [ s i g m ′ ( o ~ t ) U o T d h t − 1 d b f ] tanh ( c t ) + o t tanh ′ ( c t ) d c t d b f \frac{\mathrm{d}h_{t}}{\mathrm{d}b^{f}}= [sigm^{'}(\widetilde o_{t})U^{o^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}b^{f}}]\tanh(c_{t})+o_{t}\tanh^{'}(c_{t})\frac{\mathrm{d}c_{t}}{\mathrm{d}b^{f}} dbfdht=[sigm′(o t)UoTdbfdht−1]tanh(ct)+ottanh′(ct)dbfdct
候选记忆细胞:
d
h
t
d
W
c
=
[
s
i
g
m
′
(
o
~
t
)
U
o
T
d
h
t
−
1
d
W
c
]
tanh
(
c
t
)
+
o
t
tanh
′
(
c
t
)
d
c
t
d
W
c
\frac{\mathrm{d}h_{t}}{\mathrm{d}W^{c}}=[sigm^{'}(\widetilde o_{t})U^{o^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}W^{c}}]\tanh(c_{t})+o_{t}\tanh^{'}(c_{t})\frac{\mathrm{d}c_{t}}{\mathrm{d}W^{c}}
dWcdht=[sigm′(o
t)UoTdWcdht−1]tanh(ct)+ottanh′(ct)dWcdct
d h t d U c = [ s i g m ′ ( o ~ t ) U o T d h t − 1 d U c ] tanh ( c t ) + o t tanh ′ ( c t ) d c t d U c \frac{\mathrm{d}h_{t}}{\mathrm{d}U^{c}}= [sigm^{'}(\widetilde o_{t})U^{o^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}U^{c}}]\tanh(c_{t})+o_{t}\tanh^{'}(c_{t})\frac{\mathrm{d}c_{t}}{\mathrm{d}U^{c}} dUcdht=[sigm′(o t)UoTdUcdht−1]tanh(ct)+ottanh′(ct)dUcdct
d h t d b c = [ s i g m ′ ( o ~ t ) U o T d h t − 1 d b c ] tanh ( c t ) + o t tanh ′ ( c t ) d c t d b c \frac{\mathrm{d}h_{t}}{\mathrm{d}b^{c}}= [sigm^{'}(\widetilde o_{t})U^{o^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}b^{c}}]\tanh(c_{t})+o_{t}\tanh^{'}(c_{t})\frac{\mathrm{d}c_{t}}{\mathrm{d}b^{c}} dbcdht=[sigm′(o t)UoTdbcdht−1]tanh(ct)+ottanh′(ct)dbcdct
输出门:
d
h
t
d
W
o
=
[
s
i
g
m
′
(
o
~
t
)
(
x
t
x
t
T
+
U
o
T
d
h
t
−
1
d
W
o
)
]
tanh
(
c
t
)
+
o
t
tanh
′
(
c
t
)
d
c
t
d
W
o
\frac{\mathrm{d}h_{t}}{\mathrm{d}W^{o}}=[sigm^{'}(\widetilde o_{t})(x_{t}x^{T}_{t} + U^{o^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}W^{o}})]\tanh(c_{t})+o_{t}\tanh^{'}(c_{t})\frac{\mathrm{d}c_{t}}{\mathrm{d}W^{o}}
dWodht=[sigm′(o
t)(xtxtT+UoTdWodht−1)]tanh(ct)+ottanh′(ct)dWodct
d h t d U o = [ s i g m ′ ( o ~ t ) ( h t − 1 h t − 1 T + U o T d h t − 1 d U o ) ] tanh ( c t ) + o t tanh ′ ( c t ) d c t d U o \frac{\mathrm{d}h_{t}}{\mathrm{d}U^{o}}=[sigm^{'}(\widetilde o_{t})(h_{t-1}h^{T}_{t-1} + U^{o^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}U^{o}})]\tanh(c_{t})+o_{t}\tanh^{'}(c_{t})\frac{\mathrm{d}c_{t}}{\mathrm{d}U^{o}} dUodht=[sigm′(o t)(ht−1ht−1T+UoTdUodht−1)]tanh(ct)+ottanh′(ct)dUodct
d h t d b o = [ s i g m ′ ( o ~ t ) ( 1 + U o T d h t − 1 d b o ) ] tanh ( c t ) + o t tanh ′ ( c t ) d c t d b o \frac{\mathrm{d}h_{t}}{\mathrm{d}b^{o}}=[sigm^{'}(\widetilde o_{t})(1 +U^{o^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}b^{o}})]\tanh(c_{t})+o_{t}\tanh^{'}(c_{t})\frac{\mathrm{d}c_{t}}{\mathrm{d}b^{o}} dbodht=[sigm′(o t)(1+UoTdbodht−1)]tanh(ct)+ottanh′(ct)dbodct
以上公式都包含了 d c t \mathrm{d}c_{t} dct,因此还需要对 d c t \mathrm{d}c_{t} dct进行计算:
输入门:
d
c
t
d
W
i
=
s
i
g
m
′
(
i
~
t
)
[
x
t
x
t
T
+
U
i
T
d
h
t
−
1
d
W
i
]
g
t
+
i
t
tanh
′
(
g
~
t
)
U
c
T
d
h
t
−
1
d
W
i
+
s
i
g
m
′
(
f
~
t
)
[
U
f
T
d
h
t
−
1
d
W
i
]
c
t
−
1
+
f
t
d
c
t
−
1
d
W
i
\frac{\mathrm{d}c_{t}}{\mathrm{d}W^{i}}= sigm^{'}(\widetilde i_{t})[x_{t}x^{T}_{t} + U^{i^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}W^{i}}]g_{t}+i_{t}\tanh^{'}(\widetilde g_{t})U^{c^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}W^{i}} + sigm^{'}(\widetilde f_{t})[U^{f^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}W^{i}}]c_{t-1}+f_{t}\frac{\mathrm{d}c_{t-1}}{\mathrm{d}W^{i}}
dWidct=sigm′(i
t)[xtxtT+UiTdWidht−1]gt+ittanh′(g
t)UcTdWidht−1+sigm′(f
t)[UfTdWidht−1]ct−1+ftdWidct−1
d c t d U i = s i g m ′ ( i ~ t ) [ h t − 1 h t − 1 T + U i T d h t − 1 d U i ] g t + i t tanh ′ ( g ~ t ) U c T d h t − 1 d U i + s i g m ′ ( f ~ t ) [ U f T d h t − 1 d U i ] c t − 1 + f t d c t − 1 d U i \frac{\mathrm{d}c_{t}}{\mathrm{d}U^{i}}= sigm^{'}(\widetilde i_{t})[h_{t-1}h^{T}_{t-1} + U^{i^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}U^{i}}]g_{t}+i_{t}\tanh^{'}(\widetilde g_{t})U^{c^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}U^{i}} + sigm^{'}(\widetilde f_{t})[U^{f^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}U^{i}}]c_{t-1}+f_{t}\frac{\mathrm{d}c_{t-1}}{\mathrm{d}U^{i}} dUidct=sigm′(i t)[ht−1ht−1T+UiTdUidht−1]gt+ittanh′(g t)UcTdUidht−1+sigm′(f t)[UfTdUidht−1]ct−1+ftdUidct−1
d c t d b i = s i g m ′ ( i ~ t ) [ 1 + U i T d h t − 1 d b i ] g t + i t tanh ′ ( g ~ t ) U c T d h t − 1 d b i + s i g m ′ ( f ~ t ) [ U f T d h t − 1 d b i ) ] c t − 1 + f t d c t − 1 d b i \frac{\mathrm{d}c_{t}}{\mathrm{d}b^{i}}= sigm^{'}(\widetilde i_{t})[1 + U^{i^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}b^{i}}]g_{t}+i_{t}\tanh^{'}(\widetilde g_{t})U^{c^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}b^{i}} + sigm^{'}(\widetilde f_{t})[U^{f^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}b^{i}})]c_{t-1}+f_{t}\frac{\mathrm{d}c_{t-1}}{\mathrm{d}b^{i}} dbidct=sigm′(i t)[1+UiTdbidht−1]gt+ittanh′(g t)UcTdbidht−1+sigm′(f t)[UfTdbidht−1)]ct−1+ftdbidct−1
遗忘门:
d
c
t
d
W
f
=
s
i
g
m
′
(
i
~
t
)
U
i
T
d
h
t
−
1
d
W
f
g
t
+
i
t
tanh
′
(
g
~
t
)
U
c
T
d
h
t
−
1
d
W
f
+
s
i
g
m
′
(
f
~
t
)
[
x
t
x
t
T
+
U
f
T
d
h
t
−
1
d
W
f
]
c
t
−
1
+
f
t
d
c
t
−
1
d
W
f
\frac{\mathrm{d}c_{t}}{\mathrm{d}W^{f}}= sigm^{'}(\widetilde i_{t})U^{i^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}W^{f}}g_{t}+i_{t}\tanh^{'}(\widetilde g_{t})U^{c^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}W^{f}} + sigm^{'}(\widetilde f_{t})[x_{t}x^{T}_{t} + U^{f^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}W^{f}}]c_{t-1}+f_{t}\frac{\mathrm{d}c_{t-1}}{\mathrm{d}W^{f}}
dWfdct=sigm′(i
t)UiTdWfdht−1gt+ittanh′(g
t)UcTdWfdht−1+sigm′(f
t)[xtxtT+UfTdWfdht−1]ct−1+ftdWfdct−1
d c t d U f = s i g m ′ ( i ~ t ) U i T d h t − 1 d U f g t + i t tanh ′ ( g ~ t ) U c T d h t − 1 d U f + s i g m ′ ( f ~ t ) [ h t − 1 h t − 1 T + U f T d h t − 1 d U f ] c t − 1 + f t d c t − 1 d U f \frac{\mathrm{d}c_{t}}{\mathrm{d}U^{f}}= sigm^{'}(\widetilde i_{t})U^{i^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}U^{f}}g_{t}+i_{t}\tanh^{'}(\widetilde g_{t})U^{c^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}U^{f}} + sigm^{'}(\widetilde f_{t})[h_{t-1}h^{T}_{t-1} + U^{f^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}U^{f}}]c_{t-1}+f_{t}\frac{\mathrm{d}c_{t-1}}{\mathrm{d}U^{f}} dUfdct=sigm′(i t)UiTdUfdht−1gt+ittanh′(g t)UcTdUfdht−1+sigm′(f t)[ht−1ht−1T+UfTdUfdht−1]ct−1+ftdUfdct−1
d c t d b f = s i g m ′ ( i ~ t ) U i T d h t − 1 d b f g t + i t tanh ′ ( g ~ t ) U c T d h t − 1 d b f + s i g m ′ ( f ~ t ) [ 1 + U f T d h t − 1 d b f ] c t − 1 + f t d c t − 1 d b f \frac{\mathrm{d}c_{t}}{\mathrm{d}b^{f}}= sigm^{'}(\widetilde i_{t})U^{i^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}b^{f}}g_{t}+i_{t}\tanh^{'}(\widetilde g_{t})U^{c^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}b^{f}} + sigm^{'}(\widetilde f_{t})[1 + U^{f^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}b^{f}}]c_{t-1}+f_{t}\frac{\mathrm{d}c_{t-1}}{\mathrm{d}b^{f}} dbfdct=sigm′(i t)UiTdbfdht−1gt+ittanh′(g t)UcTdbfdht−1+sigm′(f t)[1+UfTdbfdht−1]ct−1+ftdbfdct−1
输出门:
d
c
t
d
W
o
=
s
i
g
m
′
(
i
~
t
)
U
i
T
d
h
t
−
1
d
W
o
g
t
+
i
t
tanh
′
(
g
~
t
)
U
c
T
d
h
t
−
1
d
W
o
+
s
i
g
m
′
(
f
~
t
)
[
U
f
T
d
h
t
−
1
d
W
o
]
c
t
−
1
+
f
t
d
c
t
−
1
d
W
o
\frac{\mathrm{d}c_{t}}{\mathrm{d}W^{o}}= sigm^{'}(\widetilde i_{t})U^{i^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}W^{o}}g_{t}+i_{t}\tanh^{'}(\widetilde g_{t})U^{c^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}W^{o}} + sigm^{'}(\widetilde f_{t})[U^{f^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}W^{o}}]c_{t-1}+f_{t}\frac{\mathrm{d}c_{t-1}}{\mathrm{d}W^{o}}
dWodct=sigm′(i
t)UiTdWodht−1gt+ittanh′(g
t)UcTdWodht−1+sigm′(f
t)[UfTdWodht−1]ct−1+ftdWodct−1
d c t d U o = s i g m ′ ( i ~ t ) U i T d h t − 1 d U o g t + i t tanh ′ ( g ~ t ) U c T d h t − 1 d U o + s i g m ′ ( f ~ t ) [ U f T d h t − 1 d U o ] c t − 1 + f t d c t − 1 d U o \frac{\mathrm{d}c_{t}}{\mathrm{d}U^{o}}= sigm^{'}(\widetilde i_{t})U^{i^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}U^{o}}g_{t}+i_{t}\tanh^{'}(\widetilde g_{t})U^{c^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}U^{o}} + sigm^{'}(\widetilde f_{t})[U^{f^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}U^{o}}]c_{t-1}+f_{t}\frac{\mathrm{d}c_{t-1}}{\mathrm{d}U^{o}} dUodct=sigm′(i t)UiTdUodht−1gt+ittanh′(g t)UcTdUodht−1+sigm′(f t)[UfTdUodht−1]ct−1+ftdUodct−1
d c t d b o = s i g m ′ ( i ~ t ) U i T d h t − 1 d b o g t + i t tanh ′ ( g ~ t ) U c T d h t − 1 d b o + s i g m ′ ( f ~ t ) [ U f T d h t − 1 d b o ] c t − 1 + f t d c t − 1 d b o \frac{\mathrm{d}c_{t}}{\mathrm{d}b^{o}}= sigm^{'}(\widetilde i_{t})U^{i^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}b^{o}}g_{t}+i_{t}\tanh^{'}(\widetilde g_{t})U^{c^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}b^{o}} + sigm^{'}(\widetilde f_{t})[U^{f^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}b^{o}}]c_{t-1}+f_{t}\frac{\mathrm{d}c_{t-1}}{\mathrm{d}b^{o}} dbodct=sigm′(i t)UiTdbodht−1gt+ittanh′(g t)UcTdbodht−1+sigm′(f t)[UfTdbodht−1]ct−1+ftdbodct−1
候选记忆细胞:
d
c
t
d
W
c
=
s
i
g
m
′
(
i
~
t
)
U
i
T
d
h
t
−
1
d
W
c
g
t
+
i
t
tanh
′
(
g
~
t
)
[
x
t
x
t
T
+
U
c
T
d
h
t
−
1
d
W
c
]
+
s
i
g
m
′
(
f
~
t
)
[
U
f
T
d
h
t
−
1
d
W
c
]
c
t
−
1
+
f
t
d
c
t
−
1
d
W
c
\frac{\mathrm{d}c_{t}}{\mathrm{d}W^{c}}= sigm^{'}(\widetilde i_{t})U^{i^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}W^{c}}g_{t}+i_{t}\tanh^{'}(\widetilde g_{t})[x_{t}x^{T}_{t}+U^{c^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}W^{c}}] + sigm^{'}(\widetilde f_{t})[U^{f^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}W^{c}}]c_{t-1}+f_{t}\frac{\mathrm{d}c_{t-1}}{\mathrm{d}W^{c}}
dWcdct=sigm′(i
t)UiTdWcdht−1gt+ittanh′(g
t)[xtxtT+UcTdWcdht−1]+sigm′(f
t)[UfTdWcdht−1]ct−1+ftdWcdct−1
d c t d U c = s i g m ′ ( i ~ t ) U i T d h t − 1 d U c g t + i t tanh ′ ( g ~ t ) [ h t − 1 h t − 1 T + U c T d h t − 1 d U c ] + s i g m ′ ( f ~ t ) [ U f T d h t − 1 d U c ] c t − 1 + f t d c t − 1 d U c \frac{\mathrm{d}c_{t}}{\mathrm{d}U^{c}}= sigm^{'}(\widetilde i_{t})U^{i^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}U^{c}}g_{t}+i_{t}\tanh^{'}(\widetilde g_{t})[h_{t-1}h^{T}_{t-1}+U^{c^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}U^{c}}] + sigm^{'}(\widetilde f_{t})[U^{f^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}U^{c}}]c_{t-1}+f_{t}\frac{\mathrm{d}c_{t-1}}{\mathrm{d}U^{c}} dUcdct=sigm′(i t)UiTdUcdht−1gt+ittanh′(g t)[ht−1ht−1T+UcTdUcdht−1]+sigm′(f t)[UfTdUcdht−1]ct−1+ftdUcdct−1
d c t d b c = s i g m ′ ( i ~ t ) U i T d h t − 1 d b c g t + i t tanh ′ ( g ~ t ) [ 1 + U c T d h t − 1 d b c ] + s i g m ′ ( f ~ t ) [ U f T d h t − 1 d b c ] c t − 1 + f t d c t − 1 d b c \frac{\mathrm{d}c_{t}}{\mathrm{d}b^{c}}= sigm^{'}(\widetilde i_{t})U^{i^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}b^{c}}g_{t}+i_{t}\tanh^{'}(\widetilde g_{t})[1 +U^{c^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}b^{c}}] + sigm^{'}(\widetilde f_{t})[U^{f^{T}}\frac{\mathrm{d}h_{t-1}}{\mathrm{d}b^{c}}]c_{t-1}+f_{t}\frac{\mathrm{d}c_{t-1}}{\mathrm{d}b^{c}} dbcdct=sigm′(i t)UiTdbcdht−1gt+ittanh′(g t)[1+UcTdbcdht−1]+sigm′(f t)[UfTdbcdht−1]ct−1+ftdbcdct−1
最后就是他们的初始值了
输入门:
d c 1 d W i = g 1 [ x 1 x 1 T ] \frac{\mathrm{d}c_{1}}{\mathrm{d}W^{i}}= g_{1}[x_{1}x^{T}_{1}] dWidc1=g1[x1x1T]
d c 1 d U i = g 1 [ h 0 h 0 T ] \frac{\mathrm{d}c_{1}}{\mathrm{d}U^{i}}= g_{1}[h_{0}h^{T}_{0}] dUidc1=g1[h0h0T]
d c 1 d b i = g 1 \frac{\mathrm{d}c_{1}}{\mathrm{d}b^{i}}= g_{1} dbidc1=g1
d
h
1
d
W
i
=
o
1
tanh
′
(
c
1
)
d
c
1
d
W
i
\frac{\mathrm{d}h_{1}}{\mathrm{d}W^{i}}= o_{1}\tanh^{'}(c_{1})\frac{\mathrm{d}c_{1}}{\mathrm{d}W^{i}}
dWidh1=o1tanh′(c1)dWidc1
d
h
1
d
U
i
=
o
1
tanh
′
(
c
1
)
d
c
1
d
U
i
\frac{\mathrm{d}h_{1}}{\mathrm{d}U^{i}}= o_{1}\tanh^{'}(c_{1})\frac{\mathrm{d}c_{1}}{\mathrm{d}U^{i}}
dUidh1=o1tanh′(c1)dUidc1
d
h
1
d
b
i
=
o
1
tanh
′
(
c
1
)
d
c
1
d
b
i
\frac{\mathrm{d}h_{1}}{\mathrm{d}b^{i}}= o_{1}\tanh^{'}(c_{1})\frac{\mathrm{d}c_{1}}{\mathrm{d}b^{i}}
dbidh1=o1tanh′(c1)dbidc1
遗忘门:
d c 1 d W f = c 0 [ x 1 x 1 T ] \frac{\mathrm{d}c_{1}}{\mathrm{d}W^{f}}= c_{0}[x_{1}x^{T}_{1}] dWfdc1=c0[x1x1T]
d c 1 d U f = c 0 [ h 0 h 0 T ] \frac{\mathrm{d}c_{1}}{\mathrm{d}U^{f}}= c_{0}[h_{0}h^{T}_{0}] dUfdc1=c0[h0h0T]
d c 1 d b f = c 0 \frac{\mathrm{d}c_{1}}{\mathrm{d}b^{f}}= c_{0} dbfdc1=c0
d
h
1
d
W
f
=
o
1
tanh
′
(
c
1
)
d
c
1
d
W
f
\frac{\mathrm{d}h_{1}}{\mathrm{d}W^{f}}= o_{1}\tanh^{'}(c_{1})\frac{\mathrm{d}c_{1}}{\mathrm{d}W^{f}}
dWfdh1=o1tanh′(c1)dWfdc1
d
h
1
d
U
f
=
o
1
tanh
′
(
c
1
)
d
c
1
d
U
f
\frac{\mathrm{d}h_{1}}{\mathrm{d}U^{f}}= o_{1}\tanh^{'}(c_{1})\frac{\mathrm{d}c_{1}}{\mathrm{d}U^{f}}
dUfdh1=o1tanh′(c1)dUfdc1
d
h
1
d
b
f
=
o
1
tanh
′
(
c
1
)
d
c
1
d
b
f
\frac{\mathrm{d}h_{1}}{\mathrm{d}b^{f}}= o_{1}\tanh^{'}(c_{1})\frac{\mathrm{d}c_{1}}{\mathrm{d}b^{f}}
dbfdh1=o1tanh′(c1)dbfdc1
输出门:
d c 1 d W o = 0 \frac{\mathrm{d}c_{1}}{\mathrm{d}W^{o}}= 0 dWodc1=0
d c 1 d U o = 0 \frac{\mathrm{d}c_{1}}{\mathrm{d}U^{o}}= 0 dUodc1=0
d c 1 d b o = 0 \frac{\mathrm{d}c_{1}}{\mathrm{d}b^{o}}= 0 dbodc1=0
d h 1 d W o = s i g m ′ ( o ~ 1 ) [ x 1 x 1 T ] tanh ( c 1 ) \frac{\mathrm{d}h_{1}}{\mathrm{d}W^{o}}= sigm^{'}(\widetilde o_{1})[x_{1}x^{T}_{1}]\tanh(c_{1}) dWodh1=sigm′(o 1)[x1x1T]tanh(c1)
d h 1 d U o = s i g m ′ ( o ~ 1 ) [ h 0 h 0 T ] tanh ( c 1 ) \frac{\mathrm{d}h_{1}}{\mathrm{d}U^{o}}= sigm^{'}(\widetilde o_{1})[h _{0}h^{T}_{0}]\tanh(c_{1}) dUodh1=sigm′(o 1)[h0h0T]tanh(c1)
d h 1 d b o = s i g m ′ ( o ~ 1 ) tanh ( c 1 ) \frac{\mathrm{d}h_{1}}{\mathrm{d}b^{o}}= sigm^{'}(\widetilde o_{1})\tanh(c_{1}) dbodh1=sigm′(o 1)tanh(c1)
候选记忆细胞:
d c 1 d W c = i 1 tanh ′ ( g ~ 1 ) [ x 1 x 1 T ] \frac{\mathrm{d}c_{1}}{\mathrm{d}W^{c}}= i_{1}\tanh^{'}(\widetilde g_{1})[x_{1}x^{T}_{1}] dWcdc1=i1tanh′(g 1)[x1x1T]
d c 1 d U c = i 1 tanh ′ ( g ~ 1 ) [ h 0 h 0 T ] \frac{\mathrm{d}c_{1}}{\mathrm{d}U^{c}}= i_{1}\tanh^{'}(\widetilde g_{1})[h_{0}h^{T}_{0}] dUcdc1=i1tanh′(g 1)[h0h0T]
d c 1 d b c = i 1 tanh ′ ( g ~ 1 ) \frac{\mathrm{d}c_{1}}{\mathrm{d}b^{c}}= i_{1}\tanh^{'}(\widetilde g_{1}) dbcdc1=i1tanh′(g 1)
d h 1 d W c = o 1 tanh ′ ( c 1 ) d c 1 d W c \frac{\mathrm{d}h_{1}}{\mathrm{d}W^{c}}= o_{1}\tanh^{'}(c_{1})\frac{\mathrm{d}c_{1}}{\mathrm{d}W^{c}} dWcdh1=o1tanh′(c1)dWcdc1
d h 1 d U c = o 1 tanh ′ ( c 1 ) d c 1 d U c \frac{\mathrm{d}h_{1}}{\mathrm{d}U^{c}}= o_{1}\tanh^{'}(c_{1})\frac{\mathrm{d}c_{1}}{\mathrm{d}U^{c}} dUcdh1=o1tanh′(c1)dUcdc1
d h 1 d b c = o 1 tanh ′ ( c 1 ) d c 1 d b c \frac{\mathrm{d}h_{1}}{\mathrm{d}b^{c}}= o_{1}\tanh^{'}(c_{1})\frac{\mathrm{d}c_{1}}{\mathrm{d}b^{c}} dbcdh1=o1tanh′(c1)dbcdc1
总结
ok,到这里反向传播的导数公式就全部推导结束,都是用的链式求导法则。如果有算错的,欢迎指正。