第十一讲 Review GRU & LSTM
原视频中还涉及一些关于MT的其他议题,笔者在此处忽略了。
GRU
idea: Perhaps we could use shortcut connections to prevent model from gradient vanishing. -> adaptive shortcut connections(
u
t
u_t
ut).
f
(
h
t
−
1
,
x
t
)
=
u
t
⊙
h
^
t
+
(
1
−
u
t
)
⊙
h
t
−
1
h
^
t
=
t
a
n
h
(
W
[
x
t
]
+
U
h
t
−
1
+
b
)
u
t
=
σ
(
W
u
[
x
t
]
+
U
u
h
t
−
1
+
b
u
)
\begin{aligned} f(h_{t-1},x_t) =& u_t\odot \hat h_t + (1-u_t)\odot h_{t-1} \\ \hat h_t =& tanh(W[x_t]+Uh_{t-1}+b) \\ u_t =& \sigma(W_u[x_t]+U_uh_{t-1}+b_u) \end{aligned}
f(ht−1,xt)=h^t=ut=ut⊙h^t+(1−ut)⊙ht−1tanh(W[xt]+Uht−1+b)σ(Wu[xt]+Uuht−1+bu)
idea: Prune unnecessary connections adaptively(
r
t
r_t
rt).
h
^
t
=
t
a
n
h
(
W
[
x
t
]
+
U
(
r
t
⊙
h
t
−
1
)
+
b
)
r
t
=
σ
(
W
r
[
x
t
]
+
U
r
h
t
−
1
+
b
r
)
u
t
=
σ
(
W
u
[
x
t
]
+
U
u
h
t
−
1
+
b
u
)
\begin{aligned} \hat h_t =& tanh(W[x_t]+U(r_t\odot h_{t-1})+b) \\ r_t =& \sigma(W_r[x_t]+U_r h_{t-1}+b_r) \\ u_t =& \sigma(W_u[x_t]+U_uh_{t-1}+b_u) \end{aligned}
h^t=rt=ut=tanh(W[xt]+U(rt⊙ht−1)+b)σ(Wr[xt]+Urht−1+br)σ(Wu[xt]+Uuht−1+bu)
Some tricks to train RNN
- Use LSTM or GRU
- initialize recurrent matrices to be orthogonal
- initialize other matrices with a sensible scale
- initialize forget gate bias to 1: default to remembering
- Adam, Adadelta
- clip norm.
- dropout vertically
Ensemble.
MT evaluation
- Manual
- Testing in an application that uses MT as one sub-componet
- Automatic metric
- WER word error rate
- BLEU Bilingual Evaluation Understudy