这节内容介绍几个经典的RNN模型,有GRU/LSTM/深度循环神经网络/双向循环神经网络。在吴恩达老师的DL课程中,曾说过GRU是LSTM的改进,GRU单元更加简单,构筑大型网络结构更有优势,而LSTM单元功能更加强大;深度循环神经网络就是在一个RNN单元中层数更深,拥有更强的抽取特征的能力;双向循环神经网络在机器翻译任务中十分有效,如对某个词的词义的确定既需要前面单词也需要后面单词的配合。另外值得一提的是,这些网络拥有长期记忆的方法,可以避免梯度消失的问题。
GRU
R
t
=
σ
(
X
t
W
x
r
+
H
t
−
1
W
h
r
+
b
r
)
Z
t
=
σ
(
X
t
W
x
z
+
H
t
−
1
W
h
z
+
b
z
)
H
~
t
=
tanh
(
X
t
W
x
h
+
(
R
t
⊙
H
t
−
1
)
W
h
h
+
b
h
)
H
t
=
Z
t
⊙
H
t
−
1
+
(
1
−
Z
t
)
⊙
H
~
t
\begin{aligned} R_{t} &=\sigma\left(X_{t} W_{x r}+H_{t-1} W_{h r}+b_{r}\right) \\ Z_{t} &=\sigma\left(X_{t} W_{x z}+H_{t-1} W_{h z}+b_{z}\right) \\ \widetilde{H}_{t}=& \tanh \left(X_{t} W_{x h}+\left(R_{t} \odot H_{t-1}\right) W_{h h}+b_{h}\right) \\ H_{t} &=Z_{t} \odot H_{t-1}+\left(1-Z_{t}\right) \odot \widetilde{H}_{t} \end{aligned}
RtZtH
t=Ht=σ(XtWxr+Ht−1Whr+br)=σ(XtWxz+Ht−1Whz+bz)tanh(XtWxh+(Rt⊙Ht−1)Whh+bh)=Zt⊙Ht−1+(1−Zt)⊙H
t
GRU的重置门号称能记录长期记忆,其实主要借助于激活函数sigmoid,会生成一个[0,1]之间的数,越靠近0,表示越需要遗忘。
def gru(inputs, state, params):
W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q = params
H, = state
outputs = []
for X in inputs:
Z = torch.sigmoid(torch.matmul(X, W_xz) + torch.matmul(H, W_hz) + b_z)
R = torch.sigmoid(torch.matmul(X, W_xr) + torch.matmul(H, W_hr) + b_r)
H_tilda = torch.tanh(torch.matmul(X, W_xh) + R * torch.matmul(H, W_hh) + b_h)
H = Z * H + (1 - Z) * H_tilda
Y = torch.matmul(H, W_hq) + b_q
outputs.append(Y)
return outputs, (H,)
LSTM
I
t
=
σ
(
X
t
W
x
i
+
H
t
−
1
W
h
i
+
b
i
)
F
t
=
σ
(
X
t
W
x
f
+
H
t
−
1
W
h
f
+
b
f
)
O
t
=
σ
(
X
t
W
x
o
+
H
t
−
1
W
h
o
+
b
o
)
C
~
t
=
σ
(
X
t
W
x
o
+
H
t
−
1
W
h
c
+
b
c
)
C
t
=
F
t
⊙
C
t
−
1
+
I
t
⊙
C
~
t
H
t
=
O
t
⊙
tanh
(
C
t
)
\begin{aligned} I_{t}=& \sigma\left(X_{t} W_{x i}+H_{t-1} W_{h i}+b_{i}\right) \\ F_{t}=& \sigma\left(X_{t} W_{x f}+H_{t-1} W_{h f}+b_{f}\right) \\ O_{t}=& \sigma\left(X_{t} W_{x o}+H_{t-1} W_{h o}+b_{o}\right) \\ \widetilde{C}_{t}=& \sigma\left(X_{t} W_{x o}+H_{t-1} W_{h c}+b_{c}\right) \\ C_{t}&=F_{t} \odot C_{t-1}+I_{t} \odot \widetilde{C}_{t} \\ H_{t}&=O_{t} \odot \tanh \left(C_{t}\right) \end{aligned}
It=Ft=Ot=C
t=CtHtσ(XtWxi+Ht−1Whi+bi)σ(XtWxf+Ht−1Whf+bf)σ(XtWxo+Ht−1Who+bo)σ(XtWxo+Ht−1Whc+bc)=Ft⊙Ct−1+It⊙C
t=Ot⊙tanh(Ct)
def lstm(inputs, state, params):
[W_xi, W_hi, b_i, W_xf, W_hf, b_f, W_xo, W_ho, b_o, W_xc, W_hc, b_c, W_hq, b_q] = params
(H, C) = state
outputs = []
for X in inputs:
I = torch.sigmoid(torch.matmul(X, W_xi) + torch.matmul(H, W_hi) + b_i)
F = torch.sigmoid(torch.matmul(X, W_xf) + torch.matmul(H, W_hf) + b_f)
O = torch.sigmoid(torch.matmul(X, W_xo) + torch.matmul(H, W_ho) + b_o)
C_tilda = torch.tanh(torch.matmul(X, W_xc) + torch.matmul(H, W_hc) + b_c)
C = F * C + I * C_tilda
H = O * C.tanh()
Y = torch.matmul(H, W_hq) + b_q
outputs.append(Y)
return outputs, (H, C)
深度循环神经网络
num_hiddens=256
num_epochs, num_steps, batch_size, lr, clipping_theta = 160, 35, 32, 1e2, 1e-2
pred_period, pred_len, prefixes = 40, 50, ['分开', '不分开']
lr = 1e-2 # 注意调整学习率
# LSTM单元
gru_layer = nn.LSTM(input_size=vocab_size, hidden_size=num_hiddens,num_layers=2) #num_layers为DRNN的层数
model = d2l.RNNModel(gru_layer, vocab_size).to(device)
d2l.train_and_predict_rnn_pytorch(model, num_hiddens, vocab_size, device,
corpus_indices, idx_to_char, char_to_idx,
num_epochs, num_steps, lr, clipping_theta,
batch_size, pred_period, pred_len, prefixes)
双向循环神经网络
H
→
t
=
ϕ
(
X
t
W
x
h
(
f
)
+
H
→
t
−
1
W
h
h
(
f
)
+
b
h
(
f
)
)
H
t
←
=
ϕ
(
X
t
W
x
h
(
b
)
+
H
←
t
+
1
W
h
h
(
b
)
+
b
h
(
b
)
)
H
t
=
(
H
→
t
,
H
t
←
)
O
t
=
H
t
W
h
q
+
b
q
\begin{aligned} \overrightarrow{\boldsymbol{H}}_{t}&=\phi\left(\boldsymbol{X}_{t} \boldsymbol{W}_{x h}^{(f)}+\overrightarrow{\boldsymbol{H}}_{t-1} \boldsymbol{W}_{h h}^{(f)}+\boldsymbol{b}_{h}^{(f)}\right) \\ \overleftarrow{\boldsymbol{H}_{t}}&=\phi\left(\boldsymbol{X}_{t} \boldsymbol{W}_{x h}^{(b)}+\overleftarrow{\boldsymbol{H}}_{t+1} \boldsymbol{W}_{h h}^{(b)}+\boldsymbol{b}_{h}^{(b)}\right) \\ \boldsymbol{H}_{t}&=\left(\overrightarrow{\boldsymbol{H}}_{t}, \overleftarrow{\boldsymbol{H}_{t}}\right) \\ \boldsymbol{O}_{t}&=\boldsymbol{H}_{t} \boldsymbol{W}_{h q}+\boldsymbol{b}_{q} \end{aligned}
HtHtHtOt=ϕ(XtWxh(f)+Ht−1Whh(f)+bh(f))=ϕ(XtWxh(b)+Ht+1Whh(b)+bh(b))=(Ht,Ht)=HtWhq+bq
这里提一下,concat操作在前面几种网络中也有用到,拼接特征非常好用。
num_hiddens=128
num_epochs, num_steps, batch_size, lr, clipping_theta = 160, 35, 32, 1e-2, 1e-2
pred_period, pred_len, prefixes = 40, 50, ['分开', '不分开']
lr = 1e-2 # 注意调整学习率
gru_layer = nn.GRU(input_size=vocab_size, hidden_size=num_hiddens,bidirectional=True)
model = d2l.RNNModel(gru_layer, vocab_size).to(device)
d2l.train_and_predict_rnn_pytorch(model, num_hiddens, vocab_size, device,
corpus_indices, idx_to_char, char_to_idx,
num_epochs, num_steps, lr, clipping_theta,
batch_size, pred_period, pred_len, prefixes)
有些话说
研究RNN不是我的菜,所以简单提点问题:
- LSTM/GRU这些单元是如何长期记忆的?
- concat操作的妙处?