Modern Recurrent Neural Networks

最新推荐文章于 2024-10-12 17:30:49 发布

拉普拉斯的汪

最新推荐文章于 2024-10-12 17:30:49 发布

阅读量227

点赞数

分类专栏： Deep Learning 文章标签：深度学习机器学习

本文链接：https://blog.csdn.net/qq_39599295/article/details/120118537

版权

Deep Learning 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Reference:

https://d2l.ai/chapter_recurrent-modern/index.html 9.1-9.4

Content

Motivation

Some tokens may not carry pertinent observation $\to$ Mechanism for skipping such tokens in the latent state representation
There is a logical break between parts of a sequence $\to$ Means of resetting our internal state representation
An early observation is highly significant for predicting all future observations $\to$ Mechanisms for storing vital early information in a memory cell

Gated Recurrent Units (GRU)

Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259.

GRU supports gating of the hidden state, which decides when a hidden state should be updated and also when it should be reset.

Reset Gate and Update Gate

Reset gate:
- controls how much of the previous state we might still want to remember
- helps capture short-term dependencies in sequences
Update gate:
- controls how much of the new state is just a copy of the old state
- helps capture long-term dependencies in sequences

Given the input of the current time step and the hidden state of the previous time step, the outputs of two gates are given by two fully-connected layers with a sigmoid activation function:

$\mathbf X_t \in \mathbb R^{n\times d}$ : a minibatch with batch size $n$ and $d$ inputs

$\mathbf H_{t-1}\in \mathbb R^{n\times h}$ : the hidden state of the previous time step ( $h$ =number of hidden units)

$\mathbf R_t\in \mathbb R^{n\times h}$ : the reset gate

$\mathbf Z_t \in \mathbb R^{n\times h}$ : the update gate

$\mathbf W_{xr},\mathbf W_{xz}\in \mathbb R^{d\times h}$ and $\mathbf W_{hr}, \mathbf W_{hz}\in \mathbb R^{h\times h}$ : weight parameters

$\mathbf b_r, \mathbf b_z\in \mathbb R^{1\times h}$ : biases
$\mathbf R_t=\sigma(\mathbf X_t \mathbf W_{xr}+\mathbf H_{t-1}\mathbf W_{hr}+\mathbf b_r)\\ \mathbf Z_t=\sigma(\mathbf X_t \mathbf W_{xz}+\mathbf H_{t-1}\mathbf W_{hz}+\mathbf b_z) \tag{GRU.1}$
../_images/gru-1.svg

Hidden State

Next, let us integrate the reset gate $\mathbf R_t$ with the regular latent state updating mechanism. It leads to the following candidate hidden state $\tilde {\mathbf H}_t\in \mathbb R^{n\times h}$ at time step $t$ :
$\tilde{\mathbf{H}}_t = \tanh(\mathbf{X}_t \mathbf{W}_{xh} + \left(\mathbf{R}_t \odot \mathbf{H}_{t-1}\right) \mathbf{W}_{hh} + \mathbf{b}_h) \tag{GRU.2}$
where $\mathbf{W}_{xh} \in \mathbb{R}^{d \times h}$ and $\mathbf{W}_{hh} \in \mathbb{R}^{h \times h}$ are weight parameters, $\mathbf{b}_h \in \mathbb{R}^{1 \times h}$ is the bias, and the symbol $\odot$ is the Hadamard (elementwise) product operator.

Finally, we need to incorporate the effect of the update gate $\mathbf Z_t$ . This determines the extent to which the new hidden state $\mathbf H_t\in \mathbb R^{n\times h}$ is just the old state $\mathbf H_{t-1}$ and by how much the new candidate state $\tilde {\mathbf H}_t$ is used:
$\mathbf H_t = \mathbf Z_t \odot \mathbf H_{t-1}+(1-\mathbf Z_t)\odot \tilde {\mathbf H}_t \tag{GRU.3}$
../_images/gru-3.svg

Long Short-Term Memory (LSTM)

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735–1780.

LSTM introduces a memory cell that has the same shape as the hidden state (some literatures consider the memory cell as a special type of the hidden state), engineered to record additional information.

Input Gate, Forget Gate, and Output Gate

To control the memory cell we need a number of gates.

Input gate: to decide when to read data into the cell
Output gate: to read out the entries from the cell
Forget gate: to reset the content of the cell

The motivation for such a design is the same as that of GRUs, namely to be able to decide when to remember and when to ignore inputs in the hidden state via a dedicated mechanism.

Just like in GRUs, the data feeding into the LSTM gates are the input at the current time step and the hidden state of the previous time step. They are processed by three fully-connected layers with a sigmoid activation function to compute the values of the input, forget. and output gates. As a result, values of the three gates are in the range of $(0, 1)$ .

$\mathbf I_t\in \mathbb R^{n\times h}$ : the input gate

$\mathbf O_t\in \mathbb R^{n\times h}$ : the output gate

$\mathbf F_t \in \mathbb R^{n\times h}$ : the forget gate

$\mathbf W_{xi},\mathbf W_{xo},\mathbf W_{xf}\in \mathbb R^{d\times h}$ and $\mathbf W_{hi}, \mathbf W_{ho}, \mathbf W_{hf}\in \mathbb R^{h\times h}$ : weight parameters

$\mathbf b_i, \mathbf b_o, \mathbf b_f\in \mathbb R^{1\times h}$ : biases
$\begin{aligned} \mathbf I_t&=\sigma(\mathbf X_t \mathbf W_{xi}+\mathbf H_{t-1}\mathbf W_{hi}+\mathbf b_i)\\ \mathbf O_t&=\sigma(\mathbf X_t \mathbf W_{xo}+\mathbf H_{t-1}\mathbf W_{ho}+\mathbf b_o) \\ \mathbf F_t&=\sigma(\mathbf X_t \mathbf W_{xf}+\mathbf H_{t-1}\mathbf W_{hf}+\mathbf b_f) \end{aligned}\tag{LSTM.1}$
../_images/lstm-0.svg

Memory Cell

Next we design the memory cell. Since we have not specified the action of the various gates yet, we first introduce the candidate memory cell $\tilde {\mathbf C}_t∈\mathbb R^{n×h}$ . Its computation is similar to that of the three gates described above, but using a $\tanh$ function with a value range for $(- 1, 1)$ as the activation function:
$\tilde {\mathbf C}_t =\tanh(\mathbf X_t \mathbf W_{xc}+\mathbf H_{t-1}\mathbf W_{hc}+\mathbf b_c) \tag{LSTM.2}$
where $\mathbf W_{xc}\in \mathbb R^{d\times h}$ and $\mathbf W_{hc}\in \mathbb R^{h\times h}$ are the weight parameters and $\mathbf b_c \in \mathbb R^{1\times h}$ is a bias parameter.

In GRUs, we have a mechanism to govern input and forgetting (or skipping). Similarly, in LSTMs we have two dedicated gates for such purposes: the input gate $\mathbf I_t$ governs how much we take new data into account via $\tilde{\mathbf C}_t$ and the forget gate $\mathbf F_t$ addresses how much of the old memory cell content $\mathbf C_{t-1}\in \mathbb R^{n\times h}$ we retain:
$\mathbf C_t=\mathbf F_t\odot \mathbf C_{t-1}+\mathbf I_t \odot \tilde {\mathbf C}_t \tag{LSTM.3}$
This design is introduced to alleviate the vanishing gradient problem and to better capture long range dependencies within sequences.

Hidden State

Last, we need to define how to compute the hidden state $\mathbf H_t \in \mathbb R^{n\times h}$ . This is where the output gate comes into play. In LSTM it is simply a gated version of the $\tanh$ of the memory cell. This ensures that the values of $\mathbf H_t$ are always in the interval $(- 1, 1)$ :
$\mathbf H_t=\mathbf O_t \odot \tanh (\mathbf C_t) \tag{LSTM.4}$
Whenever the output gate approximates 1 we effectively pass all memory information through to the predictor, whereas for the output gate close to 0 we retain all the information only within the memory cell and perform no further processing.

Deep Recurrent Neural Networks

We could stack multiple layers of RNNs on top of each other. Each hidden state is continuously passed to both the next time step of the current layer and the current time step of the next layer.

$\mathbf X_t \in \mathbb R^{n\times d}$ : a minibatch with batch size $n$ and $d$ inputs

$\mathbf H_{t}^{(l)}\in \mathbb R^{n\times h}$ : the hidden state of the $l^{th}$ hidden layer ( $h$ =number of hidden units), $\mathbf H_{t}^{(0)}=\mathbf X_t$ .
$\mathbf H_{t}^{(l)}=\phi_l(\mathbf H_{t}^{(l-1)}\mathbf W_{xh}^{(l)}+\mathbf H_{t-1}^{(l)} \mathbf W_{hh}^{(l)}+\mathbf b_{h}^{(l)}) \tag{DRNN.1}$
In the end, the calculation of the output layer is only based on the hidden state of the final $L^{th}$ hidden layer:
$\mathbf O_t=\mathbf H_{t}^{(L)}\mathbf W_{hq}+\mathbf b_q \tag{DRNN.2}$

Bidirectional Recurrent Neural Networks

Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.

To motivate why one might pick this specific architecture, we can first take a detour to probabilistic models.

Dynamic Programming in Hidden Markov Models

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0rEvAioR-1630835800664)(https://d2l.ai/_images/hmm.svg)]

For a sequence of $T$ observations we have the following joint probability distribution over the observed and hidden states:
$P(x_1,\cdots, x_T,h_1,\cdots,h_T)=\prod_{t=1}^T P(h_t|h_{t-1})P(x_t|h_t)\tag{HMM.1}$
where $P(h_1|h_0)=P(h_1)$ .

Now assume that we observe all $x_i$ with the exception of some $x_j$ and it is our goal to compute $P(x_j|x_{-j})$ , where $x_{-j}=(x_1,\cdots, x_{j-1},x_j,\cdots, x_T)$ .

Since there is no latent variable in $P(x_j|x_{-j})$ , we consider summing over all the possible combinations of choices for $h_1,\cdots,h_T$ . In case any $h_i$ can take on $k$ distinct values, this means that we need to sum over $k^T$ terms, which is usually impossible! Fortunately there is an elegant solution for this: dynamic programming.

Forward recursion
$\begin{aligned} P(x_1, \ldots, x_T) =& \sum_{h_1, \ldots, h_T} P(x_1, \ldots, x_T, h_1, \ldots, h_T) \\ =& \sum_{h_1, \ldots, h_T} \prod_{t=1}^T P(h_t \mid h_{t-1}) P(x_t \mid h_t) \\ =& \sum_{h_2, \ldots, h_T} \underbrace{\left[\sum_{h_1} P(h_1) P(x_1 \mid h_1) P(h_2 \mid h_1)\right]}_{\pi_2(h_2) \stackrel{\mathrm{def}}{=}} P(x_2 \mid h_2) \prod_{t=3}^T P(h_t \mid h_{t-1}) P(x_t \mid h_t) \\ =& \sum_{h_3, \ldots, h_T} \underbrace{\left[\sum_{h_2} \pi_2(h_2) P(x_2 \mid h_2) P(h_3 \mid h_2)\right]}_{\pi_3(h_3)\stackrel{\mathrm{def}}{=}} P(x_3 \mid h_3) \prod_{t=4}^T P(h_t \mid h_{t-1}) P(x_t \mid h_t)\\ =& \dots \\ =& \sum_{h_T} \pi_T(h_T) P(x_T \mid h_T). \end{aligned}\tag{HMM.2}$
In general we have the forward recursion as
$\pi_{t+1}(h_{t+1})=\sum_{h_t} \pi_t(h_t)P(x_t \mid h_t)P(h_{t+1}\mid h_t)\tag{HMM.3}$
The recursion is initialized as $\pi_1(h_1)=P(h_1)$ . In abstract terms this can be written as $\pi_{t+1}=f(\pi_t,x_t)$ , where $f$ is some learnable function. This looks very much like the update equation in the latent variable models we discussed so far in the context of RNNs!
Backward recursion
$\begin{aligned} P(x_1, \ldots, x_T) =& \sum_{h_1, \ldots, h_T} P(x_1, \ldots, x_T, h_1, \ldots, h_T) \\ =& \sum_{h_1, \ldots, h_T} \prod_{t=1}^{T-1} P(h_t \mid h_{t-1}) P(x_t \mid h_t) \cdot P(h_T \mid h_{T-1}) P(x_T \mid h_T) \\ =& \sum_{h_1, \ldots, h_{T-1}} \prod_{t=1}^{T-1} P(h_t \mid h_{t-1}) P(x_t \mid h_t) \cdot \underbrace{\left[\sum_{h_T} P(h_T \mid h_{T-1}) P(x_T \mid h_T)\right]}_{\rho_{T-1}(h_{T-1})\stackrel{\mathrm{def}}{=}} \\ =& \sum_{h_1, \ldots, h_{T-2}} \prod_{t=1}^{T-2} P(h_t \mid h_{t-1}) P(x_t \mid h_t) \cdot \underbrace{\left[\sum_{h_{T-1}} P(h_{T-1} \mid h_{T-2}) P(x_{T-1} \mid h_{T-1}) \rho_{T-1}(h_{T-1}) \right]}_{\rho_{T-2}(h_{T-2})\stackrel{\mathrm{def}}{=}} \\ =& \ldots \\ =& \sum_{h_1} P(h_1) P(x_1 \mid h_1)\rho_{1}(h_{1}) \end{aligned}\tag{HMM.4}$
We can thus write the backward recursion as
$\rho_{t-1}(h_{t-1})=\sum_{h_t} P(h_t \mid h_{t-1})P(x_{t}\mid h_t)\rho_{t}(h_{t}) \tag{HMM.5}$
with initialization $\rho _T(h_T)=1$ . Note that in abstract terms the backward recursion can be written as $\rho _{t-1}=g(\rho_t,x_t)$ , where $g$ is a learnable function. Again, this looks very much like an update equation, just running backwards unlike what we have seen so far in RNNs.
Forward and backward recursion
$\begin{aligned} &P(x_1, \ldots, x_T)\\ =& \sum_{h_1, \ldots, h_T} \prod_{t=1}^T P(h_t \mid h_{t-1}) P(x_t \mid h_t) \\ =& \sum_{h_j} \left[\left(\sum_{h_1, \ldots, h_{j-1}} \left[\prod_{t=1}^T P(h_t \mid h_{t-1}) P(x_t \mid h_t) \right]P(h_j \mid h_{j-1})\right) P(x_j \mid h_j) \left(\sum_{h_{j+1}, \ldots, h_{T}} \left[\prod_{t=1}^T P(h_t \mid h_{t-1}) P(x_t \mid h_t) \right]\right) \right] \\ =& \sum_{h_j} \pi_j(h_j) P(x_j\mid h_j) \rho( h_j) \end{aligned}\tag{HMM.6}$

From the results above, we are able to compute
$P(x_j|x_{-j})\propto \sum_{h_j} \pi_j(h_j) P(x_j\mid h_j) \rho( h_j) \tag{HMM.7}$
These recursions allow us to sum over $T$ latent variables in $\mathcal O(kT)$ (linear) time over all values of $(h_1,\cdots,h_T)$ rather than in exponential time.

Bidirectional Model

If we want to have a mechanism in RNNs that offers comparable look-ahead ability as in hidden Markov models, we need to modify the RNN design that we have seen so far. Instead of running an RNN only in the forward mode starting from the first token, we start another one from the last token running from back to front. Bidirectional RNNs add a hidden layer that passes information in a backward direction to more flexibly process such information.

In fact, this is not too dissimilar to the forward and backward recursions in the dynamic programming of hidden Markov models.

$\mathbf X_t \in \mathbb R^{n\times d}$ : a minibatch with batch size $n$ and $d$ inputs

$\overrightarrow{\mathbf{H}}_t\in \mathbb R^{n\times h},\overleftarrow{\mathbf{H}}_t\in \mathbb R^{n\times h}$ : the forward and backward hidden states for this time step

$\mathbf{W}_{xh}^{(f)} \in \mathbb{R}^{d \times h}, \mathbf{W}_{hh}^{(f)} \in \mathbb{R}^{h \times h}, \mathbf{W}_{xh}^{(b)} \in \mathbb{R}^{d \times h},\mathbf{W}_{hh}^{(b)} \in \mathbb{R}^{h \times h}$ : weights

$\mathbf{b}_h^{(f)} \in \mathbb{R}^{1 \times h} , \mathbf{b}_h^{(b)} \in \mathbb{R}^{1 \times h}$ : biases
$\begin{aligned} \overrightarrow{\mathbf{H}}_t &= \phi(\mathbf{X}_t \mathbf{W}_{xh}^{(f)} + \overrightarrow{\mathbf{H}}_{t-1} \mathbf{W}_{hh}^{(f)} + \mathbf{b}_h^{(f)}),\\ \overleftarrow{\mathbf{H}}_t &= \phi(\mathbf{X}_t \mathbf{W}_{xh}^{(b)} + \overleftarrow{\mathbf{H}}_{t+1} \mathbf{W}_{hh}^{(b)} + \mathbf{b}_h^{(b)}), \end{aligned} \tag{BM.1}$
Last, the output layer computes the output $\mathbf{O}_t \in \mathbb{R}^{n \times q}$ (number of outputs: $q$ )
$\mathbf{O}_t = \mathbf{H}_t \mathbf{W}_{hq} + \mathbf{b}_q \tag{BM.2}$
Here, the weight matrix $\mathbf{W}_{hq} \in \mathbb{R}^{2h \times q}$ and the bias $\mathbf{b}_q \in \mathbb{R}^{1 \times q}$ are the model parameters of the output layer. In fact, the two directions can have different numbers of hidden units.