Chapter 14 Recurrent Neural Networks

Chapter 14 Recurrent Neural Networks

OReilly. Hands-On Machine Learning with Scikit-Learn and TensorFlow读书笔记

Recurrent Neural Networks (RNNs) is a class of nets that can predict the future.

  • analyze time series data
  • anticipate car trajectories in autonomous driving systems
  • work on sequences of arbitrary lengths

14.1 Recurrent Neurons

The simplest possible RNN is composed of just one neuron receiving inputs, producing an output, and sending that output back to itself, as shown in Figure 14-1 (left). At each time step t t t (also called a frame), this recurrent neuron receives the inputs x ( t ) \textbf x_{(t)} x(t) as well as its own output from the previous time step, y ( t – 1 ) y_{(t–1)} y(t1). We can represent this tiny network against the time axis, as shown in Figure 14-1 (right). This is called unrolling the network through time.

You can easily create a layer of recurrent neurons. At each time step t t t, every neuron receives both the input vector x ( t ) \textbf x_{(t)} x(t) and the output vector from the previous time step y ( t – 1 ) \textbf y_{(t–1)} y(t1), as shown in Figure 14-2. Note that both the inputs and outputs are vectors now (when there was just a single neuron, the output was a scalar).

Each recurrent neuron has two sets of weights: one for the inputs x ( t ) \textbf x_{(t)} x(t) and the other for the outputs of the previous time step, y ( t − 1 ) \textbf y_{(t-1)} y(t1). Let’s call these weight vectors w x \textbf w_x wx and w y \textbf w_y wy.

Equation 14-1. Output of a single recurrent neuron for a single instance
y ( t ) = ϕ ( x ( t ) T ⋅ w x + y ( t − 1 ) T ⋅ w y + b ) \textbf y_{(t)}=\phi( \textbf x_{(t)}^T\cdot\textbf w_x+\textbf y_{(t-1)}^T\cdot\textbf w_y+b) y(t)=ϕ(x(t)Twx+y(t1)Twy+b)

  • x ( t ) \textbf x_{(t)} x(t) is a n inputs × 1 n_{\textrm{inputs}}\times 1 ninputs×1 vector
  • w x \textbf w_x wx is a n inputs × 1 n_{\textrm{inputs}}\times 1 ninputs×1 vector
  • y ( t − 1 ) , y ( t ) , w y , b \textbf y_{(t-1)}, \textbf y_{(t)}, \textbf w_y, b y(t1),y(t),wy,b are scalars
  • ϕ \phi ϕ is an activation function, e.g., ReLU

Equation 14-2. Outputs of a layer of recurrent neurons for all instances in a minibatch*
Y ( t ) = ϕ ( X ( t ) ⋅ W x + Y ( t − 1 ) ⋅ W y + b ) = ϕ ( [ X ( t )   Y ( t − 1 ) ] ⋅ W + b )  with  W = [ W x W y ] \textbf Y_{(t)}=\phi( \textbf X_{(t)}\cdot\textbf W_x+\textbf Y_{(t-1)}\cdot\textbf W_y+\textbf b)\\=\phi( \left[\textbf X_{(t)} \textrm{ }\textbf Y_{(t-1)}\right]\cdot\textbf W+\textbf b) \textrm{ with }\textbf W=\left[\begin{matrix}\textbf W_x\\\textbf W_y\end{matrix}\right]\\ Y(t)=ϕ(X(t)Wx+Y(t1)Wy+b)=ϕ([X(t) Y(t1)]W+b) with W=[WxWy]

  • Y ( t ) \textbf Y_{(t)} Y(t) is an m × n neurons m \times n_\textrm{neurons} m×nneurons matrix containing the layer’s outputs at time step t t t for each instance in the mini-batch ( m m m is the number of instances in the mini-batch and n n n neurons is the number of neurons).
  • X ( t ) \textbf X_{(t)} X(t) is an m × n inputs m \times n_\textrm{inputs} m×ninputs matrix containing the inputs for all instances ( n inputs n_\textrm{inputs} ninputs is the number of input features).
  • W x \textbf W_x Wx is an n inputs n_\textrm{inputs} ninputs × n neurons n_\textrm{neurons} nneurons matrix containing the connection weights for the inputs of the current time step.
  • W y \textbf W_y Wy is an n neurons × n neurons n_\textrm{neurons}\times n_\textrm{neurons} nneurons×nneurons matrix containing the connection weights for the outputs of the previous time step.
  • The weight matrices W x \textbf W_x Wx and W y \textbf W_y Wy are often concatenated into a single weight matrix W \textbf W W of shape ( n inputs + n neurons ) × n neurons (n_\textrm{inputs} + n_\textrm{neurons})\times n_\textrm{neurons} (ninputs+nneurons)×nneurons (see the second line of Equation 14-2).
  • b \textbf b b is a vector of size n n n neurons containing each neuron’s bias term (which are equal for each instance in the mini-batch ).

Notice that Y(t) is a function of X(t) and Y(t–1), which is a function of X(t–1) and Y(t–2), which is a function of X(t–2) and Y(t–3), and so on. This makes Y(t) a function of all the inputs since time t = 0 (that is, X(0), X(1), …, X(t)). At the first time step, t = 0, there are no previous outputs, so they are typically assumed to be all zeros.

14.1.1 Memory Cells

A part of a neural network that preserves some state across time steps is called a memory cell (or simply a cell).

A cell’s state at time step t t t, denoted h ( t ) \textbf h_{(t)} h(t) (the “h” stands for “hidden”), is a function of some inputs at that time step and its state at the previous time step: h ( t ) = f ( h ( t – 1 ) , x ( t ) ) \textbf h_{(t)}=f(\textbf h_{(t–1)}, \textbf x_{(t)}) h(t)=f(h(t1),x(t)). Its output at time step t t t, denoted y(t), is also a function of the previous state and the current inputs. In the case of the basic cells we have discussed so far, the output is simply equal to the state, but in more complex cells this is not always the case, as shown in Figure 14-3.

14.1.2 Input and Output Sequences

sequence-to-sequence networks: stock price prediction. Feed it the prices over the last N N N days, and it must output the prices shifted by one day into the future.

sequence-to-vector: feed the network a sequence of words corresponding to a movie review, and output a sentiment score (e.g., from –1 [hate] to +1 [love]).

vector-to-sequence: the input could be an image, and the output could be a caption for that image.

Encoder–Decoder: a sequence-to-vector network, called an encoder, followed by a vector-to-sequence network, called a decoder. This can be used for translating a sentence from one language to another. You would feed the network a sentence in one language, the encoder would convert this sentence into a single vector representation, and then the decoder would decode this vector into a sentence in another language. Encoder-Decoder works much better than trying to translate on the fly with a single sequence-to-sequence RNN, since the last
words of a sentence can affect the first words of the translation, so you need to wait until you have heard the whole sentence before translating it.

14.2 Basic RNNs in TensorFlow

import tensorflow as tf
import numpy as np

n_inputs=3
n_neurons=5

tf.reset_default_graph()

X0=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X0")
X1=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X1")

#tf.Variable are model parameters
Wx=tf.Variable(tf.random_normal(shape=(n_inputs,n_neurons),dtype=tf.float32))
Wy=tf.Variable(tf.random_normal(shape=(n_neurons,n_neurons),dtype=tf.float32))
b=tf.Variable(tf.zeros((1,n_neurons),dtype=tf.float32))

Y0=tf.tanh(tf.matmul(X0,Wx)+b)
Y1=tf.tanh(tf.matmul(X1,Wx)+tf.matmul(Y0,Wy)+b)

init=tf.global_variables_initializer()

#mini-batch:        instance 0,instance 1,instance 2,instance 3
X0_batch = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 0, 1]]) # t = 0
X1_batch = np.array([[9, 8, 7], [0, 0, 0], [6, 5, 4], [3, 2, 1]]) # t = 1

with tf.Session() as sess:
    init.run()
    Y0_val,Y1_val=sess.run([Y0,Y1],feed_dict={X0:X0_batch,X1:X1_batch})
print(Y0_val)
print(Y1_val)

Output:

# output at t = 0
[[-0.4560281  -0.6136596  -0.9905464  -0.77564955  0.9993651 ] #instance0
 [-0.9549303   0.9859206  -1.         -0.99163646  1.        ] #instance1
 [-0.99715877  0.99997604 -1.         -0.9997209   1.        ] #instance2
 [-0.789206    0.99960274 -1.         -0.86458516 -0.9993621 ]]#instance3
# output at t = 1
[[-0.9913576   0.99999964 -1.         -0.9999859   1.        ] #instance0
 [-0.5038604  -0.99144584  0.7425929  -0.8621742   0.9904313 ] #instance1
 [-0.9935018   0.99963355 -1.         -0.99864584  1.        ] #instance2
 [-0.99863243  0.99194056 -0.9966107  -0.81543124  0.95412266]]#instance3

14.2.1 Static Unrolling Through Time

tf.reset_default_graph()

X0=tf.placeholder(tf.float32,shape=(None,n_inputs))
X1=tf.placeholder(tf.float32,shape=(None,n_inputs))

basic_cell= tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons)
output_seqs,states =tf.nn.static_rnn(basic_cell,[X0,X1],
                                    dtype=tf.float32)
Y0,Y1=output_seqs

init=tf.global_variables_initializer()
with tf.Session() as sess:
    init.run()
    Y0_val,Y1_val=sess.run([Y0,Y1],feed_dict={X0:X0_batch,X1:X1_batch})
    states_val=sess.run(states,feed_dict={X0:X0_batch,X1:X1_batch})
print(Y0_val)
print(Y1_val)
print(states_val)

Outputs:

[[-0.8599234   0.7902698   0.6345394   0.14662108  0.67297184]
 [-0.9711478   0.96196294 -0.39417985  0.24596632 -0.4447789 ]
 [-0.9943267   0.99360377 -0.9189848   0.34039935 -0.9438828 ]
 [ 0.9999017  -0.9993643  -0.9997142   0.97324    -0.99972504]]
[[-0.9426516   0.91199005 -0.99943197 -0.37630665 -0.99996483]
 [-0.22798909  0.6336572  -0.49914655  0.43264344 -0.38135213]
 [-0.3446843   0.9518936  -0.99922     0.7550421  -0.9986572 ]
 [ 0.952438   -0.22679709 -0.91934645  0.5048858  -0.9532887 ]]
[[-0.9426516   0.91199005 -0.99943197 -0.37630665 -0.99996483]
 [-0.22798909  0.6336572  -0.49914655  0.43264344 -0.38135213]
 [-0.3446843   0.9518936  -0.99922     0.7550421  -0.9986572 ]
 [ 0.952438   -0.22679709 -0.91934645  0.5048858  -0.9532887 ]]

BasicRNNCell can be thought of as a factory that creates copies of the cell to build the unrolled RNN (one for each time step). Then we call static_rnn(), giving it the cell factory
and the input tensors, and telling it the data type of the inputs (this is used to create the initial state matrix, which by default is full of zeros). The static_rnn() function calls the cell factory’s __call__() function once per input, creating two copies of the cell (each containing a layer of five recurrent neurons), with shared weights and bias terms, and it chains them just like we did earlier. The static_rnn() function returns two objects. The first is a Python list containing the output tensors for each time step. The second is a tensor containing the final states of the network. When you are using basic cells, the final state is simply equal to the last output.

The following code builds the same RNN again, but this time it takes a single input placeholder of shape [None, n_steps, n_inputs] where the first dimension is the mini-batch size. Then it extracts the list of input sequences for each time step. X_seqs is a Python list of n_steps tensors of shape [None, n_inputs], where once again the first dimension is the mini-batch size. To do this, we first swap the first two dimensions using the transpose() function, so that the time steps are now the first dimension. Then we extract a Python list of tensors along the first dimension (i.e., one tensor per time step) using the unstack() function. The next two lines are the same as before. Finally, we merge all the output tensors into a single tensor using the stack() function, and we swap the first two dimensions to get a final outputs tensor of shape [None, n_steps, n_neurons] (again the first dimension is the mini-batch size).

n_steps=2
n_inputs=3
n_neurons=5

tf.reset_default_graph()

X=tf.placeholder(tf.float32,[None,n_steps,n_inputs])
X_seqs=tf.unstack(tf.transpose(X,perm=[1,0,2]))

basic_cell=tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons)
output_seqs,states=tf.nn.static_rnn(basic_cell,X_seqs,
                                    dtype=tf.float32)
outputs=tf.transpose(tf.stack(output_seqs),perm=[1,0,2])

init=tf.global_variables_initializer()

X_batch=np.array([
    
    # t = 0     t = 1
    [[0, 1, 2], [9, 8, 7]], # instance 1
    [[3, 4, 5], [0, 0, 0]], # instance 2
    [[6, 7, 8], [6, 5, 4]], # instance 3
    [[9, 0, 1], [3, 2, 1]], # instance 4    
])

with tf.Session() as sess:
    init.run()
    outputs_val=outputs.eval(feed_dict={X:X_batch})

print(outputs_val)

Outputs:

[[[-0.31644356 -0.9353902   0.22770049  0.79628426 -0.4265463 ]
  [-0.99981856 -1.          0.99999344  0.99937767  0.9994444 ]]

 [[-0.9790442  -0.9999978   0.9789977   0.9964669   0.58451307]
  [ 0.78789717  0.7575957   0.29761294 -0.52544653  0.56956077]]

 [[-0.9995683  -1.          0.99964195  0.99994475  0.94620717]
  [-0.98962724 -0.99999964  0.9993963   0.98958486  0.9984626 ]]

 [[-0.9998599  -0.9999702   0.9996103  -0.90244126  0.9998807 ]
  [-0.95623946 -0.99702513  0.9783163   0.94345295  0.9598671 ]]]
print(np.transpose(outputs_val,axes=[1,0,2])[1])

Outputs:

[[-0.99981856 -1.          0.99999344  0.99937767  0.9994444 ]
 [ 0.78789717  0.7575957   0.29761294 -0.52544653  0.56956077]
 [-0.98962724 -0.99999964  0.9993963   0.98958486  0.9984626 ]
 [-0.95623946 -0.99702513  0.9783163   0.94345295  0.9598671 ]]

14.2.2 Dynamic Unrolling Through Time

The dynamic_rnn() function uses a while_loop() operation to run over the cell the
appropriate number of times, and you can set swap_memory=True if you want it to swap the GPU’s memory to the CPU’s memory during backpropagation to avoid OOM errors. Conveniently, it also accepts a single tensor for all inputs at every time step (shape [None, n_steps, n_inputs]) and it outputs a single tensor for all outputs at every time step (shape [None, n_steps, n_neurons]); there is no need to stack, unstack, or transpose. The following code creates the same RNN as earlier using the dynamic_rnn() function.

tf.reset_default_graph()
X=tf.placeholder(tf.float32,[None,n_steps,n_inputs])

basic_cell=tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons)
outputs,states =tf.nn.dynamic_rnn(basic_cell,X,dtype=tf.float32)

init=tf.global_variables_initializer()
with tf.Session() as sess:
    init.run()
    outputs_val=outputs.eval(feed_dict={X:X_batch})
print(outputs_val)

14.2.3 Handing Variable Length Input Sequences

#sequence_length must be a 1D tensor indicating the length of the input
#sequence for each instance.
tf.reset_default_graph()

seq_length = tf.placeholder(tf.int32,[None])

X=tf.placeholder(tf.float32,[None,n_steps,n_inputs])

basic_cell= tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons)
outputs,states=tf.nn.dynamic_rnn(basic_cell,X,dtype=tf.float32,
                                sequence_length=seq_length)

X_batch = np.array([
# step 0 step 1
[[0, 1, 2], [9, 8, 7]], # instance 0
[[3, 4, 5], [0, 0, 0]], # instance 1 (padded with a zero vector)
[[6, 7, 8], [6, 5, 4]], # instance 2
[[9, 0, 1], [3, 2, 1]], # instance 3
])

seq_length_batch = np.array([2, 1, 2, 2])

init=tf.global_variables_initializer()

with tf.Session() as sess:
    init.run()
    outputs_val,states_val=sess.run([outputs,states],feed_dict={X:X_batch,seq_length:seq_length_batch})

print(outputs_val)
print(states_val)

Outputs:

[[[ 0.7506502  -0.73288244  0.6886906  -0.7844114   0.46216345]
  [ 1.          0.99868387 -0.7892326  -0.9867889  -0.99078304]]# final state

 [[ 0.9998594  -0.25786185  0.5101236  -0.96387774 -0.39268893]# final state
  [ 0.          0.          0.          0.          0.        ]]#zero state

 [[ 1.0000001   0.38617173  0.2732133  -0.9944151  -0.86925167]
  [ 0.99999636  0.99638826 -0.8844117  -0.69347805 -0.83941436]]# final state

 [[ 0.99975586  0.999811   -0.9798193   0.9988527  -0.9977313 ]
  [ 0.9997027   0.8874132  -0.80941224  0.87725645 -0.54265726]]]# final state
[[ 1.          0.99868387 -0.7892326  -0.9867889  -0.99078304]# t = 1
 [ 0.9998594  -0.25786185  0.5101236  -0.96387774 -0.39268893]# t = 0!!!
 [ 0.99999636  0.99638826 -0.8844117  -0.69347805 -0.83941436]# t = 1
 [ 0.9997027   0.8874132  -0.80941224  0.87725645 -0.54265726]]# t = 1

14.2.4 Handing Variable-Length Output Sequences

If the output sequences has different length to the input sequence, for example, the length of a translated sentence is generally different from the length of the input sentence. In this case, the most common solution is to define a special output called an end-of-sequence token (EOS token). Any output past the EOS should be ignored.

14.3 Training RNNs

Backpropagation through time (BPTT): unroll a RNN through time and then simply use regular backpropagation. Note that the gradients flow backward through all the outputs used by the cost function, not just through the final output.

14.3.1 Training a Sequence Classifier

import tensorflow as tf
import numpy as np

tf.reset_default_graph()

n_steps=28
n_inputs=28
n_neurons=150
n_outputs=10

learning_rate=0.001

X=tf.placeholder(tf.float32, [None,n_steps,n_inputs])
y=tf.placeholder(tf.int32,[None])

basic_cell=tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons)
outputs,states=tf.nn.dynamic_rnn(basic_cell,X,dtype=tf.float32)

#Note that the fully connected layer is connected to the states tensor,
#which contains only the final state of the RNN 
logits=tf.layers.dense(states,n_outputs)
xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
                                                       labels=y)
loss=tf.reduce_mean(xentropy)
optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op=optimizer.minimize(loss)
correct=tf.nn.in_top_k(logits,y,1)
accuracy=tf.reduce_mean(tf.cast(correct,tf.float32))

init=tf.global_variables_initializer()

#read data
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]

def shuffle_batch(X, y, batch_size):
    rnd_idx = np.random.permutation(len(X))
    n_batches = len(X) // batch_size
    for batch_idx in np.array_split(rnd_idx, n_batches):
        X_batch, y_batch = X[batch_idx], y[batch_idx]
        yield X_batch, y_batch
        
n_epochs=100
batch_size=150
X_test=X_test.reshape((-1,n_steps,n_inputs))

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch,y_batch in shuffle_batch(X_train,y_train,batch_size):
            X_batch=X_batch.reshape((-1,n_steps,n_inputs))
            sess.run(training_op,feed_dict={X:X_batch,y:y_batch})    
        acc_train=accuracy.eval(feed_dict={X:X_batch,y:y_batch})    
        acc_test=accuracy.eval(feed_dict={X:X_test,y:y_test})
        print(epoch, "Train accuracy:",acc_train,"Test accuracy:",acc_test)

You would certainly get a better result by tuning the hyperparameters, initializing the RNN weights using He initialization, training longer, or adding a bit of regularization (e.g., dropout).

You can specify an initializer for the RNN by wrapping its construction code in a variable scope (e.g., use variable_scope("rnn", initializer=variance_scaling_initializer()) to use He initialization).

14.3.2 Training to Predict Time Series

Each training instance is a randomly selected sequence of 20 consecutive values from the time series, and the target sequence is the same as the input sequence, except it is shifted by one time step into the future.

At each time step we now have an output vector of size 100. But what we actually want is a single output value at each time step. The simplest solution is to wrap the cell in an OutputProjectionWrapper. A cell wrapper acts like a normal cell, proxying every method call to an underlying cell, but it also adds some functionality. The OutputProjectionWrapper adds a fully connected layer of linear neurons (i.e., without any activation function) on top of each output (but it does not affect the cell state). All these fully connected layers share the same (trainable) weights and bias terms. The resulting RNN is represented in Figure 14-8.

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
n_steps=20
n_inputs=1
n_neurons=100
n_outputs=1

t_min, t_max=0, 30
resolution=0.1

def time_series(t):
    return t*np.sin(t)/3 +2*np.sin(t*5)

def next_batch(batch_size,n_steps):
    t0 = np.random.rand(batch_size, 1) * (t_max - t_min - n_steps * resolution)
    Ts = t0 + np.arange(0., n_steps + 1) * resolution
    ys = time_series(Ts)
    return ys[:, :-1].reshape(-1, n_steps, 1), ys[:, 1:].reshape(-1, n_steps, 1)

t = np.linspace(t_min, t_max, int((t_max - t_min) / resolution))
t_instance = np.linspace(12.2, 12.2 + resolution * (n_steps + 1), n_steps + 1)

tf.reset_default_graph()

X=tf.placeholder(tf.float32,[None,n_steps,n_inputs])
y=tf.placeholder(tf.float32,[None,n_steps,n_outputs])
cell =tf.contrib.rnn.OutputProjectionWrapper(
    tf.nn.rnn_cell.BasicRNNCell(num_units= n_neurons,activation=tf.nn.relu),
                               output_size=n_outputs)

outputs,states =tf.nn.dynamic_rnn(cell,X,dtype=tf.float32)

learning_rate=0.001

loss=tf.reduce_mean(tf.square(outputs-y)) #MSE
optimizer =tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op=optimizer.minimize(loss)

init=tf.global_variables_initializer()
saver=tf.train.Saver()

n_iterations =1500
batch_size=50

with tf.Session() as sess:
    init.run()
    for iteration in range(n_iterations):
        X_batch,y_batch=next_batch(batch_size,n_steps)
        sess.run(training_op,feed_dict={X:X_batch,y:y_batch})
        if iteration % 100 ==0:
            mse=loss.eval(feed_dict={X:X_batch,y:y_batch})
            print(iteration, "\tMSE:",mse)
    saver.save(sess, "./my_time_series_model")
with tf.Session() as sess:
    saver.restore(sess,save_path='./my_time_series_model')
    X_new =time_series(np.array(t_instance[:-1].reshape(-1,n_steps,n_inputs)))
    y_pred=sess.run(outputs,feed_dict={X:X_new})

plt.title("Testing the model", fontsize=14)
plt.plot(t_instance[:-1], time_series(t_instance[:-1]), "bo", markersize=10, label="instance")
plt.plot(t_instance[1:], time_series(t_instance[1:]), "w*", markersize=10, label="target")
plt.plot(t_instance[1:], y_pred[0,:,0], "r.", markersize=10, label="prediction")
plt.legend(loc="upper left")
plt.xlabel("Time")

plt.show();

Although using an OutputProjectionWrapper is the simplest solution to reduce the
dimensionality of the RNN’s output sequences down to just one value per time step (per instance), it is not the most efficient. There is a trickier but more efficient solution: you can reshape the RNN outputs from [batch_size, n_steps, n_neurons] to [batch_size * n_steps, n_neurons], then apply a single fully connected layer with the appropriate output size (in our case just 1), which will result in an output tensor of shape [batch_size * n_steps, n_outputs], and then reshape this tensor to [batch_size, n_steps, n_outputs]. These operations are represented in Figure 14-10.

cell =tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons,activation=tf.nn.relu)
rnn_outputs,states =tf.nn.dynamic_rnn(cell,X,dtype=tf.float32)
stacked_rnn_outputs=tf.reshape(rnn_outputs,[-1,n_neurons])
stacked_outputs = tf.layers.dense(stacked_rnn_outputs,n_outputs)
outputs=tf.reshape(stacked_outputs,[-1,n_steps,n_outputs])

14.3.3 Creative RNN

All we need is to provide it a seed sequence containing n_steps values (e.g., full of zeros), use the model to predict the next value, append this predicted value to the sequence, feed the last n_steps values to the model to predict the next value, and so on. This process generates a new sequence that has some resemblance to the original time series (see
Figure 14-11).

with tf.Session() as sess:
    saver.restore(sess, "./my_time_series_model")
    
    sequence = [0.]*n_steps
    for iteration in range(100):
        X_batch = np.array(sequence[-n_steps:]).reshape(1,n_steps,1)
        y_pred= sess.run(outputs,feed_dict={X:X_batch})
        sequence.append(y_pred[0,-1,0])

14.4 Deep RNNs

n_inputs=2
n_steps=5
n_neurons =100
n_layers = 3

tf.reset_default_graph()

X=tf.placeholder(tf.float32, [None, n_steps, n_inputs])
layers = [tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons)
          for layer in range(n_layers)] 
multi_layer_cell = tf.nn.rnn_cell.MultiRNNCell(layers)
outputs, states = tf.nn.dynamic_rnn(multi_layer_cell, X, dtype=tf.float32)

init=tf.global_variables_initializer()
X_batch = np.random.rand(2, n_steps, n_inputs)
with tf.Session() as sess:
    init.run()
    outputs_val, states_val = sess.run([outputs, states], feed_dict={X: X_batch})
print(outputs_val.shape) # (2, 5, 100)

The states variable is a tuple containing one tensor per layer, each representing the final state of that layer’s cell (with shape [batch_size, n_neurons]). If you set state_is_tuple = False when creating the MultiRNNCell, then states becomes a single tensor containing the states from every layer, concatenated along the column axis (i.e., its shape is [batch_size, n_layers * n_neurons]).

14.4.1 Distributing a Deep RNN Across Multiple GPUs

import tensorflow as tf

class DeviceCellWrapper(tf.nn.rnn_cell.RNNCell):
    def __init__(self,device,cell):
        self._cell=cell
        self._device=device
    
    @property
    def state_size(self):
        return self._cell.state_size
    
    @property
    def output_size(self):
        return self._cell.output_size
    
    def __call__(self,inputs,state,scope=None):
        with tf.device(self._device):
            return self._cell(inputs,state,scope)
        
tf.reset_default_graph()

n_inputs =5
n_steps=20
n_neurons=100

X=tf.placeholder(tf.float32,[None,n_steps,n_inputs])
devices=["/cpu:0","/cpu:0","/cpu:0"]#replace with ["/gpu:0", "/gpu:1", "/gpu:2"] if you have 3 GPUs
cells=[DeviceCellWrapper(dev,tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons))
       for dev in devices]
multi_layer_cell =tf.nn.rnn_cell.MultiRNNCell(cells)
outputs,states = tf.nn.dynamic_rnn(multi_layer_cell,X,dtype=tf.float32)

init=tf.global_variables_initializer()

with tf.Session() as sess:
    init.run()
    print(sess.run(outputs,feed_dict={X:np.random.rand(2,n_steps,n_inputs)}))

Do not set state_is_tuple=False, or the MultiRNNCell will concatenate all the cell states into a single tensor, on a single GPU.

14.4.2 Applying Dropout

If you also want to apply dropout between the RNN layers, you need to use a DropoutWrapper.

tf.reset_default_graph()

n_inputs=1
n_steps=20
n_neurons=100
n_layers=3
n_outputs=1

X=tf.placeholder(tf.float32,[None,n_steps,n_inputs])
y=tf.placeholder(tf.float32,[None,n_steps,n_outputs])

keep_prob=tf.placeholder_with_default(1.0,[])
cells=[tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons)
      for layer in range(n_layers)]
cells_drop = [tf.nn.rnn_cell.DropoutWrapper(cell,nput_keep_prob=keep_prob)
              for cell in cells]
multi_layer_cell=tf.nn.rnn_cell.MultiRNNCell(cells_drop)
rnn_outputs,states = tf.nn.dynamic_rnn(multi_layer_cell,X,
                                   dtype=tf.float32)
learning_rate=0.01

stacked_rnn_outputs=tf.reshape(rnn_outputs,[-1,n_neurons])
stacked_outputs=tf.layers.dense(stacked_rnn_outputs,n_outputs)
outputs=tf.reshape(stacked_outputs,[-1,n_steps,n_outputs])

loss=tf.reduce_mean(tf.square(outputs-y))
optimizer =tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op =optimizer.minimize(loss)

init=tf.global_variables_initializer()
saver=tf.train.Saver()
n_iterations=1500
batch_size=50
train_keep_prob=0.5

with tf.Session() as sess:
    init.run()
    for iteration in range(n_iterations):
        X_batch,y_batch=next_batch(batch_size,n_steps)
        _,mse=sess.run([training_op,loss],
                      feed_dict={X:X_batch,y:y_batch})
        if iteration%100==0:
            print(iteration,"Training MSE:",mse)
    saver.save(sess,"./my_dropout_time_series_model")

14.4.3 The Difficulty of Training over Many Time Steps

Long sequences makes training very slow. One solution is called truncated backpropagation through time: unroll the RNN only over a limited number of time steps during training.

Another problem arises that the model will not be able to learn long-term patterns. One workaround could be to make sure that these shortened sequences contain both old and recent data. But how “old” data should be taken into account. Another problem is the fact that the memory of the first inputs gradually fades away.

14.5 LSTM Cell

1553780253245

The state of a LSTM cell is split in two vectors: h ( t ) \textbf h_{(t)} h(t) and $\textbf c_{(t)} $(“c” stands for “cell”). You can think of h(t) as the short-term state and c(t) as the long-term state.

The key idea is that the network can learn what to store in the long-term state, what to throw away, and what to read from it. As the long-term state c(t–1) traverses the network from left to right, you can see that it first goes through a forget gate, dropping some memories, and then it adds some new memories via the addition operation (which adds the memories that were selected by an input gate). The result c(t) is sent straight out, without any further transformation. So, at each time step, some memories are dropped and some memories are added. Moreover, after the addition operation, the long-term state is copied and passed through the tanh function, and then the result is filtered by the output gate. This produces the short-term state h(t) (which is equal to the cell’s output for this time step y(t)). Now let’s look at where new memories come from and how the gates work.

First, the current input vector x(t) and the previous short-term state h(t–1) are fed to four different fully connected layers. They all serve a different purpose:

  • The main layer is the one that outputs g(t). It has the usual role of analyzing the current inputs x(t) and the previous (short-term) state h(t–1). In a basic cell, there is nothing else than this layer, and its output goes straight out to y(t) and h(t). In contrast, in an LSTM cell this layer’s output does not go straight out, but instead it is partially stored in the long-term state.
  • The three other layers are gate controllers. Since they use the logistic activation function, their outputs range from 0 to 1. As you can see, their outputs are fed to element-wise multiplication operations, so if they output 0s, they close the gate, and if they output 1s, they open it. Specifically:
    • The forget gate (controlled by f(t)) controls which parts of the long-term state should be erased.
    • The input gate (controlled by i(t)) controls which parts of g(t) should be added to the long-term state (this is why we said it was only “partially stored”).
    • Finally, the output gate (controlled by o(t)) controls which parts of the longterm state should be read and output at this time step (both to h(t)) and y(t).

In short, an LSTM cell can learn to recognize an important input (that’s the role of the input gate), store it in the long-term state, learn to preserve it for as long as it is needed (that’s the role of the forget gate), and learn to extract it whenever it is needed. This explains why they have been amazingly successful at capturing long-term patterns in time series, long texts, audio recordings, and more.

Equation 14-3 summarizes how to compute the cell’s long-term state, its short-term
state, and its output at each time step for a single instance (the equations for a whole
mini-batch are very similar).

Equation 14-3. LSTM computations
i ( t ) = σ ( W x i T ⋅ x ( t ) + W h i T ⋅ h ( t − 1 ) + b i ) f ( t ) = σ ( W x f T ⋅ x ( t ) + W h f T ⋅ h ( t − 1 ) + b f ) o ( t ) = σ ( W x o T ⋅ x ( t ) + W h o T ⋅ h ( t − 1 ) + b o ) g ( t ) = tanh ⁡ ( W x g T ⋅ x ( t ) + W h g T ⋅ h ( t − 1 ) + b g ) c ( t ) = f ( t ) ⊗ c ( t − 1 ) + i ( t ) ⊗ g ( t ) y ( t ) = h ( t ) = o ( t ) ⊗ tanh ⁡ ( c ( t ) ) \textbf i_{(t)}=\sigma\left(\textbf W_{xi}^T\cdot \textbf x_{(t)}+\textbf W_{hi}^T\cdot \textbf h_{(t-1)}+\textbf b_i \right)\\ \textbf f_{(t)}=\sigma\left(\textbf W_{xf}^T\cdot \textbf x_{(t)}+\textbf W_{hf}^T\cdot \textbf h_{(t-1)}+\textbf b_f \right)\\ \textbf o_{(t)}=\sigma\left(\textbf W_{xo}^T\cdot \textbf x_{(t)}+\textbf W_{ho}^T\cdot \textbf h_{(t-1)}+\textbf b_o \right)\\ \textbf g_{(t)}=\tanh\left(\textbf W_{xg}^T\cdot \textbf x_{(t)}+\textbf W_{hg}^T\cdot \textbf h_{(t-1)}+\textbf b_g \right)\\ \textbf c_{(t)}=\textbf f_{(t)}\otimes \textbf c_{(t-1)}+\textbf i_{(t)}\otimes \textbf g_{(t)}\\ \textbf y_{(t)}=\textbf h_{(t)}=\textbf o_{(t)}\otimes \tanh\left(\textbf c_{(t)}\right) i(t)=σ(WxiTx(t)+WhiTh(t1)+bi)f(t)=σ(WxfTx(t)+WhfTh(t1)+bf)o(t)=σ(WxoTx(t)+WhoTh(t1)+bo)g(t)=tanh(WxgTx(t)+WhgTh(t1)+bg)c(t)=f(t)c(t1)+i(t)g(t)y(t)=h(t)=o(t)tanh(c(t))

  • Wxi, Wxf, Wxo, Wxg are the weight matrices of each of the four layers for their connection to the input vector x(t).
  • Whi, Whf, Who, and Whg are the weight matrices of each of the four layers for their
    connection to the previous short-term state h(t–1).
  • bi, bf, bo, and bg are the bias terms for each of the four layers. Note that TensorFlow initializes bf to a vector full of 1s instead of 0s. This prevents forgetting everything at the beginning of training.
tf.reset_default_graph()

n_inputs=28
n_steps=28
n_layers=3
n_neurons=150
n_outputs=10

learning_rate=0.001

X=tf.placeholder(tf.float32,[None,n_steps,n_inputs])
y=tf.placeholder(tf.int32,[None])

lstm_cells= [tf.nn.rnn_cell.LSTMCell(num_units=n_neurons)
            for layer in range(n_layers)]
multi_cell=tf.nn.rnn_cell.MultiRNNCell(lstm_cells)
outputs,states = tf.nn.dynamic_rnn(multi_cell,X,dtype=tf.float32)
top_layer_h_state=states[-1][1]
logits=tf.layers.dense(top_layer_h_state,n_outputs,name="softmax")
xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,logits=logits)
loss=tf.reduce_mean(xentropy,name="loss")
optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op=optimizer.minimize(loss)
correct=tf.nn.in_top_k(logits,y,1)
accuracy=tf.reduce_mean(tf.cast(correct,tf.float32))

init=tf.global_variables_initializer()

n_epochs=10
batch_size=150
X_test=X_test.reshape((-1,n_steps,n_inputs))
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch,y_batch in shuffle_batch(X_train,y_train,batch_size):
            X_batch=X_batch.reshape((-1,n_steps,n_inputs))
            sess.run(training_op,feed_dict={X:X_batch,y:y_batch})
        acc_batch = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
        print(epoch, "Last batch accuracy:", acc_batch, "Test accuracy:", acc_test)

14.5.1 Peephole Connections

An LSTM variant with extra connections is called peephole connections: the previous long-term state c(t–1) is added as an input to the controllers of the forget gate and the input gate, and the current long-term state c(t) is added as input to the controller of the output gate.

lstm_cell = tf.nn.rnn_cell.LSTMCell(num_units=n_neurons, use_peepholes=True)

14.6 GRU Cell

The GRU (Gated Recurrent Unit) cell is a simplified version of the LSTM cell, and it seems to perform just as well (which explains its growing popularity).

1553819905760

The main simplifications are:

  • Both state vectors are merged into a single vector h(t).
  • A single gate controller controls both the forget gate and the input gate. If the gate controller outputs a 1, the input gate is open and the forget gate is closed. If it outputs a 0, the opposite happens. In other words, whenever a memory must be stored, the location where it will be stored is erased first. This is actually a frequent variant to the LSTM cell in and of itself.
  • There is no output gate; the full state vector is output at every time step. However, there is a new gate controller that controls which part of the previous state will be shown to the main layer.

Equation 14-4. GRU computations
z ( t ) = σ ( W x z T ⋅ x ( t ) + W h z T ⋅ h ( t − 1 ) ) r ( t ) = σ ( W x r T ⋅ x ( t ) + W h r T ⋅ h ( t − 1 ) ) g ( t ) = tanh ⁡ ( W x g T ⋅ x ( t ) + W h g T ⋅ ( r ( t ) ⊗ h ( t − 1 ) ) ) h ( t ) = h ( t − 1 ) ⊗ ( 1 − z ( t ) ) + z ( t ) ⊗ g ( t ) \textbf z_{(t)}=\sigma\left(\textbf W_{xz}^T\cdot \textbf x_{(t)}+\textbf W_{hz}^T\cdot \textbf h_{(t-1)}\right)\\ \textbf r_{(t)}=\sigma\left(\textbf W_{xr}^T\cdot \textbf x_{(t)}+\textbf W_{hr}^T\cdot \textbf h_{(t-1)}\right)\\ \textbf g_{(t)}=\tanh\left(\textbf W_{xg}^T\cdot \textbf x_{(t)}+\textbf W_{hg}^T\cdot \left(\textbf r_{(t)}\otimes\textbf h_{(t-1)}\right)\right)\\ \textbf h_{(t)}=\textbf h_{(t-1)}\otimes\left(1-\textbf z_{(t)}\right) +\textbf z_{(t)}\otimes \textbf g_{(t)} z(t)=σ(WxzTx(t)+WhzTh(t1))r(t)=σ(WxrTx(t)+WhrTh(t1))g(t)=tanh(WxgTx(t)+WhgT(r(t)h(t1)))h(t)=h(t1)(1z(t))+z(t)g(t)

gru_cell = tf.nn.rnn_cell.GRUCell(num_units=n_neurons)

14.6 Natural Language Processing

Word2Vec and Seq2Seq tutorials

14.6.1 Word Embeddings

For more details, check out Christopher Olah’s great post, or Sebastian Ruder’s series of posts.

14.6.2 An Encoder-Decoder Network for Machine Translation

The English sentences are fed to the encoder, and the decoder outputs the French translations. Note that the French translations are also used as inputs to the decoder, but pushed back by one step. In other words, the decoder is given as input the word that it should have output at the previous step (regardless of what it actually output). For the very first word, it is given a token that represents the beginning of the sentence (e.g., “”). The decoder is expected to end the sentence with an end-of-sequence (EOS) token (e.g., “”).
Note that the English sentences are reversed before they are fed to the encoder. For example “I drink milk” is reversed to “milk drink I.” This ensures that the beginning of the English sentence will be fed last to the encoder, which is useful because that’s generally the first thing that the decoder needs to translate.

Each word is initially represented by a simple integer identifier (e.g., 288 for the word “milk”). Next, an embedding lookup returns the word embedding (as explained earlier, this is a dense, fairly low-dimensional vector). These word embeddings are what is actually fed to the encoder and the decoder.

At each step, the decoder outputs a score for each word in the output vocabulary (i.e., French), and then the Softmax layer turns these scores into probabilities. For example, at the first step the word “Je” may have a probability of 20%, “Tu” may have a probability of 1%, and so on. The word with the highest probability is output. This is very much like a regular classification task, so you can train the model using the softmax_cross_entropy_with_logits() function.

Note that at inference time (after training), you will not have the target sentence to feed to the decoder. Instead, simply feed the decoder the word that it output at the previous step, as shown in Figure 14-16 (this will require an embedding lookup that is not shown on the diagram).

If you go through TensorFlow’s sequence-to-sequence tutorial and you look at the code in rnn/translate/seq2seq_model.py (in the TensorFlow models), you will notice a few important differences:

  • First, the problem of variable length input sequences. Solution: sentences are grouped into buckets of similar lengths (e.g., a bucket for the 1- to 6-word sentences, another for the 7- to 12-word sentences, and so on), and the shorter sentences are padded using a special padding token (e.g., “”). For example “I drink milk” becomes “ milk drink I”, and its translation becomes “Je bois du lait ”. Of course, we want to ignore any output past the EOS token. For this, the tutorial’s implementation uses a target_weights vector. For example, for the target sentence “Je bois du lait ”, the weights would be set to [1.0, 1.0, 1.0, 1.0, 1.0, 0.0] (notice the weight 0.0 that corresponds to the padding token in the target sentence). Simply multiplying the losses by the target weights will zero out the losses that correspond to words past EOS tokens.
  • Second, computing probabilities for large vocabulary is very computationally intensive. Solution: let the decoder output much smaller vectors, such as 1,000-dimensional vectors, then use a sampling technique to estimate the loss without having to compute it over every single word in the target vocabulary. In TensorFlow you can use the sampled_softmax_loss() function.
  • Third, the tutorial’s implementation uses an attention mechanism that lets the decoder peek into the input sequence.
  • Finally, the tutorial’s implementation makes use of the tf.nn.legacy_seq2seq module, which provides tools to build various Encoder–Decoder models easily.
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值