Building your Recurrent Neural Network - Step by Step
Welcome to Course 5’s first assignment! In this assignment, you will implement your first Recurrent Neural Network in numpy.
Recurrent Neural Networks (RNN) are very effective for Natural Language Processing and other sequence tasks because they have “memory”. They can read inputs x ⟨ t ⟩ x^{\langle t \rangle} x⟨t⟩ (such as words) one at a time, and remember some information/context through the hidden layer activations that get passed from one time-step to the next. This allows a uni-directional RNN to take information from the past to process later inputs. A bidirection RNN can take context from both the past and the future.
Notation:
-
Superscript [ l ] [l] [l] denotes an object associated with the l t h l^{th} lth layer. ( [ l ] [l] [l]表示第 l t h l^{th} lth层)
- Example: a [ 4 ] a^{[4]} a[4] is the 4 t h 4^{th} 4th layer activation. W [ 5 ] W^{[5]} W[5] and b [ 5 ] b^{[5]} b[5] are the 5 t h 5^{th} 5th layer parameters.
-
Superscript ( i ) (i) (i) denotes an object associated with the i t h i^{th} ith example. ( ( i ) (i) (i) 表示第 i i i 个样本)
- Example: x ( i ) x^{(i)} x(i) is the i t h i^{th} ith training example input.
-
Superscript ⟨ t ⟩ \langle t \rangle ⟨t⟩ denotes an object at the t t h t^{th} tth time-step. ( ⟨ t ⟩ \langle t \rangle ⟨t⟩表示第t个时间段)
- Example: x ⟨ t ⟩ x^{\langle t \rangle} x⟨t⟩ is the input x at the t t h t^{th} tth time-step. x ( i ) ⟨ t ⟩ x^{(i)\langle t \rangle} x(i)⟨t⟩ is the input at the t t h t^{th} tth timestep of example i i i.
-
Lowerscript i i i denotes the i t h i^{th} ith entry of a vector. ( a i [ l ] a^{[l]}_i ai[l] 表示层数激活层的第 i t h i^{th} ith个输入 )
- Example: a i [ l ] a^{[l]}_i ai[l] denotes the i t h i^{th} ith entry of the activations in layer l l l.
We assume that you are already familiar with numpy
and/or have completed the previous courses of the specialization. Let’s get started!
rnn_utils.py
import numpy as np
def softmax(x):
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0)
def smooth(loss, cur_loss):
return loss * 0.999 + cur_loss * 0.001
def print_sample(sample_ix, ix_to_char):
txt = ''.join(ix_to_char[ix] for ix in sample_ix)
print ('----\n %s \n----' % (txt, ))
def get_initial_loss(vocab_size, seq_length):
return -np.log(1.0/vocab_size)*seq_length
def softmax(x):
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0)
def initialize_parameters(n_a, n_x, n_y):
"""
Initialize parameters with small random values
Returns:
parameters -- python dictionary containing:
Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
b -- Bias, numpy array of shape (n_a, 1)
by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
"""
np.random.seed(1)
Wax = np.random.randn(n_a, n_x)*0.01 # input to hidden
Waa = np.random.randn(n_a, n_a)*0.01 # hidden to hidden
Wya = np.random.randn(n_y, n_a)*0.01 # hidden to output
b = np.zeros((n_a, 1)) # hidden bias
by = np.zeros((n_y, 1)) # output bias
parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "b": b,"by": by}
return parameters
def rnn_step_forward(parameters, a_prev, x):
Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
a_next = np.tanh(np.dot(Wax, x) + np.dot(Waa, a_prev) + b) # hidden state
p_t = softmax(np.dot(Wya, a_next) + by) # unnormalized log probabilities for next chars # probabilities for next chars
return a_next, p_t
def rnn_step_backward(dy, gradients, parameters, x, a, a_prev):
gradients['dWya'] += np.dot(dy, a.T)
gradients['dby'] += dy
da = np.dot(parameters['Wya'].T, dy) + gradients['da_next'] # backprop into h
daraw = (1 - a * a) * da # backprop through tanh nonlinearity
gradients['db'] += daraw
gradients['dWax'] += np.dot(daraw, x.T)
gradients['dWaa'] += np.dot(daraw, a_prev.T)
gradients['da_next'] = np.dot(parameters['Waa'].T, daraw)
return gradients
def update_parameters(parameters, gradients, lr):
parameters['Wax'] += -lr * gradients['dWax']
parameters['Waa'] += -lr * gradients['dWaa']
parameters['Wya'] += -lr * gradients['dWya']
parameters['b'] += -lr * gradients['db']
parameters['by'] += -lr * gradients['dby']
return parameters
def rnn_forward(X, Y, a0, parameters, vocab_size = 71):
# Initialize x, a and y_hat as empty dictionaries
x, a, y_hat = {}, {}, {}
a[-1] = np.copy(a0)
# initialize your loss to 0
loss = 0
for t in range(len(X)):
# Set x[t] to be the one-hot vector representation of the t'th character in X.
x[t] = np.zeros((vocab_size,1))
x[t][X[t]] = 1
# Run one step forward of the RNN
a[t], y_hat[t] = rnn_step_forward(parameters, a[t-1], x[t])
# Update the loss by substracting the cross-entropy term of this time-step from it.
loss -= np.log(y_hat[t][Y[t],0])
cache = (y_hat, a, x)
return loss, cache
def rnn_backward(X, Y, parameters, cache):
# Initialize gradients as an empty dictionary
gradients = {}
# Retrieve from cache and parameters
(y_hat, a, x) = cache
Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
# each one should be initialized to zeros of the same dimension as its corresponding parameter
gradients['dWax'], gradients['dWaa'], gradients['dWya'] = np.zeros_like(Wax), np.zeros_like(Waa), np.zeros_like(Wya)
gradients['db'], gradients['dby'] = np.zeros_like(b), np.zeros_like(by)
gradients['da_next'] = np.zeros_like(a[0])
### START CODE HERE ###
# Backpropagate through time
for t in reversed(range(len(X))):
dy = np.copy(y_hat[t])
dy[Y[t]] -= 1
gradients = rnn_step_backward(dy, gradients, parameters, x[t], a[t], a[t-1])
### END CODE HERE ###
return gradients, a
Let’s first import all the packages that you will need during this assignment.
import numpy as np
from rnn_utils import *
1 - Forward propagation for the basic Recurrent Neural Network
Later this week, you will generate music using an RNN. The basic RNN that you will implement has the structure below. In this example, T x = T y T_x = T_y Tx=Ty.
Here’s how you can implement an RNN:
Steps:
- Implement the calculations needed for one time-step of the RNN. (首先执行1个时间段所需要的计算)
- Implement a loop over T x T_x Tx time-steps in order to process all the inputs, one at a time. (执行很多时间步)
Let’s go!
1.1 - RNN cell
A Recurrent neural network can be seen as the repetition of a single cell. You are first going to implement the computations for a single time-step. The following figure describes the operations for a single time-step of an RNN cell.
Exercise: Implement the RNN-cell described in Figure (2).
Instructions:
- Compute the hidden state with tanh activation: a ⟨ t ⟩ = tanh ( W a a a ⟨ t − 1 ⟩ + W a x x ⟨ t ⟩ + b a ) a^{\langle t \rangle} = \tanh(W_{aa} a^{\langle t-1 \rangle} + W_{ax} x^{\langle t \rangle} + b_a) a⟨t⟩=tanh(Waaa⟨t−1⟩+Waxx⟨t⟩+ba).
- Using your new hidden state a ⟨ t ⟩ a^{\langle t \rangle} a⟨t⟩, compute the prediction y ^ ⟨ t ⟩ = s o f t m a x ( W y a a ⟨ t ⟩ + b y ) \hat{y}^{\langle t \rangle} = softmax(W_{ya} a^{\langle t \rangle} + b_y) y^⟨t⟩=softmax(Wyaa⟨t⟩+by). We provided you a function:
softmax
. - Store ( a ⟨ t ⟩ , a ⟨ t − 1 ⟩ , x ⟨ t ⟩ , p a r a m e t e r s ) (a^{\langle t \rangle}, a^{\langle t-1 \rangle}, x^{\langle t \rangle}, parameters) (a⟨t⟩,a⟨t−1⟩,x⟨t⟩,parameters) in cache
- Return a ⟨ t ⟩ a^{\langle t \rangle} a⟨t⟩ , y ⟨ t ⟩ y^{\langle t \rangle} y⟨t⟩ and cache
We will vectorize over m m m examples. Thus, x ⟨ t ⟩ x^{\langle t \rangle} x⟨t⟩ will have dimension ( n x , m ) (n_x,m) (nx,m), and a ⟨ t ⟩ a^{\langle t \rangle} a⟨t⟩ will have dimension ( n a , m ) (n_a,m) (na,m).
# GRADED FUNCTION: rnn_cell_forward
def rnn_cell_forward(xt, a_prev, parameters):
"""
Implements a single forward step of the RNN-cell as described in Figure (2)
Arguments:
xt -- your input data at timestep "t", numpy array of shape (n_x, m).
a_prev -- Hidden state at timestep "t-1", numpy array of shape (n_a, m)
parameters -- python dictionary containing:
Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
ba -- Bias, numpy array of shape (n_a, 1)
by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
Returns:
a_next -- next hidden state, of shape (n_a, m)
yt_pred -- prediction at timestep "t", numpy array of shape (n_y, m)
cache -- tuple of values needed for the backward pass, contains (a_next, a_prev, xt, parameters)
"""
# Retrieve parameters from "parameters"
Wax = parameters["Wax"]
Waa = parameters["Waa"]
Wya = parameters["Wya"]
ba = parameters["ba"]
by = parameters["by"]
### START CODE HERE ### (≈2 lines)
# compute next activation state using the formula given above
a_next = np.dot(Wax,xt)+np.dot(Waa,a_prev)+ba
a_next= (np.exp(a_next)-np.exp(-a_next))/(np.exp(a_next)+np.exp(-a_next))
# compute output of the current cell using the formula given above
yt_pred = softmax(np.dot(Wya,a_next)+by)
### END CODE HERE ###
# store values you need for backward propagation in cache
cache = (a_next, a_prev, xt, parameters)
return a_next, yt_pred, cache
Input:
np.random.seed(1)
xt = np.random.randn(3,10)
a_prev = np.random.randn(5,10)
Waa = np.random.randn(5,5)
Wax = np.random.randn(5,3)
Wya = np.random.randn(2,5)
ba = np.random.randn(5,1)
by = np.random.randn(2,1)
parameters = {"Waa": Waa, "Wax": Wax, "Wya": Wya, "ba": ba, "by": by}
a_next, yt_pred, cache = rnn_cell_forward(xt, a_prev, parameters)
print("a_next[4] = ", a_next[4])
print("a_next.shape = ", a_next.shape)
print("yt_pred[1] =", yt_pred[1])
print("yt_pred.shape = ", yt_pred.shape)
Output:
a_next[4] = [ 0.59584544 0.18141802 0.61311866 0.99808218 0.85016201 0.99980978 -0.18887155 0.99815551 0.6531151 0.82872037]
a_next.shape = (5, 10)
yt_pred[1] = [ 0.9888161 0.01682021 0.21140899 0.36817467 0.98988387 0.88945212
0.36920224 0.9966312 0.9982559 0.17746526]
yt_pred.shape = (2, 10)
1.2 - RNN forward pass
You can see an RNN as the repetition of the cell you’ve just built. If your input sequence of data is carried over 10 time steps, then you will copy the RNN cell 10 times. Each cell takes as input the hidden state from the previous cell ( a ⟨ t − 1 ⟩ a^{\langle t-1 \rangle} a⟨t−1⟩) and the current time-step’s input data ( x ⟨ t ⟩ x^{\langle t \rangle} x⟨t⟩). It outputs a hidden state ( a ⟨ t ⟩ a^{\langle t \rangle} a⟨t⟩) and a prediction ( y ⟨ t ⟩ y^{\langle t \rangle} y⟨t⟩) for this time-step.
Exercise: Code the forward propagation of the RNN described in Figure (3).
Instructions:
- Create a vector of zeros ( a a a) that will store all the hidden