layer.py 分步实现各个层之间的计算importnumpy as np"""This file defines layer types that are commonly used for recurrent neural
networks."""
defrnn_step_forward(x, prev_h, Wx, Wh, b):"""Run the forward pass for a single timestep of a vanilla RNN that uses a tanh
activation function.
Inputs:
- x: Input data for this timestep, of shape (N, D).
- prev_h: Hidden state from previous timestep, of shape (N, H)
- Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
- Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
- b: Biases of shape (H,)
Returns a tuple of:
- next_h: Next hidden state, of shape (N, H)
- cache: Tuple of values needed for the backward pass."""next_h, cache=None, None##############################################################################
#TODO: Implement a single forward step for the vanilla RNN. Store the next #
#hidden state and any values you need for the backward pass in the next_h #
#and cache variables respectively. #
##############################################################################
a=prev_h.dot(Wh)+x.dot(Wx)+b
next_h=np.tanh(a)
cache=(x,prev_h,Wh,Wx,b,next_h)returnnext_h, cachedefrnn_step_backward(dnext_h, cache):"""Backward pass for a single timestep of a vanilla RNN.
Inputs:
- dnext_h: Gradient of loss with respect to next hidden state
- cache: Cache object from the forward pass
Returns a tuple of:
- dx: Gradients of input data, of shape (N, D)
- dprev_h: Gradients of previous hidden state, of shape (N, H)
- dWx: Gradients of input-to-hidden weights, of shape (D, H)
- dWh: Gradients of hidden-to-hidden weights, of shape (H, H)
- db: Gradients of bias vector, of shape (H,)"""dx, dprev_h, dWx, dWh, db=None, None, None, None, None##############################################################################
#TODO: Implement the backward pass for a single step of a vanilla RNN. #
##
#HINT: For the tanh function, you can compute the local derivative in terms #
#of the output value from tanh. #
##############################################################################
x,prev_h,Wh,Wx,b,next_h=cache
da=dnext_h*(1-next_h*next_h)
dx=da.dot(Wx.T)
dprev_h=da.dot(Wh.T)
dWx=x.T.dot(da)
dWh=prev_h.T.dot(da)
db=np.sum(da,axis=0)returndx, dprev_h, dWx, dWh, dbdefrnn_forward(x, h0, Wx, Wh, b):"""Run a vanilla RNN forward on an entire sequence of data. We assume an input
sequence composed of T vectors, each of dimension D. The RNN uses a hidden
size of H, and we work over a minibatch containing N sequences. After running
the RNN forward, we return the hidden states for all timesteps.
Inputs:
- x: Input data for the entire timeseries, of shape (N, T, D).
- h0: Initial hidden state, of shape (N, H)
- Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
- Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
- b: Biases of shape (H,)
Returns a tuple of:
- h: Hidden states for the entire timeseries, of shape (N, T, H).
- cache: Values needed in the backward pass"""h, cache=None, None##############################################################################
#TODO: Implement forward pass for a vanilla RNN running on a sequence of #
#input data. You should use the rnn_step_forward function that you defined #
#above. You can use a for loop to help compute the forward pass. #
##############################################################################
N,T,D=x.shape
H=b.shape[0]
h=np.zeros((N,T,H))
prev_h=h0
cache=[]for t inrange(T):
xt=x[:,t,:]
next_h,step_cache=rnn_step_forward(xt,prev_h,Wx,Wh,b)
cache.append(step_cache)
h[:,t,:]=next_h
prev_h=next_hreturnh, cachedefrnn_backward(dh, cache):"""Compute the backward pass for a vanilla RNN over an entire sequence of data.
Inputs:
- dh: Upstream gradients of all hidden states, of shape (N, T, H)
Returns a tuple of:
- dx: Gradient of inputs, of shape (N, T, D)
- dh0: Gradient of initial hidden state, of shape (N, H)
- dWx: Gradient of input-to-hidden weights, of shape (D, H)
- dWh: Gradient of hidden-to-hidden weights, of shape (H, H)
- db: Gradient of biases, of shape (H,)"""dx, dh0, dWx, dWh, db=None, None, None, None, None##############################################################################
#TODO: Implement the backward pass for a vanilla RNN running an entire #
#sequence of data. You should use the rnn_step_backward function that you #
#defined above. You can use a for loop to help compute the backward pass. #
##############################################################################
N,T,H=dh.shape
D=cache[0][0].shape[1]
dprev_h=np.zeros((N,H))
dx=np.zeros((N,T,D))
dWx=np.zeros((D,H))
dWh=np.zeros((H,H))
db=np.zeros((H,))for t inrange(T):
t=T-1-t
dx[:,t,:], dprev_h, dWxt, dWht, dbt=rnn_step_backward(dh[:,t,:]+dprev_h,cache[t])
dWx, dWh, db= dWx+dWxt, dWh+dWht, db+dbt
dh0=dprev_hreturndx, dh0, dWx, dWh, dbdefword_embedding_forward(x, W):"""Forward pass for word embeddings. We operate on minibatches of size N where
each sequence has length T. We assume a vocabulary of V words, assigning each
to a vector of dimension D.
Inputs:
- x: Integer array of shape (N, T) giving indices of words. Each element idx
of x muxt be in the range 0 <= idx < V.
- W: Weight matrix of shape (V, D) giving word vectors for all words.
Returns a tuple of:
- out: Array of shape (N, T, D) giving word vectors for all input words.
- cache: Values needed for the backward pass"""out, cache=None, None##############################################################################
#TODO: Implement the forward pass for word embeddings. #
N,T=x.shape
V,D=W.shape
out=np.zeros((N,T,D))for i inrange(N):for j inrange(T):
out[i,j]=W[x[i,j]]
cache=(x,W.shape)returnout, cachedefword_embedding_backward(dout, cache):"""Backward pass for word embeddings. We cannot back-propagate into the words
since they are integers, so we only return gradient for the word embedding
matrix.
HINT: Look up the function np.add.at
Inputs:
- dout: Upstream gradients of shape (N, T, D)
- cache: Values from the forward pass
Returns:
- dW: Gradient of word embedding matrix, of shape (V, D)."""dW=None##############################################################################
#TODO: Implement the backward pass for word embeddings. #
##
#Note that Words can appear more than once in a sequence. #
#HINT: Look up the function np.add.at #
##############################################################################
x,W_shape=cache
dW=np.zeros(W_shape)
np.add.at(dW,x,dout)returndWdefsigmoid(x):"""A numerically stable version of the logistic sigmoid function."""pos_mask= (x >=0)
neg_mask= (x <0)
z=np.zeros_like(x)
z[pos_mask]= np.exp(-x[pos_mask])
z[neg_mask]=np.exp(x[neg_mask])
top=np.ones_like(x)
top[neg_mask]=z[neg_mask]return top / (1 +z)deftemporal_affine_forward(x, w, b):"""Forward pass for a temporal affine layer. The input is a set of D-dimensional
vectors arranged into a minibatch of N timeseries, each of length T. We use
an affine function to transform each of those vectors into a new vector of
dimension M.
Inputs:
- x: Input data of shape (N, T, D)
- w: Weights of shape (D, M)
- b: Biases of shape (M,)
Returns a tuple of:
- out: Output data of shape (N, T, M)
- cache: Values needed for the backward pass"""N, T, D=x.shape
M=b.shape[0]
out= x.reshape(N * T, D).dot(w).reshape(N, T, M) +b
cache=x, w, b, outreturnout, cachedeftemporal_affine_backward(dout, cache):"""Backward pass for temporal affine layer.
Input:
- dout: Upstream gradients of shape (N, T, M)
- cache: Values from forward pass
Returns a tuple of:
- dx: Gradient of input, of shape (N, T, D)
- dw: Gradient of weights, of shape (D, M)
- db: Gradient of biases, of shape (M,)"""x, w, b, out=cache
N, T, D=x.shape
M=b.shape[0]
dx= dout.reshape(N *T, M).dot(w.T).reshape(N, T, D)
dw= dout.reshape(N * T, M).T.dot(x.reshape(N *T, D)).T
db= dout.sum(axis=(0, 1))returndx, dw, dbdef temporal_softmax_loss(x, y, mask, verbose=False):"""A temporal version of softmax loss for use in RNNs. We assume that we are
making predictions over a vocabulary of size V for each timestep of a
timeseries of length T, over a minibatch of size N. The input x gives scores
for all vocabulary elements at all timesteps, and y gives the indices of the
ground-truth element at each timestep. We use a cross-entropy loss at each
timestep, summing the loss over all timesteps and averaging across the
minibatch.
As an additional complication, we may want to ignore the model output at some
timesteps, since sequences of different length may have been combined into a
minibatch and padded with NULL tokens. The optional mask argument tells us
which elements should contribute to the loss.
Inputs:
- x: Input scores, of shape (N, T, V)
- y: Ground-truth indices, of shape (N, T) where each element is in the range
0 <= y[i, t] < V
- mask: Boolean array of shape (N, T) where mask[i, t] tells whether or not
the scores at x[i, t] should contribute to the loss.
Returns a tuple of:
- loss: Scalar giving loss
- dx: Gradient of loss with respect to scores x."""N, T, V=x.shape
x_flat= x.reshape(N *T, V)
y_flat= y.reshape(N *T)
mask_flat= mask.reshape(N *T)
probs= np.exp(x_flat - np.max(x_flat, axis=1, keepdims=True))
probs/= np.sum(probs, axis=1, keepdims=True)
loss= -np.sum(mask_flat * np.log(probs[np.arange(N * T), y_flat])) /N
dx_flat=probs.copy()
dx_flat[np.arange(N* T), y_flat] -= 1dx_flat/=N
dx_flat*=mask_flat[:, None]if verbose: print('dx_flat:', dx_flat.shape)
dx=dx_flat.reshape(N, T, V)return loss, dx