2020 cs231n 作业3 笔记 RNN_Captioning

RNN(Recurrent Neural Network)

1、简介 

循环神经网络(RNN)是一类用于处理序列数据的神经网络。就像卷积网络是专门用于处理网格化数据(如一个图像)的神经网络,循环神经网络是专门用于处理序列x^{(1))},...,x^{(k)}的神经网络。卷积网络可以很容易地扩展到具有很大宽度和高度的图像,以及处理大小可变的图像;循环网络可以扩展到更长的序列(比不急于序列的特化网络长得多),大多数循环网络也能处理可变长度的序列。循环神经网络引入状态变量来储存过去的信息,并用其与当前的输入共同决定当前的输出。

网络模型:

计算图:

2、具体作业实现

这次的作业是实现一个RNN神经网络给图片写描述说明

2.1、Vanilla RNN: step forward

对序列中的一个数据进行前向传播:

def rnn_step_forward(x, prev_h, Wx, Wh, b):
    """
    Run the forward pass for a single timestep of a vanilla RNN that uses a tanh
    activation function.

    The input data has dimension D, the hidden state has dimension H, and we use
    a minibatch size of N.

    Inputs:
    - x: Input data for this timestep, of shape (N, D).
    - prev_h: Hidden state from previous timestep, of shape (N, H)
    - Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
    - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
    - b: Biases of shape (H,)

    Returns a tuple of:
    - next_h: Next hidden state, of shape (N, H)
    - cache: Tuple of values needed for the backward pass.
    """
    next_h, cache = None, None
    ##############################################################################
    # TODO: Implement a single forward step for the vanilla RNN. Store the next  #
    # hidden state and any values you need for the backward pass in the next_h   #
    # and cache variables respectively.                                          #
    ##############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    next_h = np.tanh(np.dot(x,Wx) + np.dot(prev_h,Wh) + b)
    cache = x, Wx, prev_h, Wh, b
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################
    return next_h, cache

2.2、Vanilla RNN: step backward

对序列中的一个数据进行反向传播:

def rnn_step_backward(dnext_h, cache):
    """
    Backward pass for a single timestep of a vanilla RNN.

    Inputs:
    - dnext_h: Gradient of loss with respect to next hidden state, of shape (N, H)
    - cache: Cache object from the forward pass

    Returns a tuple of:
    - dx: Gradients of input data, of shape (N, D)
    - dprev_h: Gradients of previous hidden state, of shape (N, H)
    - dWx: Gradients of input-to-hidden weights, of shape (D, H)
    - dWh: Gradients of hidden-to-hidden weights, of shape (H, H)
    - db: Gradients of bias vector, of shape (H,)
    """
    dx, dprev_h, dWx, dWh, db = None, None, None, None, None
    ##############################################################################
    # TODO: Implement the backward pass for a single step of a vanilla RNN.      #
    #                                                                            #
    # HINT: For the tanh function, you can compute the local derivative in terms #
    # of the output value from tanh.                                             #
    ##############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    x, Wx, prev_h, Wh, b = cache
    temp_h = x.dot(Wx) + prev_h.dot(Wh) + b #N*H
    dtemp_h = 1 - np.square(np.tanh(temp_h))
    dx = (dnext_h * dtemp_h) @ Wx.T
    dWx = x.T @ (dnext_h * dtemp_h)
    dprev_h = (dnext_h * dtemp_h) @ Wh.T
    dWh = prev_h.T @ (dnext_h * dtemp_h)
    db = np.sum(dnext_h * dtemp_h, axis=0)
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################
    return dx, dprev_h, dWx, dWh, db

2.3、Vanilla RNN: forward

对整个序列数据进行前向传播:

def rnn_forward(x, h0, Wx, Wh, b):
    """
    Run a vanilla RNN forward on an entire sequence of data. We assume an input
    sequence composed of T vectors, each of dimension D. The RNN uses a hidden
    size of H, and we work over a minibatch containing N sequences. After running
    the RNN forward, we return the hidden states for all timesteps.

    Inputs:
    - x: Input data for the entire timeseries, of shape (N, T, D).
    - h0: Initial hidden state, of shape (N, H)
    - Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
    - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
    - b: Biases of shape (H,)

    Returns a tuple of:
    - h: Hidden states for the entire timeseries, of shape (N, T, H).
    - cache: Values needed in the backward pass
    """
    h, cache = None, None
    ##############################################################################
    # TODO: Implement forward pass for a vanilla RNN running on a sequence of    #
    # input data. You should use the rnn_step_forward function that you defined  #
    # above. You can use a for loop to help compute the forward pass.            #
    ##############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    cache = []
    N, T, D = x.shape
    H = h0.shape[1]
    h = np.zeros([N,T,H])
    h[:,0,:], cache_temp = rnn_step_forward(x[:,0,:], h0, Wx, Wh, b)
    cache.append(cache_temp)
    for i in range(T-1):
      h[:,i+1,:], cache_temp = rnn_step_forward(x[:,i+1,:].reshape(N,D), h[:,i,:], Wx, Wh, b)
      cache.append(cache_temp)
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################
    return h, cache

2.4、Vanilla RNN: backward

对整个序列数据进行反向传播:

def rnn_backward(dh, cache):
    """
    Compute the backward pass for a vanilla RNN over an entire sequence of data.

    Inputs:
    - dh: Upstream gradients of all hidden states, of shape (N, T, H). 
    
    NOTE: 'dh' contains the upstream gradients produced by the 
    individual loss functions at each timestep, *not* the gradients
    being passed between timesteps (which you'll have to compute yourself
    by calling rnn_step_backward in a loop).

    Returns a tuple of:
    - dx: Gradient of inputs, of shape (N, T, D)
    - dh0: Gradient of initial hidden state, of shape (N, H)
    - dWx: Gradient of input-to-hidden weights, of shape (D, H)
    - dWh: Gradient of hidden-to-hidden weights, of shape (H, H)
    - db: Gradient of biases, of shape (H,)
    """
    dx, dh0, dWx, dWh, db = None, None, None, None, None
    ##############################################################################
    # TODO: Implement the backward pass for a vanilla RNN running an entire      #
    # sequence of data. You should use the rnn_step_backward function that you   #
    # defined above. You can use a for loop to help compute the backward pass.   #
    ##############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    N, T, H = dh.shape
    _, Wx, _, _, _ = cache[0]
    D = Wx.shape[0]
    dx = np.zeros([N, T, D])
    dWx = np.zeros([D, H])
    dWh = np.zeros([H, H])
    dh0 = np.zeros([N, H])
    db = np.zeros([H])
    dh_t = dh.copy()#用来保存dh的副本
    for i in range(T)[::-1]:
      dx[:,i,:], dprev_h, dWx_t, dWh_t, db_t = rnn_step_backward(dh_t[:,i,:], cache.pop())
      if i!=0:
        dh_t[:,i-1,:] += dprev_h#这里如果直接使用dh_t[:,i-1,:] += dprev_h,结果会出错
      else:
        dh0 += dprev_h
      dWx += dWx_t
      dWh += dWh_t
      db += db_t
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################
    return dx, dh0, dWx, dWh, db

2.5、Word embedding: forward

因为x(N,T)中的元素是对应的词的索引,现在需要把x的每个索引值映射为向量。每个向量维度为(D),则输出结果为out(N,T,D),

def word_embedding_forward(x, W):
    """
    Forward pass for word embeddings. We operate on minibatches of size N where
    each sequence has length T. We assume a vocabulary of V words, assigning each
    word to a vector of dimension D.

    Inputs:
    - x: Integer array of shape (N, T) giving indices of words. Each element idx
      of x muxt be in the range 0 <= idx < V.
    - W: Weight matrix of shape (V, D) giving word vectors for all words.

    Returns a tuple of:
    - out: Array of shape (N, T, D) giving word vectors for all input words.
    - cache: Values needed for the backward pass
    """
    out, cache = None, None
    ##############################################################################
    # TODO: Implement the forward pass for word embeddings.                      #
    #                                                                            #
    # HINT: This can be done in one line using NumPy's array indexing.           #
    ##############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    out = W[x,:]
    cache = (W,x)
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################
    return out, cache

2.6、Word embedding: backward

def word_embedding_backward(dout, cache):
    """
    Backward pass for word embeddings. We cannot back-propagate into the words
    since they are integers, so we only return gradient for the word embedding
    matrix.

    HINT: Look up the function np.add.at

    Inputs:
    - dout: Upstream gradients of shape (N, T, D)
    - cache: Values from the forward pass

    Returns:
    - dW: Gradient of word embedding matrix, of shape (V, D).
    """
    dW = None
    ##############################################################################
    # TODO: Implement the backward pass for word embeddings.                     #
    #                                                                            #
    # Note that words can appear more than once in a sequence.                   #
    # HINT: Look up the function np.add.at                                       #
    ##############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    W,x = cache
    dW = np.zeros(W.shape)
    #将dW中下标为x的元素加上dout
    np.add.at(dW,x,dout)

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################
    return dW

2.7、实现RNN的loss函数

具体流程大概是:

1、提取图片的特征作为h0初始的隐藏状态

2、使用word embedding获取hidden layer的输入词向量 (N,T,W)

3、使用rnn生成hidden layer输出的词向量 (N,T,H)

4、生成预测的词的分数 score(N,T,V)

5,temporal_softmax_loss,计算loss和dscore

    def loss(self, features, captions):
        """
        Compute training-time loss for the RNN. We input image features and
        ground-truth captions for those images, and use an RNN (or LSTM) to compute
        loss and gradients on all parameters.

        Inputs:
        - features: Input image features, of shape (N, D)
        - captions: Ground-truth captions; an integer array of shape (N, T + 1) where
          each element is in the range 0 <= y[i, t] < V

        Returns a tuple of:
        - loss: Scalar loss
        - grads: Dictionary of gradients parallel to self.params
        """
        # Cut captions into two pieces: captions_in has everything but the last word
        # and will be input to the RNN; captions_out has everything but the first
        # word and this is what we will expect the RNN to generate. These are offset
        # by one relative to each other because the RNN should produce word (t+1)
        # after receiving word t. The first element of captions_in will be the START
        # token, and the first element of captions_out will be the first word.
        captions_in = captions[:, :-1]
        captions_out = captions[:, 1:]

        # You'll need this
        mask = captions_out != self._null

        # Weight and bias for the affine transform from image features to initial
        # hidden state
        W_proj, b_proj = self.params["W_proj"], self.params["b_proj"]

        # Word embedding matrix
        W_embed = self.params["W_embed"]

        # Input-to-hidden, hidden-to-hidden, and biases for the RNN
        Wx, Wh, b = self.params["Wx"], self.params["Wh"], self.params["b"]

        # Weight and bias for the hidden-to-vocab transformation.
        W_vocab, b_vocab = self.params["W_vocab"], self.params["b_vocab"]

        loss, grads = 0.0, {}
        ############################################################################
        # TODO: Implement the forward and backward passes for the CaptioningRNN.   #
        # In the forward pass you will need to do the following:                   #
        # (1) Use an affine transformation to compute the initial hidden state     #
        #     from the image features. This should produce an array of shape (N, H)#
        # (2) Use a word embedding layer to transform the words in captions_in     #
        #     from indices to vectors, giving an array of shape (N, T, W).         #
        # (3) Use either a vanilla RNN or LSTM (depending on self.cell_type) to    #
        #     process the sequence of input word vectors and produce hidden state  #
        #     vectors for all timesteps, producing an array of shape (N, T, H).    #
        # (4) Use a (temporal) affine transformation to compute scores over the    #
        #     vocabulary at every timestep using the hidden states, giving an      #
        #     array of shape (N, T, V).                                            #
        # (5) Use (temporal) softmax to compute loss using captions_out, ignoring  #
        #     the points where the output word is <NULL> using the mask above.     #
        #                                                                          #
        #                                                                          #
        # Do not worry about regularizing the weights or their gradients!          #
        #                                                                          #
        # In the backward pass you will need to compute the gradient of the loss   #
        # with respect to all model parameters. Use the loss and grads variables   #
        # defined above to store loss and gradients; grads[k] should give the      #
        # gradients for self.params[k].                                            #
        #                                                                          #
        # Note also that you are allowed to make use of functions from layers.py   #
        # in your implementation, if needed.                                       #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #forward
        #1、从图片特征中生成hidden layer的h0: (N,H))
        h0, h0_cache = affine_forward(features, W_proj, b_proj)
        #2、word embedding获取hidden layer的输入词向量 (N,T,W)
        h_in, h_in_cache = word_embedding_forward(captions_in, W_embed)
        #3、使用rnn生成hidden layer输出的词向量 (N,T,H)
        if self.cell_type == 'rnn':
          h_out, h_out_cache = rnn_forward(h_in, h0, Wx, Wh, b)
        elif self.cell_type == 'lstm':
          h_out, h_out_cache = lstm_forward(h_in, h0, Wx, Wh, b)
        #4、生成预测的词 (N,T,V)
        score, score_cache = temporal_affine_forward(h_out, W_vocab, b_vocab)
        #5,temporal_softmax_loss,计算loss和grad
        loss, dscore = temporal_softmax_loss(score, captions_out, mask)

        #backward 
        dh_out, dW_vocab, db_vocab = temporal_affine_backward(dscore, score_cache)
        if self.cell_type == 'rnn':
          dh_in, dh0, dWx, dWh, db = rnn_backward(dh_out, h_out_cache)
        elif self.cell_type == 'lstm':
          dh_in, dh0, dWx, dWh, db = lstm_backward(dh_out, h_out_cache)
        dW_embed = word_embedding_backward(dh_in, h_in_cache)
        _, dW_proj, db_proj = affine_backward(dh0, h0_cache)
        grads['W_vocab'] = dW_vocab
        grads['b_vocab'] = db_vocab
        grads['Wx'] = dWx
        grads['Wh'] = dWh
        grads['b'] = db
        grads['W_embed'] = dW_embed
        grads['W_proj'] = dW_proj
        grads['b_proj'] = db_proj


        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        return loss, grads

2.8、实现RNN的sample函数,用来生成对图片的描述说明

流程大概是:

先从图片feature中提取初始隐藏状态h0

   for t in range(T):

   {

            1、word embedding获取hidden layer的输入词向量 (N,W)

            2、使用rnn_step生成hidden layer输出的词向量 (N,H)

            3、生成预测的词分数 (N,V)

            4、找到得分最高的词索引captions[t]

   }

    def sample(self, features, max_length=30):
        """
        Run a test-time forward pass for the model, sampling captions for input
        feature vectors.

        At each timestep, we embed the current word, pass it and the previous hidden
        state to the RNN to get the next hidden state, use the hidden state to get
        scores for all vocab words, and choose the word with the highest score as
        the next word. The initial hidden state is computed by applying an affine
        transform to the input image features, and the initial word is the <START>
        token.

        For LSTMs you will also have to keep track of the cell state; in that case
        the initial cell state should be zero.

        Inputs:
        - features: Array of input image features of shape (N, D).
        - max_length: Maximum length T of generated captions.

        Returns:
        - captions: Array of shape (N, max_length) giving sampled captions,
          where each element is an integer in the range [0, V). The first element
          of captions should be the first sampled word, not the <START> token.
        """
        N = features.shape[0]
        captions = self._null * np.ones((N, max_length), dtype=np.int32)

        # Unpack parameters
        W_proj, b_proj = self.params["W_proj"], self.params["b_proj"]
        W_embed = self.params["W_embed"]
        Wx, Wh, b = self.params["Wx"], self.params["Wh"], self.params["b"]
        W_vocab, b_vocab = self.params["W_vocab"], self.params["b_vocab"]

        ###########################################################################
        # TODO: Implement test-time sampling for the model. You will need to      #
        # initialize the hidden state of the RNN by applying the learned affine   #
        # transform to the input image features. The first word that you feed to  #
        # the RNN should be the <START> token; its value is stored in the         #
        # variable self._start. At each timestep you will need to do to:          #
        # (1) Embed the previous word using the learned word embeddings           #
        # (2) Make an RNN step using the previous hidden state and the embedded   #
        #     current word to get the next hidden state.                          #
        # (3) Apply the learned affine transformation to the next hidden state to #
        #     get scores for all words in the vocabulary                          #
        # (4) Select the word with the highest score as the next word, writing it #
        #     (the word index) to the appropriate slot in the captions variable   #
        #                                                                         #
        # For simplicity, you do not need to stop generating after an <END> token #
        # is sampled, but you can if you want to.                                 #
        #                                                                         #
        # HINT: You will not be able to use the rnn_forward or lstm_forward       #
        # functions; you'll need to call rnn_step_forward or lstm_step_forward in #
        # a loop.                                                                 #
        #                                                                         #
        # NOTE: we are still working over minibatches in this function. Also if   #
        # you are using an LSTM, initialize the first cell state to zeros.        #
        ###########################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        T = max_length
        #先从图片feature中提取初始隐藏状态h0
        h0, _ = affine_forward(features, W_proj, b_proj)
        h_out = h0
        captions_in = self._start * np.ones((N, T), dtype=np.int32)
        H = W_vocab.shape[0]
        c = np.zeros([N,H])
        for t in range(T):
          #1、word embedding获取hidden layer的输入词向量 (N,W)
          h_in, _ = word_embedding_forward(captions_in[:,t], W_embed)
          #2、使用rnn_step生成hidden layer输出的词向量 (N,H)
          if self.cell_type == 'rnn':
            h_out, _ = rnn_step_forward(h_in, h_out, Wx, Wh, b)
          elif self.cell_type == 'lstm':
            h_out, c, _ = lstm_step_forward(h_in, h_out, c, Wx, Wh, b)
          #3、生成预测的词分数 (N,V)
          score, _ = affine_forward(h_out, W_vocab, b_vocab)
          #4、找到得分最高的词索引captions
          captions[:,t] = np.argmax(score,axis=1)
          if t!=T-1:
            captions_in[:,t+1] = captions[:,t]
        print('captions',captions.shape)
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################
        return captions

 

 

 

 

 

 

  • 6
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值