2020 cs231n 作业3 笔记 LSTM_Captioning

最新推荐文章于 2024-07-07 14:06:00 发布

cheetah023

最新推荐文章于 2024-07-07 14:06:00 发布

阅读量637

点赞数

分类专栏： cs231n 文章标签： cs231n lstm rnn python 深度学习

本文链接：https://blog.csdn.net/cheetah023/article/details/107621829

版权

cs231n 专栏收录该内容

15 篇文章 16 订阅

订阅专栏

LSTM_Captioning

1、简介

LSTM里面加入了输入门 $I_{t}$ ，遗忘门 $F_{t}$ ，输出门 $O_{t}$ ，候选记忆细胞 $\tilde{C}$ (cs231n里面叫g)

遗忘门F控制上一时间步的记忆细胞 $C_{t-1}$ 中的信息是否传递到当前时间步，而输入门则控制当前时间步的输入 $X_{t}$ 通过候选记忆细胞 $\tilde{C}$ 如何流入当前时间步的记忆细胞。

如果遗忘门一直近似1且输入门一直近似0，过去的记忆细胞将一直通过时间保存并传递至当前时间步，这个设计可以应对RNN中的梯度衰减问题，并更好地捕捉时间序列中时间步距离较大的依赖关系。

2、作业代码实现

2.1、LSTM: step forward

对序列中的一个数据进行前向传播：

def lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b):
    """
    Forward pass for a single timestep of an LSTM.

    The input data has dimension D, the hidden state has dimension H, and we use
    a minibatch size of N.

    Note that a sigmoid() function has already been provided for you in this file.

    Inputs:
    - x: Input data, of shape (N, D)
    - prev_h: Previous hidden state, of shape (N, H)
    - prev_c: previous cell state, of shape (N, H)
    - Wx: Input-to-hidden weights, of shape (D, 4H)
    - Wh: Hidden-to-hidden weights, of shape (H, 4H)
    - b: Biases, of shape (4H,)

    Returns a tuple of:
    - next_h: Next hidden state, of shape (N, H)
    - next_c: Next cell state, of shape (N, H)
    - cache: Tuple of values needed for backward pass.
    """
    next_h, next_c, cache = None, None, None
    #############################################################################
    # TODO: Implement the forward pass for a single timestep of an LSTM.        #
    # You may want to use the numerically stable sigmoid implementation above.  #
    #############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    H = prev_h.shape[1]
    a = x.dot(Wx) + prev_h.dot(Wh) + b # N*4H
    # N*H
    i = sigmoid(a[:, 0:H])
    f = sigmoid(a[:, H:2*H])
    o = sigmoid(a[:, 2*H:3*H])
    g = np.tanh(a[:, 3*H:4*H])
    next_c = f * prev_c + i * g
    next_h = o * np.tanh(next_c)
    cache = next_c, x, prev_h, prev_c, i, f, o, g, Wx, Wh, b, a
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################

    return next_h, next_c, cache

2.2、LSTM: step backward

对序列中的一个数据进行反向传播：

def lstm_step_backward(dnext_h, dnext_c, cache):
    """
    Backward pass for a single timestep of an LSTM.

    Inputs:
    - dnext_h: Gradients of next hidden state, of shape (N, H)
    - dnext_c: Gradients of next cell state, of shape (N, H)
    - cache: Values from the forward pass

    Returns a tuple of:
    - dx: Gradient of input data, of shape (N, D)
    - dprev_h: Gradient of previous hidden state, of shape (N, H)
    - dprev_c: Gradient of previous cell state, of shape (N, H)
    - dWx: Gradient of input-to-hidden weights, of shape (D, 4H)
    - dWh: Gradient of hidden-to-hidden weights, of shape (H, 4H)
    - db: Gradient of biases, of shape (4H,)
    """
    dx, dprev_h, dprev_c, dWx, dWh, db = None, None, None, None, None, None
    #############################################################################
    # TODO: Implement the backward pass for a single timestep of an LSTM.       #
    #                                                                           #
    # HINT: For sigmoid and tanh you can compute local derivatives in terms of  #
    # the output value from the nonlinearity.                                   #
    #############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    next_c, x, prev_h, prev_c, i, f, o, g, Wx, Wh, b, a = cache
    # N*H
    H = prev_h.shape[1]
    do = dnext_h * np.tanh(next_c)
    dnext_ctemp = dnext_h * o * (1 - np.square(np.tanh(next_c))) + dnext_c
    df = dnext_ctemp * prev_c
    dprev_c = dnext_ctemp * f
    di = dnext_ctemp * g
    dg = dnext_ctemp * i
    ai = a[:,0:H]
    af = a[:,H:2*H]
    ao = a[:,2*H:3*H]
    ag = a[:,3*H:4*H]
    dg_tanh = (1 - np.square(np.tanh(ag))) * dg
    di_sigmoid = sigmoid(ai) * (1 - sigmoid(ai)) * di
    df_sigmoid = sigmoid(af) * (1 - sigmoid(af)) * df
    do_sigmoid = sigmoid(ao) * (1 - sigmoid(ao)) * do
    # N*4H
    da = np.concatenate((di_sigmoid,df_sigmoid,do_sigmoid,dg_tanh),axis=1)
    dx = np.dot(da, Wx.T)
    dWx = np.dot(x.T, da)
    dprev_h = np.dot(da, Wh.T)
    dWh = np.dot(prev_h.T, da)
    db = np.sum(da,axis=0)

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################

    return dx, dprev_h, dprev_c, dWx, dWh, db

2.3、LSTM: forward

对整个序列数据进行前向传播：

def lstm_forward(x, h0, Wx, Wh, b):
    """
    Forward pass for an LSTM over an entire sequence of data. We assume an input
    sequence composed of T vectors, each of dimension D. The LSTM uses a hidden
    size of H, and we work over a minibatch containing N sequences. After running
    the LSTM forward, we return the hidden states for all timesteps.

    Note that the initial cell state is passed as input, but the initial cell
    state is set to zero. Also note that the cell state is not returned; it is
    an internal variable to the LSTM and is not accessed from outside.

    Inputs:
    - x: Input data of shape (N, T, D)
    - h0: Initial hidden state of shape (N, H)
    - Wx: Weights for input-to-hidden connections, of shape (D, 4H)
    - Wh: Weights for hidden-to-hidden connections, of shape (H, 4H)
    - b: Biases of shape (4H,)

    Returns a tuple of:
    - h: Hidden states for all timesteps of all sequences, of shape (N, T, H)
    - cache: Values needed for the backward pass.
    """
    h, cache = None, None
    #############################################################################
    # TODO: Implement the forward pass for an LSTM over an entire timeseries.   #
    # You should use the lstm_step_forward function that you just defined.      #
    #############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    cache = []
    N, T, D = x.shape
    H = h0.shape[1]
    c = np.zeros([N,H])
    h = np.zeros([N,T,H])
    h[:,0,:], c, cache_temp = lstm_step_forward(x[:,0,:], h0, c, Wx, Wh, b)
    cache.append(cache_temp)
    for t in range(T-1):
      h[:,t+1,:], c, cache_temp = lstm_step_forward(x[:,t+1,:], h[:,t,:], c, Wx, Wh, b)
      cache.append(cache_temp)
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################

    return h, cache

2.4、LSTM: backward

对整个序列数据进行反向传播：

def lstm_backward(dh, cache):
    """
    Backward pass for an LSTM over an entire sequence of data.]

    Inputs:
    - dh: Upstream gradients of hidden states, of shape (N, T, H)
    - cache: Values from the forward pass

    Returns a tuple of:
    - dx: Gradient of input data of shape (N, T, D)
    - dh0: Gradient of initial hidden state of shape (N, H)
    - dWx: Gradient of input-to-hidden weight matrix of shape (D, 4H)
    - dWh: Gradient of hidden-to-hidden weight matrix of shape (H, 4H)
    - db: Gradient of biases, of shape (4H,)
    """
    dx, dh0, dWx, dWh, db = None, None, None, None, None
    #############################################################################
    # TODO: Implement the backward pass for an LSTM over an entire timeseries.  #
    # You should use the lstm_step_backward function that you just defined.     #
    #############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    N,T,H = dh.shape
    cache_t = cache[0]
    x_t = cache_t[1]
    D = x_t.shape[1]
    dx = np.zeros([N,T,D])
    dc = np.zeros([N,H])
    dh_t = dh.copy()
    dWx = np.zeros([D,4*H])
    dWh = np.zeros([H,4*H])
    db = np.zeros(4*H)
    for t in range(T)[::-1]:
      dx[:,t,:], dprev_h, dc, dWx_t, dWh_t, db_t =\
           lstm_step_backward(dh_t[:,t,:], dc, cache.pop())
      if t!=0:
        dh_t[:,t-1,:] += dprev_h
      else:
        dh0 = dprev_h
      dWx += dWx_t
      dWh += dWh_t
      db += db_t
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################

    return dx, dh0, dWx, dWh, db

2.5、在CaptioningRNN网络中加入LSTM

因为这部分的作业结构在前面的作业已经加好了，只需要再添加if分支把LSTM加上去就好了，下面是代码：

class CaptioningRNN(object):
    """
    A CaptioningRNN produces captions from image features using a recurrent
    neural network.

    The RNN receives input vectors of size D, has a vocab size of V, works on
    sequences of length T, has an RNN hidden dimension of H, uses word vectors
    of dimension W, and operates on minibatches of size N.

    Note that we don't use any regularization for the CaptioningRNN.
    """

    def __init__(
        self,
        word_to_idx,
        input_dim=512,
        wordvec_dim=128,
        hidden_dim=128,
        cell_type="rnn",
        dtype=np.float32,
    ):
        """
        Construct a new CaptioningRNN instance.

        Inputs:
        - word_to_idx: A dictionary giving the vocabulary. It contains V entries,
          and maps each string to a unique integer in the range [0, V).
        - input_dim: Dimension D of input image feature vectors.
        - wordvec_dim: Dimension W of word vectors.
        - hidden_dim: Dimension H for the hidden state of the RNN.
        - cell_type: What type of RNN to use; either 'rnn' or 'lstm'.
        - dtype: numpy datatype to use; use float32 for training and float64 for
          numeric gradient checking.
        """
        if cell_type not in {"rnn", "lstm"}:
            raise ValueError('Invalid cell_type "%s"' % cell_type)

        self.cell_type = cell_type
        self.dtype = dtype
        self.word_to_idx = word_to_idx
        self.idx_to_word = {i: w for w, i in word_to_idx.items()}
        self.params = {}

        vocab_size = len(word_to_idx)

        self._null = word_to_idx["<NULL>"]
        self._start = word_to_idx.get("<START>", None)
        self._end = word_to_idx.get("<END>", None)

        # Initialize word vectors
        self.params["W_embed"] = np.random.randn(vocab_size, wordvec_dim)
        self.params["W_embed"] /= 100

        # Initialize CNN -> hidden state projection parameters
        self.params["W_proj"] = np.random.randn(input_dim, hidden_dim)
        self.params["W_proj"] /= np.sqrt(input_dim)
        self.params["b_proj"] = np.zeros(hidden_dim)

        # Initialize parameters for the RNN
        dim_mul = {"lstm": 4, "rnn": 1}[cell_type]
        self.params["Wx"] = np.random.randn(wordvec_dim, dim_mul * hidden_dim)
        self.params["Wx"] /= np.sqrt(wordvec_dim)
        self.params["Wh"] = np.random.randn(hidden_dim, dim_mul * hidden_dim)
        self.params["Wh"] /= np.sqrt(hidden_dim)
        self.params["b"] = np.zeros(dim_mul * hidden_dim)

        # Initialize output to vocab weights
        self.params["W_vocab"] = np.random.randn(hidden_dim, vocab_size)
        self.params["W_vocab"] /= np.sqrt(hidden_dim)
        self.params["b_vocab"] = np.zeros(vocab_size)

        # Cast parameters to correct dtype
        for k, v in self.params.items():
            self.params[k] = v.astype(self.dtype)

    def loss(self, features, captions):
        """
        Compute training-time loss for the RNN. We input image features and
        ground-truth captions for those images, and use an RNN (or LSTM) to compute
        loss and gradients on all parameters.

        Inputs:
        - features: Input image features, of shape (N, D)
        - captions: Ground-truth captions; an integer array of shape (N, T + 1) where
          each element is in the range 0 <= y[i, t] < V

        Returns a tuple of:
        - loss: Scalar loss
        - grads: Dictionary of gradients parallel to self.params
        """
        # Cut captions into two pieces: captions_in has everything but the last word
        # and will be input to the RNN; captions_out has everything but the first
        # word and this is what we will expect the RNN to generate. These are offset
        # by one relative to each other because the RNN should produce word (t+1)
        # after receiving word t. The first element of captions_in will be the START
        # token, and the first element of captions_out will be the first word.
        captions_in = captions[:, :-1]
        captions_out = captions[:, 1:]

        # You'll need this
        mask = captions_out != self._null

        # Weight and bias for the affine transform from image features to initial
        # hidden state
        W_proj, b_proj = self.params["W_proj"], self.params["b_proj"]

        # Word embedding matrix
        W_embed = self.params["W_embed"]

        # Input-to-hidden, hidden-to-hidden, and biases for the RNN
        Wx, Wh, b = self.params["Wx"], self.params["Wh"], self.params["b"]

        # Weight and bias for the hidden-to-vocab transformation.
        W_vocab, b_vocab = self.params["W_vocab"], self.params["b_vocab"]

        loss, grads = 0.0, {}
        ############################################################################
        # TODO: Implement the forward and backward passes for the CaptioningRNN.   #
        # In the forward pass you will need to do the following:                   #
        # (1) Use an affine transformation to compute the initial hidden state     #
        #     from the image features. This should produce an array of shape (N, H)#
        # (2) Use a word embedding layer to transform the words in captions_in     #
        #     from indices to vectors, giving an array of shape (N, T, W).         #
        # (3) Use either a vanilla RNN or LSTM (depending on self.cell_type) to    #
        #     process the sequence of input word vectors and produce hidden state  #
        #     vectors for all timesteps, producing an array of shape (N, T, H).    #
        # (4) Use a (temporal) affine transformation to compute scores over the    #
        #     vocabulary at every timestep using the hidden states, giving an      #
        #     array of shape (N, T, V).                                            #
        # (5) Use (temporal) softmax to compute loss using captions_out, ignoring  #
        #     the points where the output word is <NULL> using the mask above.     #
        #                                                                          #
        #                                                                          #
        # Do not worry about regularizing the weights or their gradients!          #
        #                                                                          #
        # In the backward pass you will need to compute the gradient of the loss   #
        # with respect to all model parameters. Use the loss and grads variables   #
        # defined above to store loss and gradients; grads[k] should give the      #
        # gradients for self.params[k].                                            #
        #                                                                          #
        # Note also that you are allowed to make use of functions from layers.py   #
        # in your implementation, if needed.                                       #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #forward
        #1、从图片特征中生成hidden layer的h0: (N,H))
        h0, h0_cache = affine_forward(features, W_proj, b_proj)
        #2、word embedding获取hidden layer的输入词向量 (N,T,W)
        h_in, h_in_cache = word_embedding_forward(captions_in, W_embed)
        #3、使用rnn生成hidden layer输出的词向量 (N,T,H)
        if self.cell_type == 'rnn':
          h_out, h_out_cache = rnn_forward(h_in, h0, Wx, Wh, b)
        elif self.cell_type == 'lstm':
          h_out, h_out_cache = lstm_forward(h_in, h0, Wx, Wh, b)
        #4、生成预测的词的分数 (N,T,V)
        score, score_cache = temporal_affine_forward(h_out, W_vocab, b_vocab)
        #5，temporal_softmax_loss，计算loss和grad
        loss, dscore = temporal_softmax_loss(score, captions_out, mask)

        #backward 
        dh_out, dW_vocab, db_vocab = temporal_affine_backward(dscore, score_cache)
        if self.cell_type == 'rnn':
          dh_in, dh0, dWx, dWh, db = rnn_backward(dh_out, h_out_cache)
        elif self.cell_type == 'lstm':
          dh_in, dh0, dWx, dWh, db = lstm_backward(dh_out, h_out_cache)
        dW_embed = word_embedding_backward(dh_in, h_in_cache)
        _, dW_proj, db_proj = affine_backward(dh0, h0_cache)
        grads['W_vocab'] = dW_vocab
        grads['b_vocab'] = db_vocab
        grads['Wx'] = dWx
        grads['Wh'] = dWh
        grads['b'] = db
        grads['W_embed'] = dW_embed
        grads['W_proj'] = dW_proj
        grads['b_proj'] = db_proj


        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        return loss, grads

    def sample(self, features, max_length=30):
        """
        Run a test-time forward pass for the model, sampling captions for input
        feature vectors.

        At each timestep, we embed the current word, pass it and the previous hidden
        state to the RNN to get the next hidden state, use the hidden state to get
        scores for all vocab words, and choose the word with the highest score as
        the next word. The initial hidden state is computed by applying an affine
        transform to the input image features, and the initial word is the <START>
        token.

        For LSTMs you will also have to keep track of the cell state; in that case
        the initial cell state should be zero.

        Inputs:
        - features: Array of input image features of shape (N, D).
        - max_length: Maximum length T of generated captions.

        Returns:
        - captions: Array of shape (N, max_length) giving sampled captions,
          where each element is an integer in the range [0, V). The first element
          of captions should be the first sampled word, not the <START> token.
        """
        N = features.shape[0]
        captions = self._null * np.ones((N, max_length), dtype=np.int32)

        # Unpack parameters
        W_proj, b_proj = self.params["W_proj"], self.params["b_proj"]
        W_embed = self.params["W_embed"]
        Wx, Wh, b = self.params["Wx"], self.params["Wh"], self.params["b"]
        W_vocab, b_vocab = self.params["W_vocab"], self.params["b_vocab"]

        ###########################################################################
        # TODO: Implement test-time sampling for the model. You will need to      #
        # initialize the hidden state of the RNN by applying the learned affine   #
        # transform to the input image features. The first word that you feed to  #
        # the RNN should be the <START> token; its value is stored in the         #
        # variable self._start. At each timestep you will need to do to:          #
        # (1) Embed the previous word using the learned word embeddings           #
        # (2) Make an RNN step using the previous hidden state and the embedded   #
        #     current word to get the next hidden state.                          #
        # (3) Apply the learned affine transformation to the next hidden state to #
        #     get scores for all words in the vocabulary                          #
        # (4) Select the word with the highest score as the next word, writing it #
        #     (the word index) to the appropriate slot in the captions variable   #
        #                                                                         #
        # For simplicity, you do not need to stop generating after an <END> token #
        # is sampled, but you can if you want to.                                 #
        #                                                                         #
        # HINT: You will not be able to use the rnn_forward or lstm_forward       #
        # functions; you'll need to call rnn_step_forward or lstm_step_forward in #
        # a loop.                                                                 #
        #                                                                         #
        # NOTE: we are still working over minibatches in this function. Also if   #
        # you are using an LSTM, initialize the first cell state to zeros.        #
        ###########################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        T = max_length
        #先从图片feature中提取初始隐藏状态h0
        h0, _ = affine_forward(features, W_proj, b_proj)
        h_out = h0
        captions_in = self._start * np.ones((N, T), dtype=np.int32)
        H = W_vocab.shape[0]
        c = np.zeros([N,H])
        for t in range(T):
          #1、word embedding获取hidden layer的输入词向量 (N,W)
          h_in, _ = word_embedding_forward(captions_in[:,t], W_embed)
          #2、使用rnn_step生成hidden layer输出的词向量 (N,H)
          if self.cell_type == 'rnn':
            h_out, _ = rnn_step_forward(h_in, h_out, Wx, Wh, b)
          elif self.cell_type == 'lstm':
            h_out, c, _ = lstm_step_forward(h_in, h_out, c, Wx, Wh, b)
          #3、生成预测的词分数 (N,V)
          score, _ = affine_forward(h_out, W_vocab, b_vocab)
          #4、找到得分最高的词索引captions[t]
          captions[:,t] = np.argmax(score,axis=1)
          if t!=T-1:
            captions_in[:,t+1] = captions[:,t]
        print('captions',captions.shape)
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################
        return captions