CS231n assignment3 Q1 Image Captioning with Vanilla RNNs

来到最后一个作业,前两个作业仍然是使用numpy来实现一个rnn/lstm网络,后边三个作业则用到了tensorflow/pytorch,目前只用了tensorflow来完成,以后或许会把pytorch的也完成了。

前言

第一个任务是使用rnn来完成图像标注的任务。image caption是rnn类网络的经典应用,属于encoder-decoder网络,encoder使用cnn网络,如VGG16,采用softmax层前边的全连接层的向量作为图片表示,使用rnn类网络作为decoder,属于1对多的模型。
Show and Tell: A Neural Image Caption Generator
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
这两篇是图像标注的经典论文,上一篇在2015的coco比赛中拿到了第一名(因为他们用了很多trick),下一篇则引入attention机制,并介绍了soft attention和hard attention的区别。

作业

导入数据

encoder部分已经完成,采用softmax层之前的4096维向量的那一层,通过pca降维到512方便计算。
<START> 表示开始 <END>表示结束 <UNK> 表示未知 <NULL>填充短标题使用

val_urls <class 'numpy.ndarray'> (40504,) <U63
word_to_idx <class 'dict'> 1004
val_features <class 'numpy.ndarray'> (40504, 512) float32
train_urls <class 'numpy.ndarray'> (82783,) <U63
val_image_idxs <class 'numpy.ndarray'> (195954,) int32
train_image_idxs <class 'numpy.ndarray'> (400135,) int32
train_features <class 'numpy.ndarray'> (82783, 512) float32
val_captions <class 'numpy.ndarray'> (195954, 17) int32
idx_to_word <class 'list'> 1004
train_captions <class 'numpy.ndarray'> (400135, 17) int32

1250085-20190104170446524-658034183.png

1250085-20190104170504559-746430527.png

1250085-20190104170521459-2131349282.png

rnn前向过程

def rnn_step_forward(x, prev_h, Wx, Wh, b):
    """
    Run the forward pass for a single timestep of a vanilla RNN that uses a tanh
    activation function.

    The input data has dimension D, the hidden state has dimension H, and we use
    a minibatch size of N.

    Inputs:
    - x: Input data for this timestep, of shape (N, D).
    - prev_h: Hidden state from previous timestep, of shape (N, H)
    - Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
    - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
    - b: Biases of shape (H,)

    Returns a tuple of:
    - next_h: Next hidden state, of shape (N, H)
    - cache: Tuple of values needed for the backward pass.
    """
    next_h, cache = None, None
    ##############################################################################
    # TODO: Implement a single forward step for the vanilla RNN. Store the next  #
    # hidden state and any values you need for the backward pass in the next_h   #
    # and cache variables respectively.                                          #
    ##############################################################################
    temp1 = np.dot(x,Wx) # x(N,D) Wx(D,H) temp1(N,H) x的维度变为H
    temp2 = np.dot(prev_h,Wh)  # prev_h(N,H) Wh(H,H) temp2(N,H) 隐藏状态的维度不变
    cache = (x,prev_h,Wx,Wh,temp1 + temp2 + b) #给反向传播用的东西
    next_h = np.tanh(temp1 + temp2 + b) # 激活函数
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################
    return next_h, cache

next_h error: 6.292421426471037e-09

rnn后向过程

def rnn_step_backward(dnext_h, cache):
    """
    Backward pass for a single timestep of a vanilla RNN.

    Inputs:
    - dnext_h: Gradient of loss with respect to next hidden state, of shape (N, H)
    - cache: Cache object from the forward pass

    Returns a tuple of:
    - dx: Gradients of input data, of shape (N, D)
    - dprev_h: Gradients of previous hidden state, of shape (N, H)
    - dWx: Gradients of input-to-hidden weights, of shape (D, H)
    - dWh: Gradients of hidden-to-hidden weights, of shape (H, H)
    - db: Gradients of bias vector, of shape (H,)
    """
    dx, dprev_h, dWx, dWh, db = None, None, None, None, None
    ##############################################################################
    # TODO: Implement the backward pass for a single step of a vanilla RNN.      #
    #                                                                            #
    # HINT: For the tanh function, you can compute the local derivative in terms #
    # of the output value from tanh.                                             #
    ##############################################################################
    #h = tanh(Wh * prev_h + Wx * x + b)
    x,h,Wx,Wh,cacheD = cache
    N,H = h.shape
    #计算激活函数tanh的导数 f(z)' = 1 − (f(z))**2
    temp = np.ones((N,H)) - np.square(np.tanh(cacheD))
    delta = np.multiply(temp,dnext_h)#乘以传过来的梯度
    
    #计算x的梯度
    tempx = np.dot(Wx,delta.T) #复合函数求导
    dx = tempx.T
    
    #计算h的梯度
    temph = np.dot(Wh,delta.T)
    dprev_h = temph.T
    
    #计算Wxh的梯度
    dWx = np.dot(x.T,delta)
    
    #计算Whh的梯度
    dWh = np.dot(h.T,delta)
    
    #计算b的梯度
    tempb = np.sum(delta,axis = 0)
    db = tempb.T
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################
    return dx, dprev_h, dWx, dWh, db

dx error: 4.0192769090159184e-10
dprev_h error: 2.5632975303201374e-10
dWx error: 8.820222259148609e-10
dWh error: 4.703287554560559e-10
db error: 7.30162216654e-11

rnn前向传播

def rnn_forward(x, h0, Wx, Wh, b):
    """
    Run a vanilla RNN forward on an entire sequence of data. We assume an input
    sequence composed of T vectors, each of dimension D. The RNN uses a hidden
    size of H, and we work over a minibatch containing N sequences. After running
    the RNN forward, we return the hidden states for all timesteps.

    Inputs:
    - x: Input data for the entire timeseries, of shape (N, T, D).
    - h0: Initial hidden state, of shape (N, H)
    - Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
    - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
    - b: Biases of shape (H,)

    Returns a tuple of:
    - h: Hidden states for the entire timeseries, of shape (N, T, H).
    - cache: Values needed in the backward pass
    """
    h, cache = None, None
    ##############################################################################
    # TODO: Implement forward pass for a vanilla RNN running on a sequence of    #
    # input data. You should use the rnn_step_forward function that you defined  #
    # above. You can use a for loop to help compute the forward pass.            #
    ##############################################################################
    N,T,D = x.shape
    H = h0.shape[1]
    prev_h = h0
    h1 = np.empty([N,T,H])
    h2 = np.empty([N,T,H])
    h3 = np.empty([N,T,H])
    for i in range(T):
        temp_h,cache_temp = rnn_step_forward(x[:,i,:],prev_h,Wx,Wh,b) #前向传播
        h3[:,i,:] = prev_h #前一个隐藏层的状态
        prev_h = temp_h
        h2[:,i,:] = temp_h #当前隐藏层的状态
        h1[:,i,:] = cache_temp[4] #tanh函数里边的东西
    cache = (x,h3,Wx,Wh,h1)
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################
    return h2, cache

h error: 7.728466158305164e-08

rnn后向传播

def rnn_backward(dh, cache):
    """
    Compute the backward pass for a vanilla RNN over an entire sequence of data.

    Inputs:
    - dh: Upstream gradients of all hidden states, of shape (N, T, H). 
    
    NOTE: 'dh' contains the upstream gradients produced by the 
    individual loss functions at each timestep, *not* the gradients
    being passed between timesteps (which you'll have to compute yourself
    by calling rnn_step_backward in a loop).

    Returns a tuple of:
    - dx: Gradient of inputs, of shape (N, T, D)
    - dh0: Gradient of initial hidden state, of shape (N, H)
    - dWx: Gradient of input-to-hidden weights, of shape (D, H)
    - dWh: Gradient of hidden-to-hidden weights, of shape (H, H)
    - db: Gradient of biases, of shape (H,)
    """
    dx, dh0, dWx, dWh, db = None, None, None, None, None
    ##############################################################################
    # TODO: Implement the backward pass for a vanilla RNN running an entire      #
    # sequence of data. You should use the rnn_step_backward function that you   #
    # defined above. You can use a for loop to help compute the backward pass.   #
    ##############################################################################
    x = cache[0]
    N,T,D = x.shape
    N,T,H = dh.shape
    dWx = np.zeros((D,H))
    dWh = np.zeros((H,H))
    db = np.zeros(H)
    dout = dh
    dx = np.empty([N,T,D])
    dh = np.empty([N,T,H])
    hnow = np.zeros([N,H]) #当前时刻隐藏状态对应的梯度
    for k in range(T):
        i = T - 1 - k
        hnow = hnow + dout[:,i,:] #除了上一层传来的梯度,我们每一层都有输出,对应的误差函数也会传入梯度
        cacheT = (cache[0][:,i,:],cache[1][:,i,:],cache[2],cache[3],cache[4][:,i,:])
        dx_temp,dprev_h,dWx_temp,dWh_temp,db_temp = rnn_step_backward(hnow,cacheT)
        hnow = dprev_h
        dx[:,i,:] = dx_temp
        dWx = dWx + dWx_temp
        dWh = dWh + dWh_temp
        db = db + db_temp
    dh0 = hnow
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################
    return dx, dh0, dWx, dWh, db

dx error: 1.5382468491701097e-09
dh0 error: 3.3839681556240896e-09
dWx error: 7.150535245339328e-09
dWh error: 1.297338408201546e-07
db error: 1.4889022954777414e-10

词向量前向传播

词向量同样是需要训练的

def word_embedding_forward(x, W):
    """
    Forward pass for word embeddings. We operate on minibatches of size N where
    each sequence has length T. We assume a vocabulary of V words, assigning each
    word to a vector of dimension D.

    Inputs:
    - x: Integer array of shape (N, T) giving indices of words. Each element idx
      of x muxt be in the range 0 <= idx < V.
    - W: Weight matrix of shape (V, D) giving word vectors for all words.

    Returns a tuple of:
    - out: Array of shape (N, T, D) giving word vectors for all input words.
    - cache: Values needed for the backward pass
    """
    out, cache = None, None
    ##############################################################################
    # TODO: Implement the forward pass for word embeddings.                      #
    #                                                                            #
    # HINT: This can be done in one line using NumPy's array indexing.           #
    ##############################################################################
    out = W[x,:]
    cache = (x,W)
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################
    return out, cache

out error: 1.0000000094736443e-08

词向量后向传播

def word_embedding_backward(dout, cache):
    """
    Backward pass for word embeddings. We cannot back-propagate into the words
    since they are integers, so we only return gradient for the word embedding
    matrix.

    HINT: Look up the function np.add.at

    Inputs:
    - dout: Upstream gradients of shape (N, T, D)
    - cache: Values from the forward pass

    Returns:
    - dW: Gradient of word embedding matrix, of shape (V, D).
    """
    dW = None
    ##############################################################################
    # TODO: Implement the backward pass for word embeddings.                     #
    #                                                                            #
    # Note that words can appear more than once in a sequence.                   #
    # HINT: Look up the function np.add.at                                       #
    ##############################################################################
    x,W = cache
    dW = np.zeros_like(W)
    np.add.at(dW,x,dout) #在x指定的位置将dout加到dW上
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################
    return dW

dW error: 3.2774595693100364e-12

仿射层

使用仿射函数将该时间步中的RNN隐藏向量转换为词汇表中每个单词的得分。

def temporal_affine_forward(x, w, b):
    """
    Forward pass for a temporal affine layer. The input is a set of D-dimensional
    vectors arranged into a minibatch of N timeseries, each of length T. We use
    an affine function to transform each of those vectors into a new vector of
    dimension M.

    Inputs:
    - x: Input data of shape (N, T, D)
    - w: Weights of shape (D, M)
    - b: Biases of shape (M,)

    Returns a tuple of:
    - out: Output data of shape (N, T, M)
    - cache: Values needed for the backward pass
    """
    N, T, D = x.shape
    M = b.shape[0]
    out = x.reshape(N * T, D).dot(w).reshape(N, T, M) + b
    cache = x, w, b, out
    return out, cache


def temporal_affine_backward(dout, cache):
    """
    Backward pass for temporal affine layer.

    Input:
    - dout: Upstream gradients of shape (N, T, M)
    - cache: Values from forward pass

    Returns a tuple of:
    - dx: Gradient of input, of shape (N, T, D)
    - dw: Gradient of weights, of shape (D, M)
    - db: Gradient of biases, of shape (M,)
    """
    x, w, b, out = cache
    N, T, D = x.shape
    M = b.shape[0]

    dx = dout.reshape(N * T, M).dot(w.T).reshape(N, T, D)
    dw = dout.reshape(N * T, M).T.dot(x.reshape(N * T, D)).T
    db = dout.sum(axis=(0, 1))

    return dx, dw, db

dx error: 2.9215945034030545e-10
dw error: 1.5772088618663602e-10
db error: 3.252200556967514e-11

仿射层的损失函数-softmax

def temporal_softmax_loss(x, y, mask, verbose=False):
    """
    A temporal version of softmax loss for use in RNNs. We assume that we are
    making predictions over a vocabulary of size V for each timestep of a
    timeseries of length T, over a minibatch of size N. The input x gives scores
    for all vocabulary elements at all timesteps, and y gives the indices of the
    ground-truth element at each timestep. We use a cross-entropy loss at each
    timestep, summing the loss over all timesteps and averaging across the
    minibatch.

    As an additional complication, we may want to ignore the model output at some
    timesteps, since sequences of different length may have been combined into a
    minibatch and padded with NULL tokens. The optional mask argument tells us
    which elements should contribute to the loss.

    Inputs:
    - x: Input scores, of shape (N, T, V)
    - y: Ground-truth indices, of shape (N, T) where each element is in the range
         0 <= y[i, t] < V
    - mask: Boolean array of shape (N, T) where mask[i, t] tells whether or not
      the scores at x[i, t] should contribute to the loss.

    Returns a tuple of:
    - loss: Scalar giving loss
    - dx: Gradient of loss with respect to scores x.
    """

    N, T, V = x.shape

    x_flat = x.reshape(N * T, V)
    y_flat = y.reshape(N * T)
    mask_flat = mask.reshape(N * T)

    probs = np.exp(x_flat - np.max(x_flat, axis=1, keepdims=True))
    probs /= np.sum(probs, axis=1, keepdims=True)
    loss = -np.sum(mask_flat * np.log(probs[np.arange(N * T), y_flat])) / N
    dx_flat = probs.copy()
    dx_flat[np.arange(N * T), y_flat] -= 1
    dx_flat /= N
    dx_flat *= mask_flat[:, None]

    if verbose: print('dx_flat: ', dx_flat.shape)

    dx = dx_flat.reshape(N, T, V)

    return loss, dx

2.3027781774290146
23.025985953127226
2.2643611790293394
dx error: 2.583585303524283e-08

损失函数

    def loss(self, features, captions):
        """
        计算训练时RNN/LSTM的损失函数。我们输入图像特征和正确的图片注释,使用RNN/LSTM计算损失函数和所有参数的梯度

        输入:
        - features: 输入图像特征,维度 (N, D)
        - captions: 正确的图像注释; 维度为(N, T)的整数列

        输出一个tuple:
        - loss: 标量损失函数值
        - grads: 所有参数的梯度
        """
        #这里将captions分成了两个部分,captions_in是除了最后一个词外的所有词,是输入到RNN/LSTM的输入;
        #captions_out是除了第一个词外的所有词,是RNN/LSTM期望得到的输出。
        captions_in = captions[:, :-1]
        captions_out = captions[:, 1:]

        # You'll need this
        mask = (captions_out != self._null)

        # 从图像特征到初始隐藏状态的权值矩阵和偏差值 
        W_proj, b_proj = self.params['W_proj'], self.params['b_proj']

        # 词嵌入矩阵
        W_embed = self.params['W_embed']

        # RNN/LSTM参数
        Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b']

        # 每一隐藏层到输出的权值矩阵和偏差
        W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab']

        loss, grads = 0.0, {}
        ############################################################################
        # TODO: Implement the forward and backward passes for the CaptioningRNN.   #
        # In the forward pass you will need to do the following:                   #
        # (1) Use an affine transformation to compute the initial hidden state     #
        #     from the image features. This should produce an array of shape (N, H)#
        # (2) Use a word embedding layer to transform the words in captions_in     #
        #     from indices to vectors, giving an array of shape (N, T, W).         #
        # (3) Use either a vanilla RNN or LSTM (depending on self.cell_type) to    #
        #     process the sequence of input word vectors and produce hidden state  #
        #     vectors for all timesteps, producing an array of shape (N, T, H).    #
        # (4) Use a (temporal) affine transformation to compute scores over the    #
        #     vocabulary at every timestep using the hidden states, giving an      #
        #     array of shape (N, T, V).                                            #
        # (5) Use (temporal) softmax to compute loss using captions_out, ignoring  #
        #     the points where the output word is <NULL> using the mask above.     #
        #                                                                          #
        # In the backward pass you will need to compute the gradient of the loss   #
        # with respect to all model parameters. Use the loss and grads variables   #
        # defined above to store loss and gradients; grads[k] should give the      #
        # gradients for self.params[k].                                            #
        #                                                                          #
        # Note also that you are allowed to make use of functions from layers.py   #
        # in your implementation, if needed.                                       #
        ############################################################################
        N,D = features.shape
        #(1) 用线性变换从图像特征值得到初始隐藏状态,将产生维度为(N,H)的数列 
        out,cache_affine = temporal_affine_forward(features.reshape(N,1,D),W_proj,b_proj)
        N,T,H = out.shape
        h0 = out.reshape(N,H)
        
        #(2) 用词嵌入层将captions_in中词的索引转换成词向量,得到一个维度为(N, T, W)的数列
        word_out,cache_word = word_embedding_forward(captions_in,W_embed)
        
        #(3) 用RNN/LSTM处理输入的词向量,产生每一层的隐藏状态,维度为(N,T,H)
        if self.cell_type == 'rnn':
            hidden,cache_hidden = rnn_forward(word_out,h0,Wx,Wh,b)
        else:
            hidden,cache_hidden = lstm_forward(word_out,h0,Wx,Wh,b)
        
        #(4) 用线性变换计算每一步隐藏层对应的输出(得分),维度(N, T, V)
        out_vo,cache_vo = temporal_affine_forward(hidden,W_vocab,b_vocab)
        
        #(5) 用softmax函数计算损失,真实值为captions_out, 用mask忽视所有向量中<NULL>词汇
        loss,dx = temporal_softmax_loss(out_vo[:,:,:],captions_out,mask,verbose = False)
        
        #反向传播,得到对应参数
        dx_affine,dW_vocab,db_vocab = temporal_affine_backward(dx,cache_vo)
        grads['W_vocab'] = dW_vocab
        grads['b_vocab'] = db_vocab
        
        if self.cell_type == 'rnn':
            dx_hidden,dh0,dWx,dWh,db = rnn_backward(dx_affine,cache_hidden)
        else:
            dx_hidden,dh0,dWx,dWh,db = lstm_backward(dx_affine,cache_hidden)
        
        grads['Wx'] = dWx
        grads['Wh'] = dWh
        grads['b'] = db
        
        dW_embed = word_embedding_backward(dx_hidden,cache_word)
        grads['W_embed'] = dW_embed
        
        dx_initial,dW_proj,db_proj = temporal_affine_backward(dh0.reshape(N,T,H),cache_affine)
        grads['W_proj'] = dW_proj
        grads['b_proj'] = db_proj
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        return loss, grads

loss: 9.832355910027387
expected loss: 9.83235591003
difference: 2.6130209107577684e-12

W_embed relative error: 2.331070e-09
W_proj relative error: 1.112417e-08
W_vocab relative error: 4.274379e-09
Wh relative error: 5.858117e-09
Wx relative error: 1.590657e-06
b relative error: 9.727211e-10
b_proj relative error: 1.934807e-08
b_vocab relative error: 7.087097e-11

尝试来过拟合一个小数据集

(Iteration 1 / 100) loss: 76.913487
(Iteration 11 / 100) loss: 21.063245
(Iteration 21 / 100) loss: 4.016209
(Iteration 31 / 100) loss: 0.567061
(Iteration 41 / 100) loss: 0.239461
(Iteration 51 / 100) loss: 0.162024
(Iteration 61 / 100) loss: 0.111548
(Iteration 71 / 100) loss: 0.097589
(Iteration 81 / 100) loss: 0.099104
(Iteration 91 / 100) loss: 0.073981

1250085-20190104171525051-1850617809.png

Success!

测试阶段的sample

在这个任务当中,在训练阶段,rnn每一步的输入都是ground truth,而在测试阶段,rnn每一步的都是则是通过beam search搜索出的word,经过一步步的生成,很可能把一个错误给放大。
Show and Tell: A Neural Image Caption Generator这篇论文中的方法是,在训练阶段,随机地使用生成的词而非ground truth的词,作者认为这样可以强迫模型学会如何处理错误。
这一部分就是展示这一区别

    def sample(self, features, max_length=30):
        """
        Run a test-time forward pass for the model, sampling captions for input
        feature vectors.

        At each timestep, we embed the current word, pass it and the previous hidden
        state to the RNN to get the next hidden state, use the hidden state to get
        scores for all vocab words, and choose the word with the highest score as
        the next word. The initial hidden state is computed by applying an affine
        transform to the input image features, and the initial word is the <START>
        token.

        For LSTMs you will also have to keep track of the cell state; in that case
        the initial cell state should be zero.

        输入:
        - captions: 输入图像特征,维度 (N, D)
        - max_length: 生成的注释的最长长度

        输出:
        - captions: 采样得到的注释,维度(N, max_length), 每个元素是词汇的索引
        """
        N = features.shape[0]
        captions = self._null * np.ones((N, max_length), dtype=np.int32)

        # Unpack parameters
        W_proj, b_proj = self.params['W_proj'], self.params['b_proj']
        W_embed = self.params['W_embed']
        Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b']
        W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab']

        ###########################################################################
        # TODO: Implement test-time sampling for the model. You will need to      #
        # initialize the hidden state of the RNN by applying the learned affine   #
        # transform to the input image features. The first word that you feed to  #
        # the RNN should be the <START> token; its value is stored in the         #
        # variable self._start. At each timestep you will need to do to:          #
        # (1) Embed the previous word using the learned word embeddings           #
        # (2) Make an RNN step using the previous hidden state and the embedded   #
        #     current word to get the next hidden state.                          #
        # (3) Apply the learned affine transformation to the next hidden state to #
        #     get scores for all words in the vocabulary                          #
        # (4) Select the word with the highest score as the next word, writing it #
        #     (the word index) to the appropriate slot in the captions variable   #
        #                                                                         #
        # For simplicity, you do not need to stop generating after an <END> token #
        # is sampled, but you can if you want to.                                 #
        #                                                                         #
        # HINT: You will not be able to use the rnn_forward or lstm_forward       #
        # functions; you'll need to call rnn_step_forward or lstm_step_forward in #
        # a loop.                                                                 #
        #                                                                         #
        # NOTE: we are still working over minibatches in this function. Also if   #
        # you are using an LSTM, initialize the first cell state to zeros.        #
        ###########################################################################
        # (1)用线性变换从图像特征值得到初始隐藏状态,将产生维度为(N,H)的数列 
        N,D = features.shape
        out,cache_affine = temporal_affine_forward(features.reshape(N,1,D),W_proj,b_proj)
        N,T,H = out.shape
        h0 = out.reshape(N,H)
        h = h0
        
        #初始输入
        x0 = W_embed[[1,1],:]
        x_input = x0
        captions[:,0] = [1,1]
        prev_c = np.zeros_like(h) # only for lstm
        
        #(2) 执行rnn/lstm步骤
        for i in range(max_length - 1):
            if self.cell_type == 'rnn':
                next_h,_ = rnn_step_forward(x_input,h,Wx,Wh,b)
            else:
                next_h,next_c,cache = lstm_step_forward(x_input,h,prev_c,Wx,Wh,b)
                prev_c = next_c
                
            #(3) 计算每一层的输出
            out_vo,cache_vo = temporal_affine_forward(next_h.reshape(N,1,H),W_vocab,b_vocab) 
            
            #(4) 找到输出最大值的项作为下一时刻的输入
            index = np.argmax(out_vo,axis = 2) 
            x_input = np.squeeze(W_embed[index,:])
            h = next_h
            captions[:,i+1] = np.squeeze(index)#记录其索引
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################
        return captions

1250085-20190104171849530-1990725112.png

1250085-20190104171907838-1621647064.png

1250085-20190104171930460-326436984.png

1250085-20190104171950540-1253069949.png

很清楚地展示了这一区别。

思考问题是关于word级别的分词和character级别的分词的区别,各自的优点和缺点。事实上现在关于各种语言的分词研究还是很活跃的。

转载于:https://www.cnblogs.com/bernieloveslife/p/10221140.html

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: CS231n 第三次作业的内容包括使用深度学习来完成图像分类任务。具体来说,包括使用卷积神经网络 (CNN) 来训练图像分类器,并使用预训练网络 (pre-trained network) 来进行微调 (fine-tuning)。还可能包括使用数据增强 (data augmentation) 来提高模型的泛化能力,以及使用可视化工具来理解 CNN 的内部工作原理。 ### 回答2: CS231n作业3是斯坦福计算机视觉课程的第三个作业,该作业涵盖深度学习模型的生成和推理,以及如何创建生成性对抗网络(GAN)。 此次作业主要涉及三个任务: 1. Image Captioning 图片说明任务,也是本次作业的第一个任务。仔细研读与Image Captioning任务相关的代码,并以此为基础,使用RNN中的LSTM层来生成图像的描述。这个一项技术非常实用,可以让图片在搜索引擎中体现出来,提高用户使用体验。学生需要研究encoder和decoder的实现,了解他们生成文本的方法。最终,利用逆向传播算法(反向传播算法)训练神经网络,学习生成图像标题。 2. Generative Adversarial Networks 生成对抗网络是GAN。G和D两个模型构成了GAN模型,是一种强大的生成模型。在这个任务中,学生需要学习如何训练GAN模型,以生成看起来像真实图像的图像样本。这是一个非常复杂的问题,需要合理运用损失函数,较好的优化GAN的训练中表现良好。 3. Neural Style Transfer 神经风格迁移属于图像处理范畴,学生需要实现单张图像的神经风格迁移。方法是,利用一些随机初始化参数,以迭代方式计算输入图像的内容特征和样式特征。最终,需要使用反向传播算法来搜索图像处理的最佳策略。 总之,本次作业难度系数较大,但同时学生在操作过程中也能够学到很多使用深度学习技术解决实际问题的方法,提高对于深度学习的理解、掌握和技能。同时,希望学生能够在本次作业中体验到收获成功带来的成就感。 ### 回答3: CS231n Assignment 3是斯坦福大学计算机视觉课程中的一项作业,主要涉及深度强化学习。它由三个部分组成:Q-learning,Policy Gradients和Actor-Critic。 在Q-learning部分,学生需编写代码来实现Q-learning算法,在智能体与环境之间折衷时间、奖励和学习。Q-learning是一种基于回合的控制算法,它基于时间步长内的奖励和马尔科夫决策过程。在此过程中,学生需要选择一个平衡折衷,以便在训练智能体时最大限度地提高其性能。 在Policy Gradients部分,学生需实现策略梯度算法,该算法通过学习如何最大化预期回报来优化策略。在此步骤中,学生还将学习如何使用梯度上升法确定策略参数。策略梯度算法基于沿向目标策略方向更新参数的概念。 在Actor-Critic部分,学生需实现Actor-Critic算法,这是一种Q-learning和策略梯度算法的组合。该算法包括两个部分:演员即策略,用于决定智能体应采取的行动,评论家即Q值估算器,根据当前状态值和行动值返回平均价值。这两个部分相互作用,以帮助智能体学习最佳策略。 总的来说,CS231n Assignment 3是一项具有挑战性的作业,涉及深度强化学习的各个方面,需要学生掌握许多概念和算法,并将它们应用于代码实现中。完成此项作业需要时间和耐心,但完成后,学生将获得对深度强化学习的深刻理解,这对于今后从事计算机视觉工作将大有裨益。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值