CS224N刷题——Assignment3.2_Recurrent neural nets for NER

最新推荐文章于 2024-08-18 20:25:35 发布

韩明宇

最新推荐文章于 2024-08-18 20:25:35 发布

阅读量463

点赞数

分类专栏： CS224N NLP

本文链接：https://blog.csdn.net/qq_37098526/article/details/90940584

版权

NLP 同时被 2 个专栏收录

34 篇文章 5 订阅

订阅专栏

CS224N

17 篇文章 2 订阅

订阅专栏

Assignment #3

2. Recurrent neural nets for NER

每一个RNN单元利用一个sigmoid将隐藏状态向量和输入结合起来，然后在每一个时间步利用隐藏状态来预测输出：

其中 $L\in \mathbb{R}^{V\times D}$ 是词向量， $W_{h}\in \mathbb{R}^{H\times H},W_{x}\in \mathbb{R}^{D\times H},b_{1}\in \mathbb{R}^{H}$ 是RNN单元的参数， $b_{2}\in \mathbb{R}^{C}$ 是softmax的参数。和之前一样，V是单词表的大小，D是词向量的大小，H是隐藏层的大小，C是预测的类别数（这里是5）。

为了训练模型，我们对每个预测单词使用交叉熵损失：

(a)

i.相比基于窗口的模型，RNN模型多了多少个参数？

多了 $W_{h}$ 的H×H个参数，但 $W_{x}$ 只有D×H个参数，而基于窗口的模型中W有(2w+1)D个参数。

ii.预测长度为T的序列的标签的计算复杂度是多少？

(b)为什么很难直接优化F1

首先， $F_{1}$ 是不可微的。其次，很难直接对 $F_{1}$ 进行优化，因为它需要从整个语料库中进行预测计算，这使得批处理和并行化非常困难。

(c)利用上述式子实现RNN单元

    def __call__(self, inputs, state, scope=None):
        """Updates the state using the previous @state and @inputs.
        Remember the RNN equations are:

        h_t = sigmoid(x_t W_x + h_{t-1} W_h + b)

        TODO: In the code below, implement an RNN cell using @inputs
        (x_t above) and the state (h_{t-1} above).
            - Define W_x, W_h, b to be variables of the apporiate shape
              using the `tf.get_variable' functions. Make sure you use
              the names "W_x", "W_h" and "b"!
            - Compute @new_state (h_t) defined above
        Tips:
            - Remember to initialize your matrices using the xavier
              initialization as before.
        Args:
            inputs: is the input vector of size [None, self.input_size]
            state: is the previous state vector of size [None, self.state_size]
            scope: is the name of the scope to be used when defining the variables inside.
        Returns:
            a pair of the output vector and the new state vector.
        """
        scope = scope or type(self).__name__

        # It's always a good idea to scope variables in functions lest they
        # be defined elsewhere!
        with tf.variable_scope(scope):
            # YOUR CODE HERE (~6-10 lines)
            W_x = tf.get_variable(initializer=tf.contrib.layers.xavier_initializer(),
                                  shape=[self.input_size, self.state_size],
                                  name='W_x')
            W_h = tf.get_variable(initializer=tf.contrib.layers.xavier_initializer(),
                                  shape=[self.state_size, self.state_size],
                                  name='W_h')
            b = tf.get_variable(initializer=tf.zeros(self.state_size),
                                name='b')
            new_state = tf.sigmoid(tf.matmul(inputs, W_x) + tf.matmul(state, W_h) + b)
            # END YOUR CODE ###
        # For an RNN , the output and state are the same (N.B. this
        # isn't true for an LSTM, though we aren't using one of those in
        # our assignment)
        output = new_state
        return output, new_state

(d)实现RNN需要我们展开整个句子来计算。不幸的是，每个句子的长度都可以是任意的，这将导致RNN在不同的句子中展开次数的不同，从而使得不可能批量处理数据。

最常用的解决该问题的方法是对输入进行补零，假设输入的最大序列是M个单词长度，对于一个长度为T的输入，我们需要：

对于x和y加上零向量使得它们有M个单词长度，这些零向量仍然是one-hot向量，代表一个新的NULL单词。
创建一个屏蔽向量 $(m^{(t)})_{t=1}^{M}$ ，对于所有的 $t\leqslant T$ 它的值为1，对于所有的它的值为0.这个屏蔽向量将允许我们忽略网络对填充输入所做的预测。
当然，通过扩充输入和输出M-T个单词，可能会改变损失和梯度更新，为了解决这个问题，使用屏蔽向量改变损失为：

i.如果没有使用屏蔽向量，则损失和梯度更新会如何改变？屏蔽向量是如何解决这个问题的？

损失包括预测多余0标签值的准确率，填充输入的梯度将流经隐藏状态，并影响参数的学习。通过屏蔽损失，我们将这些额外的0标签（及其梯度）造成的损失归零，从而解决了问题。

ii.实现pad_sequences。

def pad_sequences(data, max_length):
    """Ensures each input-output seqeunce pair in @data is of length
    @max_length by padding it with zeros and truncating the rest of the
    sequence.

    TODO: In the code below, for every sentence, labels pair in @data,
    (a) create a new sentence which appends zero feature vectors until
    the sentence is of length @max_length. If the sentence is longer
    than @max_length, simply truncate the sentence to be @max_length
    long.
    (b) create a new label sequence similarly.
    (c) create a _masking_ sequence that has a True wherever there was a
    token in the original sequence, and a False for every padded input.

    Example: for the (sentence, labels) pair: [[4,1], [6,0], [7,0]], [1,
    0, 0], and max_length = 5, we would construct
        - a new sentence: [[4,1], [6,0], [7,0], [0,0], [0,0]]
        - a new label seqeunce: [1, 0, 0, 4, 4], and
        - a masking seqeunce: [True, True, True, False, False].

    Args:
        data: is a list of (sentence, labels) tuples. @sentence is a list
            containing the words in the sentence and @label is a list of
            output labels. Each word is itself a list of
            @n_features features. For example, the sentence "Chris
            Manning is amazing" and labels "PER PER O O" would become
            ([[1,9], [2,9], [3,8], [4,8]], [1, 1, 4, 4]). Here "Chris"
            the word has been featurized as "[1, 9]", and "[1, 1, 4, 4]"
            is the list of labels. 
        max_length: the desired length for all input/output sequences.
    Returns:
        a new list of data points of the structure (sentence', labels', mask).
        Each of sentence', labels' and mask are of length @max_length.
        See the example above for more details.
    """
    ret = []

    # Use this zero vector when padding sequences.
    zero_vector = [0] * Config.n_features
    zero_label = 4 # corresponds to the 'O' tag

    for sentence, labels in data:
        # YOUR CODE HERE (~4-6 lines)
        length = len(sentence)
        if length > max_length:
            new_sentence = sentence[:max_length]
            new_label = labels[:max_length]
            mask = [True] * max_length
        else:
            new_sentence = sentence + [zero_vector] * (max_length - length)
            new_label = labels + [zero_label] * (max_length - length)
            mask = [True] * length + [False] * (max_length - length)
        ret.append((new_sentence, new_label, mask))
        # END YOUR CODE ###
    return ret

(e)实现RNN模型的其余部分，假设只有固定输入长度。包括：

1.实现add_placeholders,add_embedding,add_training_op函数

class RNNModel(NERModel):
    """
    Implements a recursive neural network with an embedding layer and
    single hidden layer.
    This network will predict a sequence of labels (e.g. PER) for a
    given token (e.g. Henry) using a featurized window around the token.
    """

    def add_placeholders(self):
        """Generates placeholder variables to represent the input tensors

        These placeholders are used as inputs by the rest of the model building and will be fed
        data during training.  Note that when "None" is in a placeholder's shape, it's flexible
        (so we can use different batch sizes without rebuilding the model).

        Adds following nodes to the computational graph

        input_placeholder: Input placeholder tensor of  shape (None, self.max_length, n_features), type tf.int32
        labels_placeholder: Labels placeholder tensor of shape (None, self.max_length), type tf.int32
        mask_placeholder:  Mask placeholder tensor of shape (None, self.max_length), type tf.bool
        dropout_placeholder: Dropout value placeholder (scalar), type tf.float32

        TODO: Add these placeholders to self as the instance variables
            self.input_placeholder
            self.labels_placeholder
            self.mask_placeholder
            self.dropout_placeholder

        HINTS:
            - Remember to use self.max_length NOT Config.max_length

        (Don't change the variable names)
        """
        ### YOUR CODE HERE (~4-6 lines)
        self.input_placeholder = tf.placeholder(dtype=tf.int32,
                                                shape=[None, self.max_length, Config.n_features])
        self.labels_placeholder = tf.placeholder(dtype=tf.int32,
                                                 shape=[None, self.max_length])
        self.mask_placeholder = tf.placeholder(dtype=tf.bool,
                                               shape=[None, self.max_length])
        self.dropout_placeholder = tf.placeholder(dtype=tf.float32)
        ### END YOUR CODE


    def add_embedding(self):
        """Adds an embedding layer that maps from input tokens (integers) to vectors and then
        concatenates those vectors:

        TODO:
            - Create an embedding tensor and initialize it with self.pretrained_embeddings.
            - Use the input_placeholder to index into the embeddings tensor, resulting in a
              tensor of shape (None, max_length, n_features, embed_size).
            - Concatenates the embeddings by reshaping the embeddings tensor to shape
              (None, max_length, n_features * embed_size).

        HINTS:
            - You might find tf.nn.embedding_lookup useful.
            - You can use tf.reshape to concatenate the vectors. See
              following link to understand what -1 in a shape means.
              https://www.tensorflow.org/api_docs/python/array_ops/shapes_and_shaping#reshape.

        Returns:
            embeddings: tf.Tensor of shape (None, max_length, n_features*embed_size)
        """
        ### YOUR CODE HERE (~4-6 lines)
        embedding = tf.get_variable(initializer=self.pretrained_embeddings, name='embeding')
        embeddings_4d = tf.nn.embedding_lookup(embedding, self.input_placeholder)
        embeddings = tf.reshape(embeddings_4d, shape=[-1, Config.max_length, Config.n_features*Config.embed_size])
        ### END YOUR CODE
        return embeddings


    def add_training_op(self, loss):
        """Sets up the training Ops.

        Creates an optimizer and applies the gradients to all trainable variables.
        The Op returned by this function is what must be passed to the
        `sess.run()` call to cause the model to train. See

        https://www.tensorflow.org/versions/r0.7/api_docs/python/train.html#Optimizer

        for more information.

        Use tf.train.AdamOptimizer for this model.
        Calling optimizer.minimize() will return a train_op object.

        Args:
            loss: Loss tensor, from cross_entropy_loss.
        Returns:
            train_op: The Op for training.
        """
        ### YOUR CODE HERE (~1-2 lines)
        train_op = tf.train.AdamOptimizer(Config.lr).minimize(loss)
        ### END YOUR CODE
        return train_op

2.实现add_prediction_op操作，它展开RNN循环self.max_length次，记得从第二个时间步开始重用变量域里的变量，以便在时间步之间共享RNN单元的权重 $W_{x}$ 和 $W_{h}$ 。

    def add_prediction_op(self):
        """Adds the unrolled RNN:
            h_0 = 0
            for t in 1 to T:
                o_t, h_t = cell(x_t, h_{t-1})
                o_drop_t = Dropout(o_t, dropout_rate)
                y_t = o_drop_t U + b_2

        TODO: There a quite a few things you'll need to do in this function:
            - Define the variables U, b_2.
            - Define the vector h as a constant and inititalize it with
              zeros. See tf.zeros and tf.shape for information on how
              to initialize this variable to be of the right shape.
              https://www.tensorflow.org/api_docs/python/constant_op/constant_value_tensors#zeros
              https://www.tensorflow.org/api_docs/python/array_ops/shapes_and_shaping#shape
            - In a for loop, begin to unroll the RNN sequence. Collect
              the predictions in a list.
            - When unrolling the loop, from the second iteration
              onwards, you will HAVE to call
              tf.get_variable_scope().reuse_variables() so that you do
              not create new variables in the RNN cell.
              See https://www.tensorflow.org/versions/master/how_tos/variable_scope/
            - Concatenate and reshape the predictions into a predictions
              tensor.
        Hint: You will find the function tf.stack (similar to np.asarray)
              useful to assemble a list of tensors into a larger tensor.
              https://www.tensorflow.org/api_docs/python/array_ops/slicing_and_joining#pack
        Hint: You will find the function tf.transpose and the perms
              argument useful to shuffle the indices of the tensor.
              https://www.tensorflow.org/api_docs/python/array_ops/slicing_and_joining#transpose

        Remember:
            * Use the xavier initilization for matrices.
            * Note that tf.nn.dropout takes the keep probability (1 - p_drop) as an argument.
            The keep probability should be set to the value of self.dropout_placeholder

        Returns:
            pred: tf.Tensor of shape (batch_size, max_length, n_classes)
        """

        x = self.add_embedding()
        dropout_rate = self.dropout_placeholder

        preds = [] # Predicted output at each timestep should go here!

        # Use the cell defined below. For Q2, we will just be using the
        # RNNCell you defined, but for Q3, we will run this code again
        # with a GRU cell!
        if self.config.cell == "rnn":
            cell = RNNCell(Config.n_features * Config.embed_size, Config.hidden_size)
        elif self.config.cell == "gru":
            cell = GRUCell(Config.n_features * Config.embed_size, Config.hidden_size)
        else:
            raise ValueError("Unsuppported cell type: " + self.config.cell)

        # Define U and b2 as variables.
        # Initialize state as vector of zeros.
        ### YOUR CODE HERE (~4-6 lines)
        U = tf.get_variable(initializer=tf.contrib.layers.xavier_initializer(),
                            shape=[Config.hidden_size, Config.n_classes],
                            name='U')
        b2 = tf.get_variable(initializer=tf.zeros(Config.n_classes), name='b2')
        h = tf.zeros(shape=(tf.shape(x)[0], Config.hidden_size), name='h')
        ### END YOUR CODE

        with tf.variable_scope("RNN"):
            for time_step in range(self.max_length):
                ### YOUR CODE HERE (~6-10 lines)
                if time_step > 0:
                    tf.get_variable_scope().reuse_variables()
                y_var, h = cell(x[:, time_step, :], h)  # h为整体进行输入输出
                preds.append(tf.matmul(tf.nn.dropout(y_var, dropout_rate), U) + b2)
                ### END YOUR CODE

        # Make sure to reshape @preds here.
        ### YOUR CODE HERE (~2-4 lines)
        preds = tf.stack(preds)                         # (max_length, batch_size, n_classes)
        preds = tf.transpose(preds, perm=[1, 0, 2])     # (batch_size, max_length, n_classes)
        ### END YOUR CODE

        assert preds.get_shape().as_list() == [None, self.max_length, self.config.n_classes], "predictions are not of the right shape. Expected {}, got {}".format([None, self.max_length, self.config.n_classes], preds.get_shape().as_list())
        return preds

3.实现add_loss_op来处理起那一部分返回的屏蔽向量。

    def add_loss_op(self, preds):
        """Adds Ops for the loss function to the computational graph.

        TODO: Compute averaged cross entropy loss for the predictions.
        Importantly, you must ignore the loss for any masked tokens.

        Hint: You might find tf.boolean_mask useful to mask the losses on masked tokens.
        Hint: You can use tf.nn.sparse_softmax_cross_entropy_with_logits to simplify your
                    implementation. You might find tf.reduce_mean useful.
        Args:
            pred: A tensor of shape (batch_size, max_length, n_classes) containing the output of the neural
                  network before the softmax layer.
        Returns:
            loss: A 0-d tensor (scalar)
        """
        ### YOUR CODE HERE (~2-4 lines)
        #tf.dynamic_partition()
        mask_pred = tf.boolean_mask(preds, self.mask_placeholder)
        mask_label = tf.boolean_mask(self.labels_placeholder, self.mask_placeholder)
        loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=mask_pred,
                                                                             labels=mask_label))
        ### END YOUR CODE
        return loss