CS224n课程Assignment2参考答案

最新推荐文章于 2024-08-18 20:25:35 发布

Jonariguez

最新推荐文章于 2024-08-18 20:25:35 发布

阅读量2.6k

点赞数 2

分类专栏：自然语言处理深度学习 tensorflow 文章标签： nlp stanford tensorflow denpendency parsing

本文链接：https://blog.csdn.net/u013068502/article/details/92775290

版权

本文提供了Stanford大学CS224n课程作业2的详细解答，涵盖了TensorFlow中的基本操作、自动梯度计算、依赖解析等主题。作者分享了关于梯度计算、反向传播、RNN的反向传播过程的理解，并通过实例解释了计算复杂度。

摘要由CSDN通过智能技术生成

$Assignment\#2 -solution\quad By\ Jonariguez$

所有的代码题目对应的代码已上传至github/CS224n/Jonariguez

解:
（提示使用keepdims参数会方便一些哦。）

def softmax(x):
    """
    Compute the softmax function in tensorflow.

    You might find the tensorflow functions tf.exp, tf.reduce_max,
    tf.reduce_sum, tf.expand_dims useful. (Many solutions are possible, so you may
    not need to use all of these functions). Recall also that many common
    tensorflow operations are sugared (e.g. x * y does a tensor multiplication
    if x and y are both tensors). Make sure to implement the numerical stability
    fixes as in the previous homework!

    Args:
        x:   tf.Tensor with shape (n_samples, n_features). Note feature vectors are
                  represented by row-vectors. (For simplicity, no need to handle 1-d
                  input as in the previous homework)
    Returns:
        out: tf.Tensor with shape (n_sample, n_features). You need to construct this
                  tensor in this problem.
    """

    ### YOUR CODE HERE
    """
    跟作业1一样，要先减去每行的最大值，然后再做softmax
    用keepdims=True可以保持之前的形状，而不会变成行向量
    """
    x_max = tf.reduce_max(x,axis=1,keepdims=True)
    x = tf.exp(x-x_max)
    x_sum = tf.reduce_sum(x,axis=1,keepdims=True)
    out = x/x_sum
    #out = x/tf.reshape(tf.reduce_sum(x,axis=1),(x.shape[0],1))
    ### END YOUR CODE

    return out

解:
（积累知识：

tf.multiply()为元素级的乘法，要求形状相同。
tf.matmul()为矩阵乘法。
两者都要求两个矩阵的元素类型必须相同。
）

def cross_entropy_loss(y, yhat):
    """
    Compute the cross entropy loss in tensorflow.
    The loss should be summed over the current minibatch.

    y is a one-hot tensor of shape (n_samples, n_classes) and yhat is a tensor
    of shape (n_samples, n_classes). y should be of dtype tf.int32, and yhat should
    be of dtype tf.float32.

    The functions tf.to_float, tf.reduce_sum, and tf.log might prove useful. (Many
    solutions are possible, so you may not need to use all of these functions).

    Note: You are NOT allowed to use the tensorflow built-in cross-entropy
                functions.

    Args:
        y:    tf.Tensor with shape (n_samples, n_classes). One-hot encoded.
        yhat: tf.Tensorwith shape (n_sample, n_classes). Each row encodes a
                    probability distribution and should sum to 1.
    Returns:
        out:  tf.Tensor with shape (1,) (Scalar output). You need to construct this
                    tensor in the problem.
    """

    ### YOUR CODE HERE
    """
    y和y_hat的第一维都是n_samples，这是一个batch的大小，也即batch_size，那么对于
    每一个sample都要计算一个交叉熵(标量，实数)，然后再把这n_samples个交叉熵求和，最终也是个标量(实数)
    """
    single_CE = tf.multiply(tf.log(yhat),tf.to_float(y))
    out = tf.negative(tf.reduce_sum(single_CE))
    ### END YOUR CODE

    return out

解:
占位符(placeholder)和feed_dict可以在运行时动态地向计算图“喂”数据。（TensorFlow采用的是静态图）

def add_placeholders(self):
    """Generates placeholder variables to represent the input tensors.

    These placeholders are used as inputs by the rest of the model building
    and will be fed data during training.

    Adds following nodes to the computational graph

    input_placeholder: Input placeholder tensor of shape
                                          (batch_size, n_features), type tf.float32
    labels_placeholder: Labels placeholder tensor of shape
                                          (batch_size, n_classes), type tf.int32

    Add these placeholders to self as the instance variables
        self.input_placeholder
        self.labels_placeholder
    """
    ### YOUR CODE HERE
    self.input_placeholder = tf.placeholder(tf.float32,shape=[self.config.batch_size,self.config.n_features],name='input_placeholder')
    self.labels_placeholder = tf.placeholder(tf.int32,shape=[self.config.batch_size,self.config.n_classes],name='labels_placeholder')
    ### END YOUR CODE

def create_feed_dict(self, inputs_batch, labels_batch=None):
    """Creates the feed_dict for training the given step.

    A feed_dict takes the form of:
    feed_dict = {
            <placeholder>: <tensor of values to be passed for placeholder>,
            ....
    }

    If label_batch is None, then no labels are added to feed_dict.

    Hint: The keys for the feed_dict should be the placeholder
            tensors created in add_placeholders.

    Args:
        inputs_batch: A batch of input data.
        labels_batch: A batch of label data.
    Returns:
        feed_dict: The feed dictionary mapping from placeholders to values.
    """
    ### YOUR CODE HERE
    """
    feed_dict其实就是python里面字典类型
    注意：feed_dict的键是我们之前定义过的tf.placeholder对象，而不是tf.placeholder的str类型的名字
    """
    feed_dict = {
   
        self.input_placeholder:inputs_batch,
        self.labels_placeholder:labels_batch
    }
    ### END YOUR CODE
    return feed_dict

def add_prediction_op(self):
     """Adds the core transformation for this model which transforms a batch of input
     data into a batch of predictions. In this case, the transformation is a linear layer plus a
     softmax transformation:

     y = softmax(Wx + b)

     Hint: Make sure to create tf.Variables as needed.
     Hint: For this simple use-case, it's sufficient to initialize both weights W
                 and biases b with zeros.

     Args:
         input_data: A tensor of shape (batch_size, n_features).
     Returns:
         pred: A tensor of shape (batch_size, n_classes)
     """
     ### YOUR CODE HERE
     """
     x是输入，即占位符input_placeholder
     而W和b是要定义的变量，也是要训练的变量
     pred = softmax(xW+b)
     """
     with tf.variable_scope('softmax_classifier'):
         W = tf.Variable(tf.zeros([self.config.n_features,self.config.n_classes],dtype=tf.float32))
         b = tf.Variable(tf.zeros([self.config.n_classes],dtype=tf.float32))
         # print(W.name)
         # print(b.name)
         Z = tf.matmul(self.input_placeholder,W)+b
         pred = softmax(Z)
     ### END YOUR CODE
     return pred

def add_loss_op(self, pred):
        """Adds cross_entropy_loss ops to the computational graph.

        Hint: Use the cross_entropy_loss function we defined. This should be a very
                    short function.
        Args:
            pred: A tensor of shape (batch_size, n_classes)
        Returns:
            loss: A 0-d tensor (scalar)
        """
        ### YOUR CODE HERE
        """
        因为我们已经在q1_softmax.py中定义并实现了cross_entropy_loss()函数，所以这里可以直接调用
        self.labels_placeholder 是"喂"进来的真实标记
        pred    是我们预测的
        """
        loss = cross_entropy_loss(self.labels_placeholder,pred)
        ### END YOUR CODE
        return loss

def add_training_op(self, loss):
    """Sets up the training Ops.

    Creates an optimizer and applies the gradients to all trainable variables.
    The Op returned by this function is what must be passed to the
    `sess.run()` call to cause the model to train. See

    https://www.tensorflow.org/versions/r0.7/api_docs/python/train.html#Optimizer

    for more information.

    Hint: Use tf.train.GradientDescentOptimizer to get an optimizer object.
                Calling optimizer.minimize() will return a train_op object.

    Args:
        loss: Loss tensor, from cross_entropy_loss.
    Returns:
        train_op: The Op for training.
    """
    ### YOUR CODE HERE
    #直接返回优化器即可
    train_op = tf.train.GradientDescentOptimizer(self.config.lr).minimize(loss)
    ### END YOUR CODE
    return train_op

解:
TensorFlow的自动梯度，是指我们使用时只需要定义图的节点就好了，不用自己实现求解梯度，反向传播和求导由TensorFlow自动完成。

解：

stack	buffer	new dependency	transition
[ROOT,parsed,this]	[sentence,correctly]		SHIFT
[ROOT,parsed,this,sentence]	[correctly]		SHIFT
[ROOT,parsed,sentence]	[correctly]	sentence -> this	LEFT-ARC
[ROOT,parsed]	[correctly]	parsed -> sentence	RIGHT-ARC
[ROOT,parsed,correctly]	[]		SHIFT
[ROOT,parsed]	[]	parsed -> correctly	RIGHT-ARC
[ROOT]	[]	ROOT -> parsed	RIGHT-ARC

解：
共2n步

每个词都要进入stack中，故要有n步SHIFT操作。
最终stack中只剩ROOT，即每一次ARC会从stack中删掉一个词，故共有n步LEFT-ARC和RIGHT-ARC操作。

def __init__(self, sentence):
    """Initializes this partial parse.

    Your code should initialize the following fields:
        self.stack: The current stack represented as a list with the top of the stack as the
                    last element of the list.
        self.buffer: The current buffer represented as a list with the first item on the
                     buffer as the first item of the list
        self.dependencies: The list of dependencies produced so far. Represented as a list of
                tuples where each tuple is of the form (head, dependent).
                Order for this list doesn't matter.

    The root token should be represented with the string "ROOT"

    Args:
        sentence: The sentence to be parsed as a list of words.
                  Your code should not modify the sentence.
    """
    # The sentence being parsed is kept for bookkeeping purposes. Do not use it in your code.
    self.sentence = sentence

    ### YOUR CODE HERE
    self.stack = ['ROOT']
    #注意不要用self.buffer=sentence，因为这样的话self.buffer是sentence的一个
    #引用，改变self.buffer的话就是直接改变sentence，这和题目要求不符(Do not use it in your code)
    self.buffer = [word for word in self.sentence]
    self.dependencies = []
    ### END YOUR CODE

def parse_step(self, transition):
        """Performs a single parse step by applying the given transition to this partial parse

        Args:
            transition: A string that equals "S", "LA", or "RA" representing the shift, left-arc,
                        and right-arc transitions.
        """
        ### YOUR CODE HERE
        """
        重申一下操作：
        S   从buffer的最左边取出一个word，放入stack最右边
        LA  将stack最右边的word作为head，第二个word作为dependent
        RA  将stack最右边的word作为dependent，第二个word作为head
        LA和RA操作都在self.dependencies中添加(head,dependent)元组
        利用list的pop()函数具有返回值的特性可以简化代码
        """
        if transition=='S':
            self.stack.append(self.buffer.pop(0))
        elif transition=='LA':
            dependent = (self.stack[-1],self.stack.pop(-2))
            self.dependencies.append(dependent)
        else :
            dependent = (self.stack[-2],self.stack.pop(-1))
            self.dependencies.append(dependent)
        ### END YOUR CODE

def minibatch_parse(sentences, model, batch_size):
    """Parses a list of sentences in minibatches using a model.

    Args:
        sentences: A list of sentences to be parsed (each sentence is a list of words)
        model: The model that makes parsing decisions. It is assumed to have a function
               model.predict(partial_parses) that takes in a list of PartialParses as input and
               returns a list of transitions predicted for each parse. That is, after calling
                   transitions = model.predict(partial_parses)
               transitions[i] will be the next transition to apply to partial_parses[i].
        batch_size: The number of PartialParses to include in each minibatch
    Returns:
        dependencies: A list where each element is the dependencies list for a parsed sentence.
                      Ordering should be the same as in sentences (i.e., dependencies[i] should
                      contain the parse for sentences[i]).
    """

    ### YOUR CODE HERE
    start_idx,end_idx=0,0
    PartialParses = [PartialParse(sentence) for sentence in sentences]
    dependencies = []
    while end_idx<len(sentences):
        end_idx = min(start_idx+batch_size,len(sentences))
        #拿到batch_size的PartialParse
        #对于每一个句子创建一个PartialParse解析器
        batch_PartialParses = PartialParses[start_idx:end_idx]
        #然后用模型预测，得出每一个句子的transition(transitions[i])
        #注意：model.predict(x)只会对x里面的每一个解析器推进一步。
        while len(batch_PartialParses)>0:
            transitions = model.predict(batch_PartialParses)
            for i in range(len(transitions)):
                batch_PartialParses[i].parse_step(transitions[i])
            #然后把那么句子已经解析完成的丢掉，保留还没有完成的。
            #那么那些是已经完成了的呢？那就是buffer==0 && stack==1
            batch_PartialParses = [parse for parse in batch_PartialParses if len(parse.buffer)>0 or len(parse.stack)>1]

        dependencies.extend([parse.dependencies for parse in PartialParses[start_idx:end_idx]])
        #注意更新start_idx
        start_idx+=batch_size
    ### END YOUR CODE

    return dependencies

def xavier_weight_init():
    """Returns function that creates random tensor.

    The specified function will take in a shape (tuple or 1-d array) and
    returns a random tensor of the specified shape drawn from the
    Xavier initialization distribution.

    Hint: You might find tf.random_uniform useful.
    """
    def _xavier_initializer(shape, **kwargs):
        """Defines an initializer for the Xavier distribution.
        Specifically, the output should be sampled uniformly from [-epsilon, epsilon] where
            epsilon = sqrt(6) / <sum of the sizes of shape's dimensions>
        e.g., if shape = (2, 3), epsilon = sqrt(6 / (2 + 3))

        This function will be used as a variable initializer.

        Args:
            shape: Tuple or 1-d array that species the dimensions of the requested tensor.
        Returns:
            out: tf.Tensor of specified shape sampled from the Xavier distribution.
        """
        ### YOUR CODE HERE
        epsilon = tf.sqrt(6.0/tf.to_float(tf.reduce_sum(shape)))
        out = tf.Variable(tf.random_uniform(shape,minval=-epsilon,maxval=epsilon))
        ### END YOUR CODE
        return out
    # Returns defined initializer function.
    return _xavier_initializer

解：
$\mathbb{E}_{p_{drop}}[\mathbf{h}_{drop}]=\mathbb{E}_{p_{drop}}[\gamma \mathbf{d}\circ \mathbf{h}]=p_{drop}\cdot \vec{0}+(1-p_{drop})\cdot\gamma\cdot\mathbf{h}=\mathbf{h}$
即推导出：
$\gamma=\frac{1}{1-p_{drop}}$

2g1

解:
因为其实 $\mathbf{m}$ 是之前全部梯度(更新量)的加权平均，更能体现梯度的整体变化。因为这样减小了更新量的方差，避免了梯度振荡。
$\mathbf{\beta_1}$ 一般要接近1。

2g2
解：

更新量 $\mathbf{m}$ : 对梯度(更新量)进行滑动平均
学习率 $\mathbf{v}$ : 对梯度的平方进行滑动平均

梯度平均最小的参数的更新量最大，也就是说，在损失函数相对于它们的梯度很小的时候也能快速收敛。即在平缓的地方也能快递移动到最优解。

解：
我的结果为

Epoch 10 out of 10
924/924 [============================>.] - ETA: 0s - train loss: 0.0654
Evaluating on dev set - dev UAS: 88.37
New best dev UAS! Saving model in ./data/weights/parser.weights

===========================================================================
TESTING
===========================================================================
Restoring the best model weights found on the dev set
Final evaluation on test set
- test UAS: 88.84

运行时间：15分钟。

题目解读
先明确题目中各个量的维度：
由题目可知， $x^{(t)}$ 是one-hot行向量，且隐藏层也是行向量的形式。
则可得：
$x^{(t)}\in \mathbb{R}^{1\times |V|}$
$h^{(t)}\in \mathbb{R}^{1\times D_h}$

$\hat{y}^{(t)}$ 是输出，即每个单词的概率分布(softmax之后)，那么：
$\hat{y}^{(t)}\in \mathbb{R}^{1\times |V|}$
然后我们就可以得到：
$L\in \mathbb{R}^{|V|\times d}$
$e^{(t)}\in \mathbb{R}^{1\times d}$
$I\in \mathbb{R}^{d\times D_h}$
$H\in \mathbb{R}^{D_h\times D_h}$
$b_1\in \mathbb{R}^{1\times D_h}$
$U\in \mathbb{R}^{D_h\times |V|}$
$b_2\in \mathbb{R}^{1\times |V|}$

其中 $d$ 是词向量的长度，也就是代码中的 $embed\_size$ 。
在清楚了上面各矩阵的维度之后的求导才会更清晰。

因为句子的长度不一，然后损失函数是针对一个单词所计算的，然后求和之后是对整个句子的损失，故要对损失函数求平均以得到每个单词的平均损失才行。

解：
由于标签 $y^{(t)}$ 是one-hot向量，假设 $y^{(t)}$ 的真实标记为 $k$
则：
$J^{(t)}(\theta)=CE(y^{(t)},\hat{y}^{(t)})=-log\ \hat{y}_k^{(t)}=log\ \frac{1}{\hat{y}_k^{(t)}}$