CS224n - Assignment1 - Sentiment Analysis

最新推荐文章于 2021-06-24 09:57:16 发布

yyyybupt

最新推荐文章于 2021-06-24 09:57:16 发布

阅读量233

点赞数

分类专栏： nlp

本文链接：https://blog.csdn.net/qq_41747565/article/details/94410905

版权

nlp 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

代码链接，在原有代码做了些小的修改，适用于python3.6

4 Sentiment Analysis (20')

对于Stanford Sentiment Treebank数据集中的每个句子，我们使用该句子中所有单词向量的平均值作为特征，从而预测情绪水平。

我们将训练softmax分类器，并执行train / dev验证以改进分类器的泛化能力。

(a) 句子的特征表示：取句子中单词向量的平均值

具体参考q4 sentiment.py：

def getSentenceFeatures(tokens,wordVectors,sentence):
    """
    obtain the sentence feature for sentiment analysis by averaging its word vectors.
    :param tokens: a dictionary that maps words to their indices in the word vector list
    :param wordVectors: word vectors (each row) for all tokens
    :param sentence: a list of words in the sentence of interest
    :return: sentVector: feature vector for the sentence
    """
    sentVector=np.zeros((wordVectors.shape[1],))

    ### YOUR CODE HERE
    for s in sentence:
        sentVector+=wordVectors[tokens[s],:]
    sentVector*=1.0/len(sentence)
    ### END YOUR CODE

    assert sentVector.shape==(wordVectors.shape[1],)
    return sentVector

(b) 正则化的原因：

避免过拟合，增强对未知样例的泛华能力

搜索“最佳”正则化参数：

def getRegularizationValues():
    """Try different regularizations
    :return: sorted: a sorted list of values to try
    """
    values=None # Assign a list of floats in the block below
    ### YOUR CODE HERE
    values=np.logspace(-4,2,num=100,base=10)
    ### END YOUR CODE
    return sorted(values)

(1)numpy.arange 函数用于创建数值范围并返回 ndarray 对象：

numpy.arange(start, stop, step, dtype)

参数	描述
`start`	起始值，默认为`0`
`stop`	终止值（不包含）
`step`	步长，默认为`1`
`dtype`	返回`ndarray`的数据类型，如果没有提供，则会使用输入数据的类型。

(2) numpy.linspace 函数用于创建一个一维数组，数组是一个等差数列构成：

np.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None)

参数	描述
`start`	序列的起始值
`stop`	序列的终止值，如果`endpoint`为`true`，该值包含于数列中
`num`	要生成的等步长的样本数量，默认为`50`
`endpoint`	该值为 `ture` 时，数列中中包含`stop`值，反之不包含，默认是True。
`retstep`	如果为 True 时，生成的数组中会显示间距，反之不显示。
`dtype`	`ndarray` 的数据类型

(3)numpy.logspace 函数用于创建一个于等比数列

logspace(start, stop, num=50, endpoint=True, base=10.0, dtype=None)

base 参数意思是取对数的时候 log 的下标。

参数	描述
`start`	序列的起始值为：base ** start
`stop`	序列的终止值为：base ** stop。如果`endpoint`为`true`，该值包含于数列中
`num`	要生成的等步长的样本数量，默认为`50`
`endpoint`	该值为 `ture` 时，数列中中包含`stop`值，反之不包含，默认是True。
`base`	对数 log 的底数。
`dtype`	`ndarray` 的数据类型

(d) 模型跑起来

将python的位置加入环境变量中系统变量path里面
打开命令提示行，进入q4_sentiment.py所在的文件夹
输入python q4_sentiment.py --yourvectors：利用你自己的词向量训练模型
输入python q4_sentiment.py--pretrained：利用GloVe的词向量训练模型

我们认为预训练的向量训练效果更好：

更高维的词向量可以编码更多信息
GloVe向量是在更大的语料库上训练的结果
GloVe vs Word2Vec

(e) 针对训练集和测试集，绘制关于预训练GloVe向量的正则化值的分类精度，利用q4_reg_acc.png保存。

随着正则化参数不断增长，模型经历了过拟合->最优拟合->欠拟合的变化过程，正则化参数取10^(1)时实现最优拟合
过拟合->最优拟合：正则化参数从10^(-4)到10^(1)，训练集精确率稍有下降，验证集精确率上升
最优拟合->欠拟合：正则化参数从10^(1)到10^(2)，训练集和测试集的精确率均下降

(f) 运行python q4_sentiment.py --pretrained，还会生成一个q4_dev_conf.png图像：

图中用蓝色笔画出对的斜对角线上的为正确预测的情况
距离蓝色勾选框越远说明预测结果越差

(g) 选择3个示例包括分类器出错以及正确的情况，并简要说明错误原因以及正确分类所需的功能

正确示例：4 4 a warm , funny , engaging film .

错误示例： 4 1 it 's refreshing to see a girl-power movie that does n't feel it has to prove anything .

分析：词向量的平均会破坏词顺序且不能处理否定does n't

yyyybupt

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
CS224n - Assignment1 - Sentiment Analysis

代码链接，在原有代码做了些小的修改，适用于python3.64 Sentiment Analysis (20')对于Stanford Sentiment Treebank数据集中的每个句子，我们使用该句子中所有单词向量的平均值作为特征，从而预测情绪水平。我们将训练softmax分类器，并执行train / dev验证以改进分类器的泛化能力。(a) 句子的特征表示：取句子中单词向量的...
复制链接

扫一扫