Python训练文本情感分析模型

最新推荐文章于 2024-02-09 16:10:15 发布

S_Masons

最新推荐文章于 2024-02-09 16:10:15 发布

阅读量2.6w

点赞数 48

分类专栏：数据分析文章标签： python 文本分析模型训练

本文链接：https://blog.csdn.net/S_Masons/article/details/100009452

版权

数据分析专栏收录该内容

2 篇文章 4 订阅

订阅专栏

最近闲来无事，看了王树义老师的一篇文章《如何用Python和机器学习训练中文文本情感分类模型》，跟着步骤做了一个demo，此demo是爬取了美团用户的评论，对评论进行情感分析，收获很大，特此做下了笔记。

首先导入库

import pandas as pd
import numpy as np
from pandas import DataFrame, Series

读取评论数据，数据在这里

data = pd.read_csv("data.csv", encoding='GB18030')
data

数据如图所示
在这里插入图片描述

根据评分，使用lambda匿名函数，把评分>3的，取值1，当作正向情感，评分<3的，取值0，当作负向情感

def make_label(df):
    df["sentiment"] = df["star"].apply(lambda x: 1 if x > 3 else 0)

调用方法，并查看结果

make_label(data)
data

特征、标签分开赋值：

X = data[["comment"]]
y = data.sentiment

导入 jieba分词库，创建分词函数，将评论拆分，并用空格连接
通过 apply 调用函数，并新增列，填充值：

import jieba
def chinese_word_cut(mytext):
    return " ".join(jieba.cut(mytext))

X["cuted_comment"] = X.comment.apply(chinese_word_cut)

接下来要将一团的数据，拆分成训练数据集、测试数据集
从sklearn.model_selection导入数据拆分函数train_test_split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

查看数据集形状：

X_train.shape

可知道，train_test_split在默认模式下，训练数据集、测试数据集比例是3：1。

接下来要处理中文停用词，可使用第三方停用词表，能在这个GitHub上找到
创建停词函数，将停用词转成列表形式返回：

def get_custom_stopword(stop_word_file):
    with open(stop_word_file) as f:
        stop_word = f.read()
        
    stop_word_list = stop_word.split("/n")
    custom_stopword = [i for i in stop_word_list]
    return custom_stopword

stopwords = get_custom_stopword("哈工大停用词表.txt")

导入 CountVectorizer函数，将中文词语向量化：

from sklearn.feature_extraction.text import CountVectorizer

默认参数向量化

vect = CountVectorizer()
term_matrix = DataFrame(vect.fit_transform(X_train.cuted_comment).toarray(), columns=vect.get_feature_names())

数据集长这样：
在这里插入图片描述

向量化后的数据集

term_matrix.shape

(1500, 7305)

发现有特别多的数字存在，这些数字并不具有特征性，若是不过滤，会影响后面模型训练的效果

将一部分特征向量过滤，
给CountVectorizer添加参数，并重新对数据集向量化：

max_df = 0.8 # 在超过这一比例的文档中出现的关键词（过于平凡），去除掉。
min_df = 3 # 在低于这一数量的文档中出现的关键词（过于独特），去除掉。

vect = CountVectorizer(max_df = max_df,
                       min_df = min_df,
                       token_pattern=u'(?u)\\b[^\\d\\W]\\w+\\b',
                       stop_words=frozenset(stopwords))

term_matrix = DataFrame(vect.fit_transform(X_train.cuted_comment).toarray(), columns=vect.get_feature_names())

向量化后的数据集

term_matrix.shape

(1500, 1972)

过滤了很多词汇，好棒！

训练集已向量化完成，现在使用此特征矩阵训练模型
导入 朴素贝叶斯函数，建立分类模型

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

注意，我们的处理数据流程是：
1、特征向量化
2、贝叶斯分类
如果每修改一次参数，就要重新运行以上函数，会十分头疼

sklearn提供了一个管道pipeline功能，能将顺序工作连接起来

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(vect, nb)
pipe

现在我们就可以把pipe当成一个完整的模型来使用了。

将未特征向量化的数据输入，验证模型的准确率：

from sklearn.model_selection import cross_val_score
cross_val_score(pipe, X_train, y_train, cv=5, scoring='accuracy').mean()

得分：

0.8333673633410742

到此为止，模型已经初步搭建好了。【鼓掌】

但是我们用的都是训练过的数据集来测试的，准确率真的有这么高吗，来，我们进行下一步测试。

先用训练集拟合数据：

pipe.fit(X_train.cuted_comment, y_train)

测试集预测结果：

y_pred = pipe.predict(x_test.cuted_comment)

结果是，都是 0，1，0，1…：
在这里插入图片描述

使用 metrics测度工具查看评分

from sklearn import metrics
metrics.accuracy_score(y_test, y_pred)

评分：

0.866

结果显示，我们的模型对未曾见过的数据，预测的精确度达86.6%。

混淆矩阵验证

metrics.confusion_matrix(y_test, y_pred)

结果

array([[200,  37],
       [ 30, 233]], dtype=int64)

混淆矩阵中的数字从上到下，从左到右分别表示：

本来是正向，预测也是正向
本来是正向，预测却是反向
本来是反向，预测确实正向
本来是反向，预测也是反向

可见我们的模型性能还是挺不错的。

下面我们来用 snowNLP 来做对比：

from snownlp import SnowNLP
def get_sentiment(text):
    return SnowNLP(text).sentiments

使用测试集跑一遍：

y_pred_snow = X_test.comment.apply(get_sentiment)

结果：
在这里插入图片描述
snowNLP 返回的结果是0-1之间的数，而不是0、1，因此我们需要将数据转换一下，大于0.5为1，小于0.5为0。

y_pred_snow_norm = y_pred_snow.apply(lambda x: 1 if x>0.5 else 0)

结果：
在这里插入图片描述
这下好看多啦。
查看snowNLP的评分：

metrics.accuracy_score(y_test, y_pred_snow_norm)

0.77

en，比我们的模型差点。。

混淆矩阵：

metrics.confusion_matrix(y_test, y_pred_snow_norm)

array([[189,  48],
       [ 67, 196]], dtype=int64)

en，确实比我们的模型差点。。

以上。

S_Masons

关注

48
点赞
踩
429

收藏

觉得还不错? 一键收藏
19
评论
Python训练文本情感分析模型

最近闲来无事，看了王树义老师的一篇文章《如何用Python和机器学习训练中文文本情感分类模型》，跟着步骤做了一个demo，此demo是爬取了美团用户的评论，对评论进行情感分析，收获很大，特此做下了笔记。首先导入库import pandas as pdimport numpy as npfrom pandas import DataFrame, Series读取评论数据，数据在 ...
复制链接

扫一扫