情感分类实例——基于Logistics回归以及SVC

最新推荐文章于 2023-04-16 20:58:21 发布

Artoria____

最新推荐文章于 2023-04-16 20:58:21 发布

阅读量1.2k

点赞数

分类专栏： NLP

本文链接：https://blog.csdn.net/artoria_qzh/article/details/103364041

版权

NLP 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

情感分类，其本质还是属于分类问题，因此本篇博客通过两种分类模型来进行情感分类

# 导入数据集
import pandas as pd

# 定义正向为1，负向为0
dfpos = pd.read_excel('./购物评论.xlsx', sheet_name = "正向", header = None)
dfpos['y'] = 1
dfneg = pd.read_excel('./购物评论.xlsx', sheet_name = "负向", header = None)
dfneg['y'] = 0
df0 = dfpos.append(dfneg, ignore_index = True)

df0.head()

在这里插入图片描述

然后需要进行分词，并把分词后的结果放在单独一个单元格里

# 分词和预处理
import jieba

"""
由于部分停用词可能是情感词，
因此此处不要去除停用词
"""
cuttxt = lambda x: ' '.join(jieba.lcut(x))
df0['cleantxt'] = df0[0].apply(cuttxt)
df0.head()

在这里插入图片描述

由于后续需要用到sklearn，而sklearn需要d2m矩阵。
因此还要将分词后的语料转换成d2m矩阵并初步筛选

# 转换成d2m矩阵
from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer(min_df=5) # 取出现5次以上的评论
# 转换
wordmtx = countvec.fit_transform(df0.cleantxt)
wordmtx

在这里插入图片描述
筛选后总共有20582条评论。将其划分为训练集和测试集

# 分成训练集和测试集
from sklearn.model_selection import train_test_split
"""
按4:6划分训练集和测试集
"""
x_train, x_test, y_train, y_test = train_test_split(wordmtx, df0.y,
                                                    test_size=0.4,random_state=1013)

然后就可以丢进模型里进行训练了

一、Logistics回归

整体步骤和上述是一致的，在此就直接输出最后结果了

其中Logistics参数可以见这篇博客：
https://blog.csdn.net/jark_/article/details/78342644

由于数据集并不算大，就选择了默认的liblinear算法

# 使用Logistics模型预测
from sklearn.linear_model import LogisticRegression

logitmodel = LogisticRegression()
# 拟合模型
logitmodel.fit(x_train, y_train)
logitmodel.predict(x_test)
print(classification_report(y_test, logitmodel.predict(x_test)))

在这里插入图片描述
最后效果比朴素贝叶斯要好，精确度达到了0.90

二、SVC

from sklearn.svm import SVC


x_train, x_test, y_train, y_test = train_test_split(wordmtx, df0.y,
                                                    test_size=0.4,random_state=1013)
clf = SVC(kernel='rbf', verbose=True)
clf.fit(x_train, y_train)
print(classification_report(y_test, clf.predict(x_test)))

在这里插入图片描述

对比两种种算法：

SVC < Logistics回归

三、模型的优化

在上述的模型训练中，由于sklearn需要的是d2m矩阵，而CounterVectorizer生成的只能算是词频矩阵，没有考虑到词序问题。因此通过word2vec来对模型进行优化

由于word2vec要求的格式是list in list。因此首先要将数据转换成list in list

import jieba

df0['cut'] = df0[0].apply(jieba.lcut)
df0.head()

在这里插入图片描述

同样先把数据集分成训练集和测试集

# 分成训练集和测试集
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df0.cut, df0.y, test_size=0.3)

在train前还要设定维数以及出现的最小次数。
设定好了就可以train了，设定train的迭代次数为1000

from gensim.models.word2vec import Word2Vec

n_dim = 300

w2vmodel = Word2Vec(size=n_dim, min_count = 10)
w2vmodel.build_vocab(x_train)         # 生成词表

w2vmodel.train(x_train,
              total_examples = w2vmodel.corpus_count, epochs = 1000)

在这里插入图片描述
用了13分钟完成训练

在转换前还有个问题需要解决：
此时一个句子有很多个向量，每个向量都是300维的。因此先用各个词向量直接平均的方式生成整句对应的向量

def m_avgvec(words, w2vmodel):
    return pd.DataFrame([w2vmodel.wv[w] for w in words if w in w2vmodel.wv]).agg('mean')

接下来就可以转换训练集和测试集了

# 训练集的转换
train_vecs = pd.DataFrame([m_avgvec(s, w2vmodel) for s in x_train])

# 测试集的转换
test_vecs = pd.DataFrame([m_avgvec(s, w2vmodel) for s in x_test])

完成转换后再分别用Logistics回归和SVM

Logistics回归

logitmodel.fit(train_vecs, y_train)
print(classification_report(y_test, logitmodel.predict(test_vecs)))

在这里插入图片描述

SVC

clf2.fit(train_vecs, y_train)
print(classification_report(y_test, clf2.predict(test_vecs)))

在这里插入图片描述

可以对比优化前的：

Logistics回归的精确度下降了一点，但是SVM的精确度显著提升

Artoria____

关注

0
点赞
踩
14

收藏

觉得还不错? 一键收藏
1
评论
情感分类实例——基于Logistics回归以及SVC

本篇博客分别将用三种模型分别进行情感分析目录一、朴素贝叶斯二、Logistics回归三、SVC在训练模型之前，先看下数据集的样子：正向评价和负向评价在两个sheet中。因此首先要将两个sheet合并并分别标注为1和0# 导入数据集import pandas as pd# 定义正向为1，负向为0dfpos = pd.read_excel('./购物评论.xlsx', sheet...
复制链接

扫一扫

专栏目录