机器学习笔记---朴素贝叶斯案例：商品评论情感分析

最新推荐文章于 2024-07-19 16:24:38 发布

會須一飲三百杯

最新推荐文章于 2024-07-19 16:24:38 发布

阅读量148

点赞数 1

文章标签：机器学习笔记人工智能

本文链接：https://blog.csdn.net/qq_58251587/article/details/133528079

版权

本文介绍了如何使用Python库如pandas、numpy和sklearn中的CountVectorizer和MultinomialNB进行文本情感分析，通过读取CSV数据、预处理、特征抽取、停用词处理，最终对书籍评价进行好评与差评的分类并评估模型性能。

摘要由CSDN通过智能技术生成

import pandas as pd
import numpy as np
import jieba
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer # 文本特征抽取
from sklearn.naive_bayes import MultinomialNB #朴素贝叶斯分类

获取数据

data = pd.read_csv("./data/书籍评价.csv", encoding="gbk")
data

数据基本处理

取出内容列用于后续分析

content=data["内容"]
content

把评价中的好评差评转换为数字

data.loc[:, "评价"]#冒号表示所有行 标签为"评价"的列

#类比SQL 将 "好评" 条件下的 "评论编号" 列的值设置为1 冒号代表选中所有行 增加标签为“好评编号”的一列
data.loc[data.loc[:, "评价"] == "好评", "评论编号"] = 1
#将 "差评" 条件下的 "评论编号" 列的值设置为0
data.loc[data.loc[:, "评价"] == "差评", "评论编号"] = 0

选择停用词

stopwords = []#创建了一个空列表 stopwords 用来存储从文件中读取的停用词
with open("./data/stopwords.txt", "r", encoding="utf-8") as f:
    lines = f.readlines()#读取文件中的所有行，并将它们存储在一个列表 lines 中。每个列表元素都是文件中的一行文本
    print(lines)

stopwords = []
with open("./data/stopwords.txt", "r", encoding="utf-8") as f:
    lines = f.readlines()

    for tmp in lines:

        line = tmp.strip()#strip() 方法用于去除文本两端的空白字符（包括换行符 \n、制表符 \t 等），将处理后的文本赋值给 line

        stopwords.append(line)
stopwords

把内容转换为标准格式

comment_list = []

for tmp in content:
    # print(tmp)
    # 把一句句话变成一个个词
    seg_list = jieba.cut(tmp, cut_all=False)
    # print(seg_list)
    seg_str = ",".join(seg_list)
    # print(seg_str)
    comment_list.append(seg_str)

## 统计次个数
con = CountVectorizer(stop_words=stopwords)
X = con.fit_transform(comment_list)
X.toarray()

con.get_feature_names()

准备训练集和测试集

x_train = X.toarray()[:10, :]#取前十条数据作为训练集
y_train = data["评价"][:10]

x_test = X.toarray()[10:, :]#后三条作为测试集
y_test = data["评价"][10:]

模型训练

mb = MultinomialNB(alpha=1)

mb.fit(x_train, y_train)

y_pre = mb.predict(x_test)

print("预测值:", y_pre)
print("真实值:", y_test)

模型评估

mb.score(x_test, y_test)#100%

用到的数据集以及停用词的链接：

链接：https://pan.baidu.com/s/1BPMBUuo6V2N39Y5f9BBP-g?pwd=1111
提取码：1111

會須一飲三百杯

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫