使用朴素贝叶斯对垃圾邮件分类

最新推荐文章于 2023-11-20 17:30:37 发布

薛宝钗倒拔垂杨柳

最新推荐文章于 2023-11-20 17:30:37 发布

阅读量392

点赞数 3

分类专栏： python 机器学习算法文章标签： python 人工智能

本文链接：https://blog.csdn.net/qq_51808547/article/details/128035876

版权

算法同时被 3 个专栏收录

2 篇文章 0 订阅

订阅专栏

python

1 篇文章 0 订阅

订阅专栏

机器学习

1 篇文章 0 订阅

订阅专栏

把给定的数据集message.csv拆分成训练集和测试集，使用sklearn.naive_bayes.MultinomialNB类创建一个朴素贝叶斯模型，使用训练数据训练出一个预测模型，然后用预测模型对测试集中数据进行分类，评价模型的分类效果。

message.csv数据集中包含大量的短信，每行数据包括2个字段：短信内容，短信类别（1或者0）,短信类别为1的是垃圾邮件。

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
from  sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
sms = pd.read_csv(r'messages.csv')
sms_data = sms.iloc[:,0]
sms_label = sms.iloc[:,1]
# 把无意义的符号都替换成空格
sms_data_clear = []
for line in sms_data:
    # 每一行都去掉无意义符号并按空格分词
    for char in line:
        if char.isalpha() is False:
            # 不是字母，发生替换操作:
            newString = line.replace(char," ")
    tempList = newString.split(" ")
    # 将处理好后的一行数据追加到存放干净数据的列表
    sms_data_clear.append(tempList)
# 去掉长度不大于3的词和没有语义的词
sms_data_clear2 = []
for line in sms_data_clear:
    tempList = []
    for word in line:
        if word != '' and len(word) > 3 and word.isalpha():
            tempList.append(word)
    tempString = ' '.join(tempList)
    sms_data_clear2.append(tempString)
sms_data_clear = sms_data_clear2
#划分测试集
x_train,x_test,y_train,y_test = train_test_split(sms_data_clear2,sms_label,test_size=0.25,random_state=0,stratify=sms_label)
#词向量化
tfidf = TfidfVectorizer()
X_train = tfidf.fit_transform(x_train)
X_test = tfidf.transform(x_test)
X_train = X_train.toarray()
X_test = X_test.toarray()
# X_train.shape
# #输出不为0的列
# for i in range(X_train.shape[0]):
#     for j in range(X_train.shape[1]):
#         if X_train[i][j] != 0:
#             print(i,j,X_train[i][j])
#建模
gnb = GaussianNB()
module = gnb.fit(X_train,y_train)
y_predict = module.predict(X_test)
# 输出模型分类的各个指标
from sklearn.metrics import classification_report
cr = classification_report(y_predict,y_test,target_names=['正常邮件', '垃圾邮件'],output_dict=True)
print('准确率：',cr['accuracy'])
print('正常邮件的精确度：',cr['正常邮件']['precision'])
print('正常邮件的召回率：',cr['正常邮件']['recall'])
print('正常邮件的F1值：',cr['正常邮件']['f1-score'])
print('垃圾邮件的精确度：',cr['垃圾邮件']['precision'])
print('垃圾邮件的召回率：',cr['垃圾邮件']['recall'])
print('垃圾邮件的F1值：',cr['垃圾邮件']['f1-score'])
'''
support：当前行的类别在测试数据中的样本总量，如上表就是，在class 0 类别在测试集中总数量为1；
precision：精度=正确预测的个数(TP)/被预测正确的个数(TP+FP)；人话也就是模型预测的结果中有多少是预测正确的
recall:召回率=正确预测的个数(TP)/预测个数(TP+FN)；人话也就是某个类别测试集中的总量，有多少样本预测正确了；
f1-score:F1 = 2*精度*召回率/(精度+召回率)
micro avg：计算所有数据下的指标值，假设全部数据 5 个样本中有 3 个预测正确，所以 micro avg 为 3/5=0.6
macro avg：每个类别评估指标未加权的平均值，比如准确率的 macro avg，(0.50+0.00+1.00)/3=0.5
weighted avg：加权平均，就是测试集中样本量大的，我认为它更重要，给他设置的权重大点；比如第一个值的计算方法，(0.50*1 + 0.0*1 + 1.0*3)/5 = 0.70
'''
# print(cr)