使用朴素贝叶斯对垃圾邮件分类
实验内容:
把给定的数据集message.csv拆分成训练集和测试集,使用sklearn.naive_bayes.MultinomialNB类创建一个朴素贝叶斯模型,使用训练数据训练出一个预测模型,然后用预测模型对测试集中数据进行分类,评价模型的分类效果。
message.csv数据集中包含大量的短信,每行数据包括2个字段:短信内容,短信类别(1或者0),短信类别为1的是垃圾邮件。
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
sms = pd.read_csv(r'C:/Users/Downloads/messages.csv')
sms_data = sms.iloc[:,0]
sms_label = sms.iloc[:,1]
# 把无意义的符号都替换成空格
sms_data_clear = []
for line in sms_data:
# 每一行都去掉无意义符号并按空格分词
for char in line:
if char.isalpha() is False:
# 不是字母,发生替换操作:
newString = line.replace(char," ")
tempList = newString.split(" ")
# 将处理好后的一行数据追加到存放干净数据的列表
sms_data_clear.append(tempList)
# 去掉长度不大于3的词和没有语义的词
sms_data_clear2 = []
for line in sms_data_clear:
tempList = []
for word in line:
if word != '' and len(word) > 3 and word.isalpha():
tempList.append(word)
tempString = ' '.join(tempList)
sms_data_clear2.append(tempString)
sms_data_clear = sms_data_clear2
#划分测试集
x_train,x_test,y_train,y_test = train_test_split(sms_data_clear2,sms_label,test_size=0.25,random_state=0,stratify=sms_label)
#词向量化
tfidf = TfidfVectorizer()
X_train = tfidf.fit_transform(x_train)
X_test = tfidf.transform(x_test)
X_train = X_train.toarray()
X_test = X_test.toarray()
X_train.shape
# #输出不为0的列
# for i in range(X_train.shape[0]):
# for j in range(X_train.shape[1]):
# if X_train[i][j] != 0:
# print(i,j,X_train[i][j])
#建模
gnb = GaussianNB()
module = gnb.fit(X_train,y_train)
y_predict = module.predict(X_test)
# 输出模型分类的各个指标
from sklearn.metrics import classification_report
cr = classification_report(y_predict,y_test)
print(cr)
最后利用classification_report方法来细致评价模型
可以看见分类0,1两类各自的准确率,召回率,f1,每一条数据量,和总共的平均值。