多项式贝叶斯解决垃圾邮件分类问题

最新推荐文章于 2023-11-12 22:26:41 发布

_BOTAK_

最新推荐文章于 2023-11-12 22:26:41 发布

阅读量1k

点赞数 1

分类专栏：模式识别与机器学习学习笔记机器学习应用案例文章标签：贝叶斯垃圾邮件分类机器学习模式识别

本文链接：https://blog.csdn.net/BOTAK_/article/details/102846133

版权

学习笔记同时被 3 个专栏收录

64 篇文章 0 订阅

订阅专栏

模式识别与机器学习

14 篇文章 1 订阅

订阅专栏

机器学习应用案例

3 篇文章 0 订阅

订阅专栏

摘要

最近，纽约大学的friend因为学习工作繁忙，让我给他写一次作业，经过这次的作业，我逐渐有点开始明白，美国的教育模式与国内的教育模式真的是有一点差距的，哎，和关系较好的朋友寒王说起这个，我们总结了一下，那就是：有机会还是去外面认识一下差距。
ok，下面就开始正文吧。

整个项目的步骤

首先，这个比较小的项目是我一个下午的时间完成的，时间有点仓促，所以有什么不周到的地方，还请见谅，也很乐意看友们给我指正错误，共同成长。

熟悉CountVectorizer
认识数据与数据预处理
模型的建立与测试

1，熟悉CountVectorizer

总的而言，CountVectorizer是一个词频统计的类，可以算出不同的单词在所给的文档中出现的频率。直接看例子
在🌰开始之前，导入相关的包

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

我们首先建立一个简单的测试数据集：

# 1. give a simple dataset
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

上面我们给出了一个相对简单的数据集列表，这个list中包含了3句话，而这三句话中又分别有3，4，4个单词。（如果这看不懂，建议补一下python）
接下来，我们需要做的事就是创建一个CountVectorizer对象，并将上面的simple_train喂给他，看他能产出什么东东

# 2. to conduct a CountVectorizer object
cv = CountVectorizer()
cv.fit(simple_train)

输出他所能产出的东西：

# to print the vocabulary of the simple_train
print(cv.vocabulary_)

如果不出意外的话，输出的结果应该是这样子的

很容易看明白，这个的输出一个词典{key:value}，其中key就是每一个单词，然后value就是这一个单词所对应的词频（term frequency TF）所谓词频，指的是某一个给定的词语在该文件中出现的次数。这个数字通常会被正规化，以防止它偏向长的文件。。
下面我们要做的就是将训练数据转换成为DTM矩阵，关于什么是DTM矩阵呢？就是每一个Document中每一个term出现的次数，不明白？没关系，看完下面的例子你就明白了

# 3. 4.transform training data into a 'document-term matrix' (which is a sparse matrix) use “transform()”
train_data = cv.transform(simple_train)
# (the index of the list , the index of the dict ) the frequency of the list[index]
print(cv.get_feature_names())
print(train_data)

这几行代码可以得到下面的结果：

怎么理解呢？就是我们所给的list中第一个句子的第一个单词在生成的words列表中第i个位置的频率，例如第一个（0，1）的0就代表第一个句子’call you tonight’，1就代表这句话中的第一个单词call在words列表中的第一个位置，后面的1表示出现次数为1
然后，我们讲这个转化为矩阵的表示形式：

train_data = train_data.toarray()
print(train_data)

输出结果应该是这样的：

其中矩阵的行表示list中的第几个句子，列表示上面cv.get_feature_names()生成的单词列表
接下来，我们就可以利用上述训练好的CountVectorizer object去测试一下一个简单的样例了：

# 7. transform testing data into a document-term matrix (using existing vocabulary)
simple_test = ["please don't call me"]
test_data = cv.transform(simple_test).toarray()
# 8. examine the vocabulary and document-term matrix together
print(test_data)

输出应该是这样：

其中，行代表的是simple_test中的第i个句子，列表示我们上述根据simple_train生成的name列表，也就是[‘cab’, ‘call’, ‘me’, ‘please’, ‘tonight’, ‘you’]
相信读到这里，完全没问题

2. 认识数据与数据预处理

俗话说，工欲善其事，必先利其器，这句话用在机器学习以及深度学习中均非常合适，数据的认识与处理是十分重要的，即使模型再牛逼，数据不行，那也是白费。let us begin
这里，我们所采用的数据是我朋友给我的纽约大学所提供的邮件信息数据sms.tsv
首先，我们就是要读取这些数据，并观察之：

# Reading SMS data
# 9. read tab-separated file “sms.tsv”; give the names of columns as ['label', 'message']; and use head() to view part of the data.
import pandas as pd
data=pd.read_csv('sms.tsv', sep='\t',names = ['label', 'message'])
# 10. convert label to a numeric variable
data['label']=data['label'].replace(['spam','ham'],[1,0])

结下来要做的就是数据预处理操作了，怎么说呢，一开始我是没想进行数据预处理操作的，我是直接套用模型就直接开始了训练与测试，然后的结果大概你懂的，正确率只有惨淡的百分之六十，然后进行了预处理之后，the acc 达到97%，很感人对不对，所以啊，同学们，要进行数据预处理啊

首先我考虑的问题是大小写的问题，因为大写单词与小写单词在语义上而言是一样的，但是呢，在word识别中却把它当作两个单词进行处理，所以，将所有字母转换成为小写是有必要的。
其次的问题是一些停用词是不会影响语句的意思的，但是在计算单词的时候，模型也会将他们算入其中，这样肯定会导致准确率的下降
然后符号的处理，符号是没有多大用的，相信你很清楚
最后一个很关键，就是英文单词中时态的问题，例如loves 和love这两个单词都是一样的意思，但是统计词频时，会当作两个来算，很蠢对吧，所以我们要处理。
处理上述过程的代码：

stop_words = set(stopwords.words('english'))
def text_process(text):
    tokenizer = RegexpTokenizer('[a-z]+')
    lemmatizer = WordNetLemmatizer()
    token = tokenizer.tokenize(text)
    token = [lemmatizer.lemmatize(w) for w in token if lemmatizer.lemmatize(w) not in stop_words]
    return token
data['message'] = data['message'].apply(text_process)

接下来，就是定义X和y了

# 11. define X and y
X, y = data['message'], data['label']
# 12. split into training and testing sets by train_test_split(); and print the shape of training set and test set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# the training set
train = pd.concat([X_train, y_train], axis=1)
# the test set
test = pd.concat([X_test, y_test], axis=1)
train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)

接下来，就用到上述的CountVectorizer了，相信下面这些代码很容易看懂，首先要说的一下是，如果拿全部的train数据中所有的单词作为词典的话，有些庞大，所以这里我采用了随机选10个句子，然后将句子中的所有单词作为词典

# the ham e-mails
ham_train = train[train['label'] == 0]
# the spam e-mails
spam_train = train[train['label'] == 1]
seed = 1000
# to concat 10 e-mails form the train set
ham_train_part = ham_train['message'].sample(10, random_state=seed)
# to concat 10 e-mails from the test set
spam_train_part = spam_train['message'].sample(10, random_state=seed)
# the choose sentence contain number of words
part_words = []
for text in pd.concat([ham_train_part, spam_train_part]):
    part_words += text
part_words_set = set(part_words)
# before building the model , we should convert the data from the above data to the data which CountVectorizer can recognize
# the sample words
train_part_texts = [' '.join(text) for text in np.concatenate((spam_train_part.values, ham_train_part.values))]
# to convert the all words to the sentence of the  train set
train_all_texts = [' '.join(text) for text in train['message']]
# to convert the all words to the sentence of the  test set
test_all_texts = [' '.join(text) for text in test['message']]
print(train_all_texts)

# Vectorizing SMS data
# 13. instantiate the vectorizer by CountVectorizer()
cv = CountVectorizer()
part_fit = cv.fit(train_part_texts)
# to calculate the number of words of train set
train_all_count = cv.transform(train_all_texts)
# to calculate the number of words of test set
test_all_count = cv.transform(test_all_texts)

接下来，我又采用了TF-IDF模型，关于TF-IDF 可以查阅相关的文档，这里就不再多描述

tfidf = TfidfTransformer()
# 14. learn training data vocabulary, then create document-term matrix “X_train_dtm”
X_train_dtm = tfidf.fit_transform(train_all_count)
# 15. transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = tfidf.fit_transform(test_all_count)

模型的建立与测试

好了，数据准备好了，接下来就是模型的建立与预测了，这部分较简单，我就大概说一下就ok，采用的是多项式贝叶斯分类器，然后输出了相应的混淆矩阵与正确率

# Building a Naive Bayes model by using Multinomial Naïve Bayes

# 16. train a Naive Bayes model using the matrix “X_train_dtm”
mnb = MultinomialNB()
mnb.fit(X_train_dtm, y_train)

# 17. calculate accuracy of predictions
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
print(mnb.score(X_test_dtm, y_test))

# 18. give the confusion matrix
y_pred = mnb.predict(X_test_dtm)
print(y_pred)
cm = confusion_matrix(y_test, y_pred)
# to print the confusion matrix
print(cm)

the end

_BOTAK_

关注

1
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
多项式贝叶斯解决垃圾邮件分类问题

摘要最近，纽约大学的同学因为学习工作繁忙，让我给他写一次作业，经过这次的作业，我逐渐有点开始明白，美国的教育模式与国内的教育模式真的是有一点差距的，哎，和关系较好的朋友寒王说起这个，我们总结了一下，那就是：有机会还是去外面认识一下差距。ok，下面就开始正文吧。整个项目的步骤首先，这个比较小的项目是我一个下午的时间完成的，时间有点仓促，所以有什么不周到的地方，还请见谅，也很乐意看友们给我指正...
复制链接

扫一扫

专栏目录