Python 贝叶斯在文本分类的应用案例

最新推荐文章于 2024-03-05 11:36:08 发布

程志伟

最新推荐文章于 2024-03-05 11:36:08 发布

阅读量1.5k

点赞数 1

文章标签： python 机器学习

原文链接：https://www.bilibili.com/video/BV1vJ41187hk

版权

关注微信公共号：小程在线

关注CSDN博客：程志伟的博客

1.1 文本编码技术简介
1.1.1 单词计数向量
在开始分类之前，我们必须先将文本编码成数字。一种常用的方法是单词计数向量。在这种技术中，一个样本可以包
含一段话或一篇文章，这个样本中如果出现了10个单词，就会有10个特征(n=10)，每个特征代表一个单词，特征
的取值表示这个单词在这个样本中总共出现了几次，是一个离散的，代表次数的，正整数。
在sklearn当中，单词计数向量计数可以通过feature_extraction.text模块中的CountVectorizer类实现，来看一个简
单的例子：

sample = ["Machine learning is fascinating, it is wonderful"
,"Machine learning is a sensational techonology"
,"Elsa is a popular character"]

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(sample)
X
Out[8]:
<3x11 sparse matrix of type '<class 'numpy.int64'>'
with 15 stored elements in Compressed Sparse Row format>

#使用接口get_feature_names()调用每个列的名
vec.get_feature_names()
Out[9]:
['character',
'elsa',
'fascinating',
'is',
'it',
'learning',
'machine',
'popular',
'sensational',
'techonology',
'wonderful']

import pandas as pd
#注意稀疏矩阵是无法输入pandas的，1表示出现，0表示未出现
CVresult = pd.DataFrame(X.toarray(),columns = vec.get_feature_names())
CVresult
Out[11]:
character elsa fascinating ... sensational techonology wonderful
0 0 0 1 ... 0 0 1
1 0 0 0 ... 1 1 0
2 1 1 0 ... 0 0 0

[3 rows x 11 columns]

如果我们将每一列加和，除以整个特征矩阵的和，就是每一列对应的概率。由于是将进行加和，对于一个在很多个特征下都有值的样本来说，这个样本在对的贡献就会比其他的样本更大。对于句子特别长的样本而言，这个样本对的影响是巨大的。因此补集朴素贝叶斯让每个特征的权重除以自己的L2范式，就是为了避免这种情况发生。

第二个问题，观察我们的矩阵，会发现"is"这个单词出现了四次，那经过计算，这个单词出现的概率就会最大，但其实它对我们的语义并没有什么影响（除非我们希望判断的是，文章描述的是过去的事件还是现在发生的事件）。可以遇见，如果使用单词计数向量，可能会导致一部分常用词（比如中文中的”的“）频繁出现在我们的矩阵中并且占有很高的权重，对分类来说，这明显是对算法的一种误导。为了解决这个问题，比起使用次数，我们使用单词在句子中所占的比例来编码我们的单词，这就是我们著名的TF-IDF方法

1.1.2 TF-IDF
IDF的大小与一个词的常见程度成反比，这个词越常见，编码后为它设置的权重会倾向于越小，以此来压制频繁出现的一些无意义的词

from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF
vec = TFIDF()
X = vec.fit_transform(sample)
X
Out[12]:
<3x11 sparse matrix of type '<class 'numpy.float64'>'
with 15 stored elements in Compressed Sparse Row format>

#同样使用接口get_feature_names()调用每个列的名称
TFIDFresult = pd.DataFrame(X.toarray(),columns=vec.get_feature_names())
TFIDFresult
Out[13]:
character elsa fascinating ... sensational techonology wonderful
0 0.000000 0.000000 0.424396 ... 0.000000 0.000000 0.424396
1 0.000000 0.000000 0.000000 ... 0.534093 0.534093 0.000000
2 0.546454 0.546454 0.000000 ... 0.000000 0.000000 0.000000

[3 rows x 11 columns]

CVresult.sum(axis=0).sum()
Out[14]: 16

#使用TF-IDF编码之后，出现得多的单词的权重被降低了么？
CVresult.sum(axis=0)/CVresult.sum(axis=0).sum()
Out[15]:
character 0.0625
elsa 0.0625
fascinating 0.0625
is 0.2500
it 0.0625
learning 0.1250
machine 0.1250
popular 0.0625
sensational 0.0625
techonology 0.0625
wonderful 0.0625
dtype: float64

TFIDFresult.sum(axis=0) / TFIDFresult.sum(axis=0).sum()
Out[16]:
character 0.083071
elsa 0.083071
fascinating 0.064516
is 0.173225
it 0.064516
learning 0.110815
machine 0.110815
popular 0.083071
sensational 0.081192
techonology 0.081192
wonderful 0.064516
dtype: float64

可以发现is的权重由0.25下降为0.17

1.2 探索文本数据

#初次使用这个数据集的时候，会在实例化的时候开始下载
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()

#通常我们使用data来查看data里面到底包含了什么内容，但由于fetch_20newsgourps这个类加载出的数据巨大，数
据结构中混杂很多文字，因此很难去看清
#不同类型的新闻

data.target_names
Out[19]:
['alt.atheism',
'comp.graphics',
'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware',
'comp.windows.x',
'misc.forsale',
'rec.autos',
'rec.motorcycles',
'rec.sport.baseball',
'rec.sport.hockey',
'sci.crypt',
'sci.electronics',
'sci.med',
'sci.space',
'soc.religion.christian',
'talk.politics.guns',
'talk.politics.mideast',
'talk.politics.misc',
'talk.religion.misc']

fetch_20newsgroups 参数列表：

subset : 选择类中包含的数据子集
输入"train"表示选择训练集，“test"表示输入测试集，”all"表示加载所有的数据

categories : 可输入None或者数据所在的目录
选择一个子集下，不同类型或不同内容的数据所在的目录。如果不输入默认None，则会加载全部的目录。

download_if_missing：可选，默认是True
如果发现本地数据不全，是否自动进行下载

shuffle : 布尔值，可不填，表示是否打乱样本顺序
对于假设样本之间互相独立并且服从相同分布的算法或模型（比如随机梯度下降）来说可能很重要

import numpy as np
import pandas as pd
categories = ["sci.space" #科学技术 - 太空
,"rec.sport.hockey" #运动 - 曲棍球
,"talk.politics.guns" #政治 - 枪支问题
,"talk.politics.mideast"] #政治 - 中东问题
train = fetch_20newsgroups(subset="train",categories = categories)
test = fetch_20newsgroups(subset="test",categories = categories)

train.target_names
Out[21]:
['rec.sport.hockey',
'sci.space',
'talk.politics.guns',
'talk.politics.mideast']

#查看总共有多少篇文章存在

len(train.data)
Out[22]: 2303

#随意提取一篇文章来看看
train.data[0]
Out[23]: "From: tvartiai@vipunen.hut.fi (Tommi Vartiainen)\nSubject: Re: Finland/Sweden vs.NHL teams (WAS:Helsinki/Stockholm & NHL expansion)\nNntp-Posting-Host: vipunen.hut.fi\nOrganization: Helsinki University of Technology, Finland\nLines: 51\n\nIn <1993Apr16.195754.5476@ousrvr.oulu.fi> mep@phoenix.oulu.fi (Marko Poutiainen) writes:\n\n>: FINLAND: \n>: \n>: D-Jyrki Lumme.......20\n>: D-Teppo Numminen....20\n>: D-Peter Ahola.......13\n>: \n>Well well, they don't like our defenders (mainly Lumme and Numminen)...\n\nAbout 25 is correct for Numminen and Lumme.\n\n\n>: R-Teemu Selanne.....27\n>: \n>Compared to Kurri, Selanne's points are too high, lets make it 25 or 26.\n\nNo, Kurri's points are too low. 27 for Kurri and 28 for Sel{nne.\n\n>: well in the Canada Cup and World Championships largely due to the efforts of\n>: Markus Ketterer (the goalie), 3-4 or the players listed above and luck. There's\n>: presumably a lot of decent players in Finland that wouldn't be superstars at\n>: the highest level but still valuable role players, however. My guess would be\n>: that the Finnish Canada Cup team would be a .500 team in the NHL.\n\n>Wow, now, it looks like you don't like our players? What about guys like:\n>Nieminen, Jutila, Riihijarvi, Varvio, Laukkanen, Makela, Keskinen and (even\n>if he is aging) Ruotsalainen? The main difference between finnish and North-\n>American players is, that our players tend to be better in the larger rink.\n>The Canadian defenders are usually slower that defenders in Europe. \n>And I think that there was more in our success than Ketterer and luck (though\n>they helped). I think that the main reason was, that the team worked well\n>together.\n\n\nThat's true. Game is so different here in Europe compared to NHL. North-ame-\nricans are better in small rinks and europeans in large rinks. An average\neuropean player from Sweden, Finland, Russian or Tsech/Slovakia is a better \nskater and puckhandler than his NHL colleague. Especially defenders in NHL\nare mainly slow and clumsy. Sel{nne has also said that in the Finnish Sm-league\ngame is more based on skill than in NHL. In Finland he couldn't get so many \nbreakaways because defenders here are an average much better skaters than in\nNHL. Also Alpo Suhonen said that in NHL Sel{nne's speed accentuates because\nof clumsy defensemen.\n\nI have to admit that the best players come from Canada, but those regulars\naren't as skilful as regulars in the best european leagues. Also top europeans\nare in the same level as the best north-americans.(except Lemieux is in the\nclass of his own). \n\nTommi\n"

#查看一下我们的标签
np.unique(train.target)
Out[24]: array([0, 1, 2, 3], dtype=int64)

len(train.target)
Out[25]: 2303

#是否存在样本不平衡问题？
for i in [1,2,3]:
print(i,(train.target == i).sum()/len(train.target))
1 0.25749023013460703
2 0.23708206686930092
3 0.24489795918367346

1.3 使用TF-IDF将文本数据编码
from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF

Xtrain = train.data
Xtest = test.data
Ytrain = train.target
Ytest = test.target

tfidf = TFIDF().fit(Xtrain)
Xtrain_ = tfidf.transform(Xtrain)
Xtest_ = tfidf.transform(Xtest)
Xtrain_
Out[29]:
<2303x40725 sparse matrix of type '<class 'numpy.float64'>'
with 430306 stored elements in Compressed Sparse Row format>

tosee = pd.DataFrame(Xtrain_.toarray(),columns=tfidf.get_feature_names())
tosee.head()
Out[30]:
00 000 0000 00000 000000 ... zy zyg zz zz_g9q3 zzzzzz
0 0.0 0.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
1 0.0 0.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
2 0.0 0.058046 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
3 0.0 0.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
4 0.0 0.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0

[5 rows x 40725 columns]

tosee.shape
Out[31]: (2303, 40725)

1.4 在贝叶斯上分别建模，查看结果from sklearn.naive_bayes import MultinomialNB, ComplementNB, BernoulliNB
from sklearn.metrics import brier_score_loss as BS
name = ["Multinomial","Complement","Bournulli"]
#注意高斯朴素贝叶斯不接受稀疏矩阵
models = [MultinomialNB(),ComplementNB(),BernoulliNB()]
for name,clf in zip(name,models):
clf.fit(Xtrain_,Ytrain)
y_pred = clf.predict(Xtest_)
proba = clf.predict_proba(Xtest_)
score = clf.score(Xtest_,Ytest)
print(name)
#4个不同的标签取值下的布里尔分数
Bscore = []
for i in range(len(np.unique(Ytrain))):
bs = BS(Ytest==i,proba[:,i],pos_label=i)
Bscore.append(bs)
print("\tBrier under {}:{:.3f}".format(train.target_names[i],bs))
print("\tAverage Brier:{:.3f}".format(np.mean(Bscore)))
print("\tAccuracy:{:.3f}".format(score))
print("\n")
Multinomial
Brier under rec.sport.hockey:0.857
Brier under sci.space:0.033
Brier under talk.politics.guns:0.169
Brier under talk.politics.mideast:0.178
Average Brier:0.309
Accuracy:0.975

Complement
Brier under rec.sport.hockey:0.804
Brier under sci.space:0.039
Brier under talk.politics.guns:0.137
Brier under talk.politics.mideast:0.160
Average Brier:0.285
Accuracy:0.986

Bournulli
Brier under rec.sport.hockey:0.925
Brier under sci.space:0.025
Brier under talk.politics.guns:0.205
Brier under talk.politics.mideast:0.193
Average Brier:0.337
Accuracy:0.902

从结果上来看，两种贝叶斯的效果都很不错。虽然补集贝叶斯的布里尔分数更高，但它的精确度更高。我们可以使用
概率校准来试试看能否让模型进一步突破。

from sklearn.calibration import CalibratedClassifierCV
name = ["Multinomial"
,"Multinomial + Isotonic"
,"Multinomial + Sigmoid"
,"Complement"
,"Complement + Isotonic"
,"Complement + Sigmoid"
,"Bernoulli"
,"Bernoulli + Isotonic"
,"Bernoulli + Sigmoid"]
models = [MultinomialNB()
,CalibratedClassifierCV(MultinomialNB(), cv=2, method='isotonic')
,CalibratedClassifierCV(MultinomialNB(), cv=2, method='sigmoid')
,ComplementNB()
,CalibratedClassifierCV(ComplementNB(), cv=2, method='isotonic')
,CalibratedClassifierCV(ComplementNB(), cv=2, method='sigmoid')
,BernoulliNB()
,CalibratedClassifierCV(BernoulliNB(), cv=2, method='isotonic')
,CalibratedClassifierCV(BernoulliNB(), cv=2, method='sigmoid')
]
for name,clf in zip(name,models):
clf.fit(Xtrain_,Ytrain)
y_pred = clf.predict(Xtest_)
proba = clf.predict_proba(Xtest_)
score = clf.score(Xtest_,Ytest)
print(name)
Bscore = []
for i in range(len(np.unique(Ytrain))):
bs = BS(Ytest==i,proba[:,i],pos_label=i)
Bscore.append(bs)
print("\tBrier under {}:{:.3f}".format(train.target_names[i],bs))
print("\tAverage Brier:{:.3f}".format(np.mean(Bscore)))
print("\tAccuracy:{:.3f}".format(score))
print("\n")
Multinomial
Brier under rec.sport.hockey:0.857
Brier under sci.space:0.033
Brier under talk.politics.guns:0.169
Brier under talk.politics.mideast:0.178
Average Brier:0.309
Accuracy:0.975

Multinomial + Isotonic
Brier under rec.sport.hockey:0.980
Brier under sci.space:0.012
Brier under talk.politics.guns:0.226
Brier under talk.politics.mideast:0.228
Average Brier:0.362
Accuracy:0.973

Multinomial + Sigmoid
Brier under rec.sport.hockey:0.968
Brier under sci.space:0.012
Brier under talk.politics.guns:0.219
Brier under talk.politics.mideast:0.222
Average Brier:0.355
Accuracy:0.973

Complement
Brier under rec.sport.hockey:0.804
Brier under sci.space:0.039
Brier under talk.politics.guns:0.137
Brier under talk.politics.mideast:0.160
Average Brier:0.285
Accuracy:0.986

Complement + Isotonic
Brier under rec.sport.hockey:0.984
Brier under sci.space:0.007
Brier under talk.politics.guns:0.227
Brier under talk.politics.mideast:0.230
Average Brier:0.362
Accuracy:0.985

Complement + Sigmoid
Brier under rec.sport.hockey:0.970
Brier under sci.space:0.009
Brier under talk.politics.guns:0.217
Brier under talk.politics.mideast:0.221
Average Brier:0.354
Accuracy:0.986

Bernoulli
Brier under rec.sport.hockey:0.925
Brier under sci.space:0.025
Brier under talk.politics.guns:0.205
Brier under talk.politics.mideast:0.193
Average Brier:0.337
Accuracy:0.902

Bernoulli + Isotonic
Brier under rec.sport.hockey:0.957
Brier under sci.space:0.014
Brier under talk.politics.guns:0.164
Brier under talk.politics.mideast:0.181
Average Brier:0.329
Accuracy:0.952

Bernoulli + Sigmoid
Brier under rec.sport.hockey:0.825
Brier under sci.space:0.030
Brier under talk.politics.guns:0.153
Brier under talk.politics.mideast:0.160
Average Brier:0.292
Accuracy:0.879

可以观察到，多项式分布下无论如何调整，算法的效果都不如补集朴素贝叶斯来得好。因此我们在分类的时候，应该
选择补集朴素贝叶斯。对于补集朴素贝叶斯来说，使用Sigmoid进行概率校准的模型综合最优秀：准确率最高，对数
损失和布里尔分数都在0.1以下，可以说是非常理想的模型了。

对于机器学习而言，朴素贝叶斯也许不是最常用的分类算法，但作为概率预测算法中唯一一个真正依赖概率来进行计
算，并且简单快捷的算法，朴素贝叶斯还是常常被人们提起。并且，朴素贝叶斯在文本分类上的效果的确非常优秀。
由此可见，只要我们能够提供足够的数据，合理利用高维数据进行训练，朴素贝叶斯就可以为我们提供意想不到的效
果。