Scikit learn：machine learning in Python之贝叶斯学习

最新推荐文章于 2019-04-24 11:04:30 发布

禾如月

最新推荐文章于 2019-04-24 11:04:30 发布

阅读量1.7k

点赞数

分类专栏：算法和机器学习 python 3

本文链接：https://blog.csdn.net/xiu_star/article/details/53729141

版权

python 3 同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

算法和机器学习

7 篇文章 0 订阅

订阅专栏

chapter 2之朴素贝叶斯.

朴素贝叶斯是一个简单却很强大的分类器，基于贝叶斯定理的概率模型。本质来说，贝叶斯是基于每个特征值的概率去决定该实例属于一类的概率，前提条件， 也就是假定每个特征之间是独立的。朴素贝叶斯的一个非常成功的应用就是 自然语言处理（natural language processing , NLP），NLP问题有很重要的，大量的标记数据（一般为文本文件），该数据作为算法的训练集。

在这个章节，将介绍使用朴素贝叶斯进行文本分类。数据集为一组分出着相应类别的文本文档，然后训练朴素贝叶斯算法来预测一个新的未知的文档的类别。scikit-learn中给出的数据集包含19,000组来自从政治，宗教到体育和科学等20个不同主题的新闻组。

from sklearn.datasets import fetch_20newsgroups

news = fetch_20newsgroups(subset='all') #导入数据和赋值

值得注意的是，数据是存着一系列的文本内容，而不是矩阵。 另外，由于书本是Python2的，我使用的是Python3，故代码和书本有些微不同。

print ( type(news.data), type(news.target), type(news.target_names))

Downloading dataset from http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz (14 MB)

print (news.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

print( len(news.data))
print( len(news.target))

18846

print(news.data[ 0])

From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>

Subject: Pens fans reactions

Organization: Post Office, Carnegie Mellon, Pittsburgh, PA

Lines: 12

NNTP-Posting-Host: po4.andrew.cmu.edu

I am sure some bashers of Pens fans are pretty confused about the lack

of any kind of posts about the recent Pens massacre of the Devils. Actually,

I am bit puzzled too and a bit relieved. However, I am going to put an end

to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they

are killing those Devils worse than I thought. Jagr just showed you why

he is much better than his regular season stats. He is also a lot

fo fun to watch in the playoffs. Bowman should let JAgr have a lot of

fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final

regular season game. PENS RULE!!!

print(news.target[ 0],news.target_names[news.target[ 0]]) #target是用于下标定位

10 rec.sport.hockey #下标从0开始

预处理数据：

本书的机器学习算法只能适用于数值型数据，因此，需要将文本数据转化为数值数据。

目前，只有一个特征——文本内容，因此，需要一些函数将文本内容转变为有意义的一组数值型特征。直观地看，每个文本类别中的文字（确切地说，就是符号，包括数字或标点符号）有哪些，然后尝试用这些文字的频繁分布描述每个类别。 sklearn.feature_extraction.text 提供一些实用程序，从文本文档中建立数字特征向量。

在转换数据之前，先划分好训练集和测试集。在随机顺序下，75%个实例为训练集，25%个实例为测试集。

SPLIT_PETC = 0.75
split_size = int( len(news.data) * SPLIT_PETC)
x_train = news.data[:split_size]
x_test = news.data[split_size:]
y_train = news.target[:split_size]
y_test = news.target[split_size:]

这里有3中方式将文本转变为数字特征： CountVectorizer, HashingVectorizer,and TfidfVectorizer.（它们之间的不同在于获得数字特征的计算）

CountVectorizer 主要是从文本中建立一个字典，然后每个实例转变成一个数字特征向量，其中的每个元素是文本中一个独有单词出现的次数

HashingVectorizer 实现一个哈希函数（hashing function），映射特征的索引，然后如 CountVectorizer计算次数

TfidfVectorizer 和 CountVectorizer 很像，但是计算方式更为先进，使用术语逆文档频率法（Term Frequency Inverse Document Frequency，TF-IDF）——测量单词在文档或者文集中的重要性的统计学方法（寻找当前文档中比价频繁出现的单词，对比其在整个文档集中出现的次数；这样可以看到标准化的结果，避免了过度频繁）。

训练朴素贝叶斯分类器：

建立一个朴素贝叶斯分类器，由特征向量化程序和实际贝叶斯分类器：使用 sklearn.naive_bayes模块中的方法 MultinomialNB； sklearn.pipeline模块中的 Pipeline能够将向量和分类器组合一起。这里结合 MultinomialNB 建立3个不同的分类器，分别使用上面提及的3个不同的文本向量，然后对比在默认参数下，哪个更好。

from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer,HashingVectorizer, CountVectorizer

clf_1 = Pipeline([( 'vect',CountVectorizer()),( 'clf',MultinomialNB()),])
clf_2 = Pipeline([( 'vect',HashingVectorizer( non_negative= True)),( 'clf',MultinomialNB()),])
clf_3 = Pipeline([( 'vect',TfidfVectorizer()),( 'clf',MultinomialNB()),])

定义一个函数，分类和对指定的x和y值进行交叉验证：

from sklearn.cross_validation import cross_val_score,KFold
import numpy as np

from scipy.stats import sem

def evaluate_cross_validation(clf,x,y,K):
#create a k-fold cross validation iterator of k=5 folds(建立一个k=5的交叉验证迭代器)
cv = KFold( len(y),K, shuffle= True, random_state= 0)
#by default the score used is the one returned by score method of the estimator(accuracy)(默认情况下，使用的得分是返回的一个估计分数)
scores = cross_val_score(clf,x,y, cv=cv)
print(scores)

print(("Mean score:{0:.3f} (+/-{1:.3f})").format(np.mean(scores),sem(scores)))

然后，每个分类器都进行5重交叉验证：

clfs = [clf_1,clf_2,clf_3]
for clf in clfs:
evaluate_cross_validation(clf,news.data,news.target, 5)

结果如下：

[ 0.85782493 0.85725657 0.84664367 0.85911382 0.8458477 ]

Mean score:0.853 (+/-0.003)

[ 0.75543767 0.77659857 0.77049615 0.78508888 0.76200584]

Mean score:0.770 (+/-0.005)

[ 0.84482759 0.85990979 0.84558238 0.85990979 0.84213319]

Mean score:0.850 (+/-0.004)

可以看出， CountVectorizer 和 TfidfVectorizer 比 HashingVectorizer 结果更好。使用 TfidfVectorizer 继续，尝试通过将文档解析成不同的符号正则表达式来提高结果。

默认的正则表达式： ur"\b\w\w+\b" ，考虑了字母数字字符，下划线（也许也会考虑削减和点号以提高标记and begin considering tokens as Wi-Fi and site.com.）

新的正则表达式： ur"\b[a-z0- 9_\-\.]+[a-z][a-z0-9_\-\.]+\b"：

clf_4 = Pipeline([( 'vect',TfidfVectorizer( token_pattern= r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",)),( 'clf',MultinomialNB()),]) #Python3不支持ur

evaluate_cross_validation(clf_4,news.data,news.target,5)

结果如下：

[ 0.86100796 0.8718493 0.86203237 0.87291059 0.8588485 ]

Mean score:0.865 (+/-0.003)

说明结果从 0.850提高到 0.865。

此外，还有另一个参数： stop_words，允许我们忽略掉不想加入计算的一列单词，例如太频繁的单词，或者先验认为不该为特定主题提供信息的单词。

定义一个函数，获得stop words （禁用词）：

def get_stop_words():
result = set()
for line in open( 'stopwords_en.txt', 'r').readlines():
result.add(line.strip())
return result

然后，建立一个新的分类器：

clf_5 = Pipeline([( 'vect',TfidfVectorizer( stop_words= get_stop_words(), token_pattern= r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",)),( 'clf',MultinomialNB()),])
evaluate_cross_validation(clf_5,news.data,news.target, 5)

结果如下：

[ 0.88222812 0.89625895 0.88591138 0.89599363 0.88485009]

Mean score:0.889 (+/-0.003)

结果由 0.865提高到 0.889。

再看 MultinomialNB的参数，最重要的参数是alpha参数，也叫平滑参数，其默认值为1.0，假设令其为0.1：

clf_6 = Pipeline([( 'vect',TfidfVectorizer( stop_words= get_stop_words(), token_pattern= r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",)),( 'clf',MultinomialNB( alpha= 0.1)),])

结果如下：

[ 0.91405836 0.91589281 0.91085168 0.91721942 0.91509684]

Mean score:0.915 (+/-0.001)

结果由 0.889 提高到 0.915 。接下来，测试不同的alpha值对结果的影响，进而选择最佳的alpha值。

模型评估：

定义一个函数，在整个训练集训练模型，和评估模型在训练集和测试集的准确性。

from sklearn import metrics

def train_and_evaluate(clf,x_train,x_test,y_train,y_test):
clf.fit(x_train,y_train)
print( "Accuracy on training set:")
print(clf.score(x_train,y_train))
print( "Accuracy on testing set:")
print(clf.score(x_test,y_test))
print( "Classification Report:")
print(metrics.classification_report(y_test, y_pred=y_test))
print( "Confusion Matrix:")
print(metrics.confusion_matrix(y_test, y_pred=y_test))

train_and_evaluate(clf_6,x_train,x_test,y_train,y_test)

结果：

Accuracy on training set:

0.98776001132

Accuracy on testing set:

0.909592529711

由上可知，结果还可以。测试集结果也差不多达到0.91.

禾如月

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
2
评论
Scikit learn：machine learning in Python之贝叶斯学习

chapter 2之朴素贝叶斯. 朴素贝叶斯是一个简单却很强大的分类器，基于贝叶斯定理的概率模型。本质来说，贝叶斯是基于每个特征值的概率去决定该实例属于一类的概率，前提条件，也就是假定每个特征之间是独立的。朴素贝叶斯的一个非常成功的应用就是自然语言处理（natural language processing , NLP），NLP问题有很重要的，大量的标记数据（一般为文本文件），
复制链接

扫一扫