第9章 Naive Bayes(文档分类)

sklearn.naive_bayes

此包里包含几种典型的概率分布算法,GaussianNB,MultinomialNB,BernoulliNB,在自然语言领域有广泛的应用。

load_files

import matplotlib.pyplot as plt
import numpy as np
from time import time
from sklearn.datasets import load_files

print("loading train dataset ...")
t = time()
news_train = load_files(r'C:\Users\Qiuyi\Desktop\scikit-learn code\code\datasets\mlcomp\379\train')
print("summary: {0} documents in {1} categories.".format(
    len(news_train.data), len(news_train.target_names)))
print("done in {0} seconds".format(time() - t))

loading train dataset …
summary: 13180 documents in 20 categories.
done in 3.9394569396972656 seconds

TF-IDF

怎样把文档表达为计算机可以理解并处理的信息,是自然语言处理中的一个重要课题,完整的内容可以写成鸿篇巨著。

TF-IDF是一种统计方法,用以评估一个词语对于一份文档的重要程度。TF与IDF的乘积即这个词在文档中的重要程度。

TF表示词频(Term Frequency),即特定词语出现的次数除以文档的词语总数。

IDF表示一个词的逆向文档频率指数(Inverse Document Frequency),可以由总文档数目除以包含该词语的文档的数目,再将得到的商取对数得到,它表示的是词语的权重指数。

词语的重要性随它在文档中出现的次数呈正比例增加,但同时随它在语料库(即数据集corpus)中出现的频率呈反比例下降。

从语料库中提取所有出现的词语,称之为词典,词典里有多少词,每个文档都可转化为多少维的向量。

sklearn.feature_extraction.text.TfidfVectorizer
Convert a collection of raw documents to a matrix of TF-IDF features.

from sklearn.feature_extraction.text import TfidfVectorizer

print("vectorizing train dataset ...")
t = time()
vectorizer = TfidfVectorizer(encoding='latin-1')
X_train = vectorizer.fit_transform((d for d in news_train.data))
print("n_samples: %d, n_features: %d" % X_train.shape)
print("number of non-zero features in sample [{0}]: {1}".format(
    news_train.filenames[0], X_train[0].getnnz()))
print("done in {0} seconds".format(time() - t))

vectorizing train dataset …
n_samples: 13180, n_features: 130274
number of non-zero features in sample [datasets/mlcomp/379/train/talk.politics.misc/17860-178992]: 108
done in 3.9518160820007324 seconds

用 MultinomialNB 训练

sklearn.naive_bayes.MultinomialNB

平滑参数 alpha 越小越容易造成过拟合,值太大,容易造成欠拟合。

from sklearn.naive_bayes import MultinomialNB

print("traning models ...".format(time() - t))
t = time()
y_train = news_train.target
clf = MultinomialNB(alpha=0.0001)
clf.fit(X_train, y_train)
train_score = clf.score(X_train, y_train)
print("train score: {0}".format(train_score))
print("done in {0} seconds".format(time() - t))

traning models …
train score: 0.9978755690440061
done in 0.30047607421875 seconds

测试

同样先加载数据

print("loading test dataset ...")
t = time()
news_test = load_files(r'C:\Users\Qiuyi\Desktop\scikit-learn code\code\datasets\mlcomp\379\test')
print("summary: {0} documents in {1} categories.".format(
    len(news_test.data), len(news_test.target_names)))
print("done in {0} seconds".format(time() - t))

loading test dataset …
summary: 5648 documents in 20 categories.
done in 5.914935827255249 seconds

再向量化文档数据

此时只需要调用transform()即可,不再需要fit()进行语料库分析了。

print("vectorizing test dataset ...")
t = time()
X_test = vectorizer.transform((d for d in news_test.data))
y_test = news_test.target
print("n_samples: %d, n_features: %d" % X_test.shape)
print("number of non-zero features in sample [{0}]: {1}".format(
    news_test.filenames[0], X_test[0].getnnz()))
print("done in %fs" % (time() - t))

vectorizing test dataset …
n_samples: 5648, n_features: 130274
number of non-zero features in sample [datasets/mlcomp/379/test/rec.autos/7429-103268]: 61
done in 1.704420s

初步验证性预测

取测试数据集里的第一篇文档初步验证一下,看能否正确的预测这个文档所属的类别。

pred = clf.predict(X_test[0])
print("predict: {0} is in category {1}".format(
    news_test.filenames[0], news_test.target_names[pred[0]]))
print("actually: {0} is in category {1}".format(
    news_test.filenames[0], news_test.target_names[news_test.target[0]]))

predict: C:\Users\Qiuyi\Desktop\scikit-learn code\code\datasets\mlcomp\379\test\rec.autos\7429-103268 is in category rec.autos
actually: C:\Users\Qiuyi\Desktop\scikit-learn code\code\datasets\mlcomp\379\test\rec.autos\7429-103268 is in category rec.autos

预测文档与分析文档相同,可见预测结果是正确的。

预测整个测试数据集

print("predicting test dataset ...")
t = time()
pred = clf.predict(X_test)
print("done in %fs" % (time() - t))

predicting test dataset …
done in 0.059061s

sklearn.metrics.classification_report

Build a text report showing the main classification metrics
查看针对每个类别的预测准确性

from sklearn.metrics import classification_report

print("classification report on test set for classifier:")
print(clf)
print(classification_report(y_test, pred,
                            target_names=news_test.target_names))

在这里插入图片描述

sklearn.metrics.confusion_matrix

Compute confusion matrix to evaluate the accuracy of a classification
混淆矩阵,观察每种类别被错误分类的情况

confusion_matrixABC
A实际A,预测A实际A,预测B实际A,预测C
B实际B,预测A实际B,预测B实际B,预测C
C实际C,预测A实际C,预测B实际C,预测C
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, pred)
print("confusion matrix:")
print(cm)

在这里插入图片描述
从第一行数据中可以看出,类别0(alt.atheism)的文档,有13个被错误分类到类别19(talk.religion.misc)里。

# Show confusion matrix
plt.figure(figsize=(8, 8), dpi=144)
plt.title('Confusion matrix of the classifier')
ax = plt.gca()           a                       
ax.spines['right'].set_color('none')            
ax.spines['top'].set_color('none')
ax.spines['bottom'].set_color('none')
ax.spines['left'].set_color('none')
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')
ax.set_xticklabels([])
ax.set_yticklabels([])
plt.matshow(cm, fignum=1, cmap='gray')
plt.colorbar();

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值