随书代码,做些笔记。
- 加载数据集
http://mlcomp.org/datasets/379 下载
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from time import time
from sklearn.datasets import load_files
print("loading train dataset ...")
t = time()
news_train = load_files('datasets/mlcomp/379/train')
print("summary: {0} documents in {1} categories.".format(len(news_train.data), len(news_train.target_names)))
print("done in {0} seconds".format(time() - t))
print()
- 文档向量化
TF-IDF 统计方法, TF表示词频 term frequency, IDF inverse docrment frequency, TF-IDF 标识词在文档中的重要程度。
from sklearn.feature_extraction.text import TfidfVectorizer
print("vectorizing train dataset ...")
t = time()
vectorizer = TfidfVectorizer(encoding='latin-1')
X_train = vectorizer.fit_transform((d for d in news_train.data))
print("n_samples: %d, n_features: %d" % X_train.shape)
print("number of non-zero features in sample [{0}]: {1}".format(news_train.filenames[0], X_train[0].getnnz()))
print("done in {0} seconds".format(time() - t))
fit_transform()是fit() 和 transform()合并起来的结果。
fit()完成语料库分析,提取词典等操作
transform()把每篇文档转换为向量
- 模型训练
from sklearn.naive_bayes import MultinomialNB
print("traning models ...".format(time() - t))
t = time()
y_train = news_train.target
clf = MultinomialNB(alpha=0.0001)
clf.fit(X_train, y_train)
train_score = clf.score(X_train, y_train)
print("train score: {0}".format(train_score))
print("done in {0} seconds".format(time() - t))
使用的是多项式模型,其中alpha值越小,越容易造成过拟合,值越大,容易造成欠拟合。
- 加载测试集,向量化,并测试
向量化的过程中只需要transform()操作就号了。
print("loading test dataset ...")
t = time()
news_test = load_files('datasets/mlcomp/379/test')
print("summary: {0} documents in {1} categories.".format(
len(news_test.data), len(news_test.target_names)))
print("done in {0} seconds".format(time() - t))
#向量化
print("vectorizing test dataset ...")
t = time()
X_test = vectorizer.transform((d for d in news_test.data))
y_test = news_test.target
print("n_samples: %d, n_features: %d" % X_test.shape)
print("number of non-zero features in sample [{0}]: {1}".format(
news_test.filenames[0], X_test[0].getnnz()))
print("done in %fs" % (time() - t))
#使用一个文档进行简单测试
pred = clf.predict(X_test[0])
print("predict: {0} is in category {1}".format(
news_test.filenames[0], news_test.target_names[pred[0]]))
print("actually: {0} is in category {1}".format(
news_test.filenames[0], news_test.target_names[news_test.target[0]]))
- 模型评价
print("predicting test dataset ...")
t = time()
pred = clf.predict(X_test)
print("done in %fs" % (time() - t))
from sklearn.metrics import classification_report
print("classification report on test set for classifier:")
print(clf)
print(classification_report(y_test, pred,
target_names=news_test.target_names))
#生成混淆矩阵
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, pred)
print("confusion matrix:")
print(cm)
# confusion matrix:
# [[224 0 0 0 0 0 0 0 0 0 0 0 0 0 2 5 0 0 1 13]
# [ 1 267 5 5 2 8 1 1 0 0 0 2 3 2 1 0 0 0 0 0]
# [ 1 13 230 24 4 10 5 0 0 0 0 1 2 1 0 0 0 0 1 0]
# [ 0 9 21 242 7 2 10 1 0 0 1 1 7 0 0 0 0 0 0 0]
# [ 0 1 5 5 233 2 2 2 1 0 0 3 1 0 1 0 0 0 0 0]
# [ 0 20 6 3 1 260 0 0 0 2 0 1 0 0 2 0 2 0 0 0]
# [ 0 2 5 12 3 1 235 10 2 3 1 0 7 0 2 0 2 1 4 0]
# [ 0 1 0 0 1 0 8 300 4 1 0 0 1 2 3 0 2 0 1 0]
# [ 0 1 0 0 0 2 2 3 283 0 0 0 1 0 0 0 0 0 1 1]
# [ 0 1 1 0 1 2 1 2 0 297 8 1 0 1 0 0 0 0 0 0]
# [ 0 0 0 0 0 0 0 0 2 2 298 0 0 0 0 0 0 0 0 0]
# [ 0 1 2 0 0 1 1 0 0 0 0 284 2 1 0 0 2 1 2 0]
# [ 0 11 3 5 4 2 4 5 1 1 0 4 266 1 4 0 1 0 1 0]
# [ 1 1 0 1 0 2 1 0 0 0 0 0 1 266 2 1 0 0 1 0]
# [ 0 3 0 0 1 1 0 0 0 0 0 1 0 1 296 0 1 0 1 0]
# [ 3 1 0 1 0 0 0 0 0 0 1 0 0 2 1 280 0 1 1 2]
# [ 1 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 236 1 4 1]
# [ 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 3 0 290 1 0]
# [ 2 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1 10 7 212 0]
# [ 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12 4 1 4 134]]
#可视化
# Show confusion matrix
plt.figure(figsize=(8, 8), dpi=144)
plt.title('Confusion matrix of the classifier')
ax = plt.gca()
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.spines['bottom'].set_color('none')
ax.spines['left'].set_color('none')
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')
ax.set_xticklabels([])
ax.set_yticklabels([])
plt.matshow(cm, fignum=1, cmap='gray')
plt.colorbar();
扩展:
朴素贝叶斯分为三类:
(1) GauussianNB 高斯模型
sklearn.naive_bayes.GaussianNB(priors=None)
(2) BernoulliNB 伯怒利模型
sklearn.naive_bayes.BernoulliNB(alpha=1.0, binarize=0.0, fit_prior=True, class_prior=None)
只考虑词是否出现,数值就是0或者1
(3) MultinomialNB 多项式模型
sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)
考虑词频,是一个浮点数