用Python实现K近邻和朴素贝叶斯对文本数据分类

qq_37353305

已于 2022-02-02 11:07:00 修改

阅读量1.9k

点赞数

分类专栏： ML with Python 文章标签： python 分类机器学习朴素贝叶斯算法

于 2022-02-01 08:44:36 首次发布

本文链接：https://blog.csdn.net/qq_37353305/article/details/122759064

版权

ML with Python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Text Classification with KNN and Naive Bayes Algorithm in Python

简介
模型
- knn
- naive bayes
数据
Implementation with Python

简介

Knn 是最常见，最简单的非参数机器学习的方法，它对 data generating process (DGP) 没有任何假设，所以适用于大多数场景。但是个人感觉，knn 对维度较高的数据表现不太理想，而且容易过拟合。朴素贝叶斯 (Naive Bayes) 是参数方法，有模型假设，把这两者放到一起是因为两者都是机器学习中最经典，最简单的方法。

模型

假设我们有数据 $x_i,y_i)_{i=1}^n$ ， $x_i$ 是解释变量， $y_i$ 是 $p$ -维的离散真实 label。

knn

$pr(y=r|x=x_0)=\frac{\#\{i:y_i=r, x_i\in \operatorname{knn}(x_0,d)\} }{k}.$ 其中， $\operatorname{knn}(x,d)$ 是数据 $x_i)_{i=1}^n$ 中离 $x_0$ 最近的 $k$ 个 $x_i$ 的集合 based on 距离 $d(x,x_i)$ 。那么算法也很简单，对任意一个点我们只需要计算训练集中每个数据点离该点的距离，然后选出最近的 $k$ 个点计算即可。

naive bayes

$pr(y=r|x=x_0)=\frac{\pi_rf_r(x_0)}{\sum_{i=1}^p\pi_kf_k(x_0)}.$ 这个是贝叶斯公式，所以 $\pi_k=pr(y=r)$ ，可以用 $\widehat{\pi}_k=\frac{\#\{i: y_i=k\}}{n}$ 来估计。 $f_k(x)$ 是 $x ∣ y = k$ 的密度函数，是需要指定的 prior。如果让 $f_k(x)=f_k(x,\mu_k,\sigma_k)=\frac{1}{\sqrt{2\pi\sigma_k^2}}\exp(-\frac{(x-\mu_k)^2}{2\sigma_k^2})$ ，高斯分布，那么模型称为 Guassian naive bayes。如果令 $f_k(x)$ 为多项式分布，即 $f_k(x)=f_k(x,\theta_k)=\frac{(\sum_{i=1}^Kx_{i})!}{\prod_{i=1}^Kx_i!}\prod_{i=1}^K(\theta_{ki})^{x_i}$ ，其中 $K=\operatorname{dim}(x)$ ，那么模型称为 multinomial naive bayes。指定好 parametric prior distribution 后，我们就可以在训练集上训练模型了。

数据

我们研究文本数据 smoker’s helpline dataset。这个文本数据来自于滑铁卢大学的戒烟热线中心。该中心收集了加拿大打电话过来的有意向戒烟者的信息，并在6个月后回电询问 “What helped you the most in trying to quit [smoking]？”。被回电者的 response (即 text 数据) 被记录且分为了20多个类别。这里，我们想要做的一件事情是根据回电者的 reponse 进行文本分类。

Implementation with Python

调入需要的库：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB, MultinomialNB
import itertools

读取数据，这里我们只考虑用 2006 年的 observations 来完成我们的目标——训练分类器并在测试集上评价其表现，一共有1175个观测：

# load data
df_smoker = pd.read_csv("smokerhelpline.csv",  sep=',', engine='python')
df_2006 = df_smoker[df_smoker.year_start==2006]
df_2006.head(10)

在这里插入图片描述
当然，我们要先对文本数据进行处理，将其转化为可以用于模型的变量和对应值。feature 呢，是一些 n-grams，对应的值可以是该 gram 在 text 中出现的频率，也可以使用 tf-idf 的方法来计算对应值。自由化的处理可以用 nltk 库，其中包含有很多处理文本的函数，可以进行 tokenize, stemming, remove stopwords, lemmarization. remove punctutions, etc. 各种操作。在这里，我们直接用 sklearn.feature_extraction.text 中的 CountVectorizer 和 TfidfVectorizer 来将文本转化成我们想要的形式：

docs_2006 = df_2006["text"]

# vectorization
vectorizer_count = CountVectorizer(stop_words = "english", max_features = 600)
vectorizer_tfidf = TfidfVectorizer(stop_words = "english", max_features = 600)

# fit text
vectorizer_count.fit(docs_2006)
vectorizer_tfidf.fit(docs_2006)

# encode document
x = vectorizer_count.transform(docs_2006).toarray()
y = df_2006["code"]
# x = vectorizer_tfidf.transform(docs_2006)
# summarize encoded vector
print('shape: ', x.shape)

results = pd.DataFrame(x, columns=vectorizer_count.get_feature_names())
results.head(10)

在这里插入图片描述
CountVectorizer 有很多参数，比如你可以自己指定 tokenizer 和 stemmer，还可以设定 max_features，这里我 max_features 设置的600。现在，我们 $x$ 和 $y$ 都有了，我们分一下 train data 和 test data：

# train and test data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

我们考虑三个模型：3-nn，gaussiannb 和 multinomialnb：

n_neighbors = 3

# knn
model_knn = neighbors.KNeighborsClassifier(n_neighbors, weights = "distance", metric = 'hamming')
# naive Beyessian
model_gaussnb = GaussianNB()
model_multinb = MultinomialNB(alpha = 0.2)

# fit model
model_knn.fit(x_train, y_train)
model_gaussnb.fit(x_train, y_train)
model_multinb.fit(x_train, y_train)

MultinomialNB 是通过公式 $\hat{\theta}_{kj}(\alpha)=\frac{\sum_{y_i=k}x_{ji}+\alpha}{\sum_{j=1}^K\sum_{y_i=k}x_{ji}+n\alpha}$ 来估计 $\theta_k$ 的第 $j$ 个元素 $\theta_{kj}$ 的，所以要自己指定 $\alpha$ 。在 train data 上训练好模型后，我们在 test data 上 evaluate their performance:

# model evaluation
print("Accuracy of knn on train data:", np.round(model_knn.score(x_train, y_train)*100,2),"%")
print("Accuracy of knn on test data:", np.round(model_knn.score(x_test, y_test)*100,2),"%")
print("Accuracy of gaussiannb on train data:", np.round(model_gaussnb.score(x_train, y_train)*100,2),"%")
print("Accuracy of gaussiannb on test data:", np.round(model_gaussnb.score(x_test, y_test)*100,2),"%")
print("Accuracy of multinomialnb on train data:", np.round(model_multinb.score(x_train, y_train)*100,2),"%")
print("Accuracy of multinomialnb on test data:", np.round(model_multinb.score(x_test, y_test)*100,2),"%")

Accuracy of knn on train data: 98.09 %
Accuracy of knn on test data: 62.55 %
Accuracy of gaussiannb on train data: 56.91 %
Accuracy of gaussiannb on test data: 34.89 %
Accuracy of multinomialnb on train data: 87.55 %
Accuracy of multinomialnb on test data: 70.21 %

我们可以发现在我们设定的参数下， knn 存在严重的过拟合，效果一般。GaussianNB 效果非常糟糕，MutinomialNB 的效果最好，在 test data 上达到了 70% 的准确率。可以画一下 confusion matrix:

# predict y_test and calculate confusion matrix
y_pred_knn = model_knn.predict(x_test)
cm_knn = confusion_matrix(y_test, y_pred_knn)
print("confusion matrix for knn prediction:")
print(cm_knn)

# multinomial nb
y_pred_multinb = model_multinb.predict(x_test)
cm_multinb = confusion_matrix(y_test, y_pred_multinb)

# define cm-heatmap function
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Oranges, font_size = 5):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title, fontsize=font_size)
    plt.colorbar(fraction=0.046, pad=0.04)
    tick_marks = np.arange(cm.shape[1])
    plt.xticks(tick_marks, fontsize=font_size)
    # ax = plt.gca()
    # ax.set_xticklabels((ax.get_xticks() +1).astype(str))
    plt.yticks(tick_marks, fontsize=font_size)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j]),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black", fontsize=font_size)

    plt.tight_layout()
    plt.ylabel('True label', fontsize=font_size)
    plt.xlabel('Predicted label', fontsize=font_size)

plt.subplots(figsize = (10,10))
print('Confusion matrix for multinomialnb:')
plot_confusion_matrix(cm_multinb, font_size = 15)

confusion matrix for knn prediction:
[[ 2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  1  0  0  3  0  0  0  0  5  0  0  0  0  0  0  0  0  0]
 [ 1  0  1  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  5  0  0  0  0  0  0  1  0  0  2  0  0  0  0  0  0  0  0  0]
 [ 0  0  1  0  0  9  0  0  0  0  0  0  0  0  4  0  0  0  0  0  0  0  2  0]
 [ 0  0  1  0  0  1  9  0  0  0  0  0  0  0  4  0  0  0  0  0  0  1  0  0]
 [ 0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  2  0  0  0  0  0  6  0  0  0  0  1  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  4  1  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  1  0  0  0  2  1  0  0  0  0  0  0  1  0]
 [ 0  0  0  0  0  0  0  0  0  0  0 11  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  5  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0 23  0  0  0  0  0  0  0  0  0  0]
 [ 1  0  0  0  0  1  0  0  0  0  0  0  0  0 15  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  1  1  0  0  0  0  0  0  0  0  5  5  0  0  0  0  0  2  2  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  1  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  1  0  0  1  0  0  0  0  0  1  0  0  2  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  3  1  0  0  0  6  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  3  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  3  0  0  3  0  0  0 11  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  6  1  0  0  0  0  0  0 32  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  3  0  0  0  0  0  0  0  0  8]]

Confusion matrix for multinomialnb:
在这里插入图片描述
当然，几个方法效果都很一般。我们后面会试一下 svm, tree based method 以及 deep learning 在这个数据集上的效果。

qq_37353305

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
2
评论
用Python实现K近邻和朴素贝叶斯对文本数据分类

K-Nearest Neighbors Algorithm with Python and An application to text data简介简介KNN 是最常见，最简单的非参数机器学习的方法，特点就是简单好用，对 data generating process (DGP) 没有任何假设。但是一般来说，knn 不会有太好的效果，除非数据是 well separated。...
复制链接

扫一扫