Text Classification with KNN and Naive Bayes Algorithm in Python
简介
Knn 是最常见,最简单的非参数机器学习的方法,它对 data generating process (DGP) 没有任何假设,所以适用于大多数场景。但是个人感觉,knn 对维度较高的数据表现不太理想,而且容易过拟合。朴素贝叶斯 (Naive Bayes) 是参数方法,有模型假设,把这两者放到一起是因为两者都是机器学习中最经典,最简单的方法。
模型
假设我们有数据 ( x i , y i ) i = 1 n (x_i,y_i)_{i=1}^n (xi,yi)i=1n, x i x_i xi 是解释变量, y i y_i yi 是 p p p-维的离散真实 label。
knn
p r ( y = r ∣ x = x 0 ) = # { i : y i = r , x i ∈ knn ( x 0 , d ) } k . pr(y=r|x=x_0)=\frac{\#\{i:y_i=r, x_i\in \operatorname{knn}(x_0,d)\} }{k}. pr(y=r∣x=x0)=k#{i:yi=r,xi∈knn(x0,d)}.其中, knn ( x , d ) \operatorname{knn}(x,d) knn(x,d) 是数据 ( x i ) i = 1 n (x_i)_{i=1}^n (xi)i=1n 中离 x 0 x_0 x0 最近的 k k k 个 x i x_i xi 的集合 based on 距离 d ( x , x i ) d(x,x_i) d(x,xi)。那么算法也很简单,对任意一个点我们只需要计算训练集中每个数据点离该点的距离,然后选出最近的 k k k 个点计算即可。
naive bayes
p r ( y = r ∣ x = x 0 ) = π r f r ( x 0 ) ∑ i = 1 p π k f k ( x 0 ) . pr(y=r|x=x_0)=\frac{\pi_rf_r(x_0)}{\sum_{i=1}^p\pi_kf_k(x_0)}. pr(y=r∣x=x0)=∑i=1pπkfk(x0)πrfr(x0).这个是贝叶斯公式,所以 π k = p r ( y = r ) \pi_k=pr(y=r) πk=pr(y=r),可以用 π ^ k = # { i : y i = k } n \widehat{\pi}_k=\frac{\#\{i: y_i=k\}}{n} π k=n#{i:yi=k} 来估计。 f k ( x ) f_k(x) fk(x) 是 x ∣ y = k x|y=k x∣y=k 的密度函数,是需要指定的 prior。如果让 f k ( x ) = f k ( x , μ k , σ k ) = 1 2 π σ k 2 exp ( − ( x − μ k ) 2 2 σ k 2 ) f_k(x)=f_k(x,\mu_k,\sigma_k)=\frac{1}{\sqrt{2\pi\sigma_k^2}}\exp(-\frac{(x-\mu_k)^2}{2\sigma_k^2}) fk(x)=fk(x,μk,σk)=2πσk21exp(−2σk2(x−μk)2),高斯分布,那么模型称为 Guassian naive bayes。如果令 f k ( x ) f_k(x) fk(x) 为多项式分布,即 f k ( x ) = f k ( x , θ k ) = ( ∑ i = 1 K x i ) ! ∏ i = 1 K x i ! ∏ i = 1 K ( θ k i ) x i f_k(x)=f_k(x,\theta_k)=\frac{(\sum_{i=1}^Kx_{i})!}{\prod_{i=1}^Kx_i!}\prod_{i=1}^K(\theta_{ki})^{x_i} fk(x)=fk(x,θk)=∏i=1Kxi!(∑i=1Kxi)!∏i=1K(θki)xi,其中 K = dim ( x ) K=\operatorname{dim}(x) K=dim(x),那么模型称为 multinomial naive bayes。指定好 parametric prior distribution 后,我们就可以在训练集上训练模型了。
数据
我们研究文本数据 smoker’s helpline dataset。这个文本数据来自于滑铁卢大学的戒烟热线中心。该中心收集了加拿大打电话过来的有意向戒烟者的信息,并在6个月后回电询问 “What helped you the most in trying to quit [smoking]?”。被回电者的 response (即 text 数据) 被记录且分为了20多个类别。这里,我们想要做的一件事情是根据回电者的 reponse 进行文本分类。
Implementation with Python
调入需要的库:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB, MultinomialNB
import itertools
读取数据,这里我们只考虑用 2006 年的 observations 来完成我们的目标——训练分类器并在测试集上评价其表现,一共有1175个观测:
# load data
df_smoker = pd.read_csv("smokerhelpline.csv", sep=',', engine='python')
df_2006 = df_smoker[df_smoker.year_start==2006]
df_2006.head(10)
当然,我们要先对文本数据进行处理,将其转化为可以用于模型的变量和对应值。feature 呢,是一些 n-grams,对应的值可以是该 gram 在 text 中出现的频率,也可以使用 tf-idf 的方法来计算对应值。自由化的处理可以用 nltk 库,其中包含有很多处理文本的函数,可以进行 tokenize, stemming, remove stopwords, lemmarization. remove punctutions, etc. 各种操作。在这里,我们直接用 sklearn.feature_extraction.text 中的 CountVectorizer 和 TfidfVectorizer 来将文本转化成我们想要的形式:
docs_2006 = df_2006["text"]
# vectorization
vectorizer_count = CountVectorizer(stop_words = "english", max_features = 600)
vectorizer_tfidf = TfidfVectorizer(stop_words = "english", max_features = 600)
# fit text
vectorizer_count.fit(docs_2006)
vectorizer_tfidf.fit(docs_2006)
# encode document
x = vectorizer_count.transform(docs_2006).toarray()
y = df_2006["code"]
# x = vectorizer_tfidf.transform(docs_2006)
# summarize encoded vector
print('shape: ', x.shape)
results = pd.DataFrame(x, columns=vectorizer_count.get_feature_names())
results.head(10)
CountVectorizer 有很多参数,比如你可以自己指定 tokenizer 和 stemmer,还可以设定 max_features,这里我 max_features 设置的600。现在,我们
x
x
x 和
y
y
y 都有了,我们分一下 train data 和 test data:
# train and test data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
我们考虑三个模型:3-nn,gaussiannb 和 multinomialnb:
n_neighbors = 3
# knn
model_knn = neighbors.KNeighborsClassifier(n_neighbors, weights = "distance", metric = 'hamming')
# naive Beyessian
model_gaussnb = GaussianNB()
model_multinb = MultinomialNB(alpha = 0.2)
# fit model
model_knn.fit(x_train, y_train)
model_gaussnb.fit(x_train, y_train)
model_multinb.fit(x_train, y_train)
MultinomialNB 是通过公式 θ ^ k j ( α ) = ∑ y i = k x j i + α ∑ j = 1 K ∑ y i = k x j i + n α \hat{\theta}_{kj}(\alpha)=\frac{\sum_{y_i=k}x_{ji}+\alpha}{\sum_{j=1}^K\sum_{y_i=k}x_{ji}+n\alpha} θ^kj(α)=∑j=1K∑yi=kxji+nα∑yi=kxji+α 来估计 θ k \theta_k θk 的第 j j j 个元素 θ k j \theta_{kj} θkj 的,所以要自己指定 α \alpha α。在 train data 上训练好模型后,我们在 test data 上 evaluate their performance:
# model evaluation
print("Accuracy of knn on train data:", np.round(model_knn.score(x_train, y_train)*100,2),"%")
print("Accuracy of knn on test data:", np.round(model_knn.score(x_test, y_test)*100,2),"%")
print("Accuracy of gaussiannb on train data:", np.round(model_gaussnb.score(x_train, y_train)*100,2),"%")
print("Accuracy of gaussiannb on test data:", np.round(model_gaussnb.score(x_test, y_test)*100,2),"%")
print("Accuracy of multinomialnb on train data:", np.round(model_multinb.score(x_train, y_train)*100,2),"%")
print("Accuracy of multinomialnb on test data:", np.round(model_multinb.score(x_test, y_test)*100,2),"%")
Accuracy of knn on train data: 98.09 %
Accuracy of knn on test data: 62.55 %
Accuracy of gaussiannb on train data: 56.91 %
Accuracy of gaussiannb on test data: 34.89 %
Accuracy of multinomialnb on train data: 87.55 %
Accuracy of multinomialnb on test data: 70.21 %
我们可以发现在我们设定的参数下, knn 存在严重的过拟合,效果一般。GaussianNB 效果非常糟糕,MutinomialNB 的效果最好,在 test data 上达到了 70% 的准确率。可以画一下 confusion matrix:
# predict y_test and calculate confusion matrix
y_pred_knn = model_knn.predict(x_test)
cm_knn = confusion_matrix(y_test, y_pred_knn)
print("confusion matrix for knn prediction:")
print(cm_knn)
# multinomial nb
y_pred_multinb = model_multinb.predict(x_test)
cm_multinb = confusion_matrix(y_test, y_pred_multinb)
# define cm-heatmap function
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Oranges, font_size = 5):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title, fontsize=font_size)
plt.colorbar(fraction=0.046, pad=0.04)
tick_marks = np.arange(cm.shape[1])
plt.xticks(tick_marks, fontsize=font_size)
# ax = plt.gca()
# ax.set_xticklabels((ax.get_xticks() +1).astype(str))
plt.yticks(tick_marks, fontsize=font_size)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j]),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black", fontsize=font_size)
plt.tight_layout()
plt.ylabel('True label', fontsize=font_size)
plt.xlabel('Predicted label', fontsize=font_size)
plt.subplots(figsize = (10,10))
print('Confusion matrix for multinomialnb:')
plot_confusion_matrix(cm_multinb, font_size = 15)
confusion matrix for knn prediction:
[[ 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 1 0 0 3 0 0 0 0 5 0 0 0 0 0 0 0 0 0]
[ 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0]
[ 0 0 0 0 5 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0]
[ 0 0 1 0 0 9 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 2 0]
[ 0 0 1 0 0 1 9 0 0 0 0 0 0 0 4 0 0 0 0 0 0 1 0 0]
[ 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 2 0 0 0 0 0 6 0 0 0 0 1 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 4 1 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 1 0 0 0 0 0 0 1 0]
[ 0 0 0 0 0 0 0 0 0 0 0 11 0 0 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 23 0 0 0 0 0 0 0 0 0 0]
[ 1 0 0 0 0 1 0 0 0 0 0 0 0 0 15 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 1 1 0 0 0 0 0 0 0 0 5 5 0 0 0 0 0 2 2 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[ 0 0 1 0 0 1 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 0 0 0 6 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 3 0 0 0 11 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 1 0 0 0 0 0 0 32 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 8]]
Confusion matrix for multinomialnb:
当然,几个方法效果都很一般。我们后面会试一下 svm, tree based method 以及 deep learning 在这个数据集上的效果。