naive bayes
朴素贝叶斯算法想法非常简单,根据贝叶斯公式,通过先验概率计算后验概率,原理不多赘述,网上可以查到很多。
这里值得一提的是根据
The Optimality of Naive Bayes
Harry Zhang
Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada E3B 5A3 email: hzhang@unb.ca
朴素贝叶斯虽然假设各个特征之间是独立分布的(这在现实中往往是不成立的),但朴素贝叶斯算法在实际应用中仍然得到了比较好的结果。上述论文中对特征之间的局部依赖性对整体分类的影响进行了数学推导,有兴趣的同学可以看一下。
另外值得注意的一点是,朴素贝叶斯(NB)虽然是一个很好的分类器,但不是一个很好的用于解决回归问题的工具,同时朴素贝叶斯输出的概率也不具太多参考性,朴素贝叶斯算法很有可能以60%的概率将一个样本归为某个分类,而实际这个样本的置信度可能有90%,但anyway,朴素贝叶斯进行了正确的分类。
下面我们使用sklearn算法进行朴素贝叶斯的训练。
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB, ComplementNB
from sklearn import metrics
from sklearn.model_selection import train_test_split
sklearn包中提供了四种内核的贝叶斯,分别对应了四种对于条件分布 P ( x i ∣ y ) P(x_i|y) P(xi∣y) 的不同假设,分别是:
- GaussianNB 高斯分布
- BernoulliNB 0-1分布
- MultinomialNB 多项式分布
- ComplementNB 为了修复多项式分布过强的假设,适合于分布不均匀的样本
X = pd.read_csv('american_salary_feture.csv')
y = pd.read_csv('american_salary_label.csv', header=None)
y= np.array(y)
y=y.ravel()
无先验训练
gnb = GaussianNB()
bnb = BernoulliNB()
mnb = MultinomialNB()
cnb = ComplementNB()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
gnb = gnb.fit(X_train, y_train)
bnb = bnb.fit(X_train, y_train)
mnb = mnb.fit(X_train, y_train)
cnb = cnb.fit(X_train, y_train)
score_gnb = gnb.score(X_test, y_test)
score_bnb = bnb.score(X_test, y_test)
score_mnb = mnb.score(X_test, y_test)
score_cnb = cnb.score(X_test, y_test)
print("gnb:", score_gnb)
print("bnb:", score_bnb)
print("mnb:", score_mnb)
print("cnb:", score_cnb)
gnb: 0.7970765262252795
bnb: 0.7626827171109201
mnb: 0.7870040535560742
cnb: 0.7870040535560742
有先验训练
# 先看下样本本身的分布
print("#0:", (y_train==0).sum())
print("#1:", (y_train==1).sum())
#0: 18509
#1: 5911
print("#0:", (y_test==0).sum())
print("#1:", (y_test==1).sum())
#0: 6211
#1: 1930
这个样本本身大概是三比一左右,下面我们把训练集改成1:1,然后测试集还是3:1的
X_train_1 = X_train[y_train==1]
X_train_0 = X_train[y_train==0]
y_train_1 = y_train[y_train==1]
y_train_0 = y_train[y_train==0]
# 取标签为0的数据的前5911个
X_train_0 = X_train_0[:5911]
y_train_0 = y_train_0[:5911]
X_train_new = pd.concat([X_train_0, X_train_1], axis=0)
y_train_new = np.concatenate((y_train_0, y_train_1), axis=0)
gnb = GaussianNB()
bnb = BernoulliNB()
mnb = MultinomialNB()
cnb = ComplementNB()
gnb = gnb.fit(X_train_new, y_train_new)
bnb = bnb.fit(X_train_new, y_train_new)
mnb = mnb.fit(X_train_new, y_train_new)
cnb = cnb.fit(X_train_new, y_train_new)
score_gnb = gnb.score(X_test, y_test)
score_bnb = bnb.score(X_test, y_test)
score_mnb = mnb.score(X_test, y_test)
score_cnb = cnb.score(X_test, y_test)
print("gnb:", score_gnb)
print("bnb:", score_bnb)
print("mnb:", score_mnb)
print("cnb:", score_cnb)
gnb: 0.7967080211276256
bnb: 0.7463456577815993
mnb: 0.7868812185235229
cnb: 0.7868812185235229
下面告诉分类器样本比例是[3:1] 只有gnb有这个参数
gnb = GaussianNB(priors = [0.75,0.25])
gnb = gnb.fit(X_train_new, y_train_new)
score_gnb = gnb.score(X_test, y_test)
print("gnb:", score_gnb)
gnb: 0.7990418867461
可以看到有了先验条件之后准确率有所提升,先验比例尤其有利于样本较少时候的训练。
下面我们逐渐减少训练样本,然后看先验带来的收益,然后我们给模型一个与数据不符的先验条件,然后查看在不同训练数量的情况下,先验条件带来的差别。
j=0
score_without_prior=np.zeros(99)
score_with_prior = np.zeros(99)
for i in range(50,5000,50):
X_train_1 = X_train[y_train==1]
X_train_0 = X_train[y_train==0]
y_train_1 = y_train[y_train==1]
y_train_0 = y_train[y_train==0]
X_train_1 = X_train[:i]
X_train_0 = X_train[:i]
y_train_1 = y_train[:i]
y_train_0 = y_train[:i]
X_train_new = pd.concat([X_train_0, X_train_1], axis=0)
y_train_new = np.concatenate((y_train_0, y_train_1), axis=0)
gnb = GaussianNB()
gnb = gnb.fit(X_train_new, y_train_new)
score_without_prior[j] = gnb.score(X_test, y_test)
gnb = GaussianNB(priors = [0.5,0.5])
gnb = gnb.fit(X_train_new, y_train_new)
score_with_prior[j] = gnb.score(X_test, y_test)
j= j+1
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
sns.set(style="whitegrid")
data = pd.DataFrame({"score_without_prior":score_without_prior,
"score_with_prior": score_with_prior},
index=range(50,5000,50))
sns.lineplot(data=data)
plt.xlabel("estimators")
plt.ylabel("score")
plt.title("scores varies with number of estimators")
Text(0.5, 1.0, 'scores varies with number of estimators')