一、
- 高斯朴素贝叶斯、多项式朴素贝叶斯、伯努利朴素贝叶斯、补集朴素贝叶斯
import datetime
from time import time
import numpy as np
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import brier_score_loss
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
class_1 = 50000
class_2 = 500
centers = [[0.0, 0.0], [5.0, 5.0]]
clusters_std = [3, 1]
X, y = make_blobs(n_samples=[class_1, class_2],
centers=centers,
cluster_std=clusters_std,
random_state=0, shuffle=False)
print(X.shape)
print((y == 0).sum())
data = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)
print(data.head())
name = ["Gaussian", "Multinomial", "Bernoulli", "Complement"]
models = [GaussianNB(), MultinomialNB(), BernoulliNB(binarize=0.5), ComplementNB(norm=True)]
for name, clf in zip(name, models):
time_start = time()
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=420)
if name != "Gaussian":
kbs = KBinsDiscretizer(n_bins=10, encode="onehot")
kbs.fit(Xtrain)
Xtrain = kbs.transform(Xtrain)
Xtest = kbs.transform(Xtest)
clf.fit(Xtrain, ytrain)
y_pred = clf.predict(Xtest)
proba = clf.predict_proba(Xtest)[:, 1]
print(name)
print("\tBrier loss: {:.3f}".format(brier_score_loss(ytest, proba, pos_label=1)))
print("\tAccuracy:{:.3f}".format(clf.score(Xtest, ytest)))
print("\tRecall:{:.3f}".format(recall_score(ytest, y_pred)))
print("\tAUC:{:.3f}".format(roc_auc_score(ytest, proba)))
print(datetime.datetime.fromtimestamp(time()-time_start).strftime("%M:%S:%f"))
"""
1. 多项式朴素贝叶斯判断出了所有的多数类样本,但放弃了全部的少数类样本,受到样本不均衡问题影响最严重
2. 高斯比多项式在少数类的判断上更加成功一些,至少得到了43.8%的recall
3. 伯努利贝叶斯虽然整体的准确度和布里尔分数不如多项式和高斯朴素贝叶斯,但至少成功捕捉出了77.1%的少数类
4. 朴素贝叶斯算法,修正了包括无法处理样本不平衡在内的传统朴素贝叶斯的众多缺点,成功捕捉出了98.7%的少数类
"""
"""
补集朴素贝叶斯(complement naive Bayes,CNB)算法是标准多项式朴素贝叶斯算法的改进:
CNB的发明小组创造出CNB的初衷是为了解决贝叶斯中的“朴素”假设带来的各种问题,他们希望能够创造出数学方法以逃避朴素贝叶斯中的朴素假设,让算法能够不去关心所有特征之间是否是条件独立的;
以此为基础,他们创造出了能够解决样本不平衡问题,并且能够一定程度上忽略朴素假设的补集朴素贝叶斯。
"""