朴素贝叶斯
基于贝叶斯定理与特征条件独立假设的分类方法。属于生成模型,先学习联合概率分布 P ( X , Y ) P(X,Y) P(X,Y) ,后计算后验概率分布 P ( Y ∣ X ) P(Y|X) P(Y∣X)
模型
设输入空间
X
∈
R
n
X \in R^n
X∈Rn 为
n
n
n 维向量的集合,输出空间为类标记集合
Y
c
1
,
c
2
,
…
,
c
K
Y {c_1,c_2,\ldots,c_K}
Yc1,c2,…,cK 。X是定义在输入空间
X
X
X 上的随机向量,Y是定义在输出空间
Y
Y
Y 上的随机变量。P(X,Y)是X和Y的联合概率分布。训练数据集
T
=
{
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
,
…
,
(
x
N
,
y
N
)
}
T = \{(x_1,y_1),(x_2,y_2),\ldots,(x_N,y_N)\}
T={(x1,y1),(x2,y2),…,(xN,yN)}
由
P
(
Y
∣
X
)
P(Y|X)
P(Y∣X)独立同分布产生。
先验概率分布
P
(
Y
=
c
k
)
,
k
=
1
,
2
,
…
,
K
P(Y = c_k),k = 1,2,\ldots,K
P(Y=ck),k=1,2,…,K
条件概率分布
P
(
X
=
x
∣
Y
=
c
k
)
=
P
(
X
(
1
)
=
x
(
1
)
,
X
(
2
)
=
x
(
2
)
,
…
,
X
(
n
)
=
x
(
n
)
∣
Y
=
c
k
)
,
k
=
1
,
2
,
…
,
K
P(X= x|Y = c_k) = P(X^{(1)} = x^{(1)},X^{(2)} = x^{(2)},\ldots,X^{(n)} = x^{(n)}|Y = c_k),k = 1,2,\ldots,K
P(X=x∣Y=ck)=P(X(1)=x(1),X(2)=x(2),…,X(n)=x(n)∣Y=ck),k=1,2,…,K
根据贝叶斯定理和条件独立假设得到:
后验概率分布
P
(
Y
=
c
k
∣
X
=
x
)
=
P
(
Y
=
c
k
)
∏
j
P
(
X
(
j
)
∣
Y
=
c
k
)
∑
k
P
(
Y
=
c
k
)
∏
j
P
(
X
(
j
)
∣
Y
=
c
k
)
,
j
=
1
,
2
,
…
,
n
,
k
=
1
,
2
,
…
,
K
P(Y = c_k|X= x) = \frac{P(Y = c_k)\prod_j P(X^{(j)}|Y = c_k)}{\sum_k P(Y = c_k)\prod_j P(X^{(j)}|Y = c_k)},j = 1,2,\ldots ,n,k = 1,2,\ldots,K
P(Y=ck∣X=x)=∑kP(Y=ck)∏jP(X(j)∣Y=ck)P(Y=ck)∏jP(X(j)∣Y=ck),j=1,2,…,n,k=1,2,…,K
策略
后验概率最大化
假设选择0-1损失函数
根据期望风险最小化准则,得到后验概率最大化:
f
(
x
)
=
a
r
g
m
a
x
P
(
c
k
∣
X
=
x
)
f(x) = argmaxP(c_k|X = x)
f(x)=argmaxP(ck∣X=x)
= a r g m a x P ( Y = c k ) ∏ j P ( X ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) ∣ Y = c k ) = argmax\frac{P(Y = c_k)\prod_j P(X^{(j)}|Y = c_k)}{\sum_k P(Y = c_k)\prod_j P(X^{(j)}|Y = c_k)} =argmax∑kP(Y=ck)∏jP(X(j)∣Y=ck)P(Y=ck)∏jP(X(j)∣Y=ck)
因为分母对于所有的 c k c_k ck都是相同的,所以
f ( x ) = a r g m a x P ( Y = c k ) ∏ j P ( X ( j ) ∣ Y = c k ) f(x) = argmaxP(Y = c_k)\prod_j P(X^{(j)}|Y = c_k) f(x)=argmaxP(Y=ck)j∏P(X(j)∣Y=ck)
算法
极大似然估计
贝叶斯估计
应用
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from collections import Counter
import math
# data
def create_data():
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['label'] = iris.target
df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label']
data = np.array(df.iloc[:100, :])
# print(data)
return data[:,:-1], data[:,-1]
X, y = create_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
X_test[0], y_test[0]
(array([5.4, 3.4, 1.7, 0.2]), 0.0)
class NaiveBayes:
def __init__(self):
self.model = None
# 数学期望
@staticmethod
def mean(X):
return sum(X) / float(len(X))
# 标准差(方差)
def stdev(self, X):
avg = self.mean(X)
return math.sqrt(sum([pow(x-avg, 2) for x in X]) / float(len(X)))
# 高斯分布概率密度函数
def gaussian_probability(self, x, mean, stdev):
exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))
return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent
# 处理X_train
def summarize(self, train_data):
summaries = [(self.mean(i), self.stdev(i)) for i in zip(*train_data)] #zip(* )保证解压后的参数长度一直
return summaries
# 分类别求出数学期望和标准差
def fit(self, X, y):
labels = list(set(y))
data = {label:[] for label in labels}
for f, label in zip(X, y): #zip() 多个参数压缩到一起,该函数返回一个以元组为元素的列表
data[label].append(f)
self.model = {label: self.summarize(value) for label, value in data.items()} #求出每个类对应的数学期望和标准差
return 'gaussianNB train done!'
# 计算概率
def calculate_probabilities(self, input_data):
# summaries:{0.0: [(5.0, 0.37),(3.42, 0.40)], 1.0: [(5.8, 0.449),(2.7, 0.27)]}
# input_data:[1.1, 2.2]
probabilities = {}
for label, value in self.model.items():
probabilities[label] = 1
for i in range(len(value)):
mean, stdev = value[i]
probabilities[label] *= self.gaussian_probability(input_data[i], mean, stdev)
return probabilities
# 类别
def predict(self, X_test):
# {0.0: 2.9680340789325763e-27, 1.0: 3.5749783019849535e-26}
label = sorted(self.calculate_probabilities(X_test).items(), key=lambda x: x[-1])[-1][0]
return label
def score(self, X_test, y_test):
right = 0
for X, y in zip(X_test, y_test):
label = self.predict(X)
if label == y:
right += 1
return right / float(len(X_test))
model = NaiveBayes()
model.fit(X_train, y_train)
'gaussianNB train done!'
print(model.predict([4.4, 3.2, 1.3, 0.2]))
0.0
model.score(X_test, y_test)
1.0
sklearn.naive_bayes
(sklearn)三种朴素贝叶斯模型:
- 高斯模型(Gassian),适用于特征属于连续型变量,假设特征遵循正态分布,可用于连续型变量的分类问题
- 多项式模型(Multinomial),该模型常用于文本分类,特征是单词,值是单词的出现频数。主要用于文本主题分类
- 伯努利模型(Bernoulli),每个特征的取值是布尔型的,即true和false,或者1和0。在文本分类中,1就是一个特征出现在一个文档中,0就是一个特征没有出现在一个文档中,主要用于情感分析。
GaussianNB 高斯朴素贝叶斯
特征的可能性被假设为高斯
概率密度函数:
P
(
x
i
∣
y
k
)
=
1
2
π
σ
y
k
2
e
x
p
(
−
(
x
i
−
μ
y
k
)
2
2
σ
y
k
2
)
P(x_i | y_k)=\frac{1}{\sqrt{2\pi\sigma^2_{yk}}}exp(-\frac{(x_i-\mu_{yk})^2}{2\sigma^2_{yk}})
P(xi∣yk)=2πσyk21exp(−2σyk2(xi−μyk)2)
数学期望(mean): μ \mu μ,方差: σ 2 = ∑ ( X − μ ) 2 N \sigma^2=\frac{\sum(X-\mu)^2}{N} σ2=N∑(X−μ)2
MultinomialNB 多项式朴素贝叶斯
BernoulliNB 伯努利朴素贝叶斯
参考:https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X_train, y_train)
GaussianNB(priors=None, var_smoothing=1e-09)
clf.score(X_test, y_test)
1.0
clf.predict([[4.4, 3.2, 1.3, 0.2]])
array([0.])
# 将分类预测的结果存储在变量y_predict
y_predict = clf.predict(X_test)
# 从sklearn.metrics 导入 classification_report
from sklearn.metrics import classification_report
# 输出更加详细的其他评价分类性能的指标。
# print (classification_report(y_test, y_predict))
print(classification_report(clf.predict(X_test),y_test))
precision recall f1-score support
0.0 1.00 1.00 1.00 16
1.0 1.00 1.00 1.00 14
micro avg 1.00 1.00 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
参考文献:
【1】李航.统计学习方法
【2】github https://github.com/wzyonggege/statistical-learning-method/tree/master/NaiveBayes