前言
本文介绍Gaussian Naive Bayes(GNB)算法的基本思想和代码实现。手搓一个GNB模型并与sklearn中的GNB进行了对比。
贝叶斯算法是基于训练样本的概率分布,对待测样本属于各个类别的概率进行计算,从而预测类别标签的算法,因此需要掌握一些基本的概率论知识:条件概率、独立事件的概率计算、贝叶斯公式。
理论部分
算法思想和基本原理
对于给定待分类项,求解在此项条件下各个类别出现的概率,概率最大的为预测的分类结果。
-
算法的巧妙之处在于通过贝叶斯公式,将难以计算的后验概率转化为方便计算的先验概率和似然函数。贝叶斯公式:
P ( A ∣ B ) = P ( A ) P ( B ∣ A ) P ( B ) P(A|B)=\frac{P(A)P(B|A)}{P(B)} P(A∣B)=P(B)P(A)P(B∣A)
我们也可以表示成:
后验概率 = 先验概率 ⋅ 似然函数 证据 后验概率=\frac{先验概率·似然函数}{证据} 后验概率=证据先验概率⋅似然函数 -
朴素贝叶斯分类假设特征向量的分量之间相互独立,从而简化了概率的求解(这也是称为“朴素”的原因)。我们知道独立事件的概率计算,遵从如下规律:
P ( A , B ) = P ( A ) ⋅ P ( B ) P(A,B) = P(A)·P(B) P(A,B)=P(A)⋅P(B)
解释:A与B相互独立,则A与B同时发生的概率 = A发生的概率 * B发生的概率
算法定义
-
有类别集合 C = c 1 , c 2 , . . . , c m C={c_{1},c_{2},...,c_{m}} C=c1,c2,...,cm
-
设 x = { x 1 , x 2 , . . . , x n } x=\{x_{1},x_{2},...,x_{n}\} x={x1,x2,...,xn}为一个待分类项,每个xi为x的一个特征属性,y是样本的实际类别标签,y∈C
-
分别计算x属于ci(i=1,2,…,m)的概率 P ( c i ∣ x ) P(c_{i}|x) P(ci∣x)
-
找出k,使得 P ( c k ∣ x ) = m a x { P ( c 1 ∣ x ) , P ( c 2 ∣ x ) , . . . , P ( c m ∣ x ) } P(c_{k}|x)=max\{ P(c_{1}|x), P(c_{2}|x), ...,P(c_{m}|x) \} P(ck∣x)=max{P(c1∣x),P(c2∣x),...,P(cm∣x)},则x属于类别ck
概率求解步骤
-
应用贝叶斯公式
根据贝叶斯公式,x属于类别ci(i=1,2,…,m)的概率:
P ( y = c i ∣ x ) = P ( y = c i ) P ( x ∣ y = c i ) P ( x ) P(y=c_{i} |x)=\frac{P(y=c_{i})P(x|y=c_{i})}{P(x)} P(y=ci∣x)=P(x)P(y=ci)P(x∣y=ci) -
其中,分母 P ( x ) P(x) P(x)对于确定样本x的情况下,对所有i、j的结果是相同的,不影响概率的大小,因此只需考虑分子部分。
-
分子部分中, P ( y = c i ) P(y=c_{i}) P(y=ci)为先验概率,可以直接根据训练集中各类别样本的占比计算。
-
分子部分中, P ( x ∣ y = c i ) P(x|y=c_{i}) P(x∣y=ci)为似然函数,由于各分量之间相互独立,因此:
P ( x ∣ y = c i ) = P ( x 1 , x 2 , . . . , x n ∣ y = c i ) = P ( x 1 ∣ y = c i ) ⋅ P ( x 2 ∣ y = c i ) ⋅ ⋅ ⋅ P ( x m ∣ y = c i ) = ∏ j = 1 n P ( x j ∣ y = c i ) \begin{aligned} P(x|y=c_{i})&=P(x_{1},x_{2},...,x_{n}|y=c_{i})\\ &=P(x_{1}|y=c_{i})·P(x_{2}|y=c_{i})···P(x_{m}|y=c_{i})\\ &=\prod_{j=1}^{n}P(x_{j}|y=c_{i}) \end{aligned} P(x∣y=ci)=P(x1,x2,...,xn∣y=ci)=P(x1∣y=ci)⋅P(x2∣y=ci)⋅⋅⋅P(xm∣y=ci)=j=1∏nP(xj∣y=ci)
- 计算 P ( x j ∣ y = c i ) P(x_{j}|y=c_{i}) P(xj∣y=ci)
-
当样本特征为离散值时,只需用训练样本中各特征x划分到各类别c中的频率进行估计
-
当样本特征为连续值时,通常假设其服从高斯分布(正态分布),即
g ( x , η , σ ) = 1 2 Π σ e − ( x − η ) 2 2 σ 2 g(x,\eta ,\sigma )=\frac{1}{\sqrt{2\Pi}\sigma}e^{-\frac{(x-\eta)^{2}}{2\sigma ^{2}}} g(x,η,σ)=2Πσ1e−2σ2(x−η)2
可计算 P ( x j ∣ y = c i ) = g ( x j , η c i , σ c i ) P(x_{j}|y=c_{i})=g(x_{j},\eta_{c_{i}},\sigma_{c_{i}}) P(xj∣y=ci)=g(xj,ηci,σci)
其中 η c i \eta_{c_{i}} ηci 为类别 c i c_{i} ci 中特征 x j x_{j} xj 的均值, σ c i \sigma_{c_{i}} σci 为类别 c i c_{i} ci 中特征 x j x_{j} xj 的标准差。
离散情况下的算法称为朴素贝叶斯NB。本文要实现的算法是高斯朴素贝叶斯GNB,因此是针对特征为连续值的情况(后者)。
实践部分
问题描述
编程实现朴素贝叶斯分类器。利用皮马印第安人糖尿病数据集测试朴素贝叶斯分类器效果,并对比sklearn工具包的分类结果。
数据集下载地址:点击前往
GNB代码实现分解
我先根据前面的理论推导,给出每一步代码的实现详解,最后再给出完整代码示例。
训练代码分解
首先,根据之前的推导,我们知道,采用高斯分布计算似然函数 P ( x j ∣ y = c i ) P(x_{j}|y=c_{i}) P(xj∣y=ci)时,需要知道每个特征的均值、标准差;而先验概率 P ( y = c i ) P(y=c_{i}) P(y=ci),只需要从训练样本中统计而来。
因此,训练步骤的目标,就是从一组训练样本中计算得到各类别中的各特征均值、各特征标准差、样本占比。
-
_separate_by_class
首先,将训练数据特征X按照不同的类别y进行分组def _separate_by_class(self, X, y): separated = {} for i in range(len(X)): if y[i] not in separated: separated[y[i]] = [] separated[y[i]].append(X[i]) return separated
-
_summarize
计算一批样本的特征均值和标准差def _summarize(self, X): mean = np.mean(X, axis=0) std = np.std(X, axis=0) return mean, std
mean和std均为np数组,长度为特征的数量
-
fit
调用前面的函数,分组计算各类别的特征均值mean
、特征标准差std
和样本占比值ratio
,结果记录在成员变量summaries_by_class
中def fit(self, X_train, y_train): separated = self._separate_by_class(X_train, y_train) self.summaries_by_class = {} total = len(X_train) for class_value, instances in separated.items(): mean, std = self._summarize(instances) ratio = len(instances) / total self.summaries_by_class[class_value] = mean, std, ratio
预测代码分解
根据已经计算得到的参数,对待测样本计算属于各类别的概率
- _gaussian_probability
计算高斯概率 g ( x , η , σ ) = 1 2 Π σ e − ( x − η ) 2 2 σ 2 g(x,\eta ,\sigma )=\frac{1}{\sqrt{2\Pi}\sigma}e^{-\frac{(x-\eta)^{2}}{2\sigma ^{2}}} g(x,η,σ)=2Πσ1e−2σ2(x−η)2def _gaussian_probability(self, x, mean, stdev): exponent = math.exp(-(math.pow(x - mean, 2) / (2 * math.pow(stdev, 2)))) return (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent
其中exponent
是指数部分,x
, mean
, stdev
为某个具体特征值和对应均值、方差
- _calculate_probabilities
计算某个样本属于各类别的概率def _calculate_probabilities(self, summaries, inputVector): probabilities = {} for classValue, classSummaries in summaries.items(): mean_array, std_array, ratio = classSummaries probabilities[classValue] = ratio for i in range(len(inputVector)): x = inputVector[i] mean = mean_array[i] stdev = std_array[i] probabilities[classValue] *= self._gaussian_probability( x, mean, stdev) return probabilities
inputVector
为样本特征向量,summaries
是训练阶段的结果,probabilities
是一个字典,为类别->概率值的映射。 - predict_one
找到概率最大的类别,预测一条待测样本的类别。def predict_one(self, X): probabilities = self._calculate_probabilities(self.summaries, X) bestLabel, bestProb = None, -1 for classValue, probability in probabilities.items(): if bestLabel is None or probability > bestProb: bestProb = probability bestLabel = classValue return bestLabel
- predict
预测一组待测样本的类别结果def predict(self, X_test): if self.summaries is None: raise ValueError('Model not fitted') pred = [] for x in X_test: pred.append(self.predict_one(x)) return pred
完整代码
数据
import numpy as np
import pandas as pd
import math
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
class GNB(object):
def __init__(self):
self.summaries = None
def _separate_by_class(self, X, y):
separated = {}
for i in range(len(X)):
if y[i] not in separated:
separated[y[i]] = []
separated[y[i]].append(X[i])
return separated
def _summarize(self, X):
mean = np.mean(X, axis=0)
std = np.std(X, axis=0)
return mean, std
def fit(self, X_train, y_train):
separated = self._separate_by_class(X_train, y_train)
self.summaries = {}
total = len(X_train)
for class_value, instances in separated.items():
mean, std = self._summarize(instances)
ratio = len(instances) / total
self.summaries[class_value] = mean, std, ratio
def _gaussian_probability(self, x, mean, stdev):
exponent = math.exp(-(math.pow(x - mean, 2) /
(2 * math.pow(stdev, 2))))
return (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent
def _calculate_probabilities(self, summaries, inputVector):
probabilities = {}
for classValue, classSummaries in summaries.items():
mean_array, std_array, ratio = classSummaries
probabilities[classValue] = ratio
for i in range(len(inputVector)):
x = inputVector[i]
mean = mean_array[i]
stdev = std_array[i]
probabilities[classValue] *= self._gaussian_probability(
x, mean, stdev)
return probabilities
def predict_one(self, X):
probabilities = self._calculate_probabilities(self.summaries, X)
bestLabel, bestProb = None, -1
for classValue, probability in probabilities.items():
if bestLabel is None or probability > bestProb:
bestProb = probability
bestLabel = classValue
return bestLabel
def predict(self, X_test):
if self.summaries is None:
raise ValueError('Model not fitted')
pred = []
for x in X_test:
pred.append(self.predict_one(x))
return pred
# 计算预测准确率
def get_accracy(ground_true, predictions):
correct = 0
for i in range(len(ground_true)):
if ground_true[i] == predictions[i]:
correct += 1
return (correct / float(len(ground_true))) * 100.0
def pred_and_accuracy(model, X_train, X_test, y_train, y_test):
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accracy = get_accracy(y_test, predictions)
return predictions, accracy
def main():
# load dataset
dataset = pd.read_csv('pima-indians-diabetes.data.csv', header=None)
dataset = dataset.values # to numpy dataset
# split dataset
X_train, X_test, y_train, y_test = train_test_split(dataset[:, :-1],
dataset[:, -1],
test_size=0.2,
random_state=13)
# my gnb
my_gnb = GNB()
pred, acc = pred_and_accuracy(my_gnb, X_train, X_test, y_train, y_test)
print('my gnb accuracy:', acc)
# sklearn gnb
sk_gnb = GaussianNB()
pred, acc = pred_and_accuracy(sk_gnb, X_train, X_test, y_train, y_test)
print(f'sk gnb accuracy: {acc}')
if __name__ == '__main__':
main()
案例测试结果
如图所示,我们实现的GNB模型同sklearn包下的GNB模型在同样的训练集和验证集上展现出一样的准确率。