朴素贝叶斯法是基于贝叶斯定理与特征条件独立假设的分类方法。对于给定的训练数据集,首先基于特征条件独立假设学习输入输出的联合概率分布;然后基于此模型,对给定的输入 x x x , 利用贝叶斯定理求出后验概率最大的输出 y y y 。 朴素贝叶斯法实现简单,学习与预测的效率都很高,是一种常用的方法。
4.1 朴素贝叶斯法的学习与分类
4.1.1 基本方法
朴素贝叶斯法通过训练数据集学习联合概率分布 P ( X , Y ) P(X,Y) P(X,Y)。具体地,学习以下先验概率分布及条件概率分布。先验概率分布:
P ( Y = c k ) , k = 1 , 2 , … , K P(Y=c_k),\quad k=1,2,\dots,K P(Y=ck),k=1,2,…,K
条件概率分布
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , ⋯ , X ( n ) = x ( n ) ∣ Y = c k ) , k = 1 , 2 , ⋯ , K P\left(X=x \mid Y=c_{k}\right)=P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} \mid Y=c_{k}\right), \quad k=1,2, \cdots, K P(X=x∣Y=ck)=P(X(1)=x(1),⋯,X(n)=x(n)∣Y=ck),k=1,2,⋯,K
于是学习到联合概率分布 P ( X , Y ) P(X,Y) P(X,Y)。
条件概率分布
P
(
X
=
x
∣
Y
=
c
k
)
P(X=x|Y=c_k)
P(X=x∣Y=ck)有指数级数量的参数,其估计实际是不可行的。
事实上,假设
x
(
j
)
x^{(j)}
x(j)可取值有
S
j
S_j
Sj个,
Y
Y
Y可取值有
K
K
K个,那么参数个数有
K
∏
i
=
1
n
S
j
K \prod_{i=1}^{n} S_{j}
K∏i=1nSj
朴素贝叶斯法对条件概率分布作了条件独立性的假设。由于这是一个较强的假设,朴素贝叶斯法也由此得名。具体地,条件独立性假设是
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , ⋯ , X ( n ) = x ( n ) ∣ Y = c k ) = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) \begin{aligned}P\left(X=x \mid Y=c_{k}\right) &=P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} \mid Y=c_{k}\right) \\&=\prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)\end{aligned} P(X=x∣Y=ck)=P(X(1)=x(1),⋯,X(n)=x(n)∣Y=ck)=j=1∏nP(X(j)=x(j)∣Y=ck)
朴素贝叶斯法实际上学习到生成数据的机制,所以属于生成模型。 条件独立假设等于是说用于分类的特征在类确定的条件下都是条件独立的。这一假设使朴素贝叶斯法变得简单,但有时会牺牲一定的分类准确率。
朴素贝叶斯法分类时,对给定的输入
x
x
x , 通过学习到的模型计算后验概率分布
P
(
Y
=
c
k
∣
X
=
x
)
P(Y = c_k|X = x)
P(Y=ck∣X=x) , 将后验概率最大的类作为
x
x
x 的类输出。后验概率计算根据贝叶斯定理进行:
P ( Y = c k ∣ X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ k P ( X = x ∣ Y = c k ) P ( Y = c k ) P\left(Y=c_{k} \mid X=x\right)=\frac{P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)}{\sum_{k} P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)} P(Y=ck∣X=x)=∑kP(X=x∣Y=ck)P(Y=ck)P(X=x∣Y=ck)P(Y=ck)
P ( Y = c k ∣ X = x ) = P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) k = 1 , 2 , ⋯ , K P\left(Y=c_{k} \mid X=x\right)=\frac{P\left(Y=c_{k}\right)\prod_j P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) }{\sum_{k}P\left(Y=c_{k}\right)\prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) }\quad k=1,2, \cdots, K P(Y=ck∣X=x)=∑kP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)P(Y=ck)∏jP(X(j)=x(j)∣Y=ck)k=1,2,⋯,K
这是朴素贝叶斯法分类的基本公式。于是, 朴素贝叶斯分类器可表示为
y = f ( x ) = arg max c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=\arg \max_{c_k}\frac{P\left(Y=c_{k}\right)\prod_j P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) }{\sum_{k}P\left(Y=c_{k}\right)\prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) } y=f(x)=argckmax∑kP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)P(Y=ck)∏jP(X(j)=x(j)∣Y=ck)
注意到,在上式中分母对所有 c k c_k ck 都是相同的,所以,
y = arg max c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) y=\arg \max_{c_k}P\left(Y=c_{k}\right)\prod_j P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) y=argckmaxP(Y=ck)j∏P(X(j)=x(j)∣Y=ck)
4.1.2 后验概率最大化的含义
朴素贝叶斯法将实例分到后验概率最大的类中。这等价于期望风险最小化。假设选择0-1 损失函数:
L ( Y , f ( X ) ) = { 1 , Y ≠ f ( X ) 0 , Y = f ( X ) L(Y, f(X))= \begin{cases}1, & Y \neq f(X) \\ 0, & Y=f(X)\end{cases} L(Y,f(X))={1,0,Y=f(X)Y=f(X)
式中 f ( X ) f(X) f(X) 是分类决策函数。这时,期望风险函数为
R e x p ( f ) = E [ L ( Y , f ( X ) ) ] R_{exp}(f)=E[L(Y,f(X))] Rexp(f)=E[L(Y,f(X))]
期望是对联合分布 P ( X , Y ) P(X,Y) P(X,Y) 取的。由此取条件期望
R e x p ( f ) = E X ∑ k = 1 K [ L ( c k , f ( X ) ] P ( c k ∣ X ) R_{exp}(f)=E_X\sum^K_{k=1}[L(c_k,f(X)]P(c_k|X) Rexp(f)=EXk=1∑K[L(ck,f(X)]P(ck∣X)
为了使期望风险最小化,只需对 X = x X=x X=x 逐个极小化,由此得到:
f ( x ) = arg min y ∈ Y ∑ k = 1 K L ( c k , y ) P ( c k ∣ X = x ) = arg min y ∈ Y ∑ k = 1 K P ( y ≠ c k ∣ X = x ) = arg min y ∈ Y ( 1 − P ( y = c k ∣ X = x ) ) = arg max y ∈ Y P ( y = c k ∣ X = x ) \begin{aligned}f(x) &=\arg \min _{y \in \mathcal{Y}} \sum_{k=1}^{K} L\left(c_{k}, y\right) P\left(c_{k} \mid X=x\right) \\&=\arg \min _{y \in \mathcal{Y}} \sum_{k=1}^{K} P\left(y \neq c_{k} \mid X=x\right) \\&=\arg \min _{y \in \mathcal{Y}}\left(1-P\left(y=c_{k} \mid X=x\right)\right) \\&=\arg \max _{y \in \mathcal{Y}} P\left(y=c_{k} \mid X=x\right)\end{aligned} f(x)=argy∈Ymink=1∑KL(ck,y)P(ck∣X=x)=argy∈Ymink=1∑KP(y=ck∣X=x)=argy∈Ymin(1−P(y=ck∣X=x))=argy∈YmaxP(y=ck∣X=x)
这样一来,根据期望风险最小化准则就得到了后验概率最大化准则 ,即朴素贝叶斯法所采用的原理。
4.2 朴素贝叶斯法的参数估计
4.2.1 极大似然估计
在朴素贝叶斯法中, 学习意味着估计 P ( Y = c k ) P\left(Y=c_{k}\right) P(Y=ck) 和 P ( X ( j ) = x ( j ) ∣ Y = c k ) P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) P(X(j)=x(j)∣Y=ck) 。可以应用极大似然估计法估计相应的概率。先验概率 P ( Y = c k ) P\left(Y=c_{k}\right) P(Y=ck) 的极大似然估计是
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , ⋯ , K P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, \quad k=1,2, \cdots, K P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,⋯,K
设第 j j j个特征 x ( j ) x^{(j)} x(j) 可能取值的集合为 { a j 1 , a j 2 , ⋯ , a j S j } \left\{a_{j 1}, a_{j 2}, \cdots, a_{j S_{j}}\right\} {aj1,aj2,⋯,ajSj}, 条件概率 P ( X ( j ) = a j l ∣ Y = c k ) P\left(X^{(j)}=a_{j l} \mid Y=\right.c_{k}) P(X(j)=ajl∣Y=ck)的极大似然估计是
P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) j = 1 , 2 , ⋯ , n ; l = 1 , 2 , ⋯ , S j ; k = 1 , 2 , ⋯ , K P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)} \\ j=1,2, \cdots, n ; \quad l=1,2, \cdots, S_{j} ; \quad k=1,2, \cdots, K P(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck)j=1,2,⋯,n;l=1,2,⋯,Sj;k=1,2,⋯,K
式中, x i ( j ) x_{i}^{(j)} xi(j) 是第 i i i 个样本的第 j j j 个特征; a j l a_{j l} ajl 是第 j j j 个特征可能取的第 l l l 个值; I I I 为指示函数。
4.2.2 学习与分类算法
算法(朴素贝叶斯算法)
(1)计算先验概率及条件概率
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , ⋯ , K P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, \quad k=1,2, \cdots, K P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,⋯,K
P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) j = 1 , 2 , ⋯ , n ; l = 1 , 2 , ⋯ , S j ; k = 1 , 2 , ⋯ , K P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)} \\ j=1,2, \cdots, n ; \quad l=1,2, \cdots, S_{j} ; \quad k=1,2, \cdots, K P(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck)j=1,2,⋯,n;l=1,2,⋯,Sj;k=1,2,⋯,K
(2)对于给定的实例 x = ( x ( 1 ) , x ( 2 ) , ⋯ , x ( n ) ) T x=\left(x^{(1)}, x^{(2)}, \cdots, x^{(n)}\right)^{\mathrm{T}} x=(x(1),x(2),⋯,x(n))T, 计算
P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , ⋯ , K P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right), \quad k=1,2, \cdots, K P(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck),k=1,2,⋯,K
(3) 确定实例 x x x 的类
y = arg max c k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) y=\arg \max_{c_{k}} P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) y=argckmaxP(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)
import math
import random
import numpy as np
import pandas as pd
data_df = pd.read_csv('iris.csv')
def splitData(data_list, ratio):
train_size = int(len(data_list) * ratio)
random.shuffle(data_list)
train_set = data_list[:train_size]
test_set = data_list[train_size:]
return train_set, test_set
data_list = np.array(data_df).tolist()
train_set, test_set = splitData(data_list, ratio=0.7)
print('Split {0} samples into {1} train and {2} test samples '.format(len(data_df), len(train_set), len(test_set)))
# 此时需要先知道数据集中属于各类别的样本分别有多少。我们通过一个函数实现按类别划分数据。
# 两个返回值分别为划分好的数据字典,以及划分好的数据集中每个类别的样本数
def separateByClass(dataset):
separate_dict = {}
info_dict = {}
for vector in dataset:
if vector[-1] not in separate_dict:
separate_dict[vector[-1]] = []
info_dict[vector[-1]] = 0
separate_dict[vector[-1]].append(vector)
info_dict[vector[-1]] += 1
return separate_dict, info_dict
train_separated, train_info = separateByClass(train_set)
print(train_info)
# 计算属于每个类别的先验概率
def calculateClassPriorProb(dataset, dataset_info):
dataset_prior_prob = {}
sample_sum = len(dataset)
for class_value, sample_nums in dataset_info.items():
dataset_prior_prob[class_value] = sample_nums / float(sample_sum)
return dataset_prior_prob
prior_prob = calculateClassPriorProb(train_set, train_info)
print(prior_prob)
# 首先计算每个特征属于每类的条件概率
# 均值
def mean(list):
list = [float(x) for x in list] # 字符串转数字
return sum(list) / float(len(list))
# 方差
def var(list):
list = [float(x) for x in list]
avg = mean(list)
var = sum([math.pow((x - avg), 2) for x in list]) / float(len(list) - 1)
return var
# 概率密度函数
def calculateProb(x, mean, var):
exponent = math.exp(math.pow((x - mean), 2) / (-2 * var))
p = (1 / math.sqrt(2 * math.pi * var)) * exponent
return p
# 计算每个属性的均值和方差
def summarizeAttribute(dataset):
dataset = np.delete(dataset, -1, axis=1) # delete label
summaries = [(mean(attr), var(attr)) for attr in zip(*dataset)]
return summaries
summary = summarizeAttribute(train_set)
print(summary)
# 按类别提取属性特征,这里会得到 类别数目*属性数目 组 (均值,方差)
def summarizeByClass(dataset):
dataset_separated, dataset_info = separateByClass(dataset)
summarize_by_class = {}
for classValue, vector in dataset_separated.items():
summarize_by_class[classValue] = summarizeAttribute(vector)
return summarize_by_class
# train_Summary_by_class = summarizeByClass(train_set)
# print(train_Summary_by_class)
# 我们前面已经将训练数据集按类别分好
# 这里就可以实现,输入的测试数据依据每类的每个属性(类别数*属性数的字典)
# 计算属于某类的类条件概率。
def calculateClassProb(input_data, train_Summary_by_class):
prob = {}
for class_value, summary in train_Summary_by_class.items():
prob[class_value] = 1
for i in range(len(summary)):
mean, var = summary[i]
x = input_data[i]
p = calculateProb(x, mean, var)
prob[class_value] *= p
return prob
input_vector = test_set[1]
input_data = input_vector[:-1]
train_Summary_by_class = summarizeByClass(train_set)
class_prob = calculateClassProb(input_data, train_Summary_by_class)
print(class_prob)
# 朴素贝叶斯分类器
def bayesianPredictOneSample(input_data):
prior_prob = calculateClassPriorProb(train_set, train_info)
train_Summary_by_class = summarizeByClass(train_set)
classprob_dict = calculateClassProb(input_data, train_Summary_by_class)
result = {}
for class_value, class_prob in classprob_dict.items():
p = class_prob * prior_prob[class_value]
result[class_value] = p
return max(result, key=result.get)
input_vector = test_set[1]
input_data = input_vector[:-1]
result = bayesianPredictOneSample(input_data)
print("the sameple is predicted to class: {0}.".format(result))
def calculateAccByBeyesian(dataset):
correct = 0
for vector in dataset:
input_data = vector[:-1]
label = vector[-1]
result = bayesianPredictOneSample(input_data)
if result == label:
correct += 1
return correct / len(dataset)
acc = calculateAccByBeyesian(test_set)
print(acc)