朴素贝叶斯分类法是基于贝叶斯定理与特征条件独立假设的分类方法。其主要思想为:对于给定的训练数据集 D \mathcal D D ,首先基于特征条件独立假设学习输入 x \mathrm x x 与输出 y y y 的联合概率分布 P ( x , y ) P(\mathrm x, y) P(x,y) ; 然后通过先验概率 P ( y ) P(y) P(y) ,利用贝叶斯定理求出后验概率 P ( y ∣ x ) P(y\mid\mathrm x) P(y∣x) 最大对应的输出 y y y 。
一个例子
由于朴素贝叶斯分类比较简单,这里直接先给出一个例子来演示如何进行分类。相信大多数有概率论基础的同学都能依据这个例子来实现一个朴素贝叶斯分类器。而对于其理论部感兴趣的同学,可以阅读本文后续理论部分。
如下表所示,我们假设有 15 15 15个训练样例 x i , i = 1 , 2 , ⋯ , 15 \mathrm x_i, i=1,2,\cdots,15 xi,i=1,2,⋯,15, 2 2 2个特征 x ( 1 ) , x ( 2 ) x^{(1)},x^{(2)} x(1),x(2),每个特征分别有 3 3 3个取值 x ( 1 ) ∈ { 1 , 2 , 3 } x^{(1)}\in\{1,2,3\} x(1)∈{1,2,3} , x ( 2 ) ∈ { S , M , L } x^{(2)}\in\{\mathrm{S,M,L}\} x(2)∈{S,M,L},两个类别 y ∈ { − 1 , 1 } y\in\{-1,1\} y∈{−1,1}。对于一个新的样例 x = { 2 , S } \mathrm x=\{2,\mathrm S\} x={2,S},我们需要确定其分类。
x 1 \mathrm x_1 x1 | x 2 \mathrm x_2 x2 | x 3 \mathrm x_3 x3 | x 4 \mathrm x_4 x4 | x 5 \mathrm x_5 x5 | x 6 \mathrm x_6 x6 | x 7 \mathrm x_7 x7 | x 8 \mathrm x_8 x8 | x 9 \mathrm x_9 x9 | x 10 \mathrm x_{10} x10 | x 11 \mathrm x_{11} x11 | x 12 \mathrm x_{12} x12 | x 13 \mathrm x_{13} x13 | x 14 \mathrm x_{14} x14 | x 15 \mathrm x_{15} x15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
x ( 1 ) x^{(1)} x(1) | 1 1 1 | 1 1 1 | 1 1 1 | 1 1 1 | 1 1 1 | 2 2 2 | 2 2 2 | 2 2 2 | 2 2 2 | 2 2 2 | 3 3 3 | 3 3 3 | 3 3 3 | 3 3 3 | 3 3 3 |
x ( 2 ) x^{(2)} x(2) | S \mathrm S S | M \mathrm M M | M \mathrm M M | S \mathrm S S | S \mathrm S S | S \mathrm S S | M \mathrm M M | M \mathrm M M | L \mathrm L L | L \mathrm L L | L \mathrm L L | M \mathrm M M | M \mathrm M M | L \mathrm L L | L \mathrm L L |
y y y | − 1 -1 −1 | − 1 -1 −1 | 1 1 1 | 1 1 1 | − 1 -1 −1 | − 1 -1 −1 | − 1 -1 −1 | 1 1 1 | 1 1 1 | 1 1 1 | 1 1 1 | 1 1 1 | 1 1 1 | 1 1 1 | − 1 -1 −1 |
根据上表,我们很容易求出各种概率:
P
(
y
=
1
)
=
9
15
,
P
(
y
=
−
1
)
=
6
15
(1)
P(y=1)=\frac{9}{15}, P(y=-1)=\frac{6}{15}\tag{1}
P(y=1)=159,P(y=−1)=156(1)
P ( x ( 1 ) = 1 ∣ y = 1 ) = 2 9 , P ( x ( 1 ) = 2 ∣ y = 1 ) = 3 9 , P ( x ( 1 ) = 3 ∣ y = 1 ) = 4 9 (2) P(x^{(1)}=1\mid y=1)=\frac{2}{9},P(x^{(1)}=2\mid y=1)=\frac{3}{9},P(x^{(1)}=3\mid y=1)=\frac{4}{9}\tag{2} P(x(1)=1∣y=1)=92,P(x(1)=2∣y=1)=93,P(x(1)=3∣y=1)=94(2)
P ( x ( 2 ) = S ∣ y = 1 ) = 1 9 , P ( x ( 2 ) = M ∣ y = 1 ) = 4 9 , P ( x ( 2 ) = L ∣ y = 1 ) = 4 9 (3) P(x^{(2)}=\mathrm S\mid y=1)=\frac{1}{9},P(x^{(2)}=\mathrm M\mid y=1)=\frac{4}{9},P(x^{(2)}=\mathrm L\mid y=1)=\frac{4}{9}\tag{3} P(x(2)=S∣y=1)=91,P(x(2)=M∣y=1)=94,P(x(2)=L∣y=1)=94(3)
P ( x ( 1 ) = 1 ∣ y = − 1 ) = 3 6 , P ( x ( 1 ) = 2 ∣ y = − 1 ) = 2 6 , P ( x ( 1 ) = 3 ∣ y = − 1 ) = 1 6 (4) P(x^{(1)}=1\mid y=-1)=\frac{3}{6},P(x^{(1)}=2\mid y=-1)=\frac{2}{6},P(x^{(1)}=3\mid y=-1)=\frac{1}{6}\tag{4} P(x(1)=1∣y=−1)=63,P(x(1)=2∣y=−1)=62,P(x(1)=3∣y=−1)=61(4)
P ( x ( 2 ) = S ∣ y = − 1 ) = 3 6 , P ( x ( 2 ) = M ∣ y = − 1 ) = 2 6 , P ( x ( 2 ) = L ∣ y = − 1 ) = 1 6 (5) P(x^{(2)}=\mathrm S\mid y=-1)=\frac{3}{6},P(x^{(2)}=\mathrm M\mid y=-1)=\frac{2}{6},P(x^{(2)}=\mathrm L\mid y=-1)=\frac{1}{6}\tag{5} P(x(2)=S∣y=−1)=63,P(x(2)=M∣y=−1)=62,P(x(2)=L∣y=−1)=61(5)
对于新的样例
x
=
{
2
,
S
}
\mathrm x=\{2,\mathrm S\}
x={2,S} ,我们有
P
(
y
=
1
)
P
(
x
(
1
)
=
2
∣
y
=
1
)
P
(
x
(
2
)
=
S
∣
y
=
1
)
=
9
15
×
3
9
×
1
9
=
1
45
(6)
P(y=1)P(x^{(1)}=2\mid y=1)P(x^{(2)}=\mathrm S\mid y=1)=\frac{9}{15}\times\frac{3}{9}\times\frac{1}{9}=\frac{1}{45}\tag{6}
P(y=1)P(x(1)=2∣y=1)P(x(2)=S∣y=1)=159×93×91=451(6)
P ( y = − 1 ) P ( x ( 1 ) = 2 ∣ y = − 1 ) P ( x ( 2 ) = S ∣ y = − 1 ) = 6 15 × 2 6 × 3 6 = 1 15 (7) P(y=-1)P(x^{(1)}=2\mid y=-1)P(x^{(2)}=\mathrm S\mid y=-1)=\frac{6}{15}\times\frac{2}{6}\times\frac{3}{6}=\frac{1}{15}\tag{7} P(y=−1)P(x(1)=2∣y=−1)P(x(2)=S∣y=−1)=156×62×63=151(7)
比较公式(6)和(7),我们判断新的样例 x = { 2 , S } \mathrm x=\{2,\mathrm S\} x={2,S}的输出类别为 y = − 1 y=-1 y=−1 。
朴素贝叶斯分类模型
生成模型与判别模型
与前面文章介绍过的判别模型 (KNN, 线性回归,逻辑斯特回归,决策树,感知机,SVM) 不同, 朴素贝叶斯分类模型属于生成模型。在这些监督学习(训练样例同时有输入
x
\mathrm x
x 和输出
y
y
y)中,目标都是学习一个模型,应用这一个模型,对给定的输入预测相应的输出。这个模型的一般形式为决策函数:
y
=
f
(
x
)
(8)
y=f(\mathrm x)\tag{8}
y=f(x)(8)
或者,从概率的角度上(8)可以写成条件概率:
P
(
y
∣
x
)
(9)
P(y\mid\mathrm x)\tag{9}
P(y∣x)(9)
根据求解(9)的方式,我们可以将模型分为判别模型(discriminative model)和生成模型(generation model):
-
判别模型:直接学习 P ( y ∣ x ) P(y\mid\mathrm x) P(y∣x)。例如,在线性回归中,直接将模型设为 y = f ( x ) = P ( y ∣ x ) = ω 0 + w T x y=f(\mathrm x)=P(y\mid\mathrm x)=\omega_0+\mathrm w^{\mathrm T}\mathrm x y=f(x)=P(y∣x)=ω0+wTx。然后通过正规方程或梯度下降求解最优的参数 w ˉ = { ω 0 , w } \bar{\mathrm w}=\{\omega_0,\mathrm w\} wˉ={ω0,w}。判别方式是直接学习条件概率 P ( y ∣ x ) P(y\mid\mathrm x) P(y∣x),因此学习的准确率更高。
-
生成模型:间接学习 P ( y ∣ x ) P(y\mid\mathrm x) P(y∣x)。通过贝叶斯定理:
P ( y ∣ x ) = P ( x , y ) P ( x ) (10) P(y\mid\mathrm x)=\frac{P(\mathrm x,y)}{P(\mathrm x)}\tag{10} P(y∣x)=P(x)P(x,y)(10)
间接求解我们的目标 P ( y ∣ x ) P(y\mid\mathrm x) P(y∣x)。生产模型需要估计其联合概率分布 P ( x , y ) P(\mathrm x,y) P(x,y),其学习收敛速度较快,但样本数量增加时,学习的模型可以更快地收敛到真实模型。当存在隐变量时,仍然可以用生成模型,此时判别模型不再适用,具体可以参考隐马尔科夫模型。
损失函数
当给定模型(8)或(9)时,我们的目标是通过损失函数最小来求解模型(8),(9)中的优化参数。例如在线性回归中,模型为
y
^
=
ω
0
+
w
T
x
\hat y=\omega_0+\mathrm w^{\mathrm T}\mathrm x
y^=ω0+wTx,我们通过损失函数
∑
i
(
y
^
i
−
y
i
)
2
\sum\nolimits_{i}(\hat y_i-y_i)^2
∑i(y^i−yi)2来寻求最优的参数
w
ˉ
=
{
ω
0
,
w
}
\bar{\mathrm w}=\{\omega_0,\mathrm w\}
wˉ={ω0,w}。对于朴素贝叶斯来说,我们首先定义对于一个样本
x
\mathrm x
x 被判别为类别
k
∈
K
k\in\mathcal K
k∈K 时的损失函数为:
E
(
y
^
=
k
∣
x
)
=
∑
k
′
∈
K
γ
k
k
′
P
(
y
=
k
′
∣
x
)
(11)
E(\hat y=k\mid\mathrm x)=\sum\limits_{k^\prime\in\mathcal K}\gamma_{kk^\prime}P(y=k^\prime\mid\mathrm x)\tag{11}
E(y^=k∣x)=k′∈K∑γkk′P(y=k′∣x)(11)
这里
γ
k
k
′
\gamma_{kk^\prime}
γkk′表示的是真实类别为
y
=
k
′
y=k^\prime
y=k′的样例点
x
\mathrm x
x被分类为
y
^
=
k
\hat y=k
y^=k时所产生的损失,一般来说,我们有:
γ
k
k
′
=
{
1
,
k
≠
k
′
0
,
k
=
k
′
(12)
\begin{aligned} \gamma_{kk^\prime}=\begin{cases} 1,\quad k\neq k^\prime\\ 0, \quad k=k^\prime \end{cases} \end{aligned}\tag{12}
γkk′={1,k=k′0,k=k′(12)
那么对于整个训练数据集,我们可以定义损失函数为:
E
(
D
)
=
E
x
[
E
(
y
^
∣
x
)
]
(13)
E(\mathcal D)=\mathbb E_{\mathrm x}[E(\hat y\mid\mathrm x)]\tag{13}
E(D)=Ex[E(y^∣x)](13)
为了最小化损失函数(13),我们直观地可以使得对于每一个样例,使其损失函数(11)最小,这样就产生了贝叶斯判定准则:
k
⋆
=
a
r
g
m
i
n
k
∈
K
E
(
y
^
=
k
∣
x
)
=
a
r
g
m
i
n
k
∈
K
P
(
y
^
=
k
∣
x
)
(14)
k^\star=argmin_{k\in\mathcal K}\quad E(\hat y=k\mid\mathrm x)=argmin_{k\in\mathcal K}\quad P(\hat y=k\mid\mathrm x)\tag{14}
k⋆=argmink∈KE(y^=k∣x)=argmink∈KP(y^=k∣x)(14)
求解(14)的难度在于计算
P
(
y
∣
x
)
P(y\mid\mathrm x)
P(y∣x)。在朴素贝叶斯中,我们一般通过贝叶斯定理求解:
P
(
y
∣
x
)
=
P
(
x
,
y
)
P
(
x
)
=
P
(
x
∣
y
)
P
(
y
)
P
(
x
)
(15)
P(y\mid\mathrm x)=\frac{P(\mathrm x,y)}{P(\mathrm x)}=\frac{P(\mathrm x\mid y)P(y)}{P(\mathrm x)}\tag{15}
P(y∣x)=P(x)P(x,y)=P(x)P(x∣y)P(y)(15)
对于分类问题来说,我们的目的是求解得到输出
y
y
y。为此,对于
y
y
y来说,我们可以定义
P
(
y
)
P(y)
P(y)为
y
y
y的先验概率,
P
(
y
∣
x
)
P(y\mid\mathrm x)
P(y∣x)为已知
x
\mathrm x
x的情况下
y
y
y的后验概率,
P
(
x
,
y
)
P(\mathrm x,y)
P(x,y)为联合概率分布,
P
(
x
∣
y
)
P(\mathrm x\mid y)
P(x∣y) 为条件概率。对于一个给定的样例
x
\mathrm x
x,概率
P
(
x
)
P(\mathrm x)
P(x)不变,为此判定准则(14)可以变为:
k
⋆
=
a
r
g
m
i
n
k
∈
K
P
(
x
∣
y
=
k
)
P
(
y
=
k
)
(16)
k^\star=argmin_{k\in\mathcal K}\quad P(\mathrm x\mid y=k)P(y=k)\tag{16}
k⋆=argmink∈KP(x∣y=k)P(y=k)(16)
为此,我们的目的变为如何根据训练数据集
D
\mathcal D
D来计算
P
(
x
∣
y
=
k
)
P(\mathrm x\mid y=k)
P(x∣y=k) 和
P
(
y
=
k
)
P(y=k)
P(y=k)。下面我们介绍如何通过极大似然估计和贝叶斯估计来计算公式(16)。
参数估计:MLE 和 MAP
极大似然估计 (MLE)
在朴素贝叶斯分类中,学习后验概率
P
(
y
∣
x
)
P(y\mid\mathrm x)
P(y∣x)意味着估计先验概率
P
(
y
)
P(y)
P(y) 和 条件概率
P
(
x
∣
y
)
P(\mathrm x\mid y)
P(x∣y)。利用极大似然估计法,先验概率
P
(
y
)
P(y)
P(y)可以计算如下:
P
(
y
=
k
)
=
∣
D
k
∣
∣
D
∣
(17)
P(y=k)=\frac{\lvert\mathcal D_k\rvert}{\lvert\mathcal D\rvert}\tag{17}
P(y=k)=∣D∣∣Dk∣(17)
其中,
D
k
\mathcal D_k
Dk 为训练数据集中所有类别为
k
k
k的样例集合。同样地,条件概率
P
(
x
∣
y
)
P(\mathrm x\mid y)
P(x∣y) 可以展开如下:
P
(
x
∣
y
=
k
)
=
P
(
x
(
1
)
,
x
(
2
)
,
⋯
,
x
(
M
)
∣
y
=
k
)
(18)
P(\mathrm x\mid y=k)=P(x^{(1)},x^{(2)},\cdots,x^{(M)}\mid y=k)\tag{18}
P(x∣y=k)=P(x(1),x(2),⋯,x(M)∣y=k)(18)
由于在朴素贝叶斯中,各个特征条件独立,公式(18)可以转化为:
P
(
x
∣
y
=
k
)
=
P
(
x
(
1
)
,
x
(
2
)
,
⋯
,
x
(
M
)
∣
y
=
k
)
=
∏
j
=
1
M
P
(
x
(
j
)
∣
y
=
k
)
(19)
P(\mathrm x\mid y=k)=P(x^{(1)},x^{(2)},\cdots,x^{(M)}\mid y=k)=\prod\limits_{j=1}^{M}P(x^{(j)}\mid y=k)\tag{19}
P(x∣y=k)=P(x(1),x(2),⋯,x(M)∣y=k)=j=1∏MP(x(j)∣y=k)(19)
我们假定在
D
k
\mathcal D_k
Dk中,第
j
j
j个特征取值为
x
(
j
)
x^{(j)}
x(j)的样例集合为
D
k
(
j
)
\mathcal D_k^{(j)}
Dk(j),那么由极大似然估计可得
P
(
x
(
j
)
∣
y
=
k
)
P(x^{(j)}\mid y=k)
P(x(j)∣y=k):
P
(
x
(
j
)
∣
y
)
=
∣
D
k
(
j
)
∣
∣
D
k
∣
(20)
P(x^{(j)}\mid y)=\frac{\lvert D_k^{(j)}\rvert}{\lvert\mathcal D_k\rvert}\tag{20}
P(x(j)∣y)=∣Dk∣∣Dk(j)∣(20)
将公式(20)带入公式(19)得:
P
(
x
∣
y
=
k
)
=
∏
j
=
1
M
P
(
x
(
j
)
∣
y
=
k
)
=
∏
j
=
1
M
∣
D
k
(
j
)
∣
∣
D
k
∣
(21)
P(\mathrm x\mid y=k)=\prod\limits_{j=1}^{M}P(x^{(j)}\mid y=k)=\prod\limits_{j=1}^{M}\frac{\lvert D_k^{(j)}\rvert}{\lvert\mathcal D_k\rvert}\tag{21}
P(x∣y=k)=j=1∏MP(x(j)∣y=k)=j=1∏M∣Dk∣∣Dk(j)∣(21)
至此,最基本的朴素贝叶斯分类方法已经介绍完毕。总结:首先根据训练数据集 D \mathcal D D计算公式(17)和(21),然后根据得到的 P ( y ) , P ( x ∣ y ) P(y),P(\mathrm x\mid y) P(y),P(x∣y)得到 P ( y ∣ x ) P(y\mid\mathcal x) P(y∣x),最后根据(14)进行分类。回到本文最前面的例子:其中公式(1)计算先验概率 P ( y ) P(y) P(y),公式(2)-(5)计算条件概率 P ( x ∣ y ) P(\mathrm x\mid y) P(x∣y),得到(6)-(7)中的后验概率 P ( y ∣ x ) P(y\mid\mathcal x) P(y∣x), 最后比较后验概率 P ( y = 1 ∣ x ) P(y=1\mid\mathrm x) P(y=1∣x)与 P ( y = − 1 ∣ x ) P(y=-1\mid\mathrm x) P(y=−1∣x),样例 x \mathrm x x被判别后验概率最大所对应的 y y y,即 y = − 1 y=-1 y=−1。
最大后验概率估计 (MAP)
上述极大似然估计可能会对训练集中未出现的特征出现估计的概率值为0的情况,即条件概率
P
(
x
∣
y
)
=
0
P(\mathrm x\mid y)=0
P(x∣y)=0。为解决这一问题,我们可以使用贝叶斯估计。那么,先验概率
P
(
y
)
P(y)
P(y)可以表示为
P
(
y
=
k
)
=
∣
D
k
∣
+
α
∣
D
∣
+
α
K
(22)
P(y=k)=\frac{\lvert\mathcal D_k\rvert+\alpha}{\lvert\mathcal D\rvert+\alpha K}\tag{22}
P(y=k)=∣D∣+αK∣Dk∣+α(22)
其中
K
K
K为输出
y
y
y可能的取值个数,
α
\alpha
α为一个常数,当
α
=
0
\alpha=0
α=0 时,即为前面所说的极大似然估计;当
α
=
1
\alpha=1
α=1时,这时称为拉普拉斯平滑。同样地,条件概率的贝叶斯估计为:
P
(
x
(
j
)
∣
y
=
k
)
=
∣
D
k
(
j
)
∣
+
α
∣
D
k
∣
+
α
K
j
(23)
P(x^{(j)}\mid y=k)=\frac{\lvert D_k^{(j)}\rvert+\alpha}{\lvert\mathcal D_k\rvert+\alpha K_j}\tag{23}
P(x(j)∣y=k)=∣Dk∣+αKj∣Dk(j)∣+α(23)
其中
K
j
K_j
Kj为第
j
j
j个特征属性可能的取值个数,根据公式(22)和(23),我们很容易将公式(1)-(7)改写为:
P
(
y
=
1
)
=
10
17
,
P
(
y
=
−
1
)
=
7
17
(
1
′
)
P(y=1)=\frac{10}{17}, P(y=-1)=\frac{7}{17}\tag{$1^\prime$}
P(y=1)=1710,P(y=−1)=177(1′)
P ( x ( 1 ) = 1 ∣ y = 1 ) = 3 12 , P ( x ( 1 ) = 2 ∣ y = 1 ) = 4 12 , P ( x ( 1 ) = 3 ∣ y = 1 ) = 5 12 ( 2 ′ ) P(x^{(1)}=1\mid y=1)=\frac{3}{12},P(x^{(1)}=2\mid y=1)=\frac{4}{12},P(x^{(1)}=3\mid y=1)=\frac{5}{12}\tag{$2^\prime$} P(x(1)=1∣y=1)=123,P(x(1)=2∣y=1)=124,P(x(1)=3∣y=1)=125(2′)
P ( x ( 2 ) = S ∣ y = 1 ) = 2 12 , P ( x ( 2 ) = M ∣ y = 1 ) = 5 12 , P ( x ( 2 ) = L ∣ y = 1 ) = 5 12 ( 3 ′ ) P(x^{(2)}=\mathrm S\mid y=1)=\frac{2}{12},P(x^{(2)}=\mathrm M\mid y=1)=\frac{5}{12},P(x^{(2)}=\mathrm L\mid y=1)=\frac{5}{12}\tag{$3^\prime$} P(x(2)=S∣y=1)=122,P(x(2)=M∣y=1)=125,P(x(2)=L∣y=1)=125(3′)
P ( x ( 1 ) = 1 ∣ y = − 1 ) = 4 9 , P ( x ( 1 ) = 2 ∣ y = − 1 ) = 3 9 , P ( x ( 1 ) = 3 ∣ y = − 1 ) = 2 9 ( 4 ′ ) P(x^{(1)}=1\mid y=-1)=\frac{4}{9},P(x^{(1)}=2\mid y=-1)=\frac{3}{9},P(x^{(1)}=3\mid y=-1)=\frac{2}{9}\tag{$4^\prime$} P(x(1)=1∣y=−1)=94,P(x(1)=2∣y=−1)=93,P(x(1)=3∣y=−1)=92(4′)
P ( x ( 2 ) = S ∣ y = − 1 ) = 4 9 , P ( x ( 2 ) = M ∣ y = − 1 ) = 3 9 , P ( x ( 2 ) = L ∣ y = − 1 ) = 2 9 ( 5 ′ ) P(x^{(2)}=\mathrm S\mid y=-1)=\frac{4}{9},P(x^{(2)}=\mathrm M\mid y=-1)=\frac{3}{9},P(x^{(2)}=\mathrm L\mid y=-1)=\frac{2}{9}\tag{$5^\prime$} P(x(2)=S∣y=−1)=94,P(x(2)=M∣y=−1)=93,P(x(2)=L∣y=−1)=92(5′)
对于新的样例
x
=
{
2
,
S
}
\mathrm x=\{2,\mathrm S\}
x={2,S} ,我们有
P
(
y
=
1
)
P
(
x
(
1
)
=
2
∣
y
=
1
)
P
(
x
(
2
)
=
S
∣
y
=
1
)
=
10
17
×
4
12
×
1
12
=
5
153
=
0.032
(
6
′
)
P(y=1)P(x^{(1)}=2\mid y=1)P(x^{(2)}=\mathrm S\mid y=1)=\frac{10}{17}\times\frac{4}{12}\times\frac{1}{12}=\frac{5}{153}=0.032\tag{$6^\prime$}
P(y=1)P(x(1)=2∣y=1)P(x(2)=S∣y=1)=1710×124×121=1535=0.032(6′)
P ( y = − 1 ) P ( x ( 1 ) = 2 ∣ y = − 1 ) P ( x ( 2 ) = S ∣ y = − 1 ) = 7 17 × 3 9 × 4 9 = 28 459 = 0.061 ( 7 ′ ) P(y=-1)P(x^{(1)}=2\mid y=-1)P(x^{(2)}=\mathrm S\mid y=-1)=\frac{7}{17}\times\frac{3}{9}\times\frac{4}{9}=\frac{28}{459}=0.061\tag{$7^\prime$} P(y=−1)P(x(1)=2∣y=−1)P(x(2)=S∣y=−1)=177×93×94=45928=0.061(7′)
比较公式( 6 ′ 6^\prime 6′)和( 7 ′ 7^\prime 7′),我们判断新的样例 x = { 2 , S } \mathrm x=\{2,\mathrm S\} x={2,S}的输出类别为 y = − 1 y=-1 y=−1 。
连续特征的朴素贝叶斯分类
在前面的例子中,我们都假设特征取值是离散的,这样我们可以直接通过公式(20)和(23)直接计算条件概率。但是,在实际生活中,很多特征属性的取值是连续的。这时,一般有两种方法:1)最简单直观地就是将连续数据离散化;2)假定该连续特征变量服从一种概率分布(正态分布,二项分布等),根据训练数据集计算该特征的概率密度函数。对于第一种方法,比较简单,不在赘述。下面主要介绍第二种方法:
对于连续特征
x
(
j
)
x^{(j)}
x(j),我们假定其服从正态分布,即
P
(
x
(
j
)
∣
y
=
k
)
∼
N
(
μ
k
j
,
σ
k
j
)
P(x^{(j)}\mid y=k)\sim\mathcal N(\mu_{kj},\sigma_{kj})
P(x(j)∣y=k)∼N(μkj,σkj) 。对于正态分布,其极大似然估计为:
μ
k
j
=
1
∣
D
k
∣
∑
i
∈
D
k
x
i
(
j
)
(24)
\mu_{kj}=\frac{1}{\lvert\mathcal D_{k}\rvert}\sum\limits_{i\in\mathcal D_{k}}{x_{i}^{(j)}}\tag{24}
μkj=∣Dk∣1i∈Dk∑xi(j)(24)
σ k j 2 = 1 ∣ D k ∣ ∑ i ∈ D k ( x i ( j ) − μ k j ) 2 (25) \sigma_{kj}^2=\frac{1}{\lvert\mathcal D_{k}\rvert}\sum\limits_{i\in\mathcal D_{k}}{(x_{i}^{(j)}-\mu_{kj})^2}\tag{25} σkj2=∣Dk∣1i∈Dk∑(xi(j)−μkj)2(25)
根据公式(24)和(25),就可以用正态分布函数的概率密度表达式来计算条件概率:
P
(
x
(
j
)
∣
y
=
k
)
=
1
2
π
σ
k
j
exp
(
−
(
x
(
j
)
−
μ
k
j
)
2
2
σ
k
j
2
)
(26)
P(x^{(j)}\mid y=k)=\frac{1}{\sqrt{2\pi}\sigma_{kj}}\exp(-\frac{(x^{(j)}-\mu_{kj})^2}{2\sigma_{kj}^2})\tag{26}
P(x(j)∣y=k)=2πσkj1exp(−2σkj2(x(j)−μkj)2)(26)
将公式(26)替换前面公式(20)或(23),就可以利用上述方式得到贝叶斯分类器了。注意:这里我们假定其服从正态分布,当然也可以假定其服从其他分布。这里概率分布的选取对最后的分类效果影响较大。
连续朴素贝叶斯分类实例
我们这里使用iris数据集(sklearn库中自带),这里面有150个训练样例,4个feature, 总共分3类。我们只考虑了前2个feature,这么做是为了在下面二维图中展示分类结果。在这150个样例中,我们取出第25,75,125个样例作为测试样例(其label分别为0,1,2),其他147个作为训练样例。下图为测试结果:
|
图1给出了这3三个测试样例的预测结果,其输出的后验概率矩阵就是这3个测试样例属于类别 0 , 1 , 2 0,1,2 0,1,2 的后验概率。由于,贝叶斯判别准则为判别为后验概率最大的类,于是可得这3个样例分别判别为类别 0 , 2 , 2 0, 2, 2 0,2,2。图2更加直观地显示了图1的判别结果。其中,填充颜色为我们样例点真正的类别,其中那3个点的轮廓颜色表示的是我们判别的结果,当两种颜色相同时,样例点被正确判别。由图2可知,属于蓝色的那个样例点被错误分类为绿色类别。
|
更进一步,我们可以利用上面得到的朴素贝叶斯分类器对所有的可能的样例点 (即,图3中的每一个坐标点)进行分类,由此我们可以得到分类超平面(二维空间中为线条)如下:
|
附录:
所有图片的python源代码如下:
图1-图2:
# -*- encoding: utf-8 -*-
"""
@File : Naive_Bayes_fig001.py
@Time : 2020/5/28 11:50
@Author : tengweitw
@Email : tengweitw@foxmail.com
"""
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
# Set the format of labels
def LabelFormat(plt):
ax = plt.gca()
plt.tick_params(labelsize=14)
labels = ax.get_xticklabels() + ax.get_yticklabels()
[label.set_fontname('Times New Roman') for label in labels]
font = {'family': 'Times New Roman',
'weight': 'normal',
'size': 16,
}
return font
# Plot the training points:
def PlotTrainPoint(train_data, train_target):
for i in range(0, len(train_target)):
if train_target[i] == 0:
plt.plot(train_data[i][0], train_data[i][1], 'rs', markersize=6, markerfacecolor="r")
elif train_target[i] == 1:
plt.plot(train_data[i][0], train_data[i][1], 'bo', markersize=6, markerfacecolor="b")
elif train_target[i] == 2:
plt.plot(train_data[i][0], train_data[i][1], 'gd', markersize=6, markerfacecolor="g")
# Plot the testing points and prediction results
def PlotTestPoint(test_data, test_target, y_predict_test):
# Plot the testing points
for i in range(0, len(test_target)):
if test_target[i] == 0:
plt.plot(test_data[i][0], test_data[i][1], 'rs', markersize=6)
elif test_target[i] == 1:
plt.plot(test_data[i][0], test_data[i][1], 'bo', markersize=6)
else:
plt.plot(test_data[i][0], test_data[i][1], 'gs', markersize=6)
# Plot the prediction results
for i in range(0, len(y_predict_test)):
if y_predict_test[i] == 0:
plt.plot(test_data[i][0], test_data[i][1], 'rs', markerfacecolor='none', markersize=10)
elif y_predict_test[i] == 1:
plt.plot(test_data[i][0], test_data[i][1], 'bs', markersize=10, markerfacecolor="none")
else:
plt.plot(test_data[i][0], test_data[i][1], 'gs', markersize=10, markerfacecolor="none")
# Compute the conditional probability P(x|y) for the test point x
def Calculate_P_x_y(Mean_x, Std_x_square, test_point, M, K):
P_x_y = np.zeros((M, K))
for j in range(0, M):
for k in range(0, K):
P_x_y[j][k] = 1 / np.sqrt(2 * np.pi * Std_x_square[k][j]) * np.exp(
-np.square(test_point[j] - Mean_x[k][j]) / 2 / Std_x_square[k][j])
return P_x_y
def Naive_Bayes_classfier(train_data, train_target, test_data, test_target):
M = np.size(train_data, 1) # Dimension of features: 2
N = np.size(train_data, 0) # Number of instances
K = np.size(np.unique(train_target)) # Number of classes
cnt = np.zeros((K, M))
cnt_y = np.zeros((K, 1))
sum_x = np.zeros((K, M))
x = np.array((K, 1))
for k in range(0, K): # for each class
for j in range(0, M): # for each feature
cnt_y[k] = 0
for i in range(0, N): # for each instance
if train_target[i] == k:
cnt_y[k] += 1
cnt[k][j] += 1 # The number of feature j in class i
sum_x[k][j] += train_data[i][j] # The sum of x_{kj}
# Compute the mean of x^j in the training points classified as class k
Mean_x = np.zeros((K, M))
for k in range(0, K): # for each class
for j in range(0, M): # for each feature
Mean_x[k][j] = sum_x[k][j] / cnt[k][j]
# Compute the variances of x^j in the training points classified as class k
Std_x_square = np.zeros((K, M))
for k in range(0, K): # for each class
for j in range(0, M): # for each feature
for i in range(0, N): # for each instance
if train_target[i] == k:
Std_x_square[k][j] += np.square(Mean_x[k][j] - train_data[i][j]) / cnt[k][j]
# Compute the prior probability P(y)
P_y = np.zeros((K, 1))
for k in range(0, K):
P_y[k] = cnt_y[k] / sum(cnt_y)
# ------Prediction---------#
L = len(test_target) # number of test_points
P_y_x = np.ones((K, L)) # The posteriori probability
for l in range(L):
temp = Calculate_P_x_y(Mean_x, Std_x_square, test_data[l], M, K)
for k in range(K):
for j in range(M):
P_y_x[k][l] = P_y_x[k][l] * temp[j][k]
P_y_x[k][l] = P_y_x[k][l] * P_y[k]
print("\n The posterior prob. for the instances are:")
print(P_y_x)
y_predict_test = np.argmax(P_y_x, axis=0) # find the max according to the column
print("The instances are classified as")
print(y_predict_test)
return y_predict_test
# import dataset of iris
iris = datasets.load_iris()
# The first two-dim feature for simplicity
data = iris.data[:, :2]
# The labels
label = iris.target
# Choose the 25,75,125th instance as testing points
test_data = [data[25, :], data[75, :], data[125, :]]
test_target = label[[25, 75, 125]]
print('The testing instances are:')
print(test_data)
print("The true classes for each instance are:")
print(test_target)
data = np.delete(data, [25, 75, 125], axis=0)
label = np.delete(label, [25, 75, 125], axis=0)
train_data = data
train_target = label
y_predict_test = Naive_Bayes_classfier(train_data, train_target, test_data, test_target)
plt.figure()
PlotTrainPoint(train_data, train_target)
PlotTestPoint(test_data, test_target, y_predict_test)
font = LabelFormat(plt)
plt.xlabel('$x^{(1)}$', font)
plt.ylabel('$x^{(2)}$', font)
plt.xlim(4,8)
plt.ylim(1.8,4.5)
plt.show()
图3:
# -*- encoding: utf-8 -*-
"""
@File : Naive_Bayes_fig002.py
@Time : 2020/5/28 17:18
@Author : tengweitw
@Email : tengweitw@foxmail.com
"""
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
# Set the format of labels
def LabelFormat(plt):
ax = plt.gca()
plt.tick_params(labelsize=14)
labels = ax.get_xticklabels() + ax.get_yticklabels()
[label.set_fontname('Times New Roman') for label in labels]
font = {'family': 'Times New Roman',
'weight': 'normal',
'size': 16,
}
return font
# Plot the training points:
def PlotTrainPoint(train_data, train_target):
for i in range(0, len(train_target)):
if train_target[i] == 0:
plt.plot(train_data[i][0], train_data[i][1], 'rs', markersize=6, markerfacecolor="r")
elif train_target[i] == 1:
plt.plot(train_data[i][0], train_data[i][1], 'bo', markersize=6, markerfacecolor="b")
elif train_target[i] == 2:
plt.plot(train_data[i][0], train_data[i][1], 'gd', markersize=6, markerfacecolor="g")
cmap_light = ListedColormap(['tomato', 'limegreen', 'cornflowerblue'])
# Plot the contour of classes
def PlotTestPoint(test_data, test_target, y_predict_test):
for i in range(0, len(y_predict_test)):
if y_predict_test[i] == 0:
plt.plot(test_data[i][0], test_data[i][1], color='cornflowerblue',marker="s", markersize=6)
elif y_predict_test[i] == 1:
plt.plot(test_data[i][0], test_data[i][1], color='limegreen', marker="s",markersize=6)
else:
plt.plot(test_data[i][0], test_data[i][1], color='tomato',marker="s", markersize=6)
# Compute the conditional probability P(x|y) for the test point x
def Calculate_P_x_y(Mean_x, Std_x_square, test_point, M, K):
P_x_y = np.zeros((M, K))
for j in range(0, M):
for k in range(0, K):
P_x_y[j][k] = 1 / np.sqrt(2 * np.pi * Std_x_square[k][j]) * np.exp(
-np.square(test_point[j] - Mean_x[k][j]) / 2 / Std_x_square[k][j])
return P_x_y
def Naive_Bayes_classfier(train_data, train_target, test_data, test_target):
M = np.size(train_data, 1) # Dimension of features: 2
N = np.size(train_data, 0) # Number of instances
K = np.size(np.unique(train_target)) # Number of classes
cnt = np.zeros((K, M))
cnt_y = np.zeros((K, 1))
sum_x = np.zeros((K, M))
x = np.array((K, 1))
for k in range(0, K): # for each class
for j in range(0, M): # for each feature
cnt_y[k] = 0
for i in range(0, N): # for each instance
if train_target[i] == k:
cnt_y[k] += 1
cnt[k][j] += 1 # The number of feature j in class i
sum_x[k][j] += train_data[i][j] # The sum of x_{kj}
# Compute the mean of x^j in the training points classified as class k
Mean_x = np.zeros((K, M))
for k in range(0, K): # for each class
for j in range(0, M): # for each feature
Mean_x[k][j] = sum_x[k][j] / cnt[k][j]
# Compute the variances of x^j in the training points classified as class k
Std_x_square = np.zeros((K, M))
for k in range(0, K): # for each class
for j in range(0, M): # for each feature
for i in range(0, N): # for each instance
if train_target[i] == k:
Std_x_square[k][j] += np.square(Mean_x[k][j] - train_data[i][j]) / cnt[k][j]
# Compute the prior probability P(y)
P_y = np.zeros((K, 1))
for k in range(0, K):
P_y[k] = cnt_y[k] / sum(cnt_y)
# ------Prediction---------#
L = len(test_target) # number of test_points
P_y_x = np.ones((K, L)) # The posteriori probability
for l in range(L):
temp = Calculate_P_x_y(Mean_x, Std_x_square, test_data[l], M, K)
for k in range(K):
for j in range(M):
P_y_x[k][l] = P_y_x[k][l] * temp[j][k]
P_y_x[k][l] = P_y_x[k][l] * P_y[k]
print("\n The posterior prob. for the instances are:")
print(P_y_x)
y_predict_test = np.argmax(P_y_x, axis=0) # find the max according to the column
print("The instances are classified as")
print(y_predict_test)
return y_predict_test
# import dataset of iris
iris = datasets.load_iris()
# The first two-dim feature for simplicity
data = iris.data[:, :2]
# The labels
label = iris.target
# Delete the three instances to keep the same with fig 1-2
# You can also remove the following two sentences
data = np.delete(data, [25, 75, 125], axis=0)
label = np.delete(label, [25, 75, 125], axis=0)
train_data = data
train_target = label
# Here use all the points in the considered area as testing points
test_data1=[]
test_target1=[]
x=np.linspace(4,8,100)
y=np.linspace(1.8,4.5,100)
for i in x:
for j in y:
test_data1.append([i,j])
test_target1.append([1])
y_predict_test = Naive_Bayes_classfier(train_data, train_target, test_data1, test_target1)
plt.figure()
PlotTestPoint(test_data1, test_target1, y_predict_test)
PlotTrainPoint(train_data, train_target)
font = LabelFormat(plt)
plt.xlabel('$x^{(1)}$', font)
plt.ylabel('$x^{(2)}$', font)
plt.xlim(4,8)
plt.ylim(1.8,4.5)
plt.show()