朴素贝叶斯法学习日志
第四章 朴素贝叶斯法的学习和分类
通过学习以下先验概率分布及条件概率分布学习联合分布概率分布
P
(
X
,
Y
)
P(X,Y)
P(X,Y)
即学习以下先验概率分布及条件概率分布
先验概率分布
P
(
Y
=
c
k
)
,
k
=
1
,
2
,
⋯
 
,
K
P\left(Y=c_{k}\right), \quad k=1,2, \cdots, K
P(Y=ck),k=1,2,⋯,K
条件概率分布
P
(
X
=
x
∣
Y
=
c
k
)
=
P
(
X
(
1
)
=
x
(
1
)
,
⋯
 
,
X
(
n
)
=
x
(
n
)
∣
Y
=
c
k
)
,
k
=
1
,
2
,
⋯
 
,
K
P\left(X=x | Y=c_{k}\right)=P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} | Y=c_{k}\right), \quad k=1,2, \cdots, K
P(X=x∣Y=ck)=P(X(1)=x(1),⋯,X(n)=x(n)∣Y=ck),k=1,2,⋯,K
学习得到联合概率分布
P
(
X
,
Y
)
P(X,Y)
P(X,Y)
而条件概率分布
P
(
X
=
x
∣
Y
=
c
k
)
P\left(X=x | Y=c_{k}\right)
P(X=x∣Y=ck)有指数级数量的参数,其估计实际是不可行的(事实上假设
x
(
j
)
x^{(j)}
x(j)可取值有
S
j
S_{j}
Sj个,
j
=
1
,
2
,
⋯
 
,
n
j=1,2, \cdots, n
j=1,2,⋯,n,
Y
Y
Y可取值有
K
K
K个,那么参数个数为
K
∏
j
=
1
n
S
j
K \prod_{j=1}^{n} S_{j}
K∏j=1nSj)
朴素贝叶斯法对条件概率分布作了条件独立性的假设,由于这是一个较强的假设,朴素贝叶斯法也由此得名,条件独立性假设是
P
(
X
=
x
∣
Y
=
c
k
)
=
P
(
X
(
1
)
=
x
(
1
)
,
⋯
 
,
X
(
n
)
=
x
(
n
)
∣
Y
=
c
k
)
=
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
(
1
)
P\left(X=x | Y=c_{k}\right)=P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} | Y=c_{k}\right)\\=\prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right) \quad(1)
P(X=x∣Y=ck)=P(X(1)=x(1),⋯,X(n)=x(n)∣Y=ck)=j=1∏nP(X(j)=x(j)∣Y=ck)(1)朴素贝叶斯法实际上学习到生成数据的机制,所以属于生成模型。条件独立假设等于是说用于分类的特征在类确定的条件下都是条件独立的。
朴素贝叶斯法分类时,对给定的输入
x
x
x,通过学习到的模型计算后验概率分布
P
(
Y
=
c
k
∣
X
=
x
)
P\left(Y=c_{k} | X=x\right)
P(Y=ck∣X=x),将后验概率最大的类作为
x
x
x类的输出,后验概率计算根据贝叶斯定理进行:
P
(
Y
=
c
k
∣
X
=
x
)
=
P
(
X
=
x
∣
Y
=
c
k
)
P
(
Y
=
c
k
)
∑
k
P
(
X
=
x
∣
Y
=
c
k
)
P
(
Y
=
c
k
)
(
2
)
P\left(Y=c_{k} | X=x\right)=\frac{P\left(X=x | Y=c_{k}\right) P\left(Y=c_{k}\right)}{\sum_{k} P\left(X=x | Y=c_{k}\right) P\left(Y=c_{k}\right)} \quad(2)
P(Y=ck∣X=x)=∑kP(X=x∣Y=ck)P(Y=ck)P(X=x∣Y=ck)P(Y=ck)(2)将式
(
1
)
(1)
(1)带入式
(
2
)
(2)
(2)有
P
(
Y
=
c
k
∣
X
=
x
)
=
P
(
Y
=
c
k
)
∏
j
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
∑
k
P
(
Y
=
c
k
)
∏
j
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
,
k
=
1
,
2
,
⋯
 
,
K
(
3
)
P\left(Y=c_{k} | X=x\right)=\frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right)}{\sum_{k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right)}, \quad k=1,2, \cdots, K \quad(3)
P(Y=ck∣X=x)=∑kP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)P(Y=ck)∏jP(X(j)=x(j)∣Y=ck),k=1,2,⋯,K(3)
这是朴素贝叶斯法分类的基本公式。于是,朴素贝叶斯分类器可表示为
y
=
f
(
x
)
=
arg
max
c
k
P
(
Y
=
c
k
)
∏
j
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
∑
k
P
(
Y
=
c
k
)
∏
j
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
(
4
)
y=f(x)=\arg \max _{c_{k}} \frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right)}{\sum_{k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right)}\quad(4)
y=f(x)=argckmax∑kP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)P(Y=ck)∏jP(X(j)=x(j)∣Y=ck)(4)
注意到,在式
(
3
)
(3)
(3)中分母对所有类都是相同的,所以有
y
=
arg
max
c
k
P
(
Y
=
c
k
)
∏
j
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
y=\arg \max _{c_k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y = c_{k}\right)
y=argckmaxP(Y=ck)j∏P(X(j)=x(j)∣Y=ck)
后验概率最大化含义
朴素贝叶斯法将实例分到后验概率最大的类中,这等价于期望风险最小化,假设选择 0 − 1 0-1 0−1损失函数: L ( Y , f ( X ) ) = { 1 , Y ≠ f ( X ) 0 , Y = f ( X ) L(Y, f(X))=\left\{\begin{array}{ll}{1,} & {Y \neq f(X)} \\ {0,} & {Y=f(X)}\end{array}\right. L(Y,f(X))={1,0,Y̸=f(X)Y=f(X)式中 f ( X ) f(X) f(X)是分类决策函数。这时,期望风险函数为 R exp ( f ) = E [ L ( Y , f ( X ) ) ] R_{\exp }(f)=E[L(Y, f(X))] Rexp(f)=E[L(Y,f(X))]期望是对联合概率分布 P ( X , Y ) P(X,Y) P(X,Y)取的,由此取条件期望 R e x p ( f ) = E X ∑ k = 1 K [ L ( c k , f ( X ) ) ] P ( c k ∣ X ) R_{\mathrm{exp}}(f)=E_{X} \sum_{k=1}^{K}\left[L\left(c_{k}, f(X)\right)\right] P\left(c_{k} | X\right) Rexp(f)=EXk=1∑K[L(ck,f(X))]P(ck∣X)为了使期望风险最小化,只需对 X = x X=x X=x逐个极小化,由此得到: f ( x ) = arg min y ∈ Y ∑ k = 1 K L ( c k , y ) P ( c k ∣ X = x ) = arg min y ∈ Y ∑ k = 1 K P ( y ≠ c k ∣ X = x ) = arg min y ∈ Y ( 1 − P ( y = c k ∣ X = x ) ) = arg max y ∈ Y P ( y = c k ∣ X = x ) \begin{aligned} f(x) &=\arg \min _{y \in\mathcal{Y}} \sum_{k=1}^{K} L\left(c_{k}, y\right) P\left(c_{k} | X=x\right) \\ &=\arg \min _{y \in \mathcal{Y}} \sum_{k=1}^{K} P\left(y \neq c_{k} | X=x\right) \\ &=\arg \min _{y \in \mathcal{Y}}\left(1-P\left(y=c_{k} | X=x\right)\right) \\ &=\arg \max _{y \in \mathcal{Y}} P\left(y=c_{k} | X=x\right) \end{aligned} f(x)=argy∈Ymink=1∑KL(ck,y)P(ck∣X=x)=argy∈Ymink=1∑KP(y̸=ck∣X=x)=argy∈Ymin(1−P(y=ck∣X=x))=argy∈YmaxP(y=ck∣X=x)这样,根据期望风险最小化准则就得到了后验概率最大化准则: f ( x ) = arg max c k P ( c k ∣ X = x ) f(x)=\arg \max _{c_{k}} P\left(c_{k} | X=x\right) f(x)=argckmaxP(ck∣X=x)即朴素贝叶斯法所采用的原理。
朴素贝叶斯法的参数估计
极大似然估计
在朴素贝叶斯法中,学习意味着估计
P
(
Y
=
c
k
)
P\left(Y=c_{k}\right)
P(Y=ck)和
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right)
P(X(j)=x(j)∣Y=ck)可以应用极大似然估计估计相应的概率
先验概率
P
(
Y
=
c
k
)
P\left(Y=c_{k}\right)
P(Y=ck)的极大似然估计是
P
(
Y
=
c
k
)
=
∑
i
=
1
N
I
(
y
i
=
c
k
)
N
,
k
=
1
,
2
,
⋯
 
,
K
P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, k=1,2, \cdots, K
P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,⋯,K
设第
j
j
j个特征
x
(
j
)
x^{(j)}
x(j)可能取值的集合为
{
a
j
1
,
a
j
2
,
⋯
 
,
a
j
S
j
}
\left\{a_{j 1}, a_{j 2}, \cdots, a_{j S_{j}}\right\}
{aj1,aj2,⋯,ajSj},条件概率
P
(
X
(
j
)
=
a
j
l
∣
Y
=
c
k
)
P\left(X^{(j)}=a_{j l} | Y=c_{k}\right)
P(X(j)=ajl∣Y=ck)的极大似然估计是
P
(
X
(
j
)
=
a
j
l
∣
Y
=
c
k
)
=
∑
i
=
1
N
I
(
x
i
(
j
)
=
a
j
l
,
y
i
=
c
k
)
∑
i
=
1
N
I
(
y
i
=
c
k
)
P\left(X^{(j)}=a_{j l} | Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l} ,y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}
P(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck)
j
=
1
,
2
,
⋯
 
,
n
;
l
=
1
,
2
,
⋯
 
,
S
j
:
k
=
1
,
2
,
⋯
 
,
K
j=1,2, \cdots, n ; \quad l=1,2, \cdots, S_{j} : k=1,2, \cdots, K
j=1,2,⋯,n;l=1,2,⋯,Sj:k=1,2,⋯,K式中,
x
i
(
j
)
x^{(j)}_i
xi(j)是第
i
i
ii
ii个样本的第个
j
j
j特征,
a
j
l
a_{jl}
ajl是第
j
j
j个特征可能取的第
l
l
l个值;
I
I
I为指示函数
朴素贝叶斯算法
输入:训练数据
T
=
{
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
,
⋯
 
,
(
x
N
,
y
N
)
}
T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\left(x_{N}, y_{N}\right)\right\}
T={(x1,y1),(x2,y2),⋯,(xN,yN)},其中
x
i
=
(
x
i
(
1
)
,
x
i
(
2
)
,
⋯
 
,
x
i
(
n
)
)
T
x_{i}=\left(x_{i}^{(1)}, x_{i}^{(2)}, \cdots, x_{i}^{(n)}\right)^{\mathrm{T}}
xi=(xi(1),xi(2),⋯,xi(n))T ,
x
i
(
j
)
x^{(j)}_i
xi(j)是第
i
i
i个样本的第
j
j
j个特征,
x
i
(
j
)
∈
{
a
j
j
,
a
j
2
,
⋯
 
,
a
j
s
}
x_{i}^{(j)} \in\left\{a_{j j}, a_{j 2}, \cdots, a_{j s}\right\}
xi(j)∈{ajj,aj2,⋯,ajs},
a
j
l
a_{j l}
ajl是第
j
j
j个特征可能取的第
l
l
l个值
j
=
1
,
2
,
⋯
 
,
n
,
I
=
1
,
2
,
⋯
 
,
S
j
,
y
i
∈
{
c
1
,
c
2
,
⋯
 
,
c
K
}
j=1,2, \cdots, n, \quad I=1,2, \cdots, S_{j}, \quad y_{i} \in\left\{c_{1}, c_{2}, \cdots, c_{K}\right\}
j=1,2,⋯,n,I=1,2,⋯,Sj,yi∈{c1,c2,⋯,cK};实例
x
x
x
输出:实例
x
x
x的分类
(
1
)
(1)
(1) 计算先验概率及条件概率
P
(
Y
=
c
k
)
=
∑
i
=
1
N
I
(
y
i
=
c
k
)
N
,
k
=
1
,
2
,
⋯
 
,
K
P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, k=1,2, \cdots, K
P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,⋯,K
P
(
X
(
j
)
=
a
j
l
∣
Y
=
c
k
)
=
∑
i
=
1
N
I
(
x
i
(
j
)
=
a
j
l
,
y
i
=
c
k
)
∑
i
=
1
N
I
(
y
i
=
c
k
)
P\left(X^{(j)}=a_{j l} | Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l} ,y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}
P(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck)
j
=
1
,
2
,
⋯
 
,
n
;
l
=
1
,
2
,
⋯
 
,
S
j
:
k
=
1
,
2
,
⋯
 
,
K
j=1,2, \cdots, n ; \quad l=1,2, \cdots, S_{j} : k=1,2, \cdots, K
j=1,2,⋯,n;l=1,2,⋯,Sj:k=1,2,⋯,K
(
2
)
(2)
(2) 对于给定的实例
x
=
(
x
(
1
)
,
x
(
2
)
,
⋯
 
,
x
(
n
)
)
T
x=\left(x^{(1)}, x^{(2)}, \cdots, x^{(n)}\right)^{\mathrm{T}}
x=(x(1),x(2),⋯,x(n))T,计算
P
(
Y
=
c
k
)
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
,
k
=
1
,
2
,
⋯
 
,
K
P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right), \quad k=1,2, \cdots, K
P(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck),k=1,2,⋯,K
(
3
)
(3)
(3) 确定实例
x
x
x的类
y
=
arg
max
c
k
P
(
Y
=
c
k
)
∏
j
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
y=\arg \max _{c_k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y = c_{k}\right)
y=argckmaxP(Y=ck)j∏P(X(j)=x(j)∣Y=ck)
贝叶斯估计
用极大似然估计可能会出现所要估计的概括值为
0
0
0的情况,这时会影响到后验概率大计算结果,产生分类偏差,解决这一问题的方法是采用贝叶斯估计,具体地,条件概率的贝叶斯估计是
P
λ
(
X
(
j
)
=
a
j
l
∣
Y
=
c
k
)
=
∑
i
=
1
N
I
(
x
i
(
j
)
=
a
j
l
,
y
i
=
c
k
)
+
λ
∑
i
=
1
N
I
(
y
i
=
c
k
)
+
S
j
λ
P_{\lambda}\left(X^{(j)}=a_{jl} | Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l} ,y_{i}=c_{k}\right)+\lambda}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+S_{j} \lambda}
Pλ(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)+Sjλ∑i=1NI(xi(j)=ajl,yi=ck)+λ 式中
λ
⩾
0
\lambda \geqslant 0
λ⩾0。等价于在随机变量各个取值大频数上赋予一个正数
λ
>
0
\lambda>0
λ>0.当
λ
=
0
\lambda=0
λ=0时就是极大似然估计,常取
λ
=
1
\lambda=1
λ=1,这时称为拉普拉斯平滑
(
Laplace smoothing
)
(\text { Laplace smoothing })
( Laplace smoothing )
显然,对任何
l
=
1
,
2
,
⋯
 
,
S
j
,
k
=
1
,
2
,
⋯
 
,
K
l=1,2, \cdots, S_{j}, \quad k=1,2, \cdots, K
l=1,2,⋯,Sj,k=1,2,⋯,K,有
P
λ
(
X
(
j
)
=
a
j
l
∣
Y
=
c
k
)
>
0
{P_{\lambda}\left(X^{(j)}=a_{j l} | Y=c_{k}\right)>0}
Pλ(X(j)=ajl∣Y=ck)>0
∑
l
=
1
s
j
P
(
X
(
j
)
=
a
j
l
∣
Y
=
c
k
)
=
1
{\sum_{l=1}^{s_{j}} P\left(X^{(j)}=a_{j l} | Y=c_{k}\right)=1}
l=1∑sjP(X(j)=ajl∣Y=ck)=1
同样,先验概率大贝叶斯估计是 P λ ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + K λ P_{\lambda}\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+\lambda}{N+K \lambda} Pλ(Y=ck)=N+Kλ∑i=1NI(yi=ck)+λ
《统计学习方法》 李航