朴素贝叶斯法的参数估计
1. 极大似然估计
在朴素贝叶斯法中,学习意味着估计
P
(
Y
=
c
k
)
P(Y=c_k)
P(Y=ck)和
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
P(X^{(j)}=x^{(j)}|Y=c_k)
P(X(j)=x(j)∣Y=ck)。可以应用极大似然估计法估计相应的概率。先验概率
P
(
Y
=
c
k
)
P(Y=c_k)
P(Y=ck)的极大似然估计是
P
(
Y
=
c
k
)
=
∑
i
=
1
N
I
(
y
i
=
c
k
)
N
,
k
=
1
,
2
,
.
.
.
,
K
P(Y=c_k)=\frac{\sum_{i=1}^NI(y_i=c_k)}{N}, \quad k=1,2,...,K
P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,...,K
设第j个特征
x
(
j
)
x^{(j)}
x(j)可能取值的集合为
{
a
j
1
,
a
j
2
,
.
.
.
,
a
j
S
j
}
\{a_{j1},a_{j2},...,a_{jS_j}\}
{aj1,aj2,...,ajSj},条件概率
P
(
X
(
j
)
=
a
j
l
∣
Y
=
c
k
)
P(X^{(j)}=a_{jl}|Y=c_k)
P(X(j)=ajl∣Y=ck)的极大似然估计是
P
(
X
(
j
)
=
a
j
l
∣
Y
=
c
k
)
=
∑
i
=
1
N
I
(
x
i
(
j
)
=
a
j
l
,
y
i
=
c
k
)
∑
i
=
1
N
I
(
y
l
=
c
k
)
j
=
1
,
2
,
.
.
.
,
n
;
l
=
1
,
2
,
.
.
,
S
j
;
k
=
1
,
2
,
.
.
.
,
K
P(X^{(j)}=a_{jl}|Y=c_k)=\frac{\sum_{i=1}^NI(x_i^{(j)}=a_{jl},y_i=c_k)}{\sum_{i=1}^NI(y_l=c_k)} \quad j=1,2,...,n;\space l=1,2,..,S_j;\space k=1,2,...,K
P(X(j)=ajl∣Y=ck)=∑i=1NI(yl=ck)∑i=1NI(xi(j)=ajl,yi=ck)j=1,2,...,n; l=1,2,..,Sj; k=1,2,...,K
式中,
x
i
(
j
)
x_i^{(j)}
xi(j)是第i个样本的第j个特征,
a
j
l
a_jl
ajl是第j个特征可能取的第l个值,
I
I
I为指示函数。
2. 学习与分类算法
算法:朴素贝叶斯算法
输入:训练数据 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } T=\{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\} T={(x1,y1),(x2,y2),...,(xN,yN)},其中 x i = ( x i ( 1 ) , x i ( 2 ) , . . . , x i ( n ) ) T x_i=(x_i^{(1)},x_i^{(2)},...,x_i^{(n)})^T xi=(xi(1),xi(2),...,xi(n))T, x i ( j ) x_i^{(j)} xi(j)是第i个样本的j个特征, x i ( j ) ∈ a j 1 , a j 2 , . . . , a j S j x_i^{(j)}\in {a_{j1},a_{j2},...,a_{jS_j}} xi(j)∈aj1,aj2,...,ajSj, a j l a_{jl} ajl是第j个特征可能取的第 l l l个值, j = 1 , 2 , . . . , n , l = 1 , 2 , . . . , S j , y i ∈ { c 1 , c 2 , . . , c K } j=1,2,...,n,\quad l=1,2,...,S_j,\quad y_i\in\{c_1,c_2,..,c_K\} j=1,2,...,n,l=1,2,...,Sj,yi∈{c1,c2,..,cK};实例x;
输出:实例x的分类
(1)计算先验概率及条件概率
P
(
Y
=
c
k
)
=
∑
i
=
1
N
I
(
y
i
=
c
k
)
N
,
k
=
1
,
2
,
.
.
.
,
K
p
(
X
(
f
)
=
a
j
l
∣
Y
=
c
k
)
=
∑
i
=
1
N
I
(
x
i
(
f
)
=
a
j
l
,
y
i
=
c
k
)
∑
i
=
1
N
I
(
y
i
=
c
k
)
j
=
1
,
2
,
.
.
.
,
n
;
l
=
1
,
2
,
.
.
.
,
S
j
=
1
,
2
,
.
.
.
,
K
P(Y=c_k)=\frac{\sum_{i=1}^NI(y_i=c_k)}{N}, \quad k=1,2,...,K \\ p(X^{(f)}=a_{jl}|Y=c_k)=\frac{\sum_{i=1}^NI(x_i^{(f)}=a_{jl},y_i=c_k)}{\sum_{i=1}^NI(y_i=c_k)} \\ j=1,2,...,n; \quad l=1,2,...,\quad S_j=1,2,...,K
P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,...,Kp(X(f)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(f)=ajl,yi=ck)j=1,2,...,n;l=1,2,...,Sj=1,2,...,K
(2)对于给定的实例
x
=
(
x
(
1
)
,
x
(
2
)
,
.
.
.
,
x
(
n
)
)
T
x=(x^{(1)},x^{(2)},...,x^{(n)})^T
x=(x(1),x(2),...,x(n))T,计算
P
(
Y
=
c
k
)
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
,
k
=
1
,
2
,
.
.
.
,
K
P(Y=c_k)\prod_{j=1}^nP(X^{(j)}=x^{(j)}|Y=c_k),\quad k=1,2,...,K
P(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck),k=1,2,...,K
(3)确定实例x的类
y
=
a
r
g
m
a
x
c
k
P
(
Y
=
c
k
)
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
y=arg\space max_{c_k}P(Y=c_k)\prod_{j=1}^nP(X^{(j)}=x^{(j)}|Y=c_k)
y=arg maxckP(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)
例:
试由下表的训练数据学习一个朴素贝叶斯分类器并确定 x = ( 2 , S ) T x=(2,S)^T x=(2,S)T的类标记y。表中 X ( 1 ) , X ( 2 ) X^{(1)},X^{(2)} X(1),X(2)为特征,取值的集合分别为 A 1 = { 1 , 2 , 3 } , A 2 = { S , M , L } A_1=\{1,2,3\},A_2=\{S,M,L\} A1={1,2,3},A2={S,M,L},Y为类标记, Y ∈ C = { 1 , − 1 } Y\in C=\{1,-1\} Y∈C={1,−1}
根据朴素贝叶斯算法:
P ( Y = 1 ) = 9 15 , P ( Y = − 1 ) = 6 15 P(Y=1)=\frac{9}{15},\quad P(Y=-1)=\frac{6}{15} P(Y=1)=159,P(Y=−1)=156
P ( X ( 1 ) = 1 ∣ Y = 1 ) = 2 9 , P ( X ( 1 ) = 2 ∣ Y = 1 ) = 3 9 , P ( X ( 1 ) = 3 ∣ Y = 1 ) = 4 9 P(X^{(1)}=1|Y=1)=\frac{2}{9},P(X^{(1)}=2|Y=1)=\frac{3}{9},P(X^{(1)}=3|Y=1)=\frac{4}{9} P(X(1)=1∣Y=1)=92,P(X(1)=2∣Y=1)=93,P(X(1)=3∣Y=1)=94
P ( X ( 2 ) = S ∣ Y = 1 ) = 1 9 , P ( X ( 2 ) = M ∣ Y = 1 ) = 4 9 , P ( X ( 2 ) = L ∣ Y = 1 ) = 4 9 P(X^{(2)}=S|Y=1)=\frac{1}{9},P(X^{(2)}=M|Y=1)=\frac{4}{9},P(X^{(2)}=L|Y=1)=\frac{4}{9} P(X(2)=S∣Y=1)=91,P(X(2)=M∣Y=1)=94,P(X(2)=L∣Y=1)=94
P ( X ( 1 ) = 1 ∣ Y = − 1 ) = 3 6 , P ( X ( 1 ) = 2 ∣ Y = − 1 ) = 2 6 , P ( X ( 1 ) = 3 ∣ Y = − 1 ) = 1 6 P(X^{(1)}=1|Y=-1)=\frac{3}{6},P(X^{(1)}=2|Y=-1)=\frac{2}{6},P(X^{(1)}=3|Y=-1)=\frac{1}{6} P(X(1)=1∣Y=−1)=63,P(X(1)=2∣Y=−1)=62,P(X(1)=3∣Y=−1)=61
P ( X ( 2 ) = S ∣ Y = − 1 ) = 3 6 , P ( X ( 2 ) = M ∣ Y = − 1 ) = 2 6 , P ( X ( 2 ) = L ∣ Y = − 1 ) = 1 6 P(X^{(2)}=S|Y=-1)=\frac{3}{6},P(X^{(2)}=M|Y=-1)=\frac{2}{6},P(X^{(2)}=L|Y=-1)=\frac{1}{6} P(X(2)=S∣Y=−1)=63,P(X(2)=M∣Y=−1)=62,P(X(2)=L∣Y=−1)=61
对于给定的
x
=
(
2
,
S
)
T
x=(2,S)^T
x=(2,S)T计算:
P
(
Y
=
1
)
P
(
X
(
1
)
=
2
∣
Y
=
1
)
P
(
X
(
2
)
=
S
∣
Y
=
1
)
=
1
45
P
(
Y
=
−
1
)
P
(
X
(
1
)
=
2
∣
Y
=
−
1
)
P
(
X
(
2
)
=
S
∣
Y
=
−
1
)
=
1
15
P(Y=1)P(X^{(1)}=2|Y=1)P(X^{(2)}=S|Y=1)=\frac{1}{45} \\ P(Y=-1)P(X^{(1)}=2|Y=-1)P(X^{(2)}=S|Y=-1)=\frac{1}{15}
P(Y=1)P(X(1)=2∣Y=1)P(X(2)=S∣Y=1)=451P(Y=−1)P(X(1)=2∣Y=−1)P(X(2)=S∣Y=−1)=151
因为
P
(
Y
=
−
1
)
P
(
X
(
1
)
=
2
∣
Y
=
−
1
)
P
(
X
(
2
)
=
S
∣
Y
=
−
1
)
P(Y=-1)P(X^{(1)}=2|Y=-1)P(X^{(2)}=S|Y=-1)
P(Y=−1)P(X(1)=2∣Y=−1)P(X(2)=S∣Y=−1)最大,所以
y
=
−
1
y=-1
y=−1.
3. 贝叶斯估计
用极大似然估计可能会出现所要估计的概率值为0的情况。这时会影响到后验概率的计算结果,使分类产生偏差。解决这一问题的方法是采用贝叶斯估计。具体地,条件概率的贝叶斯估计是
P
λ
(
X
(
j
)
=
a
j
l
∣
Y
=
c
k
)
=
∑
i
=
1
N
I
(
x
i
(
j
)
=
a
j
l
,
y
i
=
c
k
)
+
λ
∑
i
=
1
N
I
(
y
i
=
c
k
)
+
S
j
λ
(1)
P_{\lambda}(X^{(j)}=a_{jl}|Y=c_k)=\frac{\sum_{i=1}^NI(x_i^{(j)}=a_{jl},y_i=c_k)+\lambda}{\sum_{i=1}^NI(y_i=c_k)+S_j\lambda} \tag{1}
Pλ(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)+Sjλ∑i=1NI(xi(j)=ajl,yi=ck)+λ(1)
式中
λ
≥
0
\lambda \geq 0
λ≥0。等价于在随机变量各个取值的频数上赋予一个正数
λ
>
0
\lambda>0
λ>0。当
λ
=
0
\lambda=0
λ=0时就是极大似然估计。常取
λ
=
1
\lambda=1
λ=1,这时称为拉普拉斯平滑(Laplace smoothing)。显然,对任何
l
=
1
,
2
,
.
.
.
,
S
j
,
k
=
1
,
2
,
.
.
.
,
K
l=1,2,...,S_j, \quad k=1,2,...,K
l=1,2,...,Sj,k=1,2,...,K,有
P
λ
(
X
(
j
)
=
a
j
l
∣
Y
=
c
k
)
>
0
∑
l
=
1
S
j
P
(
X
(
j
)
=
a
j
l
∣
Y
=
c
k
)
=
1
P_\lambda(X^{(j)}=a_{jl}|Y=c_k)>0 \\ \sum_{l=1}^{S_j}P(X^{(j)}=a_{jl}|Y=c_k)=1
Pλ(X(j)=ajl∣Y=ck)>0l=1∑SjP(X(j)=ajl∣Y=ck)=1
表明式(1)确为一种概率分布。其中
S
j
S_j
Sj是某一特征值的可能取值数量。同样,先验概率的贝叶斯估计是
P
λ
(
Y
=
c
k
)
=
∑
i
=
1
N
I
(
y
i
=
c
k
)
+
λ
N
+
K
λ
P_{\lambda}(Y=c_k)=\frac{\sum_{i=1}^NI(y_i=c_k)+\lambda}{N+K\lambda}
Pλ(Y=ck)=N+Kλ∑i=1NI(yi=ck)+λ
其中K是类标记Y的可能取值数量。
例:
试由下表的训练数据学习一个朴素贝叶斯分类器并确定 x = ( 2 , S ) T x=(2,S)^T x=(2,S)T的类标记y。表中 X ( 1 ) , X ( 2 ) X^{(1)},X^{(2)} X(1),X(2)为特征,取值的集合分别为 A 1 = { 1 , 2 , 3 } , A 2 = { S , M , L } A_1=\{1,2,3\},A_2=\{S,M,L\} A1={1,2,3},A2={S,M,L},Y为类标记, Y ∈ C = { 1 , − 1 } Y\in C=\{1,-1\} Y∈C={1,−1},按照拉普拉斯平滑估计概率,即取 λ = 1 \lambda=1 λ=1.
解:按照贝叶斯估计
P ( Y = 1 ) = 10 17 , P ( Y = − 1 ) = 7 17 P(Y=1)=\frac{10}{17},P(Y=-1)=\frac{7}{17} P(Y=1)=1710,P(Y=−1)=177
P ( X ( 1 ) = 1 ∣ Y = 1 ) = 3 12 , P ( X ( 1 ) = 2 ∣ Y = 1 ) = 4 12 , P ( X ( 1 ) = 3 ∣ Y = 1 ) = 5 12 P(X^{(1)}=1|Y=1)=\frac{3}{12},P(X^{(1)}=2|Y=1)=\frac{4}{12},P(X^{(1)}=3|Y=1)=\frac{5}{12} P(X(1)=1∣Y=1)=123,P(X(1)=2∣Y=1)=124,P(X(1)=3∣Y=1)=125
P ( X ( 2 ) = S ∣ Y = 1 ) = 2 12 , P ( X ( 2 ) = M ∣ Y = 1 ) = 5 12 , P ( X ( 2 ) = L ∣ Y = 1 ) = 5 12 P(X^{(2)}=S|Y=1)=\frac{2}{12},P(X^{(2)}=M|Y=1)=\frac{5}{12},P(X^{(2)}=L|Y=1)=\frac{5}{12} P(X(2)=S∣Y=1)=122,P(X(2)=M∣Y=1)=125,P(X(2)=L∣Y=1)=125
P ( X ( 1 ) = 1 ∣ Y = − 1 ) = 4 9 , P ( X ( 1 ) = 2 ∣ Y = − 1 ) = 3 9 , P ( X ( 1 ) = 3 ∣ Y = − 1 ) = 2 9 P(X^{(1)}=1|Y=-1)=\frac{4}{9},P(X^{(1)}=2|Y=-1)=\frac{3}{9},P(X^{(1)}=3|Y=-1)=\frac{2}{9} P(X(1)=1∣Y=−1)=94,P(X(1)=2∣Y=−1)=93,P(X(1)=3∣Y=−1)=92
P ( X ( 2 ) = S ∣ Y = − 1 ) = 4 9 , P ( X ( 2 ) = M ∣ Y = − 1 ) = 3 9 , P ( X ( 2 ) = L ∣ Y = − 1 ) = 2 9 P(X^{(2)}=S|Y=-1)=\frac{4}{9},P(X^{(2)}=M|Y=-1)=\frac{3}{9},P(X^{(2)}=L|Y=-1)=\frac{2}{9} P(X(2)=S∣Y=−1)=94,P(X(2)=M∣Y=−1)=93,P(X(2)=L∣Y=−1)=92
对于给定的
x
=
(
2
,
S
)
T
x=(2,S)^T
x=(2,S)T计算:
P
(
Y
=
1
)
P
(
X
(
1
)
=
2
∣
Y
=
1
)
P
(
X
(
2
)
=
S
∣
Y
=
1
)
=
5
153
=
0.0327
P
(
Y
=
−
1
)
P
(
X
(
1
)
=
2
∣
Y
=
−
1
)
P
(
X
(
2
)
=
S
∣
Y
=
−
1
)
=
28
459
=
0.0610
P(Y=1)P(X^{(1)}=2|Y=1)P(X^{(2)}=S|Y=1)=\frac{5}{153}=0.0327 \\ P(Y=-1)P(X^{(1)}=2|Y=-1)P(X^{(2)}=S|Y=-1)=\frac{28}{459}=0.0610
P(Y=1)P(X(1)=2∣Y=1)P(X(2)=S∣Y=1)=1535=0.0327P(Y=−1)P(X(1)=2∣Y=−1)P(X(2)=S∣Y=−1)=45928=0.0610
因为
P
(
Y
=
−
1
)
P
(
X
(
1
)
=
2
∣
Y
=
−
1
)
P
(
X
(
2
)
=
S
∣
Y
=
−
1
)
P(Y=-1)P(X^{(1)}=2|Y=-1)P(X^{(2)}=S|Y=-1)
P(Y=−1)P(X(1)=2∣Y=−1)P(X(2)=S∣Y=−1)最大,所以
y
=
−
1
y=-1
y=−1.