一、方法原理
输入空间为 n n n 向量: X ⊆ R n \mathcal{X} \subseteq \mathbb{R}^n X⊆Rn
输出空间为类标记: Y = { c 1 , c 2 , . . . , c K } \mathcal{Y} = \{ c_1,c_2,...,c_K\} Y={c1,c2,...,cK}
训练数据集: T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } T = \{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\} T={(x1,y1),(x2,y2),...,(xN,yN)} ,由联合分布 P ( X , Y ) P(X,Y) P(X,Y) 独立同分布产生。
朴素贝叶斯方法就是通过训练数据集学习联合概率分布 P ( X , Y ) P(X,Y) P(X,Y),具体学习过程如下:
-
先验概率: P ( Y = c k ) , k = 1 , 2 , . . . . , K P(Y=c_k), k =1,2,....,K P(Y=ck),k=1,2,....,K;
-
条件概率: P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , X ( 2 ) = x ( 2 ) , . . . , X ( n ) = x ( n ) ∣ Y = c k ) , k = 1 , 2 , . . . , K P(X = x|Y = c_k) = P(X^{(1)}=x^{(1)},X^{(2)}=x^{(2)},...,X^{(n)}=x^{(n)}|Y = c_k), k = 1,2,...,K P(X=x∣Y=ck)=P(X(1)=x(1),X(2)=x(2),...,X(n)=x(n)∣Y=ck),k=1,2,...,K, X ( i ) X^{(i)} X(i) 为 X X X 的第 i i i 个分量。
朴素贝叶斯分类有一个前提假设就是条件独立性假设,也就是:
P
(
X
=
x
∣
Y
=
c
k
)
=
P
(
X
(
1
)
=
x
(
1
)
,
X
(
2
)
=
x
(
2
)
,
.
.
.
,
X
(
n
)
=
x
(
n
)
∣
Y
=
c
k
)
=
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
P(X = x |Y = c_k) = P(X^{(1)}=x^{(1)},X^{(2)}=x^{(2)},...,X^{(n)}=x^{(n)}|Y = c_k)\\ = \prod_{j=1}^n P(X^{(j)} = x^{(j)}|Y=c_k)
P(X=x∣Y=ck)=P(X(1)=x(1),X(2)=x(2),...,X(n)=x(n)∣Y=ck)=j=1∏nP(X(j)=x(j)∣Y=ck)
所以,使用朴素贝叶斯方法分类时,对于给定的输入
x
x
x,通过学习到的模型可以计算后验概率
P
(
Y
=
c
k
∣
X
=
x
)
P(Y=c_k|X=x)
P(Y=ck∣X=x):
P
(
Y
=
c
k
∣
X
=
x
)
=
P
(
Y
=
c
k
)
P
(
X
=
x
∣
Y
=
c
k
)
∑
k
P
(
Y
=
c
k
)
P
(
X
=
x
∣
Y
=
c
k
)
=
P
(
Y
=
c
k
)
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
∑
k
P
(
Y
=
c
k
)
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
P(Y = c_k|X=x) = \frac{P(Y=c_k)P(X =x|Y=c_k)}{\sum_k P(Y=c_k)P(X =x|Y=c_k)}\\ = \frac{P(Y=c_k)\prod_{j=1}^n P(X^{(j)} = x^{(j)}|Y=c_k)}{\sum_k P(Y=c_k)\prod_{j=1}^n P(X^{(j)} = x^{(j)}|Y=c_k)}
P(Y=ck∣X=x)=∑kP(Y=ck)P(X=x∣Y=ck)P(Y=ck)P(X=x∣Y=ck)=∑kP(Y=ck)∏j=1nP(X(j)=x(j)∣Y=ck)P(Y=ck)∏j=1nP(X(j)=x(j)∣Y=ck)
将后验概率最大的类作为当前输入的输出结果,所以朴素贝叶斯分类器可以表示为:
y
=
f
(
x
)
=
arg
max
c
k
P
(
Y
=
c
k
)
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
∑
k
P
(
Y
=
c
k
)
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
y = f(x) = \arg\max_{c_k} \frac{P(Y=c_k)\prod_{j=1}^n P(X^{(j)} = x^{(j)}|Y=c_k)}{\sum_k P(Y=c_k)\prod_{j=1}^n P(X^{(j)} = x^{(j)}|Y=c_k)}
y=f(x)=argckmax∑kP(Y=ck)∏j=1nP(X(j)=x(j)∣Y=ck)P(Y=ck)∏j=1nP(X(j)=x(j)∣Y=ck)
由于上式分母都相同,所以也等价于:
y
=
f
(
x
)
=
arg
max
c
k
P
(
Y
=
c
k
)
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
y = f(x) = \arg\max_{c_k} P(Y=c_k)\prod_{j=1}^n P(X^{(j)} = x^{(j)}|Y=c_k)
y=f(x)=argckmaxP(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)
后验概率最大化,其实等价于期望风险最小化:
假设损失函数选择 0-1 损失, L ( Y , f ( X ) ) = { 1 , Y ≠ f ( X ) 0 , Y = f ( X ) L(Y,f(X)) = \left\{ \begin{matrix}1,Y\neq f(X) \\0,Y = f(X)\end{matrix}\right. L(Y,f(X))={1,Y=f(X)0,Y=f(X)
那么此时期望风险可以表示为:
R e x p ( f ) = E [ L ( Y , f ( X ) ) ] = ∑ x ∑ c k L ( c k , f ( x ) ) P ( x , y ) = ∑ x ∑ c k L ( c k , f ( x ) ) P ( x ) P ( c k ∣ x ) = ∑ x P ( x ) ∑ c k L ( c k , f ( x ) ) P ( c k ∣ x ) R_{exp}(f) = E[L(Y,f(X))]\\ = \sum_x\sum_{c_k} L(c_k,f(x))P(x,y)\\ =\sum_x\sum_{c_k} L(c_k,f(x))P(x)P(c_k|x)\\ = \sum_xP(x)\sum_{c_k}L(c_k,f(x))P(c_k|x) Rexp(f)=E[L(Y,f(X))]=x∑ck∑L(ck,f(x))P(x,y)=x∑ck∑L(ck,f(x))P(x)P(ck∣x)=x∑P(x)ck∑L(ck,f(x))P(ck∣x)
所以,最小化期望风险,就等价于最小化 ∑ y L ( y , f ( x ) ) P ( y ∣ x ) \sum_yL(y,f(x))P(y|x) ∑yL(y,f(x))P(y∣x),也就是:
f ( x ) = arg min y ∑ k = 1 K L ( c k , y ) P ( c k ∣ x ) = arg min y ∑ k = 1 K P ( y ≠ c k ∣ x ) = arg min y ∑ k = 1 K 1 − P ( y = c k ∣ x ) = arg max y ∑ k = 1 K P ( y = c k ∣ x ) f(x) = \arg\min_{y} \sum_{k=1}^KL(c_k,y)P(c_k|x)\\ = \arg \min_{y}\sum_{k=1}^KP(y \neq c_k|x)\\ = \arg \min_{y}\sum_{k=1}^K 1-P(y = c_k|x)\\ = \arg \max_{y}\sum_{k=1}^K P(y = c_k|x) f(x)=argymink=1∑KL(ck,y)P(ck∣x)=argymink=1∑KP(y=ck∣x)=argymink=1∑K1−P(y=ck∣x)=argymaxk=1∑KP(y=ck∣x)
也就是最大化后验概率。
二、参数估计
对于朴素贝叶斯方法,我们只需要去估计 P ( Y = c k ) P(Y =c_k) P(Y=ck) 和 P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X^{(j)}=x^{(j)}|Y =c_k) P(X(j)=x(j)∣Y=ck) 即可进行分类,可以使用极大似然估计方法估计相应概率。
先验概率的极大似然估计为:
P
(
Y
=
c
k
)
=
∑
i
=
1
N
I
(
y
i
=
c
k
)
N
,
k
−
1
,
2
,
.
.
.
,
K
P(Y=c_k) = \frac{\sum_{i=1}^N I(y_i = c_k)}{N},k-1,2,...,K
P(Y=ck)=N∑i=1NI(yi=ck),k−1,2,...,K
也就是所有样本中属于第
k
k
k 类的样本所占的比例。
设第
j
j
j 个特征
x
(
j
)
x^{(j)}
x(j) 可能取值的集合为
{
a
j
1
,
a
j
2
,
.
.
.
,
a
j
s
j
}
\{a_{j1},a_{j2},...,a_{js_j}\}
{aj1,aj2,...,ajsj},条件概率的极大似然估计为:
P
(
X
(
j
)
=
a
j
l
∣
Y
=
c
k
)
=
∑
i
=
1
N
I
(
X
i
(
j
)
=
a
j
l
,
y
i
=
c
k
)
∑
i
=
1
N
I
(
y
=
c
k
)
j
=
1
,
2
,
.
.
.
,
n
;
k
=
1
,
2
,
.
.
.
,
K
P(X^{(j)} = a_{jl}|Y=c_k) = \frac{\sum_{i=1}^N I(X_i^{(j)} = a_{jl},y_i = c_k)}{\sum_{i=1}^N I( y=c_k)}\\ j = 1,2,...,n;\quad k = 1,2,...,K
P(X(j)=ajl∣Y=ck)=∑i=1NI(y=ck)∑i=1NI(Xi(j)=ajl,yi=ck)j=1,2,...,n;k=1,2,...,K
这里是使用条件概率公式,并且用比例表示概率。
三、学习与分类算法
算法4.1
输入:训练数据 T = { ( x i , y i ) , i = 1 , 2 , . . . , N } T = \{(x_i,y_i),i=1,2,...,N\} T={(xi,yi),i=1,2,...,N},其中 x i = ( x i ( 1 ) , . . . x i ( n ) ) T x_i = (x_i^{(1)},...x_i^{(n)})^T xi=(xi(1),...xi(n))T。 x i ( j ) x_i^{(j)} xi(j) 是第 i i i 个样本的第 j j j 个特征, x i ( j ) x_i^{(j)} xi(j) 的可能取值有 s j s_j sj 个,即 x i ( j ) ∈ { a j 1 , a j 2 , . . . , a j s j } x_i^{(j)} \in \{a_{j1},a_{j2},...,a_{js_j}\} xi(j)∈{aj1,aj2,...,ajsj} 。 y i ∈ { c 1 , c 2 , . . . , c K } y_i \in \{c_1,c_2,...,c_K\} yi∈{c1,c2,...,cK}。实例 x x x。
输出:实例 x x x 的分类。
(1) 计算先验概率及条件概率
P
(
Y
=
c
k
)
=
∑
i
=
1
N
I
(
y
i
=
c
k
)
N
,
k
=
1
,
2
,
.
.
.
,
K
P(Y=c_k) = \frac{\sum_{i=1}^NI(y_i = c_k)}{N}, \quad k =1,2,...,K
P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,...,K
P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( X i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y = c k ) j = 1 , 2 , . . . , n ; l = 1 , 2 , . . . , S j ; k = 1 , 2 , . . . , K P(X^{(j)} = a_{jl}|Y=c_k) = \frac{\sum_{i=1}^N I(X_i^{(j)} = a_{jl},y_i = c_k)}{\sum_{i=1}^N I( y=c_k)}\\ j = 1,2,...,n;\quad l = 1,2,...,S_j;\quad k = 1,2,...,K P(X(j)=ajl∣Y=ck)=∑i=1NI(y=ck)∑i=1NI(Xi(j)=ajl,yi=ck)j=1,2,...,n;l=1,2,...,Sj;k=1,2,...,K
(2) 对于给定的实例
x
=
(
x
(
1
)
,
x
(
2
)
,
.
.
.
,
x
(
n
)
)
T
x = (x^{(1)},x^{(2)},...,x^{(n)})^T
x=(x(1),x(2),...,x(n))T,计算
P
(
Y
=
c
k
)
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
P(Y = c_k)\prod_{j=1}^nP(X^{(j)} = x^{(j)}|Y=c_k)
P(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)
(3) 确定实例
x
x
x 的类
y
=
arg
max
c
k
P
(
Y
=
c
k
)
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
y = \arg \max_{c_k} P(Y = c_k)\prod_{j=1}^nP(X^{(j)} = x^{(j)}|Y=c_k)
y=argckmaxP(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)
四、例4.1
根据给定训练数据学习一个朴素贝叶斯分类器并确定 x = ( 2 , S ) T x=(2,S)^T x=(2,S)T 的类标记 y y y
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X ( 1 ) X^{(1)} X(1) | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 |
X ( 2 ) X^{(2)} X(2) | S | M | M | S | S | S | M | M | L | L | L | M | M | L | L |
Y Y Y | -1 | -1 | 1 | 1 | -1 | -1 | -1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -1 |
(1)
P
(
Y
=
1
)
=
9
15
;
P
(
Y
=
−
1
)
=
6
15
P(Y =1) = \frac{9}{15} ;\quad P(Y = -1) = \frac{6}{15}
P(Y=1)=159;P(Y=−1)=156
P ( X ( 1 ) = 1 ∣ Y = 1 ) = 2 9 ; P ( X ( 1 ) = 2 ∣ Y = 1 ) = 3 9 ; P ( X ( 1 ) = 3 ∣ Y = 1 ) = 4 9 P ( X ( 1 ) = 1 ∣ Y = − 1 ) = 3 6 ; P ( X ( 1 ) = 2 ∣ Y = − 1 ) = 2 6 ; P ( X ( 1 ) = 3 ∣ Y = − 1 ) = 1 6 P ( X ( 2 ) = S ∣ Y = 1 ) = 1 9 ; P ( X ( 2 ) = M ∣ Y = 1 ) = 4 9 ; P ( X ( 2 ) = L ∣ Y = 1 ) = 4 9 P ( X ( 2 ) = S ∣ Y = − 1 ) = 3 6 ; P ( X ( 2 ) = M ∣ Y = − 1 ) = 2 6 ; P ( X ( 2 ) = L ∣ Y = − 1 ) = 1 6 P(X^{(1)} =1|Y=1) = \frac{2}{9};\quad P(X^{(1)} =2|Y=1) = \frac{3}{9};\quad P(X^{(1)} =3|Y=1) = \frac{4}{9}\\ P(X^{(1)} =1|Y=-1) = \frac{3}{6};\quad P(X^{(1)} =2|Y=-1) = \frac{2}{6};\quad P(X^{(1)} =3|Y=-1) = \frac{1}{6}\\ P(X^{(2)} =S|Y=1) = \frac{1}{9};\quad P(X^{(2)} =M|Y=1) = \frac{4}{9};\quad P(X^{(2)} =L|Y=1) = \frac{4}{9}\\ P(X^{(2)} =S|Y=-1) = \frac{3}{6};\quad P(X^{(2)} =M|Y=-1) = \frac{2}{6};\quad P(X^{(2)} =L|Y=-1) = \frac{1}{6} P(X(1)=1∣Y=1)=92;P(X(1)=2∣Y=1)=93;P(X(1)=3∣Y=1)=94P(X(1)=1∣Y=−1)=63;P(X(1)=2∣Y=−1)=62;P(X(1)=3∣Y=−1)=61P(X(2)=S∣Y=1)=91;P(X(2)=M∣Y=1)=94;P(X(2)=L∣Y=1)=94P(X(2)=S∣Y=−1)=63;P(X(2)=M∣Y=−1)=62;P(X(2)=L∣Y=−1)=61
(2) 对于实例
x
=
(
2
,
S
)
T
x=(2,S)^T
x=(2,S)T,
P
(
Y
=
1
)
∗
P
(
X
(
1
)
=
2
∣
Y
=
1
)
∗
P
(
X
(
2
)
=
S
∣
Y
=
1
)
=
9
15
∗
3
9
∗
1
9
=
1
45
P
(
Y
=
−
1
)
∗
P
(
X
(
1
)
=
2
∣
Y
=
−
1
)
∗
P
(
X
(
2
)
=
S
∣
Y
=
−
1
)
=
6
15
∗
2
6
∗
3
6
=
1
15
P(Y=1)*P(X^{(1)} =2|Y=1)*P(X^{(2)} =S|Y=1) = \frac{9}{15}*\frac{3}{9}*\frac{1}{9} = \frac{1}{45}\\ P(Y=-1)*P(X^{(1)} =2|Y=-1)*P(X^{(2)} =S|Y=-1) = \frac{6}{15}*\frac{2}{6}*\frac{3}{6} = \frac{1}{15}
P(Y=1)∗P(X(1)=2∣Y=1)∗P(X(2)=S∣Y=1)=159∗93∗91=451P(Y=−1)∗P(X(1)=2∣Y=−1)∗P(X(2)=S∣Y=−1)=156∗62∗63=151
(3) 根据分类规则,实例
x
x
x 的类标记为
y
=
−
1
y = -1
y=−1
五、贝叶斯估计
使用极大似然估计可能会出现要估计的概率值为 0 的情况,这时会影响到后验概率的计算结果,使分类产生偏差。解决这一问题的方法就是使用贝叶斯估计,即在随机变量的各个取值上加上一个正数
λ
\lambda
λ,然后再计算概率:
P
λ
(
Y
=
c
k
)
=
∑
i
=
1
N
I
(
y
i
=
c
k
)
+
λ
N
+
K
λ
P_\lambda(Y = c_k) = \frac{\sum_{i=1}^N I(y_i = c_k) + \lambda}{N + K\lambda}
Pλ(Y=ck)=N+Kλ∑i=1NI(yi=ck)+λ
P λ ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( X i ( j ) = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j λ P_\lambda(X^{(j)} = a_{jl}|Y=c_k) = \frac{\sum_{i=1}^N I(X^{(j)}_i = a_{jl},y_i=c_k) + \lambda}{\sum_{i=1}^N I(y_i = c_k) + S_j \lambda} Pλ(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)+Sjλ∑i=1NI(Xi(j)=ajl,yi=ck)+λ
常取 λ = 1 \lambda =1 λ=1,此时成为拉普拉斯平滑 。