李航《统计学习方法》第4章——朴素贝叶斯法

一、方法原理

输入空间为 n n n 向量: X ⊆ R n \mathcal{X} \subseteq \mathbb{R}^n XRn

输出空间为类标记: Y = { c 1 , c 2 , . . . , c K } \mathcal{Y} = \{ c_1,c_2,...,c_K\} Y={c1,c2,...,cK}

训练数据集: T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } T = \{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\} T={(x1,y1),(x2,y2),...,(xN,yN)} ,由联合分布 P ( X , Y ) P(X,Y) P(X,Y) 独立同分布产生。

朴素贝叶斯方法就是通过训练数据集学习联合概率分布 P ( X , Y ) P(X,Y) P(X,Y),具体学习过程如下:

  • 先验概率: P ( Y = c k ) , k = 1 , 2 , . . . . , K P(Y=c_k), k =1,2,....,K P(Y=ck),k=1,2,....,K

  • 条件概率: P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , X ( 2 ) = x ( 2 ) , . . . , X ( n ) = x ( n ) ∣ Y = c k ) , k = 1 , 2 , . . . , K P(X = x|Y = c_k) = P(X^{(1)}=x^{(1)},X^{(2)}=x^{(2)},...,X^{(n)}=x^{(n)}|Y = c_k), k = 1,2,...,K P(X=xY=ck)=P(X(1)=x(1),X(2)=x(2),...,X(n)=x(n)Y=ck),k=1,2,...,K X ( i ) X^{(i)} X(i) X X X 的第 i i i 个分量。

朴素贝叶斯分类有一个前提假设就是条件独立性假设,也就是:
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , X ( 2 ) = x ( 2 ) , . . . , X ( n ) = x ( n ) ∣ Y = c k ) = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X = x |Y = c_k) = P(X^{(1)}=x^{(1)},X^{(2)}=x^{(2)},...,X^{(n)}=x^{(n)}|Y = c_k)\\ = \prod_{j=1}^n P(X^{(j)} = x^{(j)}|Y=c_k) P(X=xY=ck)=P(X(1)=x(1),X(2)=x(2),...,X(n)=x(n)Y=ck)=j=1nP(X(j)=x(j)Y=ck)
所以,使用朴素贝叶斯方法分类时,对于给定的输入 x x x,通过学习到的模型可以计算后验概率 P ( Y = c k ∣ X = x ) P(Y=c_k|X=x) P(Y=ckX=x)
P ( Y = c k ∣ X = x ) = P ( Y = c k ) P ( X = x ∣ Y = c k ) ∑ k P ( Y = c k ) P ( X = x ∣ Y = c k ) = P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) P(Y = c_k|X=x) = \frac{P(Y=c_k)P(X =x|Y=c_k)}{\sum_k P(Y=c_k)P(X =x|Y=c_k)}\\ = \frac{P(Y=c_k)\prod_{j=1}^n P(X^{(j)} = x^{(j)}|Y=c_k)}{\sum_k P(Y=c_k)\prod_{j=1}^n P(X^{(j)} = x^{(j)}|Y=c_k)} P(Y=ckX=x)=kP(Y=ck)P(X=xY=ck)P(Y=ck)P(X=xY=ck)=kP(Y=ck)j=1nP(X(j)=x(j)Y=ck)P(Y=ck)j=1nP(X(j)=x(j)Y=ck)
将后验概率最大的类作为当前输入的输出结果,所以朴素贝叶斯分类器可以表示为:
y = f ( x ) = arg ⁡ max ⁡ c k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) y = f(x) = \arg\max_{c_k} \frac{P(Y=c_k)\prod_{j=1}^n P(X^{(j)} = x^{(j)}|Y=c_k)}{\sum_k P(Y=c_k)\prod_{j=1}^n P(X^{(j)} = x^{(j)}|Y=c_k)} y=f(x)=argckmaxkP(Y=ck)j=1nP(X(j)=x(j)Y=ck)P(Y=ck)j=1nP(X(j)=x(j)Y=ck)
由于上式分母都相同,所以也等价于:
y = f ( x ) = arg ⁡ max ⁡ c k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) y = f(x) = \arg\max_{c_k} P(Y=c_k)\prod_{j=1}^n P(X^{(j)} = x^{(j)}|Y=c_k) y=f(x)=argckmaxP(Y=ck)j=1nP(X(j)=x(j)Y=ck)

后验概率最大化,其实等价于期望风险最小化:

假设损失函数选择 0-1 损失, L ( Y , f ( X ) ) = { 1 , Y ≠ f ( X ) 0 , Y = f ( X ) L(Y,f(X)) = \left\{ \begin{matrix}1,Y\neq f(X) \\0,Y = f(X)\end{matrix}\right. L(Y,f(X))={1,Y=f(X)0,Y=f(X)

那么此时期望风险可以表示为:
R e x p ( f ) = E [ L ( Y , f ( X ) ) ] = ∑ x ∑ c k L ( c k , f ( x ) ) P ( x , y ) = ∑ x ∑ c k L ( c k , f ( x ) ) P ( x ) P ( c k ∣ x ) = ∑ x P ( x ) ∑ c k L ( c k , f ( x ) ) P ( c k ∣ x ) R_{exp}(f) = E[L(Y,f(X))]\\ = \sum_x\sum_{c_k} L(c_k,f(x))P(x,y)\\ =\sum_x\sum_{c_k} L(c_k,f(x))P(x)P(c_k|x)\\ = \sum_xP(x)\sum_{c_k}L(c_k,f(x))P(c_k|x) Rexp(f)=E[L(Y,f(X))]=xckL(ck,f(x))P(x,y)=xckL(ck,f(x))P(x)P(ckx)=xP(x)ckL(ck,f(x))P(ckx)
所以,最小化期望风险,就等价于最小化 ∑ y L ( y , f ( x ) ) P ( y ∣ x ) \sum_yL(y,f(x))P(y|x) yL(y,f(x))P(yx),也就是:
f ( x ) = arg ⁡ min ⁡ y ∑ k = 1 K L ( c k , y ) P ( c k ∣ x ) = arg ⁡ min ⁡ y ∑ k = 1 K P ( y ≠ c k ∣ x ) = arg ⁡ min ⁡ y ∑ k = 1 K 1 − P ( y = c k ∣ x ) = arg ⁡ max ⁡ y ∑ k = 1 K P ( y = c k ∣ x ) f(x) = \arg\min_{y} \sum_{k=1}^KL(c_k,y)P(c_k|x)\\ = \arg \min_{y}\sum_{k=1}^KP(y \neq c_k|x)\\ = \arg \min_{y}\sum_{k=1}^K 1-P(y = c_k|x)\\ = \arg \max_{y}\sum_{k=1}^K P(y = c_k|x) f(x)=argymink=1KL(ck,y)P(ckx)=argymink=1KP(y=ckx)=argymink=1K1P(y=ckx)=argymaxk=1KP(y=ckx)
也就是最大化后验概率。

二、参数估计

对于朴素贝叶斯方法,我们只需要去估计 P ( Y = c k ) P(Y =c_k) P(Y=ck) P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X^{(j)}=x^{(j)}|Y =c_k) P(X(j)=x(j)Y=ck) 即可进行分类,可以使用极大似然估计方法估计相应概率。

先验概率的极大似然估计为:
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k − 1 , 2 , . . . , K P(Y=c_k) = \frac{\sum_{i=1}^N I(y_i = c_k)}{N},k-1,2,...,K P(Y=ck)=Ni=1NI(yi=ck),k1,2,...,K
也就是所有样本中属于第 k k k 类的样本所占的比例。

设第 j j j 个特征 x ( j ) x^{(j)} x(j) 可能取值的集合为 { a j 1 , a j 2 , . . . , a j s j } \{a_{j1},a_{j2},...,a_{js_j}\} {aj1,aj2,...,ajsj},条件概率的极大似然估计为:
P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( X i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y = c k ) j = 1 , 2 , . . . , n ; k = 1 , 2 , . . . , K P(X^{(j)} = a_{jl}|Y=c_k) = \frac{\sum_{i=1}^N I(X_i^{(j)} = a_{jl},y_i = c_k)}{\sum_{i=1}^N I( y=c_k)}\\ j = 1,2,...,n;\quad k = 1,2,...,K P(X(j)=ajlY=ck)=i=1NI(y=ck)i=1NI(Xi(j)=ajl,yi=ck)j=1,2,...,n;k=1,2,...,K

这里是使用条件概率公式,并且用比例表示概率。

三、学习与分类算法

算法4.1

输入:训练数据 T = { ( x i , y i ) , i = 1 , 2 , . . . , N } T = \{(x_i,y_i),i=1,2,...,N\} T={(xi,yi),i=1,2,...,N},其中 x i = ( x i ( 1 ) , . . . x i ( n ) ) T x_i = (x_i^{(1)},...x_i^{(n)})^T xi=(xi(1),...xi(n))T x i ( j ) x_i^{(j)} xi(j) 是第 i i i 个样本的第 j j j 个特征, x i ( j ) x_i^{(j)} xi(j) 的可能取值有 s j s_j sj 个,即 x i ( j ) ∈ { a j 1 , a j 2 , . . . , a j s j } x_i^{(j)} \in \{a_{j1},a_{j2},...,a_{js_j}\} xi(j){aj1,aj2,...,ajsj} y i ∈ { c 1 , c 2 , . . . , c K } y_i \in \{c_1,c_2,...,c_K\} yi{c1,c2,...,cK}。实例 x x x

输出:实例 x x x 的分类。

(1) 计算先验概率及条件概率
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , . . . , K P(Y=c_k) = \frac{\sum_{i=1}^NI(y_i = c_k)}{N}, \quad k =1,2,...,K P(Y=ck)=Ni=1NI(yi=ck),k=1,2,...,K

P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( X i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y = c k ) j = 1 , 2 , . . . , n ; l = 1 , 2 , . . . , S j ; k = 1 , 2 , . . . , K P(X^{(j)} = a_{jl}|Y=c_k) = \frac{\sum_{i=1}^N I(X_i^{(j)} = a_{jl},y_i = c_k)}{\sum_{i=1}^N I( y=c_k)}\\ j = 1,2,...,n;\quad l = 1,2,...,S_j;\quad k = 1,2,...,K P(X(j)=ajlY=ck)=i=1NI(y=ck)i=1NI(Xi(j)=ajl,yi=ck)j=1,2,...,n;l=1,2,...,Sj;k=1,2,...,K

(2) 对于给定的实例 x = ( x ( 1 ) , x ( 2 ) , . . . , x ( n ) ) T x = (x^{(1)},x^{(2)},...,x^{(n)})^T x=(x(1),x(2),...,x(n))T,计算
P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) P(Y = c_k)\prod_{j=1}^nP(X^{(j)} = x^{(j)}|Y=c_k) P(Y=ck)j=1nP(X(j)=x(j)Y=ck)
(3) 确定实例 x x x 的类
y = arg ⁡ max ⁡ c k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) y = \arg \max_{c_k} P(Y = c_k)\prod_{j=1}^nP(X^{(j)} = x^{(j)}|Y=c_k) y=argckmaxP(Y=ck)j=1nP(X(j)=x(j)Y=ck)

四、例4.1

根据给定训练数据学习一个朴素贝叶斯分类器并确定 x = ( 2 , S ) T x=(2,S)^T x=(2,S)T 的类标记 y y y

表4.1 训练数据
123456789101112131415
X ( 1 ) X^{(1)} X(1)111112222233333
X ( 2 ) X^{(2)} X(2)SMMSSSMMLLLMMLL
Y Y Y-1-111-1-1-11111111-1

(1)
P ( Y = 1 ) = 9 15 ; P ( Y = − 1 ) = 6 15 P(Y =1) = \frac{9}{15} ;\quad P(Y = -1) = \frac{6}{15} P(Y=1)=159;P(Y=1)=156

P ( X ( 1 ) = 1 ∣ Y = 1 ) = 2 9 ; P ( X ( 1 ) = 2 ∣ Y = 1 ) = 3 9 ; P ( X ( 1 ) = 3 ∣ Y = 1 ) = 4 9 P ( X ( 1 ) = 1 ∣ Y = − 1 ) = 3 6 ; P ( X ( 1 ) = 2 ∣ Y = − 1 ) = 2 6 ; P ( X ( 1 ) = 3 ∣ Y = − 1 ) = 1 6 P ( X ( 2 ) = S ∣ Y = 1 ) = 1 9 ; P ( X ( 2 ) = M ∣ Y = 1 ) = 4 9 ; P ( X ( 2 ) = L ∣ Y = 1 ) = 4 9 P ( X ( 2 ) = S ∣ Y = − 1 ) = 3 6 ; P ( X ( 2 ) = M ∣ Y = − 1 ) = 2 6 ; P ( X ( 2 ) = L ∣ Y = − 1 ) = 1 6 P(X^{(1)} =1|Y=1) = \frac{2}{9};\quad P(X^{(1)} =2|Y=1) = \frac{3}{9};\quad P(X^{(1)} =3|Y=1) = \frac{4}{9}\\ P(X^{(1)} =1|Y=-1) = \frac{3}{6};\quad P(X^{(1)} =2|Y=-1) = \frac{2}{6};\quad P(X^{(1)} =3|Y=-1) = \frac{1}{6}\\ P(X^{(2)} =S|Y=1) = \frac{1}{9};\quad P(X^{(2)} =M|Y=1) = \frac{4}{9};\quad P(X^{(2)} =L|Y=1) = \frac{4}{9}\\ P(X^{(2)} =S|Y=-1) = \frac{3}{6};\quad P(X^{(2)} =M|Y=-1) = \frac{2}{6};\quad P(X^{(2)} =L|Y=-1) = \frac{1}{6} P(X(1)=1Y=1)=92;P(X(1)=2Y=1)=93;P(X(1)=3Y=1)=94P(X(1)=1Y=1)=63;P(X(1)=2Y=1)=62;P(X(1)=3Y=1)=61P(X(2)=SY=1)=91;P(X(2)=MY=1)=94;P(X(2)=LY=1)=94P(X(2)=SY=1)=63;P(X(2)=MY=1)=62;P(X(2)=LY=1)=61

(2) 对于实例 x = ( 2 , S ) T x=(2,S)^T x=(2,S)T,
P ( Y = 1 ) ∗ P ( X ( 1 ) = 2 ∣ Y = 1 ) ∗ P ( X ( 2 ) = S ∣ Y = 1 ) = 9 15 ∗ 3 9 ∗ 1 9 = 1 45 P ( Y = − 1 ) ∗ P ( X ( 1 ) = 2 ∣ Y = − 1 ) ∗ P ( X ( 2 ) = S ∣ Y = − 1 ) = 6 15 ∗ 2 6 ∗ 3 6 = 1 15 P(Y=1)*P(X^{(1)} =2|Y=1)*P(X^{(2)} =S|Y=1) = \frac{9}{15}*\frac{3}{9}*\frac{1}{9} = \frac{1}{45}\\ P(Y=-1)*P(X^{(1)} =2|Y=-1)*P(X^{(2)} =S|Y=-1) = \frac{6}{15}*\frac{2}{6}*\frac{3}{6} = \frac{1}{15} P(Y=1)P(X(1)=2Y=1)P(X(2)=SY=1)=1599391=451P(Y=1)P(X(1)=2Y=1)P(X(2)=SY=1)=1566263=151
(3) 根据分类规则,实例 x x x 的类标记为 y = − 1 y = -1 y=1

五、贝叶斯估计

使用极大似然估计可能会出现要估计的概率值为 0 的情况,这时会影响到后验概率的计算结果,使分类产生偏差。解决这一问题的方法就是使用贝叶斯估计,即在随机变量的各个取值上加上一个正数 λ \lambda λ,然后再计算概率:
P λ ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + K λ P_\lambda(Y = c_k) = \frac{\sum_{i=1}^N I(y_i = c_k) + \lambda}{N + K\lambda} Pλ(Y=ck)=N+Kλi=1NI(yi=ck)+λ

P λ ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( X i ( j ) = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j λ P_\lambda(X^{(j)} = a_{jl}|Y=c_k) = \frac{\sum_{i=1}^N I(X^{(j)}_i = a_{jl},y_i=c_k) + \lambda}{\sum_{i=1}^N I(y_i = c_k) + S_j \lambda} Pλ(X(j)=ajlY=ck)=i=1NI(yi=ck)+Sjλi=1NI(Xi(j)=ajl,yi=ck)+λ

常取 λ = 1 \lambda =1 λ=1,此时成为拉普拉斯平滑 。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值