朴素贝叶斯法,后验概率最大化的含义,参数估计



朴素贝叶斯法


输出取值: y ∈ { c 1 , c 2 , . . . , c k } y \in \{ c_{1}, c_{2}, ..., c_{k} \} y{c1,c2,...,ck}

输入取值: 假设 x ( j ) x^{(j)} x(j) 可取值有 S j S_{j} Sj 个,其中 j = 1 , 2 , . . . , n j = 1, 2, ..., n j=1,2,...,n

条件独立性假设:
P ( X = x   ∣   Y = c k ) = P ( X ( 1 ) = x ( 1 ) , X ( 2 ) = x ( 2 ) , . . . , X ( n ) = x ( n )   ∣   Y = c k ) = ∏ j = 1 n P ( X ( j ) = x ( j )   ∣   Y = c k ) ( 1.1 ) P(X = x \space | \space Y = c_{k}) = P(X^{(1)} = x^{(1)}, X^{(2)} = x^{(2)}, ..., X^{(n)} = x^{(n)} \space | \space Y = c_{k}) = \prod_{j = 1}^{n} P(X^{(j)} = x^{(j)} \space | \space Y = c_{k}) \quad (1.1) P(X=x  Y=ck)=P(X(1)=x(1),X(2)=x(2),...,X(n)=x(n)  Y=ck)=j=1nP(X(j)=x(j)  Y=ck)(1.1)

朴素贝叶斯法由此得名。

由贝叶斯定理,得到后验概率:
P ( Y = c k   ∣   X = x ) = P ( X = x   ∣   Y = c k ) ⋅ P ( Y = c k ) ∑ k P ( X = x   ∣   Y = c k ) ⋅ P ( Y = c k ) ( 1.2 ) P(Y = c_{k} \space | \space X = x) = \frac {P(X = x \space | \space Y = c_{k}) \cdot P(Y = c_{k})} {\sum_{k} P(X = x \space | \space Y = c_{k}) \cdot P(Y = c_{k})} \quad (1.2) P(Y=ck  X=x)=kP(X=x  Y=ck)P(Y=ck)P(X=x  Y=ck)P(Y=ck)(1.2)

将式(1.1)带入(1.2),有:
P ( Y = c k   ∣   X = x ) = P ( Y = c k ) ⋅ ∏ j n P ( X ( j ) = x ( j )   ∣   Y = c k ) ∑ k P ( Y = c k ) ⋅ ∏ j n P ( X ( j ) = x ( j )   ∣   Y = c k ) ( 1.3 ) P(Y = c_{k} \space | \space X = x) = \frac {P(Y = c_{k}) \cdot \prod_{j}^{n} P(X^{(j)} = x^{(j)} \space | \space Y = c_{k})} {\sum_{k} P(Y = c_{k}) \cdot \prod_{j}^{n} P(X^{(j)} = x^{(j)} \space | \space Y = c_{k})} \quad (1.3) P(Y=ck  X=x)=kP(Y=ck)jnP(X(j)=x(j)  Y=ck)P(Y=ck)jnP(X(j)=x(j)  Y=ck)(1.3)

上式即为朴素贝叶斯法分类的基本公式。故,朴素贝叶斯分类器可表示为:
y = f ( x ) = a r g   m a x c k   P ( Y = c k ) ⋅ ∏ j n P ( X ( j ) = x ( j )   ∣   Y = c k ) ∑ k P ( Y = c k ) ⋅ ∏ j n P ( X ( j ) = x ( j )   ∣   Y = c k ) ( 1.4 ) y= f(x) = arg \space \underset {c_{k}} {max} \space \frac {P(Y = c_{k}) \cdot \prod_{j}^{n} P(X^{(j)} = x^{(j)} \space | \space Y = c_{k})} {\sum_{k} P(Y = c_{k}) \cdot \prod_{j}^{n} P(X^{(j)} = x^{(j)} \space |\space Y = c_{k})} \quad (1.4) y=f(x)=arg ckmax kP(Y=ck)jnP(X(j)=x(j)  Y=ck)P(Y=ck)jnP(X(j)=x(j)  Y=ck)(1.4)

由于上式对所有分母都是相同的,故:
y = f ( x ) = a r g   m a x c k   P ( Y = c k ) ⋅ ∏ j n P ( X ( j ) = x ( j )   ∣   Y = c k ) ( 1.5 ) y= f(x) = arg \space \underset {c_{k}} {max} \space P(Y = c_{k}) \cdot \prod_{j}^{n} P(X^{(j)} = x^{(j)} \space | \space Y = c_{k}) \quad (1.5) y=f(x)=arg ckmax P(Y=ck)jnP(X(j)=x(j)  Y=ck)(1.5)

后验概率最大化的含义


期望风险最小化,选择 0-1 损失函数:
L ( Y , f ( X ) ) = { 1 , Y ≠ f ( X ) 0 , Y = f ( X ) ( 2.1 ) L(Y, f(X)) = \begin{cases} 1, \quad Y \neq f(X) \\ 0, \quad Y = f(X) \end{cases} \quad (2.1) L(Y,f(X))={1,Y=f(X)0,Y=f(X)(2.1)

式中, f ( X ) f(X) f(X) 是分类决策函数。

期望风险函数为:
R e x p ( f ) = E [ L ( Y , f ( X ) ) ] ( 2.2 ) R_{exp}(f) = E[L(Y, f(X))] \quad (2.2) Rexp(f)=E[L(Y,f(X))](2.2)

上式中,期望是对联合分布 P ( X , Y ) P(X, Y) P(X,Y) 取的,由此取条件期望,得:
R e x p ( f ) = E [ L ( Y , f ( X ) ) ] = ∑ x ∑ k L ( c k , f ( x ) ) ⋅ P ( Y = c k , X = x ) = ∑ x ∑ k L ( c k , f ( x ) ) ⋅ P ( Y = c k   ∣   X = x ) ⋅ P ( X = x ) = ∑ x P ( X = x ) ∑ k L ( c k , f ( x ) ) ⋅ P ( Y = c k   ∣   X = x ) = E X [ ∑ k L ( c k , f ( x ) ) ⋅ P ( Y = c k   ∣   X = x ) ] ( 2.3 ) R_{exp}(f) = E[L(Y, f(X))] = \sum_{x} \sum_{k} L(c_{k}, f(x)) \cdot P(Y = c_{k}, X = x) = \sum_{x} \sum_{k} L(c_{k}, f(x)) \cdot P(Y = c_{k} \space | \space X = x) \cdot P(X = x) = \sum_{x} P(X = x) \sum_{k} L(c_{k}, f(x)) \cdot P(Y = c_{k} \space | \space X = x) = E_{X}[\sum_{k} L(c_{k}, f(x)) \cdot P(Y = c_{k} \space | \space X = x)] \quad (2.3) Rexp(f)=E[L(Y,f(X))]=xkL(ck,f(x))P(Y=ck,X=x)=xkL(ck,f(x))P(Y=ck  X=x)P(X=x)=xP(X=x)kL(ck,f(x))P(Y=ck  X=x)=EX[kL(ck,f(x))P(Y=ck  X=x)](2.3)

为使期望风险最小化,只需对 X = x X = x X=x 逐个极小化,得:
f ( x ) = a r g   m i n y ∑ k L ( c k , f ( x ) ) ⋅ P ( Y = c k   ∣   X = x ) = a r g   m i n c k ∑ k P ( Y ≠ c k   ∣   X = x ) = a r g   m i n c k   ( 1 − P ( Y = c k   ∣   X = x ) ) = a r g   m a x c k   P ( Y = c k   ∣   X = x ) ( 2.4 ) f(x) = arg \space \underset {y} {min} \sum_{k} L(c_{k}, f(x)) \cdot P(Y = c_{k} \space | \space X = x) = arg \space \underset {c_{k}} {min} \sum_{k} P(Y \neq c_{k} \space | \space X = x) = arg \space \underset {c_{k}} {min} \space (1 - P(Y = c_{k} \space | \space X = x)) = arg \space \underset {c_{k}} {max} \space P(Y = c_{k} \space | \space X = x) \quad (2.4) f(x)=arg yminkL(ck,f(x))P(Y=ck  X=x)=arg ckminkP(Y=ck  X=x)=arg ckmin (1P(Y=ck  X=x))=arg ckmax P(Y=ck  X=x)(2.4)
由期望风险最小化准则,可得后验概率最大化准则,即为朴素贝叶斯法所采用的原理。

综上所述,后验概率最大化等价于 0-1 损失函数时的期望风险最小化。


参数估计


在朴素贝叶斯法中,学习意味着估计先验概率 P ( Y = c k ) P(Y = c_{k}) P(Y=ck) 和条件概率 P ( X ( j ) = x ( j )   ∣   Y = c k ) P(X^{(j)} = x^{(j)} \space | \space Y = c_{k}) P(X(j)=x(j)  Y=ck)

极大似然估计

先验概率 P ( Y = c k ) P(Y = c_{k}) P(Y=ck) 的极大似然估计:
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N k = 1 , 2 , . . . , K ( 3.1 ) P(Y = c_{k}) = \frac {\sum_{i = 1}^{N} I(y_{i} = c_{k})} {N} \quad k = 1, 2, ..., K \quad (3.1) P(Y=ck)=Ni=1NI(yi=ck)k=1,2,...,K(3.1)

设第 j j j 个特征 x ( j ) x_{(j)} x(j) 可能取值的集合为 { a j , 1 , a j , 2 , . . . , a j , S j } \{a_{j, 1}, a_{j, 2}, ..., a_{j, S_{j}}\} {aj,1,aj,2,...,aj,Sj},则条件概率 P ( X ( j ) = a j , l   ∣   Y = c k ) P(X_{(j)} = a_{j, l} \space | \space Y = c_{k}) P(X(j)=aj,l  Y=ck) 的极大似然估计为:
P ( X ( j ) = a j , l   ∣   Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j , l , y i = c k ) ∑ j = 1 N I ( y i = c k ) j = 1 , 2 , . . . , n ; l = 1 , 2 , . . . , S j ; k = 1 , 2 , . . . , K ( 3.2 ) P(X_{(j)} = a_{j, l} \space | \space Y = c_{k}) = \frac {\sum_{i = 1}^{N} I(x_{i}^{(j)} = a_{j, l}, y_{i} = c_{k})} {\sum_{j = 1}^{N} I(y_{i} = c_{k})} \quad j = 1, 2, ..., n; \quad l = 1, 2, ..., S_{j}; \quad k = 1, 2, ..., K \quad (3.2) P(X(j)=aj,l  Y=ck)=j=1NI(yi=ck)i=1NI(xi(j)=aj,l,yi=ck)j=1,2,...,n;l=1,2,...,Sj;k=1,2,...,K(3.2)

贝叶斯估计

先验概率的贝叶斯估计:
P λ ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + k ⋅ λ ( 3.3 ) P_{\lambda}(Y = c_{k}) = \frac {\sum_{i = 1}^{N} I(y_{i} = c_{k}) + \lambda} {N + k \cdot \lambda} \quad (3.3) Pλ(Y=ck)=N+kλi=1NI(yi=ck)+λ(3.3)

条件概率的贝叶斯估计:
P λ ( X ( j ) = a j , l   ∣   Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j , l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j ⋅ λ ( 3.4 ) P_{\lambda}(X^{(j)} = a_{j, l} \space | \space Y = c_{k}) = \frac {\sum_{i = 1}^{N} I(x_{i}^{(j)} = a_{j, l}, y_{i} = c_{k}) + \lambda} {\sum_{i = 1}^{N} I(y_{i} = c_{k}) + S_{j} \cdot \lambda} \quad (3.4) Pλ(X(j)=aj,l  Y=ck)=i=1NI(yi=ck)+Sjλi=1NI(xi(j)=aj,l,yi=ck)+λ(3.4)

其中, λ ≥ 0 \lambda \geq 0 λ0

λ = 0 \lambda = 0 λ=0 时,即为极大似然估计;
常取 λ = 1 \lambda = 1 λ=1,此时称为拉普拉斯平滑(Laplace smoothing)。

显然有:
P λ ( X ( j ) = a j , l   ∣   Y = c k ) > 0 j = 1 , 2 , . . . , n ; k = 1 , 2 , . . . , K ; l = 1 , 2 , . . . , S j ∑ l = 1 S j P ( X ( j ) = a j , l   ∣   Y = c k ) = 1 j = 1 , 2 , . . . , n ; k = 1 , 2 , . . . , K P_{\lambda}(X^{(j)} = a_{j, l} \space | \space Y = c_{k}) > 0 \quad j = 1, 2, ..., n; \quad k = 1, 2, ..., K; \quad l = 1, 2, ..., S_{j} \\ \sum_{l = 1}^{S_{j}} P(X^{(j)} = a_{j, l} \space | \space Y = c_{k}) = 1 \quad j = 1, 2, ..., n; \quad k = 1, 2, ..., K Pλ(X(j)=aj,l  Y=ck)>0j=1,2,...,n;k=1,2,...,K;l=1,2,...,Sjl=1SjP(X(j)=aj,l  Y=ck)=1j=1,2,...,n;k=1,2,...,K

参考资料

李航【统计学习方法】第一版,第4章

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值