朴素贝叶斯法(naive Bayes)是基于
贝叶斯定理
与特征条件独立假设
的分类方法
- 对于给定数据集,首先基于特征条件独立假设学习输入/输出的联合概率分布
- 然后基于此模型,对给定的输入x,利用贝叶斯定理求出后验概率最大的输出y
朴素贝叶斯法的学习与分类
基本方法
通过训练数据集学习联合概率分布P(X,Y)。具体地,学习先验概率分布及条件概率分布
- 先验概率分布: P ( Y = c k ) , k = 1 , 2 , . . . , K P(Y=c_{k}),k=1,2,...,K P(Y=ck),k=1,2,...,K
- 条件概率分布: P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , . . . , X ( n ) = x ( n ) ∣ Y = c k ) , k = 1 , 2 , . . . , K P(X=x|Y=c_{k})=P(X^{(1)}=x^{(1)},...,X^{(n)}=x^{(n)}|Y=c_{k}),k=1,2,...,K P(X=x∣Y=ck)=P(X(1)=x(1),...,X(n)=x(n)∣Y=ck),k=1,2,...,K
于是学习到联合概率分布P(X,Y)
但条件概率分布具有指数级数量的参数,其估计实际是不可行的。
朴素贝叶斯法对条件概率分布作了
条件独立性
的假设
- P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , . . . , X ( n ) = x ( n ) ∣ Y = c k ) = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X=x|Y=c_{k})=P(X^{(1)}=x^{(1)},...,X^{(n)}=x^{(n)}|Y=c_{k})=\prod_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_{k}) P(X=x∣Y=ck)=P(X(1)=x(1),...,X(n)=x(n)∣Y=ck)=∏j=1nP(X(j)=x(j)∣Y=ck)
条件独立性假设等于是说用于分类的特征在类确定的条件下都是条件独立的
分类时,对给定的输入x,通过学习到的模型计算后验概率分布 P ( Y = c k ∣ X = x ) P(Y=c_{k}|X=x) P(Y=ck∣X=x),将后验概率最大的类作为x的类输出。
- 后验概率: P ( Y = c k ∣ X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ k P ( X = x ∣ Y = c k ) P ( Y = c k ) P(Y=c_{k}|X=x)=\frac{P(X=x|Y=c_{k})P(Y=c_{k})}{\sum_{k}P(X=x|Y=c_{k})P(Y=c_{k})} P(Y=ck∣X=x)=∑kP(X=x∣Y=ck)P(Y=ck)P(X=x∣Y=ck)P(Y=ck)
代入条件独立性假设后的式子: P ( Y = c k ∣ X = x ) = P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , . . . , K P(Y=c_{k}|X=x)=\frac{P(Y=c_{k})\prod_{j}P(X^{(j)}=x^{(j)}|Y=c_{k})}{\sum_{k}P(Y=c_{k})\prod_{j}P(X^{(j)}=x^{(j)}|Y=c_{k})},k=1,2,...,K P(Y=ck∣X=x)=∑kP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)P(Y=ck)∏jP(X(j)=x(j)∣Y=ck),k=1,2,...,K
于是,朴素贝叶斯分类器可表示为:
- y = f ( x ) = a r g max c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=arg \displaystyle\max_{c_{k}} \frac{P(Y=c_{k})\prod_{j}P(X^{(j)}=x^{(j)}|Y=c_{k})}{\sum_{k}P(Y=c_{k})\prod_{j}P(X^{(j)}=x^{(j)}|Y=c_{k})} y=f(x)=argckmax∑kP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)P(Y=ck)∏jP(X(j)=x(j)∣Y=ck)
分母对所有的 c k c_{k} ck都是相同的,所以
- y = a r g max c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) y=arg \displaystyle\max_{c_{k}} {P(Y=c_{k})\prod_{j}P(X^{(j)}=x^{(j)}|Y=c_{k})} y=argckmaxP(Y=ck)j∏P(X(j)=x(j)∣Y=ck)
后验概率最大化的含义
朴素贝叶斯法将实例分到后验概率最大的类中,等价于期望风险最小化
朴素贝叶斯法的参数估计
极大似然估计
先验概率P(Y=ck)的极大似然估计是:
- P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , . . . , K P(Y=c_{k})=\frac{\sum_{i=1}^{N}I(y_{i}=c_{k})}{N},k=1,2,...,K P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,...,K
设第j个特征 x j x_{j} xj可能取值的集合为 { a j 1 , a j 2 , . . . , a j s j } \left\{a_{j1},a_{j2},...,a_{js_{j}} \right\} {aj1,aj2,...,ajsj},条件概率 P ( X ( j ) = a j l ∣ Y = c k ) P(X^{(j)}=a_{jl}|Y=c_{k}) P(X(j)=ajl∣Y=ck)的极大似然估计是
- P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) P(X^{(j)}=a_{jl}|Y=c_{k})=\frac{\sum_{i=1}^{N}I(x_{i}^{(j)}=a_{jl},y_{i}=c_{k})}{\sum_{i=1}^{N}I(y_{i}=c_{k})} P(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck)
- j = 1 , 2 , . . . , n ; l = 1 , 2 , . . . , S j ; k = 1 , 2 , . . . , k j=1,2,...,n; l=1,2,...,S_{j}; k=1,2,...,k j=1,2,...,n;l=1,2,...,Sj;k=1,2,...,k
学习与分类算法
- 输入:训练数据 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } ,其中 x i = ( x i ( 1 ) , x i ( 2 ) , . . . , x i ( n ) ) T , x i ( j ) 是第 i 个样本的第 j 个特征, x i ( j ) ϵ { a j 1 , a j 2 , . . . , a j s j } , a j l 是第 j 个特征可能取的第 l 个值, j = 1 , 2 , . . . , n , l = 1 , 2 , . . . , S j , y i ϵ { c 1 , c 2 , . . . , c k } ;实例 x 输入:训练数据T=\left\{(x_{1},y_{1}),(x_{2},y_{2}),...,(x_{N},y_{N}) \right\},其中x_{i}=(x_{i}^{(1)},x_{i}^{(2)},...,x_{i}^{(n)})^{T},x_{i}^{(j)}是第i个样本的第j个特征,x_{i}^{(j)}\epsilon \left\{a_{j1},a_{j2},...,a_{js_{j}} \right\},a_{jl}是第j个特征可能取的第l个值,j=1,2,...,n,l=1,2,...,S_{j},y_{i}\epsilon \left\{c_{1},c_{2},...,c_{k} \right\};实例x 输入:训练数据T={(x1,y1),(x2,y2),...,(xN,yN)},其中xi=(xi(1),xi(2),...,xi(n))T,xi(j)是第i个样本的第j个特征,xi(j)ϵ{aj1,aj2,...,ajsj},ajl是第j个特征可能取的第l个值,j=1,2,...,n,l=1,2,...,Sj,yiϵ{c1,c2,...,ck};实例x
- 输出:实例 x 的分类 输出:实例x的分类 输出:实例x的分类
过程:
1、计算先验概率及条件概率
- P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , . . . , K P(Y=c_{k})=\frac{\sum_{i=1}^{N}I(y_{i}=c_{k})}{N},k=1,2,...,K P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,...,K
- P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) P(X^{(j)}=a_{jl}|Y=c_{k})=\frac{\sum_{i=1}^{N}I(x_{i}^{(j)}=a_{jl},y_{i}=c_{k})}{\sum_{i=1}^{N}I(y_{i}=c_{k})} P(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck)
j = 1 , 2 , . . . , n ; l = 1 , 2 , . . . , S j ; k = 1 , 2 , . . . , k j=1,2,...,n; l=1,2,...,S_{j}; k=1,2,...,k j=1,2,...,n;l=1,2,...,Sj;k=1,2,...,k
2、对于给定的实例 x = ( x ( 1 ) , x ( 2 ) , . . . , x ( n ) ) T x=(x^{(1)},x^{(2)},...,x^{(n)})^{T} x=(x(1),x(2),...,x(n))T,计算
- P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , . . . , K {P(Y=c_{k})\prod_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_{k})},k=1,2,...,K P(Y=ck)∏j=1nP(X(j)=x(j)∣Y=ck),k=1,2,...,K
3、确定实例x的类
- y = a r g max c k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) y=arg \displaystyle\max_{c_{k}} {P(Y=c_{k})\prod_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_{k})} y=argckmaxP(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)
实例:
贝叶斯估计
用极大似然估计可能会出现所要估计的概率值为0的情况,这时会影响到后验概率的计算结果,使分类出现偏差。解决方法是使用贝叶斯估计
条件概率的贝叶斯估计是:
- P λ ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k + λ ) ∑ i = 1 N I ( y i = c k ) + S j λ P_{\lambda }(X^{(j)}=a_{jl}|Y=c_{k})=\frac{\sum_{i=1}^{N}I(x_{i}^{(j)}=a_{jl},y_{i}=c_{k}+\lambda)}{\sum_{i=1}^{N}I(y_{i}=c_{k})+S_{j}\lambda} Pλ(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)+Sjλ∑i=1NI(xi(j)=ajl,yi=ck+λ)
式中 λ ≥ 0 \lambda\geq 0 λ≥0。
- 等价于在随机变量各个取值的频数上赋予一个正数 λ > 0 \lambda>0 λ>0。
- 当 λ = 0 \lambda=0 λ=0时,就是极大似然估计
- 常取 λ = 1 \lambda=1 λ=1,这时称为拉普拉斯平滑
- 对任何 l = 1 , 2 , . . . , S j , k = 1 , 2 , . . . , K l=1,2,...,S_{j},k=1,2,...,K l=1,2,...,Sj,k=1,2,...,K,都有:
- P λ ( X ( j ) = a j l ∣ Y = c k ) > 0 P_{\lambda}(X^{(j)}=a_{jl}|Y=c_{k})>0 Pλ(X(j)=ajl∣Y=ck)>0
- ∑ l = 1 s j P ( X ( j ) = a j l ∣ Y = c k ) = 1 \sum_{l=1}^{s_{j}}P(X^{(j)}=a_{jl}|Y=c_{k})=1 ∑l=1sjP(X(j)=ajl∣Y=ck)=1
表明条件概率的贝叶斯估计是一种概率分布先验概率的贝叶斯估计为:
- P λ ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + K λ P_{\lambda}(Y=c_{k})=\frac{\sum_{i=1}^{N}I(y_{i}=c_{k})+\lambda}{N+K\lambda} Pλ(Y=ck)=N+Kλ∑i=1NI(yi=ck)+λ
接着上个案例: