学习笔记-《统计学习方法》-第四章-朴素贝叶斯

4 朴素贝叶斯

4.1.1 朴素贝叶斯的学习与分类

设输入空间 X ⊆ R n \mathcal{X} \subseteq R^n XRn n n n维向量的集合,输出空间为类标记集合 Y = { c 1 , c 2 , . . . , c k } \mathcal{Y} = \{c_1, c_2, ..., c_k\} Y={c1,c2,...,ck}。输入为特征向量 x ∈ X x \in \mathcal{X} xX,输出为类标记(class label) y ∈ Y y \in \mathcal{Y} yY X X X是定义在输入空间 X \mathcal{X} X上的随机变量, Y Y Y是定义在输出空间 Y \mathcal{Y} Y上的随机变量, P ( X , Y ) P(X, Y) P(X,Y) X X X Y Y Y的联合概率分布,训练数据集
T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n ) } T=\{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\} T={(x1,y1),(x2,y2),...,(xn,yn)}
P ( X , Y ) P(X,Y) P(X,Y)独立同分布产生。

朴素贝叶斯法通过训练数据集学习联合概率分布 P ( X , Y ) P(X,Y) P(X,Y)。具体的,是学习先验概率分布及条件概率分布。

先验概率分布
P ( Y = c k ) , k = 1 , 2 , . . . , K P(Y=c_k), k=1,2,...,K P(Y=ck),k=1,2,...,K
条件概率分布
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , X ( 2 ) = x ( 2 ) , . . . , X ( n ) = x ( n ) ∣ Y = c k ) , k = 1 , 2 , . . . , K P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)}, X^{(2)}=x^{(2)}, ..., X^{(n)}=x^{(n)}|Y=c_k), k=1,2,...,K P(X=xY=ck)=P(X(1)=x(1),X(2)=x(2),...,X(n)=x(n)Y=ck),k=1,2,...,K
从而获得联合概率分布。

条件概率分布 P ( X = x ∣ Y = c k ) P(X=x|Y=c_k) P(X=xY=ck)有指数级的参数,其估计实际上是不可能的。假设 x ( j ) x^{(j)} x(j)可取值有 S j S_j Sj个, j = 1 , 2 , . . , n j=1,2,..,n j=1,2,..,n Y Y Y可取值有 K K K个,那么参数个数为 K ∏ j = 1 n S j K \prod_{j=1}^{n}{S_j} Kj=1nSj

朴素贝叶斯法为了解决该问题,作了条件独立性假设,由于这是一个较强的假设,朴素贝叶斯法因此得名。
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , X ( 2 ) = x ( 2 ) , . . . , X ( n ) = x ( n ) ∣ Y = c k ) = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)}, X^{(2)}=x^{(2)}, ..., X^{(n)}=x^{(n)}|Y=c_k)\\ =\prod_{j=1}^{n}{P(X^{(j)}=x^{(j)}|Y=c_k)} P(X=xY=ck)=P(X(1)=x(1),X(2)=x(2),...,X(n)=x(n)Y=ck)=j=1nP(X(j)=x(j)Y=ck)
基于此,后验概率为
P ( Y = c k ∣ X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ k P ( X = x ∣ Y = c k ) P ( Y = c k ) P(Y=c_k|X=x)=\frac{P(X=x|Y=c_k)P(Y=c_k)}{\sum_k P(X=x|Y=c_k)P(Y=c_k)} P(Y=ckX=x)=kP(X=xY=ck)P(Y=ck)P(X=xY=ck)P(Y=ck)
将上式代入,可得
P ( Y = c k ∣ X = x ) = P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) P(Y=c_k|X=x)=\frac{P(Y=c_k)\prod_{j=1}^{n}{P(X^{(j)}=x^{(j)}|Y=c_k)}} {\sum_k P(Y=c_k) \prod_{j=1}^{n}{P(X^{(j)}=x^{(j)}|Y=c_k)}} P(Y=ckX=x)=kP(Y=ck)j=1nP(X(j)=x(j)Y=ck)P(Y=ck)j=1nP(X(j)=x(j)Y=ck)
于是得到
y = f ( x ) = a r g m a x c k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=\underset{c_k}{argmax}\frac{P(Y=c_k)\prod_{j=1}^{n}{P(X^{(j)}=x^{(j)}|Y=c_k)}} {\sum_k P(Y=c_k) \prod_{j=1}^{n}{P(X^{(j)}=x^{(j)}|Y=c_k)}} y=f(x)=ckargmaxkP(Y=ck)j=1nP(X(j)=x(j)Y=ck)P(Y=ck)j=1nP(X(j)=x(j)Y=ck)
又因为分母部分对于所有的 c k c_k ck是一致的,所以
y = f ( x ) = a r g m a x c k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=\underset{c_k}{argmax} {P(Y=c_k)\prod_{j=1}^{n}{P(X^{(j)}=x^{(j)}|Y=c_k)}} y=f(x)=ckargmaxP(Y=ck)j=1nP(X(j)=x(j)Y=ck)

4.1.2 后验概率最大化的定义

朴素贝叶斯实际是将实例分到后验概率最大化的类中,这等价于期望风险最小化。假设选择0-1损失函数
L ( Y , f ( X ) ) = { 1 , Y ≠ f ( X ) 0 , Y = f ( X ) L(Y, f(X)) = \begin{cases} 1, Y \neq f(X) \\ 0, Y = f(X) \end{cases} L(Y,f(X))={1,Y=f(X)0,Y=f(X)
期望风险函数为:
R e x p ( f ) = E [ L ( Y , f ( X ) ) ] = E X ∑ k = 1 K L ( Y , f ( X ) ) P ( c k ∣ X ) R_{exp}(f)=E[L(Y,f(X))] =E_X \sum_{k=1}^{K} L(Y,f(X)) P(c_k|X) Rexp(f)=E[L(Y,f(X))]=EXk=1KL(Y,f(X))P(ckX)
为使期望风险最小化,只需对 X = x X=x X=x逐个极小化,由此
f ( x ) = a r g m i n y ∈ Y ∑ k = 1 K L ( Y , f ( X ) ) P ( c k ∣ X = x ) = a r g m i n y ∈ Y ∑ k = 1 K P ( y ≠ c k ∣ X = x ) = a r g m i n y ∈ Y ∑ k = 1 K 1 − P ( y = c k ∣ X = x ) = a r g m a x y ∈ Y ∑ k = 1 K P ( y = c k ∣ X = x ) f(x)=\underset{y\in\mathcal{Y}}{argmin}\sum_{k=1}^{K} L(Y,f(X)) P(c_k|X=x) \\ = \underset{y\in\mathcal{Y}}{argmin}\sum_{k=1}^{K} P(y \neq c_k|X=x) \\ = \underset{y\in\mathcal{Y}}{argmin}\sum_{k=1}^{K} 1 - P(y = c_k|X=x) \\ = \underset{y\in\mathcal{Y}}{argmax}\sum_{k=1}^{K} P(y = c_k|X=x) f(x)=yYargmink=1KL(Y,f(X))P(ckX=x)=yYargmink=1KP(y=ckX=x)=yYargmink=1K1P(y=ckX=x)=yYargmaxk=1KP(y=ckX=x)
由此,根据期望风险最小化准则得到了后验概率最大化准则,也就是贝叶斯法所采用的准则。

4.2 朴素贝叶斯法的参数估计

4.2.1 极大似然估计

先验概率的极大似然估计
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , . . . , K P(Y = c_k) = \frac{\sum_{i=1}^{N}I(y_i = c_k)}{N}, k=1,2,...,K P(Y=ck)=Ni=1NI(yi=ck),k=1,2,...,K
证明:

首先明确参数是什么,参数是 p ( y = c k ) p(y=c_k) p(y=ck)以及 p ( x ( j ) = a j l ∣ y = c k ) p(x^{(j)}=a_{jl}|y=c_k) p(x(j)=ajly=ck),以 ψ \psi ψ代表这两个参数
$$
L(\psi) = log \prod_{i=1}^N p(x_i, y_i; \psi) \
= log \prod_{i=1}^N p(x_i | y_i; \psi) p(y_i; \psi) \
= log \prod_{i=1}^N (\prod_{j=1}^n p(x_i^{(j)} | y_i ; \psi)) p(y_i; \psi) \
= \sum_{i=1}^N[log p(y_i; \psi) + \sum_{j=1}^n log p(x_i^{(j)}| y_i ; \psi)] \
代入参数 \
= \sum_{i=1}N[\sum_{k=1}K log p(y = c_k)^{I(y_i=c_k)} + \sum_{k=1}^K \sum_{j=1}^n \sum_{l=1}^{S_j} log p(x_i^{(j)} = a_{jl}| y_i = c_k) {I(x_i{(j)}=a_{jl}, y_i=c_k)}] \

= \sum_{i=1}^N [\sum_{k=1}^K {I(y_i=c_k)}log p(y = c_k) + \sum_{k=1}^K \sum_{j=1}^n \sum_{l=1}^{S_j} {I(x_i^{(j)}=a_{jl}, y_i=c_k)} log p(x_i^{(j)} = a_{jl}| y_i = c_k)] \
$$
但实际上, p ( y = c k ) p(y=c_k) p(y=ck)也存在相应的约束,有约束的求极值,可以考虑使用拉格朗日乘子法。

上式子中只有前半段含有$p(y = c_k),所以求先验概率估计时只考虑前半部分

先验概率估计

F = ∑ i = 1 N [ ∑ k = 1 K I ( y i = c k ) l o g p ( y = c k ) + λ ( 1 − ∑ k = 1 K p ( y = c k ) ) ] F = \sum_{i=1}^N [\sum_{k=1}^K {I(y_i=c_k)}log p(y = c_k) + \lambda (1 - \sum_{k=1}^K p(y = c_k))] F=i=1N[k=1KI(yi=ck)logp(y=ck)+λ(1k=1Kp(y=ck))]

这里需要注意,并没有直接把 1 − ∑ k = 1 K p ( y = c k ) ) 1 - \sum_{k=1}^K p(y = c_k)) 1k=1Kp(y=ck))代入,而是带入了 ∑ i = 1 N ( 1 − ∑ k = 1 K p ( y = c k ) ) ) \sum_{i=1}^N (1 - \sum_{k=1}^K p(y = c_k))) i=1N(1k=1Kp(y=ck))),区别不大,因为都是0,代入一个和多个是一样的,但是代入多个的情况下,下面更容易求解。
{ ∂ F ∂ p ( y = c 1 ) = ∑ i = 1 N I ( y = c 1 ) p ( y = c 1 ) − λ = 0 ∂ F ∂ p ( y = c 2 ) = ∑ i = 1 N I ( y = c 2 ) p ( y = c 2 ) − λ = 0 . . . ∂ F ∂ p ( y = c K ) = ∑ i = 1 N I ( y = c K ) p ( y = c K ) − λ = 0 ∂ F ∂ λ = ∑ i = 1 N { 1 − ∑ k = 1 K p ( y = c k ) } = 0 \begin{cases} \frac{\partial F}{\partial p(y = c_1)} = \sum_{i=1}^N {\frac{I(y = c_1)}{p(y = c_1)} - \lambda} = 0 \\ \frac{\partial F}{\partial p(y = c_2)} = \sum_{i=1}^N {\frac{I(y = c_2)}{p(y = c_2)} - \lambda} = 0 \\ ... \\ \frac{\partial F}{\partial p(y = c_K)} = \sum_{i=1}^N {\frac{I(y = c_K)}{p(y = c_K)} - \lambda} = 0 \\ \frac{\partial F}{\partial \lambda} = \sum_{i=1}^N \{1 - \sum_{k=1}^K p(y = c_k)\} = 0 \end{cases} p(y=c1)F=i=1Np(y=c1)I(y=c1)λ=0p(y=c2)F=i=1Np(y=c2)I(y=c2)λ=0...p(y=cK)F=i=1Np(y=cK)I(y=cK)λ=0λF=i=1N{1k=1Kp(y=ck)}=0
联立前N个式子,可得
{ p ( y = c 1 ) = ∑ i = 1 N I ( y = c 1 ) N λ p ( y = c 2 ) = ∑ i = 1 N I ( y = c 2 ) N λ . . . p ( y = c K ) = ∑ i = 1 N I ( y = c K ) N λ (2) \begin{cases} p(y = c_1) = \frac{\sum_{i=1}^N I(y = c_1)}{N \lambda} \\ p(y = c_2) = \frac{\sum_{i=1}^N I(y = c_2)}{N \lambda} \\ ... \\ p(y = c_K) = \frac{\sum_{i=1}^N I(y = c_K)}{N \lambda} \end{cases} \tag{2} p(y=c1)=Nλi=1NI(y=c1)p(y=c2)=Nλi=1NI(y=c2)...p(y=cK)=Nλi=1NI(y=cK)(2)
因为 ∑ k = 1 K p ( y = c k ) = 1 \sum_{k=1}^K p(y = c_k) = 1 k=1Kp(y=ck)=1,所以
1 = ∑ i = 1 N ∑ i = 1 K I ( y = c k ) N λ 1 = N N λ λ = 1 1 = \frac {\sum_{i=1}^N \sum_{i=1}^K I(y = c_k)} {N \lambda} \\ 1 = \frac {N} {N \lambda} \\ \lambda = 1 1=Nλi=1Ni=1KI(y=ck)1=NλNλ=1
代入(2)式,得到
p ( y = c k ) = ∑ i = 1 N I ( y = c k ) N k = 1 , 2 , 3 , . . . , K p(y = c_k) = \frac{\sum_{i=1}^N I(y = c_k)}{N} k = 1,2,3,...,K p(y=ck)=Ni=1NI(y=ck)k=1,2,3,...,K

条件概率的极大似然估计

G = ∑ i = 1 N { ∑ k = 1 K ∑ j = 1 n ( ( ∑ l = 1 S j I ( x i ( j ) = a j l , y i = c k ) l o g p ( x i ( j ) = a j l ∣ y i = c k ) ) + λ k j ( 1 − ∑ l = 1 S j p ( x j = a j l ∣ y = c k ) ) } G = \sum_{i=1}^N \{ \sum_{k=1}^K \sum_{j=1}^n ( (\sum_{l=1}^{S_j} {I(x_i^{(j)}=a_{jl}, y_i=c_k)} log p(x_i^{(j)} = a_{jl}| y_i = c_k)) + \lambda_{kj} (1 - \sum_{l=1}^{S_j} p(x^{j} = a_{jl} | y = c_k)) \} \\ G=i=1N{k=1Kj=1n((l=1SjI(xi(j)=ajl,yi=ck)logp(xi(j)=ajlyi=ck))+λkj(1l=1Sjp(xj=ajly=ck))}

与上面类似,由于对于每个 k , j k,j k,j都会存在一个 ∑ l = 1 S j p ( x j = a j l ∣ y = c k ) = 1 \sum_{l=1}^{S_j} p(x^{j} = a_{jl} | y = c_k) = 1 l=1Sjp(xj=ajly=ck)=1,所以实际上存在 k ∗ l k*l kl个约束,求导可得
{ ∂ G ∂ p ( x i ( j ) = a j l ∣ y i = c k ) ) = ∑ i = 1 N { I ( x i ( j ) = a j l , y i = c k ) p ( x i ( j ) = a j l ∣ y i = c k ) − λ k j } = 0 ∂ G ∂ λ k j = ∑ i = 1 N ( 1 − ∑ l = 1 S j p ( x j = a j l ∣ y = c k ) ) = 0 (3) \begin{cases} \frac{\partial G}{\partial p(x_i^{(j)} = a_{jl}| y_i = c_k))} = \sum_{i=1}^N \{ \frac{I(x_i^{(j)}=a_{jl}, y_i=c_k)} {p(x_i^{(j)} = a_{jl}| y_i = c_k)} - \lambda_{kj} \} = 0 \\ \frac{\partial G}{\partial \lambda_{kj}} = \sum_{i=1}^N (1 - \sum_{l=1}^{S_j} p(x^{j} = a_{jl} | y = c_k)) = 0 \end{cases} \tag{3} p(xi(j)=ajlyi=ck))G=i=1N{p(xi(j)=ajlyi=ck)I(xi(j)=ajl,yi=ck)λkj}=0λkjG=i=1N(1l=1Sjp(xj=ajly=ck))=0(3)
由第一个式子可得
p ( x i ( j ) = a j l ∣ y i = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) N λ k j (4) p(x_i^{(j)} = a_{jl}| y_i = c_k) = \frac{\sum_{i=1}^N I(x_i^{(j)}=a_{jl}, y_i=c_k)} {N \lambda_{kj}} \tag{4} p(xi(j)=ajlyi=ck)=Nλkji=1NI(xi(j)=ajl,yi=ck)(4)
由第二个式子可得
∑ l = 1 S j p ( x j = a j l ∣ y = c k ) ) = 1 (5) \sum_{l=1}^{S_j} p(x^{j} = a_{jl} | y = c_k)) = 1 \tag{5} l=1Sjp(xj=ajly=ck))=1(5)
联立两个式子可得
1 = ∑ l = 1 S j ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) N λ k j 1 = ∑ i = 1 N I ( y i = c k ) N λ k j N λ k j = ∑ i = 1 N I ( y i = c k ) 1 = \sum_{l = 1}^{S_j} \frac{\sum_{i=1}^N I(x_i^{(j)}=a_{jl}, y_i=c_k)} {N \lambda_{kj}} \\ 1 = \frac {\sum_{i = 1}^N I(y_i = c_k)} {N \lambda_{kj}} \\ N \lambda_{kj} = \sum_{i = 1}^N I(y_i = c_k) 1=l=1SjNλkji=1NI(xi(j)=ajl,yi=ck)1=Nλkji=1NI(yi=ck)Nλkj=i=1NI(yi=ck)
代入上式(4),得到
p ( x i ( j ) = a j l ∣ y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) p(x_i^{(j)} = a_{jl}| y = c_k) = \frac{\sum_{i=1}^N I(x_i^{(j)}=a_{jl}, y_i=c_k)} {\sum_{i = 1}^N I(y_i = c_k)} p(xi(j)=ajly=ck)=i=1NI(yi=ck)i=1NI(xi(j)=ajl,yi=ck)

证明完毕。

4.2.2 学习与分类算法

输入:训练数据 T = { ( x 1 , y 2 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } T = \{(x_1, y_2), (x_2, y_2), ..., (x_N, y_N)\} T={(x1,y2),(x2,y2),...,(xN,yN)},其中 x i = ( x i ( 1 ) , x i ( 2 ) , . . . , x i ( j ) ) T x_i = (x_i^{(1)}, x_i^{(2)}, ..., x_i^{(j)})^T xi=(xi(1),xi(2),...,xi(j))T, 其中 x i ( j ) x_i^{(j)} xi(j)是第 i i i个样本的第 j j j个特征, x i ( j ) ∈ { a j 1 , a j 2 , . . . a j S j } x_i^{(j)} \in \{a_{j1}, a_{j2}, ... a_{jS_j}\} xi(j){aj1,aj2,...ajSj} a j l a_{jl} ajl是第 j j j个特征可能的第 l l l个取值, j = 1 , 2 , . . . , n , l = 1 , 2 , . . . , S j , y i ∈ { c 1 , c 2 , . . . , c K } j = 1,2, ..., n, l = 1,2,..., S_j, y_i \in \{c_1, c_2, ..., c_K\} j=1,2,...,n,l=1,2,...,Sj,yi{c1,c2,...,cK}

输出:输出实例 x x x的分类。

(1)计算先验概率及条件概率
$$
p(y = c_k) = \frac{\sum_{i=1}^N I(y = c_k)}{N},\ \ \ \ k = 1,2,3,…,K \

p(x_i^{(j)} = a_{jl}| y = c_k) =
\frac{\sum_{i=1}^N I(x_i^{(j)}=a_{jl}, y_i=c_k)}
{\sum_{i = 1}^N I(y_i = c_k)} \
j = 1,2, …, n; \ \ l = 1,2,…, S_j; \ \ k = 1, 2, …, K
KaTeX parse error: Can't use function '$' in math mode at position 12: (2)对于给定的实例$̲x = (x^{(1)}, x…
P(Y = c_k) \prod_{j = 1}^n P(X ^{(j)} = x^{(j)} | Y = c_k), \ \ \ k=1,2,…,K
KaTeX parse error: Can't use function '$' in math mode at position 9: (3)确定实例$̲x$的类
y = \underset {c_k} {argmax} P(Y = c_k) \prod_{j = 1}^n P(X ^{(j)} = x^{(j)} | Y = c_k)
$$

4.2.3 贝叶斯估计

使用极大似然估计可能会出现所要估计的概率值为0的情况,这时会影响到后验概率的计算结果,使分类产生偏差。解决这一问题的方法是采用贝叶斯估计
P λ ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j λ P_{\lambda}(X^{(j)} = a_{jl} | Y = c_k) = \frac{\sum_{i=1}^N I(x_i^{(j)}=a_{jl}, y_i=c_k) + \lambda} {\sum_{i = 1}^N I(y_i = c_k) + S_j \lambda} Pλ(X(j)=ajlY=ck)=i=1NI(yi=ck)+Sjλi=1NI(xi(j)=ajl,yi=ck)+λ
式中 λ > = 0 \lambda >= 0 λ>=0,等价于在随机变量各个取值的频数上赋予一个正数 λ > 0 \lambda > 0 λ>0,当 λ = 0 \lambda = 0 λ=0的时候,就是极大似然估计。常取 λ = 1 \lambda = 1 λ=1,这时称为拉普拉斯平滑(Laplacian smoothing)。由于对于任意的 l = 1 , 2 , . . . , S j ;    k = 1 , 2 , . . . , K l = 1,2,..., S_j; \ \ k = 1, 2, ..., K l=1,2,...,Sj;  k=1,2,...,K,都有
P λ ( X ( j ) = a j l ∣ Y = c k ) > 0 ∑ l = 1 S j P λ ( X ( j ) = a j l ∣ Y = c k ) = 0 P_{\lambda}(X^{(j)} = a_{jl} | Y = c_k) > 0 \\ \sum_{l = 1}^{S_j} P_{\lambda}(X^{(j)} = a_{jl} | Y = c_k) = 0 Pλ(X(j)=ajlY=ck)>0l=1SjPλ(X(j)=ajlY=ck)=0
所以贝叶斯估计也是一种概率分布,同样的,先验概率的贝叶斯估计是
P λ ( Y = c k ) = ∑ i = 1 N I ( y = c k ) + λ N + K λ ,      k = 1 , 2 , 3 , . . . , K P_{\lambda}(Y = c_k) = \frac{\sum_{i=1}^N I(y = c_k) + \lambda}{N + K \lambda},\ \ \ \ k = 1,2,3,...,K Pλ(Y=ck)=N+Kλi=1NI(y=ck)+λ,    k=1,2,3,...,K
总结

  1. 朴素贝叶斯法是典型的生成学习方法。生成学习方法由训练数据学习联合概率分布 P ( X , Y ) P(X,Y) P(X,Y),然后求得后验概率分布 P ( X ∣ Y ) P(X | Y) P(XY)

  2. 朴素贝叶斯的基本假设是条件独立性,基于此,省略率大量的参数,学习与预测大为简化,因而优点是高效,且易于实现。缺点就是分类性能不一定很高。

习题

回顾下贝叶斯估计

思路:假设概率 P λ ( Y = c i ) P_{\lambda}(Y=c_i) Pλ(Y=ci)服从狄利克雷(Dirichlet)分布,根据贝叶斯公式,推导后验概率也服从Dirichlet分布,求参数期望;

证明步骤:

  1. 条件假设

根据朴素贝叶斯法的基本方法,训练数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , … , ( x N , y N ) } T=\{(x_1,y_1),(x_2,y_2),\ldots,(x_N,y_N)\} T={(x1,y1),(x2,y2),,(xN,yN)},假设:
(1)随机变量 Y Y Y出现 y = c k y=c_k y=ck的次数为 m k m_k mk​,即 m k = ∑ i = 1 N I ( y i = c k ) \displaystyle m_k=\sum_{i=1}^N I(y_i=c_k) mk=i=1NI(yi=ck),可知 ∑ k = 1 K m k = N \displaystyle \sum_{k=1}^K m_k = N k=1Kmk=N(y总共有N个);
(2) P λ ( Y = c k ) = u k P_\lambda(Y=c_k)=u_k Pλ(Y=ck)=uk,随机变量 u k u_k uk服从参数为 λ \lambda λ的Dirichlet分布。

补充说明:

  1. 狄利克雷(Dirichlet)分布
      参考PRML(Pattern Recognition and Machine Learning)一书的第2.2.1章节:⽤似然函数(2.34)乘以先验(2.38),我们得到了参数 u k u_k uk的后验分布,形式为
    p ( u ∣ D , α ) ∝ p ( D ∣ u ) p ( u ∣ α ) ∝ ∏ k = 1 K u k α k + m k − 1 p(u|D,\alpha) \propto p(D|u)p(u|\alpha) \propto \prod_{k=1}^K u_k^{\alpha_k+m_k-1} p(uD,α)p(Du)p(uα)k=1Kukαk+mk1

    该书中第B.4章节: 狄利克雷分布是K个随机变量 0 ⩽ u k ⩽ 1 0 \leqslant u_k \leqslant 1 0uk1的多变量分布,其中 k = 1 , 2 , … , K k=1,2,\ldots,K k=1,2,,K,并满足以下约束
    0 ⩽ u k ⩽ 1 , ∑ k = 1 K u k = 1 0 \leqslant u_k \leqslant 1, \quad \sum_{k=1}^K u_k = 1 0uk1,k=1Kuk=1
    u = ( u 1 , … , u K ) T , α = ( α 1 , … , α K ) T u=(u_1,\ldots,u_K)^T, \alpha=(\alpha_1,\ldots,\alpha_K)^T u=(u1,,uK)T,α=(α1,,αK)T,有
    D i r ( u ∣ α ) = C ( α ) ∏ k − 1 K u k α k − 1 E ( u k ) = α k ∑ k = 1 K α k Dir(u|\alpha) = C(\alpha) \prod_{k-1}^K u_k^{\alpha_k - 1} \\ E(u_k) = \frac{\alpha_k}{\displaystyle \sum_{k=1}^K \alpha_k} Dir(uα)=C(α)k1Kukαk1E(uk)=k=1Kαkαk

  2. 为什么假设 Y = c k Y=c_k Y=ck的概率服从Dirichlet分布?
    答:原因如下:
    (1)首先,根据PRML第B.4章节,Dirichlet分布是Beta分布的推广。
    (2)由于,Beta分布是二项式分布的共轭分布,Dirichlet分布是多项式分布的共轭分布。Dirichlet分布可以看作是“分布的分布”;
    (3)又因为,Beta分布与Dirichlet分布都是先验共轭的,意味着先验概率和后验概率属于同一个分布。当假设为Beta分布或者Dirichlet分布时,通过获得大量的观测数据,进行数据分布的调整,使得计算出来的概率越来越接近真实值。
    (4)因此,对于一个概率未知的事件,Beta分布或Dirichlet分布能作为表示该事件发生的概率的概率分布。

大佬牛逼…感谢网上的解答

  1. 得到先验概率:
    P ( u ) = P ( u 1 , u 2 , … , u K ) = C ( λ ) ∏ k = 1 K u k λ − 1 \displaystyle P(u)=P(u_1,u_2,\ldots,u_K) = C(\lambda) \prod_{k=1}^K u_k^{\lambda - 1} P(u)=P(u1,u2,,uK)=C(λ)k=1Kukλ1

  2. 得到似然函数
      记 m = ( m 1 , m 2 , … , m K ) T m=(m_1, m_2, \ldots, m_K)^T m=(m1,m2,,mK)T,可得似然函数为
    P ( m ∣ u ) = u 1 m 1 ⋅ u 2 m 2 ⋯ u K m K = ∏ k = 1 K u k m k P(m|u) = u_1^{m_1} \cdot u_2^{m_2} \cdots u_K^{m_K} = \prod_{k=1}^K u_k^{m_k} P(mu)=u1m1u2m2uKmK=k=1Kukmk

  3. 得到后验概率分布
      结合贝叶斯公式,求 u u u的后验概率分布,可得
    P ( u ∣ m ) = P ( m ∣ u ) P ( u ) P ( m ) P(u|m) = \frac{P(m|u)P(u)}{P(m)} P(um)=P(m)P(mu)P(u)

  4. 根据假设(1),可得
    P ( u ∣ m , λ ) ∝ P ( m ∣ u ) P ( u ∣ λ ) ∝ ∏ k = 1 K u k λ + m k − 1 P(u|m,\lambda) \propto P(m|u)P(u|\lambda) \propto \prod_{k=1}^K u_k^{\lambda+m_k-1} P(um,λ)P(mu)P(uλ)k=1Kukλ+mk1
    上式表明,后验概率分布P(u|m,\lambda)P(um,λ)也服从Dirichlet分布

  5. 得到随机变量uu的期望
      根据后验概率分布 P ( u ∣ m , λ ) P(u|m,\lambda) P(um,λ)和假设(1),求随机变量 u u u的期望,可得
    E ( u k ) = α k ∑ k = 1 K α k E(u_k) = \frac{\alpha_k}{\displaystyle \sum_{k=1}^K \alpha_k} E(uk)=k=1Kαkαk
    其中 α k = λ + m k \alpha_k = \lambda+m_k αk=λ+mk,则
    E ( u k ) = α k ∑ k = 1 K α k = λ + m k ∑ k = 1 K ( λ + m k ) = λ + m k ∑ k = 1 K λ + ∑ k = 1 K m k = λ + m k K λ + N = ∑ i = 1 N I ( y i = c k ) + λ N + K λ \begin{aligned} E(u_k) &= \frac{\alpha_k}{\displaystyle \sum_{k=1}^K \alpha_k} \\ &= \frac{\lambda+m_k}{\displaystyle \sum_{k=1}^K (\lambda + m_k)} \\ &= \frac{\lambda+m_k}{\displaystyle \sum_{k=1}^K \lambda +\sum_{k=1}^K m_k} \\ &= \frac{\lambda+m_k}{\displaystyle K \lambda + N } \\ &= \frac{\displaystyle \sum_{i=1}^N I(y_i=c_k) + \lambda}{N+K \lambda} \end{aligned} E(uk)=k=1Kαkαk=k=1K(λ+mk)λ+mk=k=1Kλ+k=1Kmkλ+mk=Kλ+Nλ+mk=N+Kλi=1NI(yi=ck)+λ

公式(4.11)得证

公式(4.10)的证明类似

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值