一、朴素贝叶斯简单介绍
朴素贝叶斯成立的前提是条件独立性假设:分类的特征xix_ixi在类别确定的条件下都是独立的,用公式表示如下:
P(X=xi∣Y=ck)=P(X=xi1,X=xi2,⋯ ,X=xin∣Y=ck)=∏j=1nP(X(j)=xij∣Y=ck)
\begin{aligned}
P(X=x_i|Y=c_k) &= P(X=x_i^1,X=x_i^2,\cdots,X=x_i^n|Y=c_k) \\
& = \prod_{j=1}^{n}P(X^{(j)}=x_i^{j}|Y=c_k)
\end{aligned}
P(X=xi∣Y=ck)=P(X=xi1,X=xi2,⋯,X=xin∣Y=ck)=j=1∏nP(X(j)=xij∣Y=ck)
其中ckc_kck是类别,假设有K个类,n是样本的维度,xix_ixi是输入样本
朴素贝叶斯法表示如下:
y=arg maxckP(Y=ck)∏j=1nP(X(j)=xij∣Y=ck) y=arg\,\max_{c_k}P(Y=c_k) \prod_{j=1}^{n}P(X^{(j)}=x_i^{j}|Y=c_k)y=argckmaxP(Y=ck)j=1∏nP(X(j)=xij∣Y=ck)
二、贝叶斯决策论
介绍朴素贝叶斯中最大化后验概率的来源
朴素贝叶斯选择0-1损失函数作为评价标准,0-1损失函数表示如下:
L(Y,f(X))={0,Y = f(X)1,Y ≠ f(X)
L(Y,f(X))=
\begin{cases}
0,& \text{Y = f(X)} \\
1,& \text{Y $\neq$ f(X)}
\end{cases}
L(Y,f(X))={0,1,Y = f(X)Y = f(X)
其中f(X)f(X)f(X)是分类决策函数
期望损失:Rexp(f)=E(L(Y,f(x)))R_{exp}(f)=E(L(Y,f(x)))Rexp(f)=E(L(Y,f(x))),显然对每个样本xxx最小化条件风险,则期望损失最小,下面证明期望损失最小化等价于后验概率最大化:
f(x)=arg miny∈Y∑k=1KL(ck,y)P(y=ck∣X=x)=arg miny∈Y∑k=1KP(y≠ck∣X=x)=arg miny∈Y∑k=1K(1−P(y=ck∣X=x))=arg maxy∈Y∑k=1KP(y=ck∣X=x)
\begin{aligned}
f(x) &=arg \,\min_{y \in \mathcal Y}\sum_{k=1}^{K}L(c_k,y)P(y=c_k|X=x) \\
&=arg \,\min_{y \in \mathcal Y}\sum_{k=1}^{K}P(y \neq c_k|X=x) \\
&=arg \,\min_{y \in \mathcal Y}\sum_{k=1}^{K}(1-P(y=c_k|X=x) )\\
&=arg \,\max_{y \in \mathcal Y}\sum_{k=1}^{K}P(y=c_k|X=x)
\end{aligned}
f(x)=argy∈Ymink=1∑KL(ck,y)P(y=ck∣X=x)=argy∈Ymink=1∑KP(y=ck∣X=x)=argy∈Ymink=1∑K(1−P(y=ck∣X=x))=argy∈Ymaxk=1∑KP(y=ck∣X=x)
得到后验概率最大化准则:
f(x)==arg maxy∈Y∑i=1KP(y=ck∣X=x)f(x)==arg \,\max_{y \in \mathcal Y}\sum_{i=1}^{K}P(y=c_k|X=x) f(x)==argy∈Ymaxi=1∑KP(y=ck∣X=x)
其中Y={c1,c2,⋯ ,cK}\mathcal Y=\{c_1,c2, \cdots,c_K\}Y={c1,c2,⋯,cK},KKK是类别个数
三、参数估计
方法1:极大似然估计
下面先给出结果,再证明
先验概率估计:
P(Y=ck)=∑i=1nI(yi=ck)N P(Y=c_k)={ {\sum_{i=1}^{n}I(y_i=c_k)} \over {N} }P(Y=ck)=N∑i=1nI(yi=ck)
条件概率估计:
P(X(j)=ajl∣Y=ck)=∑i=1nI(Xi(j)=ajl,yi=ck)∑i=1nI(yi=ck) P(X^{(j)}=a_{jl}|Y=c_k)={ {\sum_{i=1}^{n}I(X_i^{(j)}=a_{jl},y_i=c_k)} \over { \sum_{i=1}^{n}I(y_i=c_k) } }P(X(j)=ajl∣Y=ck)=∑i=1nI(yi=ck)∑i=1nI(Xi(j)=ajl,yi=ck)
其中k=1,2,⋯ ,K,j=1,2,⋯ ,n,l=1,2,⋯Sjk=1,2,\cdots,K,j=1,2,\cdots,n,l=1,2,\cdots S_jk=1,2,⋯,K,j=1,2,⋯,n,l=1,2,⋯Sj
x(j)∈{aj1,aj2,⋯ ,ajSj}x^{(j)} \in \{a_{j1},a_{j2},\cdots,a_{jS_j}\}x(j)∈{aj1,aj2,⋯,ajSj}
下面给出证明:
1、估计先验概率P(ck)P(c_k)P(ck)
令P(Y=ck)=θk,k∈{1,2,⋯ ,K}P(Y=c_k)=\theta_k,k \in \{1,2,\cdots ,K\}P(Y=ck)=θk,k∈{1,2,⋯,K}
则P(Y)=∏k=1KθkI(Y=ck)P(Y)=\prod_{k=1}^{K} \theta_k^{I(Y=c_k)}P(Y)=∏k=1KθkI(Y=ck)
那么对数似然函数表示如下:
L(θ)=log(∏i=1nP(Y=yi))=log(∏i=1n∏k=1KθkI(Yi=ck))=log(∏k=1KθkNk)=∑k=1KNklogθk \begin{aligned} L(\theta) &=\log(\prod_{i=1}^{n}P(Y=y_i)) \\ &=\log(\prod_{i=1}^{n}\prod_{k=1}^{K}\theta_k^{I(Y_i=c_k)}) \\ &=\log(\prod_{k=1}^{K}\theta_k^{N_k}) \\ &=\sum_{k=1}^{K}N_k \log \theta_k \end{aligned} L(θ)=log(i=1∏nP(Y=yi))=log(i=1∏nk=1∏KθkI(Yi=ck))=log(k=1∏KθkNk)=k=1∑KNklogθk
其中NkN_kNk是样本类别为ckc_kck的样本数目
又因为∑k=1Kθk=1\sum_{k=1}^{K}\theta_k=1∑k=1Kθk=1,所以拉格朗日函数可以表示为:
L(θk,λ)=∑k=1KNklogθk+λ(∑k=1Kθk−1) L(\theta_k,\lambda)=\sum_{k=1}^{K}N_k \log \theta_k+ \lambda (\sum_{k=1}^{K}\theta_k-1)L(θk,λ)=k=1∑KNklogθk+λ(k=1∑Kθk−1)
拉格朗日函数对θk\theta_kθk求偏导可得:
∂L(θk,λ)θk=Nkθk+λ=0⇒Nk=−λθk{ \partial L(\theta_k,\lambda) \over {\theta_k} }={ N_k \over \theta_k}+\lambda=0 \Rightarrow N_k=-\lambda \theta_kθk∂L(θk,λ)=θkNk+λ=0⇒Nk=−λθk
对上式求和可得:
∑k=1K=Nk=−λ∑k=1kθk⇒N=−λ⇒θk=NkN得证 \sum_{k=1}^{K}=N_k=-\lambda\sum_{k=1}^{k}\theta_k \Rightarrow N=-\lambda \Rightarrow \theta_k=\frac {N_k}{N} \quad得证k=1∑K=Nk=−λk=1∑kθk⇒N=−λ⇒θk=NNk得证
1、估计条件概率P(X(j)=ajl∣y=ck)P(X^{(j)}=a_{jl}|y=c_k)P(X(j)=ajl∣y=ck)
令P(X(j)=ajl∣y=ck)=θkjlP(X^{(j)}=a_{jl}|y=c_k)=\theta_{kjl}P(X(j)=ajl∣y=ck)=θkjl
P(X(j)=ajl∣y=ck)=θkjl=∏k=1K∏j=1n∏l=1SjθkjlI(X(j)=ajl)P(X^{(j)}=a_{jl}|y=c_k)=\theta_{kjl}=\prod_{k=1}^{K}\prod_{j=1}^{n}\prod_{l=1}^{S_j}\theta_{kjl}^{I(X^{(j)}=a_{jl})}P(X(j)=ajl∣y=ck)=θkjl=k=1∏Kj=1∏nl=1∏SjθkjlI(X(j)=ajl)
似然函数表示如下:
l(θ)=∏i=1NK(∏k=1K∏j=1n∏l=1SjθkjlI(Xi(j)=ajl))=∏k=1K∏j=1nθkjlNkjl
\begin{aligned}
l(\theta) &=\prod_{i=1}^{N_K}(\prod_{k=1}^{K}\prod_{j=1}^{n}\prod_{l=1}^{S_j}\theta_{kjl}^{I(X^{(j)}_i=a_{jl})} ) \\
& =\prod_{k=1}^{K}\prod_{j=1}^{n}\theta_{kjl}^{N_{kjl}}
\end{aligned}
l(θ)=i=1∏NK(k=1∏Kj=1∏nl=1∏SjθkjlI(Xi(j)=ajl))=k=1∏Kj=1∏nθkjlNkjl
其中NkjlN_{kjl}Nkjl表示数据集中属于类ckc_kck,且样本的第jjj维度取值为ajl的个数a_{jl}的个数ajl的个数
所以对数似然函数表示如下:
L(θ)=∑k=1K∑j=1nNkjllogθkjl L(\theta)=\sum_{k=1}^{K}\sum_{j=1}^{n}N_{kjl} \log \theta_{kjl}L(θ)=k=1∑Kj=1∑nNkjllogθkjl
又因为∑l=1Sjθkjl=1\sum_{l=1}^{S_j}\theta_{kjl}=1∑l=1Sjθkjl=1
拉格朗日函数可以表示为:
L(θ,λ)=∑k=1K∑j=1nNkjllogθkjl+λ(∑l=1Sjθkjl−1) L(\theta,\lambda)=\sum_{k=1}^{K}\sum_{j=1}^{n}N_{kjl} \log \theta_{kjl}+\lambda(\sum_{l=1}^{S_j}\theta_{kjl}-1)L(θ,λ)=k=1∑Kj=1∑nNkjllogθkjl+λ(l=1∑Sjθkjl−1)
⇒∂L(θ,λ)∂θkjl=Nkjlθkjl−λ=0\Rightarrow \frac {\partial L(\theta,\lambda)} {\partial \theta_{kjl}} =\frac {N_{kjl}}{\theta_{kjl}}-\lambda=0⇒∂θkjl∂L(θ,λ)=θkjlNkjl−λ=0
⇒∑l=1SjNkjl=λ∑l=1Sjθkjl=λ\Rightarrow \sum_{l=1}^{S_j}N_{kjl}=\lambda \sum_{l=1}^{S_j}\theta_{kjl}=\lambda⇒∑l=1SjNkjl=λ∑l=1Sjθkjl=λ
⇒λ=Nk\Rightarrow \lambda =N_k⇒λ=Nk
⇒θkjl=NkjlNk\Rightarrow \theta_{kjl}=\frac {N_{kjl}}{N_k}⇒θkjl=NkNkjl, 得证
方法2:贝叶斯估计
贝叶斯估计是为了解决极大似然估计中可能存在的所要估计的概率值为0的情况
先验概率估计:
P(Y=ck)=∑i=1nI(yi=ck)+λN+Kλ P(Y=c_k)={ {\sum_{i=1}^{n}I(y_i=c_k)+\lambda} \over {N+K\lambda} }P(Y=ck)=N+Kλ∑i=1nI(yi=ck)+λ
条件概率估计:
P(X(j)=ajl∣Y=ck)=∑i=1nI(Xi(j)=ajl,yi=ck)+λ∑i=1nI(yi=ck)+Sjλ P(X^{(j)}=a_{jl}|Y=c_k)={ {\sum_{i=1}^{n}I(X_i^{(j)}=a_{jl},y_i=c_k)}+\lambda \over { \sum_{i=1}^{n}I(y_i=c_k) } +S_j\lambda}P(X(j)=ajl∣Y=ck)=∑i=1nI(yi=ck)+Sjλ∑i=1nI(Xi(j)=ajl,yi=ck)+λ
其中k=1,2,⋯ ,K,j=1,2,⋯ ,n,l=1,2,⋯Sjk=1,2,\cdots,K,j=1,2,\cdots,n,l=1,2,\cdots S_jk=1,2,⋯,K,j=1,2,⋯,n,l=1,2,⋯Sj
x(j)∈{aj1,aj2,⋯ ,ajSj}x^{(j)} \in \{a_{j1},a_{j2},\cdots,a_{jS_j}\}x(j)∈{aj1,aj2,⋯,ajSj}