文章目录
简介
-
最近准备每周听一下b站白板推导系列,希望自己每周至少听一课并整理一课的公式外加一些扩展吧,慢慢来
-
b站:https://www.bilibili.com/video/BV1aE411o7qd
-
目前已经有人整理出了非常优秀的公式,会借鉴
github: https://github.com/tsyw/MachineLearningNotes
1 频率派VS贝叶斯派
1.1 数据与参数
-
参数(parameter),对参数表示为 θ \theta θ
-
假设有一概率模型 ,随机变量 x x x 服从于概率分布 x ∼ P ( x ∣ θ ) x\sim P(x|\theta) x∼P(x∣θ)
-
数据(data),对数据集使用X表示,其中对每个 x i x_i xi,我们称之为一个样本,假设一份数据有p个特征,则每个样本是长度为p的向量
x i = ( x i 1 , x i 2 , ⋯ , x i p ) T i = 1 , 2 , . . . , N x_{i}=(x_{i1},x_{i2},\cdots,x_{ip})^{T} \\ i = 1,2,...,N xi=(xi1,xi2,⋯,xip)Ti=1,2,...,N
对于N个样本,P个特征的数据集X表示为
X = ( x 1 , x 2 , ⋯ , x N ) N ⋅ P T = [ x 11 x 12 ⋯ x 1 p x 21 x 22 ⋯ x 2 p ⋮ ⋮ ⋮ ⋮ x N 1 x N 2 ⋯ x N p ] N ⋅ P \begin{aligned} X &=(x_1,x_2,\cdots,x_N )_{N\cdot P}^T \\ &=\left[\begin{array}{cccc} {x}_{11} & {x}_{12} & \cdots & {x}_{1p} \\ {x}_{21} & {x}_{22} & \cdots & {x}_{2p} \\ \vdots & \vdots &\vdots &\vdots & \\ {x}_{N1} & {x}_{N2} & \cdots & {x}_{Np}\\ \end{array}\right]_{N\cdot P} \end{aligned} X=(x1,x2,⋯,xN)N⋅PT= x11x21⋮xN1x12x22⋮xN2⋯⋯⋮⋯x1px2p⋮xNp N⋅P
对于每个 x i j x_{ij} xij 均由上述概率分布 x ∼ P ( x ∣ θ ) x\sim P(x|\theta) x∼P(x∣θ) 产生。 -
标签 Y = y 1 , y 2 , . . . , y n Y= {y_1,y_2,...,y_n} Y=y1,y2,...,yn
1.2 频率派
-
频率注意学派认为参数 θ \theta θ 是一个未知常数,即概率分布 p ( x ∣ θ ) p(x|\theta) p(x∣θ) 中的 θ \theta θ 是一个常量。
-
对于 N N N 个样本数据来说观测数据集X, x ∼ i i d p x ∣ θ x\mathop\sim \limits^{iid} p{x|\theta} x∼iidpx∣θ ,的概率为 。为了求 θ \theta θ 的大小,我们采用最大对数似然MLE的方法:
θ M L E = a r g m a x θ L ( θ ) = a r g m a x θ log p ( X ∣ θ ) = a r g m a x log ( ∏ i = 1 N p ( x i ∣ θ ) ) = a r g m a x θ ∑ i = 1 N log p ( x i ∣ θ ) \begin{aligned} \theta_{MLE}&=\mathop{argmax}\limits _{\theta} \mathop{L}(\theta)\\ &=\mathop{argmax}\limits _{\theta}\log p(X|\theta)\\ &=\mathop{argmax}\mathop{\log}\left(\prod\limits _{i=1}^{N}p(x_{i}|\theta)\right)\\ &\mathop{=}\mathop{argmax}\limits _{\theta}\sum\limits _{i=1}^{N}\log p(x_{i}|\theta) \end{aligned} θMLE=θargmaxL(θ)=θargmaxlogp(X∣θ)=argmaxlog(i=1∏Np(xi∣θ))=θargmaxi=1∑Nlogp(xi∣θ)
1.3贝叶斯派
-
贝叶斯派认为参数 θ \theta θ 满足一个预设的先验分布 θ ∼ p ( θ ) \theta\sim p(\theta) θ∼p(θ) 。
-
借助贝叶斯定理,可将先验分布和后验分布结合联系:
θ M A P = a r g m a x θ p ( θ ∣ X ) = a r g m a x θ p ( X ∣ θ ) ⋅ p ( θ ) p ( x ) ∝ a r g m a x θ p ( X ∣ θ ) ⋅ p ( θ ) \begin{aligned} \theta_{MAP}&=\mathop{argmax}\limits _{\theta}p(\theta|X)\\ &=\mathop{argmax}\limits _{\theta}\frac{p(X|\theta)\cdot p(\theta)}{p(x)} \\ &\propto \mathop{argmax}\limits _{\theta}p(X|\theta)\cdot p(\theta) \end{aligned} θMAP=θargmaxp(θ∣X)=θargmaxp(x)p(X∣θ)⋅p(θ)∝θargmaxp(X∣θ)⋅p(θ)
其中- p ( θ ∣ X ) p(\theta|X) p(θ∣X) 为后验概率(posterior)
- p ( X ∣ θ ) p(X|\theta) p(X∣θ) 为似然函数(likelyhood)
- p ( θ ) p(\theta) p(θ) 为先验概率(prior)
- p ( x ) = ∫ θ p ( x ∣ θ ) ⋅ p ( θ ) d θ p(x) = \int_{\theta}p(x|\theta)\cdot p(\theta) d\theta p(x)=∫θp(x∣θ)⋅p(θ)dθ ,是个定积分,算出来是一个确定的数
-
上述公式中第二个等号是由于分母和 θ \theta θ 没有关系(积分被积掉了)。求解这个 θ \theta θ 值后计算 p ( X ∣ θ ) ⋅ p ( θ ) ∫ θ p ( X ∣ θ ) ⋅ p ( θ ) d θ \frac{p(X|\theta)\cdot p(\theta)}{\int\limits _{\theta}p(X|\theta)\cdot p(\theta)d\theta} θ∫p(X∣θ)⋅p(θ)dθp(X∣θ)⋅p(θ) ,就得到了参数的后验概率。其中 p ( X ∣ θ ) p(X|\theta) p(X∣θ) 叫似然,是我们的模型分布。得到了参数的后验分布后,我们可以将这个分布用于预测贝叶斯预测:
p ( x n e w ∣ X ) = ∫ θ p ( x n e w ∣ θ ) ⋅ p ( θ ∣ X ) d θ p(x_{new}|X)=\int\limits _{\theta}p(x_{new}|\theta)\cdot p(\theta|X)d\theta p(xnew∣X)=θ∫p(xnew∣θ)⋅p(θ∣X)dθ
1.4掷硬币问题的极大似然估计和最大后验估计
-
极大似然估计
在掷硬币实验中用1表示出现正面向上,用0表示出现反面向上,即
x i = { 1 , 正面出现 0 , 反面出现 x_i= \left\{ \begin{aligned} 1,\quad正面出现 \\ 0,\quad反面出现 \end{aligned} \right. xi={1,正面出现0,反面出现估计出现正面向上的概率为 θ \theta θ,反面出现向上的概率为 1 − θ 1-\theta 1−θ,$x_i \sim B(1.,\theta) $,概率分布函数为
P ( X = x ) = θ x ( 1 − θ ) 1 − x = { P ( x = 0 ) = 1 − θ P ( x = 1 ) = θ P(X=x) = \theta^x(1-\theta)^{1-x} = \left\{ \begin{aligned} P(x=0) & = 1-\theta \\ P(x=1) & = \theta \end{aligned} \right. P(X=x)=θx(1−θ)1−x={P(x=0)P(x=1)=1−θ=θ
似然函数:
L ( θ ) = P ( X 1 = x 1 ∣ θ ) ∗ ⋯ ∗ P ( X n = x n ∣ θ ) = ∏ i = 1 n θ x i ( 1 − θ ) 1 − x i \begin{aligned} L(\theta) & = P(X_1=x_1|\theta)*\cdots*P(X_n=x_n|\theta) \\ & = \prod_{i=1}^n \theta^{x_i}(1-\theta)^{1-x_i} \end{aligned} L(θ)=P(X1=x1∣θ)∗⋯∗P(Xn=xn∣θ)=i=1∏nθxi(1−θ)1−xi
对数似然函数:
ln L ( θ ) = ln ∏ i = 1 n θ x i ( 1 − θ ) 1 − x i = ∑ i = 1 n [ ln θ x i + ln ( 1 − θ ) 1 − x i ] = ∑ i = 1 n x i ln θ + ∑ i = 1 n ( 1 − x i ) ln ( 1 − θ ) = ∑ i = 1 n x i ln θ + ( n − ∑ i = 1 n x i ) ln ( 1 − θ ) \begin{aligned} \ln {L(\theta)} & = \ln { \prod_{i=1}^n \theta^{x_i}(1-\theta)^{1-x_i}} \\ & = \sum_{i=1}^n\left[ \ln\theta^{x_i} + \ln{(1- \theta)^{1-x_i} } \right]\\ & = \sum_{i=1}^nx_i\ln \theta + \sum_{i=1}^n(1-x_i)\ln{(1- \theta)} \\ & = \sum_{i=1}^nx_i\ln \theta + (n-\sum_{i=1}^nx_i)\ln{(1- \theta)} \end{aligned} lnL(θ)=lni=1∏nθxi(1−θ)1−xi=i=1∑n[lnθxi+ln(1−θ)1−xi]=i=1∑nxilnθ+i=1∑n(1−xi)ln(1−θ)=i=1∑nxilnθ+(n−i=1∑nxi)ln(1−θ)
目标
: m a x ln L ( θ ) \mathcal {max} \ln L(\theta) maxlnL(θ)对 θ \theta θ求偏导
∂ ln L ( θ ) ∂ θ = ∑ i = 1 n x i θ − n − ∑ i = 1 n x i 1 − θ \\ \frac{\partial\ln L(\theta)}{\partial \theta}=\frac{\sum\limits_{i=1}^nx_i}{\theta}-\frac{n-\sum\limits_{i=1}^nx_i}{1-\theta} ∂θ∂lnL(θ)=θi=1∑nxi−1−θn−i=1∑nxi
令偏导数等于0,则
∑ i = 1 n x i θ = n − ∑ i = 1 n x i 1 − θ \frac{\sum\limits_{i=1}^nx_i}{\theta}=\frac{n-\sum\limits_{i=1}^nx_i}{1-\theta} θi=1∑nxi=1−θn−i=1∑nxi
求出
θ ^ = 1 n ∑ i = 1 n x i \hat {\theta} = \frac{1}{n}\sum\limits_{i=1}^nx_i θ^=n1i=1∑nxi -
最大后验估计
假设已知先验概率为 β \beta β分布
π ( θ ) = Γ ( α + β ) Γ ( α ) Γ ( β ) θ α − 1 ( 1 − θ ) β − 1 \pi(\theta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \theta^{\alpha-1}(1-\theta)^{\beta-1} π(θ)=Γ(α)Γ(β)Γ(α+β)θα−1(1−θ)β−1
求后验概率 P ( θ ∣ x 1 , x 2 , . . . , x n ) P(\theta \mathcal{|} x_1,x_2,...,x_n) P(θ∣x1,x2,...,xn)
P ( θ ∣ x 1 , x 2 , . . . , x n ) = P ( θ , x 1 , x 2 , . . . , x n ) P ( x 1 , x 2 , . . . , x n ) = π ( θ ) ∗ p ( x 1 ∣ θ ) ∗ ⋯ ∗ p ( x n ∣ θ ) ∫ P ( θ , x 1 , x 2 , . . . , x n ) d θ ∝ π ( θ ) ∗ p ( x 1 ∣ θ ) ∗ ⋯ ∗ p ( x n ∣ θ ) = θ α − 1 ( 1 − θ ) β − 1 ∗ ln ∏ i = 1 n θ x i ( 1 − θ ) 1 − x i = θ ∑ x i + α − 1 ∗ ( 1 − θ ) n − ∑ x i + β − 1 \begin{aligned} P(\theta \mathcal{|} x_1,x_2,...,x_n) & = \frac{P(\theta,x_1,x_2,...,x_n)}{P(x_1,x_2,...,x_n)} \\ & = \frac{\pi (\theta)*p(x_1|\theta)*\cdots*p(x_n|\theta)}{\int P(\theta,x_1,x_2,...,x_n) \mathcal{d} \theta} \\ &\propto \pi (\theta)*p(x_1|\theta)*\cdots*p(x_n|\theta) \\ & = \theta^{\alpha-1}(1-\theta)^{\beta-1}*\ln { \prod_{i=1}^n \theta^{x_i}(1-\theta)^{1-x_i}} \\ & = \theta^{\sum x_i + \alpha -1} * (1-\theta)^{n-\sum x_i +\beta -1} \end{aligned} P(θ∣x1,x2,...,xn)=P(x1,x2,...,xn)P(θ,x1,x2,...,xn)=∫P(θ,x1,x2,...,xn)dθπ(θ)∗p(x1∣θ)∗⋯∗p(xn∣θ)∝π(θ)∗p(x1∣θ)∗⋯∗p(xn∣θ)=θα−1(1−θ)β−1∗lni=1∏nθxi(1−θ)1−xi=θ∑xi+α−1∗(1−θ)n−∑xi+β−1
备注:- 因为 ∫ P ( θ , x 1 , x 2 , . . . , x n ) d θ \int P(\theta,x_1,x_2,...,x_n) d\theta ∫P(θ,x1,x2,...,xn)dθ已将 θ \theta θ积分挤掉了,所以与其无关,为一个常数;
- ∝ \propto ∝ :正比于;
- Γ ( α + β ) Γ ( α ) Γ ( β ) \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} Γ(α)Γ(β)Γ(α+β) 也是一个常数不考虑;
- θ ∑ x i + α − 1 ∗ ( 1 − θ ) n − ∑ x i + β − 1 \theta^{\sum x_i + \alpha -1} * (1-\theta)^{n-\sum x_i +\beta -1} θ∑xi+α−1∗(1−θ)n−∑xi+β−1是参数为 ∑ x i + α − 1 , n − ∑ x i + β − 1 \sum x_i + \alpha -1,n-\sum x_i +\beta -1 ∑xi+α−1,n−∑xi+β−1的 β \beta β分布
此时 L ( θ ) L(\theta) L(θ):
L ( θ ) = θ ∑ x i + α − 1 ∗ ( 1 − θ ) n − ∑ x i + β − 1 L(\theta) = \theta^{\sum x_i + \alpha -1} * (1-\theta)^{n-\sum x_i +\beta -1} L(θ)=θ∑xi+α−1∗(1−θ)n−∑xi+β−1
对数似然
:
ln L ( θ ) = ( ∑ i = 1 n x i + α − 1 ) ln θ + ( n − ∑ i = 1 n x i + β − 1 ) ln ( 1 − θ ) \ln L(\theta) = (\sum\limits_{i=1}^n x_i + \alpha -1)\ln \theta +(n-\sum \limits_{i=1}^n x_i +\beta -1)\ln(1-\theta) lnL(θ)=(i=1∑nxi+α−1)lnθ+(n−i=1∑nxi+β−1)ln(1−θ)
对 θ \theta θ求偏导 :
∂ ln L ( θ ) ∂ θ = ∑ i = 1 n x i + α − 1 θ − n − ∑ i = 1 n x i + β − 1 1 − θ \frac{\partial\ln L(\theta)}{\partial \theta} = \frac{\sum\limits_{i=1}^n x_i + \alpha -1}{\theta} - \frac{n-\sum \limits_{i=1}^n x_i +\beta -1}{1-\theta} ∂θ∂lnL(θ)=θi=1∑nxi+α−1−1−θn−i=1∑nxi+β−1
令偏导数等于0,则
∑ i = 1 n x i + α − 1 θ = n − ∑ i = 1 n x i + β − 1 1 − θ \frac{\sum\limits_{i=1}^n x_i + \alpha -1}{\theta} = \frac{n-\sum \limits_{i=1}^n x_i +\beta -1}{1-\theta} θi=1∑nxi+α−1=1−θn−i=1∑nxi+β−1
求出:
θ ^ = ∑ i = 1 n x i + α − 1 n + α + β − 2 \hat {\theta} = \frac{\sum\limits_{i=1}^n x_i + \alpha -1}{n+\alpha +\beta -2} θ^=n+α+β−2i=1∑nxi+α−1 -
极大似然和最大后验估计总结
- 对比极大似然 θ \theta θ估计值 θ ^ = 1 n ∑ i = 1 n x i \hat {\theta} = \frac{1}{n}\sum\limits_{i=1}^nx_i θ^=n1i=1∑nxi和最大后验估计值 θ ^ = ∑ i = 1 n x i + α − 1 n + α + β − 2 \hat {\theta} = \frac{\sum\limits_{i=1}^n x_i + \alpha -1}{n+\alpha +\beta -2} θ^=n+α+β−2i=1∑nxi+α−1,当样本n趋于无穷大的时候,两者 θ \theta θ的估计值是区域一致的;
- 最大后验估计中会给出参数的先验信息,当样本n足够大的时候,我们先前的先验信息和样本信息比就微不足道了,所以就近似于只用所有样本信息去估计 θ \theta θ所得到的结果;
- 考虑极端情况下,n=1,通过极大似然估计,结果是0,或者是1,但是在最大后验估计中,若样本n=1,那么最大后验估计结果就是 α α + β − 1 \frac{ \alpha }{\alpha +\beta -1} α+β−1α或者 α − 1 α + β − 1 \frac{ \alpha-1 }{\alpha +\beta -1} α+β−1α−1,这是样本量雄安的时候,最大后验估计的优势所在。
1.5 推导正太分布均值的极大似然估计和最大后验估计
-
问题
:推导下述正太分布均值的极大似然估计和最大后验估计,数据 x 1 , x 2 , . . . , x n x_1,x_2,...,x_n x1,x2,...,xn来自正太分布 N ( μ , σ 2 ) \mathcal{N}(\mu,\sigma^2) N(μ,σ2),其中 σ 2 \sigma^2 σ2已知:
- 根据样本 x 1 , x 2 , . . . , x n x_1,x_2,...,x_n x1,x2,...,xn写出 μ \mu μ的极大似然估计
- 假设 μ \mu μ的先验分布是 N ( 0 , τ 2 ) \mathcal{N}(0,\tau^2) N(0,τ2),根据样本 x 1 , x 2 , . . . , x n x_1,x_2,...,x_n x1,x2,...,xn写出 μ \mu μ的最大后验估计
-
根据样本 x 1 , x 2 , . . . , x n x_1,x_2,...,x_n x1,x2,...,xn写出 μ \mu μ的极大似然估计
样本的概率密度函数 f ( x i ) = 1 2 π σ exp ( − ( x i − μ ) 2 2 σ 2 ) i = 1 , 2 , . . . . . , n f(x_i)=\frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(x_i-\mu)^{2}}{2 \sigma^{2}}\right) \quad i=1,2,.....,n f(xi)=2πσ1exp(−2σ2(xi−μ)2)i=1,2,.....,n
似然函数:
L ( x i ; μ ) = ∏ i = 1 n f ( x i ; μ ) = ( 2 π σ ) − n ∗ exp ( − 1 2 σ 2 ∑ i = 1 n ( x i − μ ) 2 ) \begin{aligned} L(x_i;\mu) & = \prod_{i=1}^n f(x_i;\mu)\\ & = ({\sqrt{2 \pi} \sigma})^{-n} * \exp \left( -\frac{1}{2 \sigma^{2}} \sum\limits_{i=1}^n(x_i-\mu)^{2} \right) \end{aligned} L(xi;μ)=i=1∏nf(xi;μ)=(2πσ)−n∗exp(−2σ21i=1∑n(xi−μ)2)
对数似然函数:
ln L ( x i ; μ ) = − n ln ( 2 π σ ) − 1 2 σ 2 ∑ i = 1 n ( x i − μ ) 2 ⇒ ∂ ln L ( x i ; μ ) ∂ μ = 1 σ 2 ∑ i = 1 n ( x i − μ ) = 1 σ 2 ( ∑ i = 1 n x i − n μ ) 令 1 σ 2 ( ∑ i = 1 n x i − n μ ) = 0 ⇒ μ ^ = 1 n ∑ i = 1 n x i \begin{aligned} \ln L(x_i;\mu) & = -n \ln ({\sqrt{2 \pi} \sigma}) - \frac{1}{2 \sigma^{2}} \sum\limits_{i=1}^n(x_i-\mu)^{2} \\ \Rightarrow \quad \frac{\partial\ln L(x_i;\mu)}{\partial \mu} & = \frac{1}{\sigma^{2}} \sum\limits_{i=1}^n(x_i-\mu) \\ & = \frac{1}{\sigma^{2}} (\sum\limits_{i=1}^nx_i-n\mu) \\ 令 \quad \frac{1}{\sigma^{2}} (\sum\limits_{i=1}^nx_i-n\mu) & = 0 \\ \Rightarrow \quad \hat{\mu} & = \frac{1}{n}\sum\limits_{i=1}^nx_i \end{aligned} lnL(xi;μ)⇒∂μ∂lnL(xi;μ)令σ21(i=1∑nxi−nμ)⇒μ^=−nln(2πσ)−2σ21i=1∑n(xi−μ)2=σ21i=1∑n(xi−μ)=σ21(i=1∑nxi−nμ)=0=n1i=1∑nxi -
假设 μ \mu μ的先验分布是$\mathcal{N}(0,\tau^2) , 根据样本 ,根据样本 ,根据样本x_1,x_2,…,x_n 写出 写出 写出\mu$的最大后验估计
先验分布 f ( μ ) = 1 2 π τ exp ( − μ 2 2 τ 2 ) i = 1 , 2 , . . . . . , n f(\mu)=\frac{1}{\sqrt{2 \pi} \tau} \exp \left(-\frac{\mu^{2}}{2 \tau^{2}}\right) \quad i=1,2,.....,n f(μ)=2πτ1exp(−2τ2μ2)i=1,2,.....,n
P ( μ ∣ x 1 , x 2 , . . . , x n ) = P ( μ , x 1 , x 2 , . . . , x n ) P ( x 1 , x 2 , . . . , x n ) = f ( μ ) ∗ p ( x 1 ∣ μ ) ∗ ⋯ ∗ p ( x n ∣ μ ) ∫ P ( μ , x 1 , x 2 , . . . , x n ) d μ ∝ f ( μ ) ∗ p ( x 1 ∣ μ ) ∗ ⋯ ∗ p ( x n ∣ μ ) = 1 2 π τ exp ( − μ 2 2 τ 2 ) ∗ ∏ i = 1 n 1 2 π σ exp ( − ( x i − μ ) 2 2 σ 2 ) \begin{aligned} P(\mu \mathcal{|} x_1,x_2,...,x_n) & = \frac{P(\mu,x_1,x_2,...,x_n)}{P(x_1,x_2,...,x_n)} \\ & = \frac{f (\mu)*p(x_1|\mu)*\cdots*p(x_n|\mu)}{\int P(\mu,x_1,x_2,...,x_n) \mathcal{d} \mu} \\ &\propto f (\mu)*p(x_1|\mu)*\cdots*p(x_n|\mu) \\ & = \frac{1}{\sqrt{2 \pi} \tau} \exp \left(-\frac{\mu^{2}}{2 \tau^{2}}\right)*\prod_{i=1}^n \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(x_i-\mu)^{2}}{2 \sigma^{2}}\right) \end{aligned} P(μ∣x1,x2,...,xn)=P(x1,x2,...,xn)P(μ,x1,x2,...,xn)=∫P(μ,x1,x2,...,xn)dμf(μ)∗p(x1∣μ)∗⋯∗p(xn∣μ)∝f(μ)∗p(x1∣μ)∗⋯∗p(xn∣μ)=2πτ1exp(−2τ2μ2)∗i=1∏n2πσ1exp(−2σ2(xi−μ)2)
此时 L ( θ ) L(\theta) L(θ):
L ( θ ) = 1 2 π τ exp ( − μ 2 2 τ 2 ) ∗ ∏ i = 1 n 1 2 π σ exp ( − ( x i − μ ) 2 2 σ 2 ) L(\theta) =\frac{1}{\sqrt{2 \pi} \tau} \exp \left(-\frac{\mu^{2}}{2 \tau^{2}}\right)*\prod_{i=1}^n \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(x_i-\mu)^{2}}{2 \sigma^{2}}\right) L(θ)=2πτ1exp(−2τ2μ2)∗i=1∏n2πσ1exp(−2σ2(xi−μ)2)
对数似然
:
ln P ( μ ∣ x 1 , x 2 , . . . , x n ) = − ln 2 π τ − μ 2 2 τ 2 − n ln ( 2 π σ ) − 1 2 σ 2 ∑ i = 1 n ( x i − μ ) 2 ⇒ ∂ ln P ( μ ∣ x 1 , x 2 , . . . , x n ) ∂ μ = − μ τ 2 + 1 σ 2 ∑ i = 1 n ( x i − μ ) = 1 σ 2 ( ∑ i = 1 n x i − n μ ) − μ τ 2 令 1 σ 2 ( ∑ i = 1 n x i − n μ ) − μ τ 2 = 0 ⇒ 1 σ 2 ( ∑ i = 1 n x i − n μ ) = μ τ 2 ⇒ μ ^ = τ 2 ∑ i = 1 2 x i σ 2 + n τ 2 = ∑ i = 1 2 x i n + σ 2 τ 2 \begin{aligned} \ln P(\mu \mathcal{|} x_1,x_2,...,x_n) & = -\ln \sqrt{2 \pi} \tau-\frac{\mu^{2}}{2 \tau^{2}} -n \ln ({\sqrt{2 \pi} \sigma})-\frac{1}{2 \sigma^{2}} \sum\limits_{i=1}^n(x_i-\mu)^{2} \\ \Rightarrow \quad \frac{\partial\ln P(\mu \mathcal{|} x_1,x_2,...,x_n)}{\partial \mu}& = -\frac{\mu}{\tau^2} + \frac{1}{\sigma ^ 2}\sum\limits_{i=1}^n(x_i-\mu) \\ & = \frac{1}{\sigma^{2}} (\sum\limits_{i=1}^nx_i-n\mu) - \frac{\mu}{\tau^2} \\ 令\quad \frac{1}{\sigma^{2}} (\sum\limits_{i=1}^nx_i-n\mu) - \frac{\mu}{\tau^2} & = 0 \\ \Rightarrow \quad \frac{1}{\sigma^{2}}(\sum\limits_{i=1}^nx_i-n\mu) & = \frac{\mu}{\tau^2}\\ \Rightarrow \quad \hat{\mu} & = \frac{\tau^2\sum\limits_{i=1}^2x_i}{\sigma^2+n\tau^2}\\ & = \frac{\sum\limits_{i=1}^2x_i}{n+\frac{\sigma^2}{\tau^2}} \end{aligned} lnP(μ∣x1,x2,...,xn)⇒∂μ∂lnP(μ∣x1,x2,...,xn)令σ21(i=1∑nxi−nμ)−τ2μ⇒σ21(i=1∑nxi−nμ)⇒μ^=−ln2πτ−2τ2μ2−nln(2πσ)−2σ21i=1∑n(xi−μ)2=−τ2μ+σ21i=1∑n(xi−μ)=σ21(i=1∑nxi−nμ)−τ2μ=0=τ2μ=σ2+nτ2τ2i=1∑2xi=n+τ2σ2i=1∑2xi
当n较小时候,贝叶斯估计比极大似然估计要准确一些
1.6 朴素贝叶斯算法
-
基于贝叶斯定理和特征条件独立假设的分类方法
-
朴素贝叶斯法与贝叶斯估计是不同的概念
-
生成模型与判别模型
{ 生成模型: P ( Y ∣ X ) = P ( X , Y ) P ( X ) X,Y为随机变量 判别模型: Y = f ( X ) , P ( Y ∣ X ) \left\{ \begin{aligned} &\text{生成模型} :P(Y|X) = \frac{P(X,Y)}{P(X)} \text{X,Y为随机变量} \\ &\text{判别模型} :Y=f(X),P(Y|X) \end{aligned} \right. ⎩ ⎨ ⎧生成模型:P(Y∣X)=P(X)P(X,Y)X,Y为随机变量判别模型:Y=f(X),P(Y∣X)
1.6.1 朴素贝叶斯法的学习与分类
-
输入:
特征向量 x ∈ X ⊆ R n x \in \mathcal{X} \subseteq \mathrm{R}^{n} x∈X⊆Rn 为实例的特征向量,输出:
类标记 y i ∈ Y = { c 1 , c 2 , ⋯ , c K } y_{i} \in \mathcal{Y}=\left\{c_{1}, c_{2}, \cdots, c_{K}\right\} yi∈Y={c1,c2,⋯,cK}X X X是定义在输入空间 X \mathcal{X} X 上的随机向量, Y Y Y 是定义在输出空间 Y \mathcal{Y} Y 上的随机变量。 P ( X , Y ) P(X, Y) P(X,Y) 是 X X X 和 Y Y Y 的联合概率分布。训练数据集
T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x N , y N ) } = { ( x i , y i ) } i = 1 N \begin{aligned} T& = \left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\left(x_{N}, y_{N}\right)\right\} \\ & = \left\{ (x_i,y_i)\right\}_{i=1}^N \end{aligned} T={(x1,y1),(x2,y2),⋯,(xN,yN)}={(xi,yi)}i=1N
由 P ( X , Y ) P(X, Y) P(X,Y) 独立同分布产生。 -
先验概率分布:
P ( Y = c k ) , k = 1 , 2 , . . . , K P(Y=c_k),\quad k =1,2,...,K P(Y=ck),k=1,2,...,K -
条件概率分布:
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , ⋯ , X ( n ) = x ( n ) ∣ Y = c k ) , k = 1 , 2 , ⋯ , K P\left(X=x \mid Y=c_{k}\right)=P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} \mid Y=c_{k}\right), \quad k=1,2, \cdots, K P(X=x∣Y=ck)=P(X(1)=x(1),⋯,X(n)=x(n)∣Y=ck),k=1,2,⋯,K -
联合概率分布:
由条件概率公式 P ( A B ) = P ( B ) P ( A ∣ B ) P(AB) = P(B)P(A|B) P(AB)=P(B)P(A∣B) , 结合上面的
先验概率分布
和条件概率分布
,可得联合概率分布 P ( X , Y ) P(X,Y) P(X,Y)(或者写成 P ( Y , X ) P(Y,X) P(Y,X))
P ( Y , X ) = P ( Y = c k , X = x ) = P ( Y = c k ) P ( X = x ∣ Y = c k ) \begin{aligned} P(Y,X) &= P(Y=c_k,X=x) \\ &= P(Y=c_k)P(X=x\mid Y=c_k) \end{aligned} P(Y,X)=P(Y=ck,X=x)=P(Y=ck)P(X=x∣Y=ck) -
生成模型(后验概率):
根据全概率公式和贝叶斯公式:
P ( B ∣ A ) = P ( A B ) P ( A ) = P ( B ) P ( A ∣ B ) P ( A ) = P ( B ) P ( A ∣ B ) ∑ P ( B ) P ( A ∣ B ) ⇒ P ( Y = c k ∣ X = x ) = P ( Y = c k , X = x ) P ( X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) P ( X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ k P ( X = x ∣ Y = c k ) P ( Y = c k ) \begin{aligned} P(B|A) &= \frac{P(AB)}{P(A)} \\ &=\frac{P(B)P(A|B)}{P(A)}\\ &=\frac{P(B)P(A|B)}{\sum P(B)P(A|B)}\\ \quad \\ \Rightarrow P\left(Y=c_{k} \mid X=x\right)&= \frac{P\left(Y=c_{k} ,X=x\right)}{P(X=x)} \\ &=\frac{P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)}{P(X=x)}\\ &=\frac{P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)}{\sum_{k} P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)} \end{aligned} P(B∣A)⇒P(Y=ck∣X=x)=P(A)P(AB)=P(A)P(B)P(A∣B)=∑P(B)P(A∣B)P(B)P(A∣B)=P(X=x)P(Y=ck,X=x)=P(X=x)P(X=x∣Y=ck)P(Y=ck)=∑kP(X=x∣Y=ck)P(Y=ck)P(X=x∣Y=ck)P(Y=ck) -
模型假设:条件独立性
条件概率分布 P ( X = x ∣ Y = c k ) P\left(X=x \mid Y=c_{k}\right) P(X=x∣Y=ck) 有指数级数量的参数,若不假设各属性条件独立性,其估计实际是不可行的。
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , ⋯ , X ( n ) = x ( n ) ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) ∣ Y = c k ) ∗ P ( X ( 2 ) = x ( 2 ) ∣ Y = c k ) ∗ ⋯ = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) \begin{aligned} P\left(X=x \mid Y=c_{k}\right) &=P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} \mid Y=c_{k}\right) \\ &= P(X^{(1)}=x^{(1)} \mid Y=c_k) * P(X^{(2)}=x^{(2)} \mid Y=c_k) * \cdots \\ &=\prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) \end{aligned} P(X=x∣Y=ck)=P(X(1)=x(1),⋯,X(n)=x(n)∣Y=ck)=P(X(1)=x(1)∣Y=ck)∗P(X(2)=x(2)∣Y=ck)∗⋯=j=1∏nP(X(j)=x(j)∣Y=ck)
事实上, 假设 x ( j ) x^{(j)} x(j) 可取值有 S j S_{j} Sj 个, j = 1 , 2 , ⋯ , n , Y j=1,2, \cdots, n, Y j=1,2,⋯,n,Y 可取值有 K K K 个, 那么参数个数 为 K ∏ j = 1 n S j K \prod\limits_{j=1}^{n} S_{j} Kj=1∏nSj 。朴素贝叶斯法对条件概率分布作了条件独立性的假设。由于这是一个较强的假 设, 朴素贝叶斯法也由此得名。
-
预测准则:后验概率最大
(后面证明)结合条件独立性,后验概率为
y = arg max c k P ( Y = c k ∣ X = x ) = P ( Y = c k , X = x ) P ( X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) P ( X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ k P ( X = x ∣ Y = c k ) P ( Y = c k ) = P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , ⋯ , K \begin{aligned} y&=\arg\max_{c_k} P\left(Y=c_{k} \mid X=x\right)\\ &= \frac{P\left(Y=c_{k} ,X=x\right)}{P(X=x)} \\ &=\frac{P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)}{P(X=x)}\\ &=\frac{P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)}{\sum_{k} P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)} \\ &=\frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}{\sum_{k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}, \quad k=1,2, \cdots, K \end{aligned} y=argckmaxP(Y=ck∣X=x)=P(X=x)P(Y=ck,X=x)=P(X=x)P(X=x∣Y=ck)P(Y=ck)=∑kP(X=x∣Y=ck)P(Y=ck)P(X=x∣Y=ck)P(Y=ck)=∑kP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)P(Y=ck)∏jP(X(j)=x(j)∣Y=ck),k=1,2,⋯,K
这是朴素贝叶斯法分类的基本公式。于是, 朴素贝叶斯分类器可表示为
y = f ( x ) = arg max c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=\arg \max _{c_{k}} \frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}{\sum_{k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)} y=f(x)=argckmax∑kP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)P(Y=ck)∏jP(X(j)=x(j)∣Y=ck)
1.6.2 后验概率最大化
-
朴素贝叶斯法将实例分到后验概率最大的类中。这等价于期望风险最小化。假设 选择 0-1 损失函数:
L ( Y , f ( X ) ) = { 1 , Y ≠ f ( X ) 0 , Y = f ( X ) L(Y, f(X))= \begin{cases}1, & Y \neq f(X) \\ 0, & Y=f(X)\end{cases} L(Y,f(X))={1,0,Y=f(X)Y=f(X)
式中 f ( X ) f(X) f(X) 是分类决策函数。这时, 期望风险函数为
R exp ( f ) = E [ L ( Y , f ( X ) ) ] R_{\exp }(f)=E[L(Y, f(X))] Rexp(f)=E[L(Y,f(X))]
当损失函数期望最小时候,等价与后验概率最大化
y = arg max c k P ( Y = c k ∣ X = x ) y =\arg\max_{c_k} P\left(Y=c_{k} \mid X=x\right) y=argckmaxP(Y=ck∣X=x) -
最小化期望风险
arg min R exp ( f ) = arg min E [ L ( Y , f ( X ) ) ] = arg min ∑ Y ∑ X L ( Y , f ( X ) ) P ( X , Y ) = arg min ∑ Y ∑ X L ( Y , f ( X ) ) P ( Y ∣ X ) P ( X ) = arg min ∑ X { ∑ Y L ( Y , f ( X ) ) P ( Y ∣ X ) } P ( X ) = E X ∑ Y L ( Y = c k , f ( X ) ) P ( Y = c k ∣ X ) \begin{aligned} \arg\min R_{\exp }(f) & = \arg\min E[L(Y, f(X))] \\ & = \arg\min \sum_Y\sum_X L(Y, f(X))P(X,Y) \\ & = \arg\min \sum_Y\sum_X L(Y, f(X))P(Y|X)P(X) \\ & = \arg\min \sum_X \left\{\sum_Y L(Y, f(X))P(Y|X)\right\}P(X) \\ & = E_X \sum_Y L(Y=c_k, f(X))P(Y=c_k|X) \end{aligned} argminRexp(f)=argminE[L(Y,f(X))]=argminY∑X∑L(Y,f(X))P(X,Y)=argminY∑X∑L(Y,f(X))P(Y∣X)P(X)=argminX∑{Y∑L(Y,f(X))P(Y∣X)}P(X)=EXY∑L(Y=ck,f(X))P(Y=ck∣X)
即期望是对联合分布 P ( X , Y ) P(X, Y) P(X,Y) 取的。由此取条件期望
R exp ( f ) = E X ∑ k = 1 K [ L ( Y = c k , f ( X ) ) ] P ( Y = c k ∣ X ) R_{\exp }(f)=E_{X} \sum_{k=1}^{K}\left[L\left(Y=c_{k}, f(X)\right)\right] P\left(Y=c_{k} \mid X\right) Rexp(f)=EXk=1∑K[L(Y=ck,f(X))]P(Y=ck∣X)
为了使期望风险最小化, 只需对 X = x X=x X=x 逐个极小化, 由此得到:
f ( x ) = arg min R exp ( f ) = arg min E [ L ( Y , f ( X ) ) ] = arg min ∑ k = 1 K [ L ( Y = c k , f ( X ) ) ] P ( Y = c k ∣ X ) ∵ L ( Y , f ( X ) ) = { 1 , Y ≠ f ( X ) 0 , Y = f ( X ) i f f ( X ) = Y = c k , t h e n L ( Y = c k , f ( X ) ) = 0 , ∴ L ( Y = c k , f ( X ) ) = I [ f ( x ) ≠ c k ] ⇒ = arg min ∑ k = 1 K I [ f ( x ) ≠ c k ] P ( Y = c k ∣ X ) = arg min ∑ k = 1 K [ 1 − I [ f ( x ) = c k ] ] P ( Y = c k ∣ X ) = arg min ∑ k = 1 K { P ( Y = c k ∣ X ) − I [ f ( x ) = c k ] P ( Y = c k ∣ X ) } = arg min { ∑ k = 1 K P ( Y = c k ∣ X ) − ∑ k = 1 K I [ f ( x ) = c k ] P ( Y = c k ∣ X ) } ∵ ∑ k = 1 K P ( Y = c k ∣ X ) = 1 ⇒ = arg min { 1 − ∑ k = 1 K I [ f ( x ) = c k ] P ( Y = c k ∣ X ) } 等价于 ⇒ = arg max ∑ k = 1 K I [ f ( x ) = c k ] P ( Y = c k ∣ X ) \begin{aligned} f(x) &= \arg\min R_{\exp }(f) \\ & = \arg\min E[L(Y, f(X))] \\ & = \arg\min \sum_{k=1}^{K}\left[L\left(Y=c_{k}, f(X)\right)\right] P\left(Y=c_{k} \mid X\right) \\ \because L(Y, f(X))&= \begin{cases}1, & Y \neq f(X) \\ 0, & Y=f(X)\end{cases}\quad \mathcal{if}\quad f(X) = Y =c_k, then \quad L\left(Y=c_{k}, f(X)\right) = 0,\therefore L\left(Y=c_{k}, f(X)\right)= I[f(x)\neq c_k] \\ \Rightarrow & = \arg\min \sum_{k=1}^{K}I[f(x)\neq c_k]P\left(Y=c_{k} \mid X\right)\\ & = \arg\min \sum_{k=1}^{K} \left[1-I[f(x)= c_k]\right]P\left(Y=c_{k} \mid X\right)\\ & = \arg\min \sum_{k=1}^{K} \left\{P\left(Y=c_{k} \mid X\right)-I[f(x)= c_k]P\left(Y=c_{k} \mid X\right)\right\}\\ & = \arg\min \left\{\sum_{k=1}^{K}P\left(Y=c_{k} \mid X\right)-\sum_{k=1}^{K}I[f(x)= c_k]P\left(Y=c_{k} \mid X\right)\right\}\\ \because &\sum_{k=1}^{K}P\left(Y=c_{k} \mid X\right) = 1 \\ \Rightarrow& = \arg\min \left\{1-\sum_{k=1}^{K}I[f(x)= c_k]P\left(Y=c_{k} \mid X\right)\right\}\\ &\text{等价于} \\ \Rightarrow& = \arg\max \sum_{k=1}^{K}I[f(x)= c_k]P\left(Y=c_{k} \mid X\right) \end{aligned} f(x)∵L(Y,f(X))⇒∵⇒⇒=argminRexp(f)=argminE[L(Y,f(X))]=argmink=1∑K[L(Y=ck,f(X))]P(Y=ck∣X)={1,0,Y=f(X)Y=f(X)iff(X)=Y=ck,thenL(Y=ck,f(X))=0,∴L(Y=ck,f(X))=I[f(x)=ck]=argmink=1∑KI[f(x)=ck]P(Y=ck∣X)=argmink=1∑K[1−I[f(x)=ck]]P(Y=ck∣X)=argmink=1∑K{P(Y=ck∣X)−I[f(x)=ck]P(Y=ck∣X)}=argmin{k=1∑KP(Y=ck∣X)−k=1∑KI[f(x)=ck]P(Y=ck∣X)}k=1∑KP(Y=ck∣X)=1=argmin{1−k=1∑KI[f(x)=ck]P(Y=ck∣X)}等价于=argmaxk=1∑KI[f(x)=ck]P(Y=ck∣X)
朴素贝叶斯法将实例分到后验概率最大的类中。这等价于期望风险最小化
f ( x ) = arg min R exp ( f ) = arg max ∑ k = 1 K I [ f ( x ) = c k ] P ( Y = c k ∣ X ) \begin{aligned} f(x) &= \arg\min R_{\exp }(f) \\ & = \arg\max \sum_{k=1}^{K}I[f(x)= c_k]P\left(Y=c_{k} \mid X\right) \end{aligned} f(x)=argminRexp(f)=argmaxk=1∑KI[f(x)=ck]P(Y=ck∣X)
因为预测后验概率的最大,所以得找一个 c k c_k ck,使得 I [ f ( x ) = c k ] P ( Y = c k ) I[f(x)= c_k]P(Y=c_{k} ) I[f(x)=ck]P(Y=ck) 为真,这样一来,根据期望风险最小化准则就得到了后验概率最大化准则
f ( x ) = arg max c k P ( Y = c k ∣ X = x ) f(x) = \arg\max_{c_k}P(Y=c_k\mid X=x) f(x)=argckmaxP(Y=ck∣X=x)
1.6.3 朴素贝叶斯的参数估计
-
根据上面的推导
y = arg max f ( x ) = arg max c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∝ arg max c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) \begin{aligned} y&=\arg\max f(x)\\ &=\arg \max _{c_{k}} \frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}{\sum_{k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}\\ &\propto \arg \max _{c_{k}}P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) \end{aligned} y=argmaxf(x)=argckmax∑kP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)P(Y=ck)∏jP(X(j)=x(j)∣Y=ck)∝argckmaxP(Y=ck)j∏P(X(j)=x(j)∣Y=ck)
在朴素贝叶斯法中, 学习意味着估计 P ( Y = c k ) P\left(Y=c_{k}\right) P(Y=ck) 和 P ( X ( j ) = x ( j ) ∣ Y = c k ) P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) P(X(j)=x(j)∣Y=ck) -
极大似然估计:
-
先验概率
P ( Y = c k ) P\left(Y=c_{k}\right) P(Y=ck) 的极大似然估计
是
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , ⋯ , K P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, \quad k=1,2, \cdots, K P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,⋯,K -
条件概率
P ( X ( j ) = a j l ∣ Y = P\left(X^{(j)}=a_{j l} \mid Y=\right. P(X(j)=ajl∣Y= c k ) \left.c_{k}\right) ck)的极大似然估计
是:(设第 j j j 个特征 x ( j ) x^{(j)} x(j) 可能取值的集合为 { a j 1 , a j 2 , ⋯ , a j S j } \left\{a_{j 1}, a_{j 2}, \cdots, a_{j S_{j}}\right\} {aj1,aj2,⋯,ajSj})
P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) j = 1 , 2 , ⋯ , n ; l = 1 , 2 , ⋯ , S j ; k = 1 , 2 , ⋯ , K \begin{aligned} &P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)} \\ &j=1,2, \cdots, n ; \quad l=1,2, \cdots, S_{j} ; \quad k=1,2, \cdots, K \end{aligned} P(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck)j=1,2,⋯,n;l=1,2,⋯,Sj;k=1,2,⋯,K
, x i ( j ) x_{i}^{(j)} xi(j) 是第 i i i 个样本的第 j j j 个特征; a j l a_{j l} ajl 是第 j j j 个特征可能取的第 l l l 个值; I I I 为指 示函数。
-
-
贝叶斯估计
用极大似然估计可能会出现所要估计的概率值为 0 的情况。这时会影响到后验概 率的计算结果, 使分类产生偏差。解决这一问题的方法是采用贝叶斯估计。具体地:
-
条件概率的贝叶斯估计
是
P λ ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j λ P_{\lambda}\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)+\lambda}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+S_{j} \lambda} Pλ(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)+Sjλ∑i=1NI(xi(j)=ajl,yi=ck)+λ
式中 λ ⩾ 0 \lambda \geqslant 0 λ⩾0 。等价于在随机变量各个取值的频数上赋予一个正数 λ > 0 \lambda>0 λ>0 。当
λ = 0 \lambda=0 λ=0时就 是极大似然估计。常取
λ = 1 \lambda=1 λ=1,这时称为拉普拉斯平滑 (Laplacian smoothing)
。显然, 对任何 l = 1 , 2 , ⋯ , S j , k = 1 , 2 , ⋯ , K l=1,2, \cdots, S_{j}, k=1,2, \cdots, K l=1,2,⋯,Sj,k=1,2,⋯,K, 有
P λ ( X ( j ) = a j l ∣ Y = c k ) > 0 ∑ l = 1 S j P ( X ( j ) = a j l ∣ Y = c k ) = 1 \begin{aligned} &P_{\lambda}\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)>0 \\ &\sum_{l=1}^{S_{j}} P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=1 \end{aligned} Pλ(X(j)=ajl∣Y=ck)>0l=1∑SjP(X(j)=ajl∣Y=ck)=1
同样 -
先验概率的贝叶斯估计
是
P λ ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + K λ P_{\lambda}\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+\lambda}{N+K \lambda} Pλ(Y=ck)=N+Kλ∑i=1NI(yi=ck)+λ
-
1.6.4 朴素贝叶斯算法流程
-
算法 4.1 (朴素贝叶斯算法 (naïve Bayes algorithm))
输入:
训练数据 T = { ( x i , y i ) } i = 1 N T=\left\{ (x_i,y_i)\right\}_{i=1}^N T={(xi,yi)}i=1N, 其中 x i = ( x i ( 1 ) , x i ( 2 ) , ⋯ x_{i}=\left(x_{i}^{(1)}, x_{i}^{(2)}, \cdots\right. xi=(xi(1),xi(2),⋯, x i ( n ) ) T , x i ( j ) \left.x_{i}^{(n)}\right)^{\mathrm{T}}, x_{i}^{(j)} xi(n))T,xi(j) 是第 i i i 个样本的第 j j j 个特征, x i ( j ) ∈ { a j 1 , a j 2 , ⋯ , a j S j } , a j l x_{i}^{(j)} \in\left\{a_{j 1}, a_{j 2}, \cdots, a_{j} S_{j}\right\}, a_{j l} xi(j)∈{aj1,aj2,⋯,ajSj},ajl 是第 j j j 个特 征可能取的第 l l l 个值, j = 1 , 2 , ⋯ , n , l = 1 , 2 , ⋯ , S j , y i ∈ { c 1 , c 2 , ⋯ , c K } j=1,2, \cdots, n, l=1,2, \cdots, S_{j}, y_{i} \in\left\{c_{1}, c_{2}, \cdots, c_{K}\right\} j=1,2,⋯,n,l=1,2,⋯,Sj,yi∈{c1,c2,⋯,cK}; 实例 x x x;输出:
实例 x x x 的分类。
(1) 计算先验概率及条件概率- 先验概率
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , ⋯ , K P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, \quad k=1,2, \cdots, K P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,⋯,K
- 条件概率
P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) j = 1 , 2 , ⋯ , n ; l = 1 , 2 , ⋯ , S j ; k = 1 , 2 , ⋯ , K P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}\\ j=1,2, \cdots, n ; \quad l=1,2, \cdots, S_{j} ; \quad k=1,2, \cdots, K P(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck)j=1,2,⋯,n;l=1,2,⋯,Sj;k=1,2,⋯,K
(2) 对于给定的实例 x = ( x ( 1 ) , x ( 2 ) , ⋯ , x ( n ) ) T x=\left(x^{(1)}, x^{(2)}, \cdots, x^{(n)}\right)^{\mathrm{T}} x=(x(1),x(2),⋯,x(n))T, 计算
P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , ⋯ , K P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right), \quad k=1,2, \cdots, K P(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck),k=1,2,⋯,K
(3)确定实例 x x x 的类
y = arg max c k P ( Y = c k ) ∏ n P ( X ( j ) = x ( j ) ∣ Y = c k ) y=\arg \max _{c_{k}} P\left(Y=c_{k}\right) \prod^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) y=argckmaxP(Y=ck)∏nP(X(j)=x(j)∣Y=ck)