一、频率派vs贝叶斯学派

简介

  • 最近准备每周听一下b站白板推导系列,希望自己每周至少听一课并整理一课的公式外加一些扩展吧,慢慢来

  • 白板推导课程

    b站:https://www.bilibili.com/video/BV1aE411o7qd

  • 目前已经有人整理出了非常优秀的公式,会借鉴

    github: https://github.com/tsyw/MachineLearningNotes

1 频率派VS贝叶斯派

1.1 数据与参数

  • 参数(parameter),对参数表示为 θ \theta θ

  • 假设有一概率模型 ,随机变量 x x x 服从于概率分布 x ∼ P ( x ∣ θ ) x\sim P(x|\theta) xP(xθ)

  • 数据(data),对数据集使用X表示,其中对每个 x i x_i xi,我们称之为一个样本,假设一份数据有p个特征,则每个样本是长度为p的向量
    x i = ( x i 1 , x i 2 , ⋯   , x i p ) T i = 1 , 2 , . . . , N x_{i}=(x_{i1},x_{i2},\cdots,x_{ip})^{T} \\ i = 1,2,...,N xi=(xi1,xi2,,xip)Ti=1,2,...,N
    对于N个样本,P个特征的数据集X表示为
    X = ( x 1 , x 2 , ⋯   , x N ) N ⋅ P T = [ x 11 x 12 ⋯ x 1 p x 21 x 22 ⋯ x 2 p ⋮ ⋮ ⋮ ⋮ x N 1 x N 2 ⋯ x N p ] N ⋅ P \begin{aligned} X &=(x_1,x_2,\cdots,x_N )_{N\cdot P}^T \\ &=\left[\begin{array}{cccc} {x}_{11} & {x}_{12} & \cdots & {x}_{1p} \\ {x}_{21} & {x}_{22} & \cdots & {x}_{2p} \\ \vdots & \vdots &\vdots &\vdots & \\ {x}_{N1} & {x}_{N2} & \cdots & {x}_{Np}\\ \end{array}\right]_{N\cdot P} \end{aligned} X=(x1,x2,,xN)NPT= x11x21xN1x12x22xN2x1px2pxNp NP
    对于每个 x i j x_{ij} xij 均由上述概率分布 x ∼ P ( x ∣ θ ) x\sim P(x|\theta) xP(xθ) 产生。

  • 标签 Y = y 1 , y 2 , . . . , y n Y= {y_1,y_2,...,y_n} Y=y1,y2,...,yn


1.2 频率派

  • 频率注意学派认为参数 θ \theta θ 是一个未知常数,即概率分布 p ( x ∣ θ ) p(x|\theta) p(xθ) 中的 θ \theta θ 是一个常量。

  • 对于 N N N 个样本数据来说观测数据集X, x ∼ i i d p x ∣ θ x\mathop\sim \limits^{iid} p{x|\theta} xiidpxθ ,的概率为 。为了求 θ \theta θ 的大小,我们采用最大对数似然MLE的方法:
    θ M L E = a r g m a x θ L ( θ ) = a r g m a x θ log ⁡ p ( X ∣ θ ) = a r g m a x log ⁡ ( ∏ i = 1 N p ( x i ∣ θ ) ) = a r g m a x θ ∑ i = 1 N log ⁡ p ( x i ∣ θ ) \begin{aligned} \theta_{MLE}&=\mathop{argmax}\limits _{\theta} \mathop{L}(\theta)\\ &=\mathop{argmax}\limits _{\theta}\log p(X|\theta)\\ &=\mathop{argmax}\mathop{\log}\left(\prod\limits _{i=1}^{N}p(x_{i}|\theta)\right)\\ &\mathop{=}\mathop{argmax}\limits _{\theta}\sum\limits _{i=1}^{N}\log p(x_{i}|\theta) \end{aligned} θMLE=θargmaxL(θ)=θargmaxlogp(Xθ)=argmaxlog(i=1Np(xiθ))=θargmaxi=1Nlogp(xiθ)

1.3贝叶斯派

  • 贝叶斯派认为参数 θ \theta θ 满足一个预设的先验分布 θ ∼ p ( θ ) \theta\sim p(\theta) θp(θ)

  • 借助贝叶斯定理,可将先验分布和后验分布结合联系:
    θ M A P = a r g m a x θ p ( θ ∣ X ) = a r g m a x θ p ( X ∣ θ ) ⋅ p ( θ ) p ( x ) ∝ a r g m a x θ p ( X ∣ θ ) ⋅ p ( θ ) \begin{aligned} \theta_{MAP}&=\mathop{argmax}\limits _{\theta}p(\theta|X)\\ &=\mathop{argmax}\limits _{\theta}\frac{p(X|\theta)\cdot p(\theta)}{p(x)} \\ &\propto \mathop{argmax}\limits _{\theta}p(X|\theta)\cdot p(\theta) \end{aligned} θMAP=θargmaxp(θX)=θargmaxp(x)p(Xθ)p(θ)θargmaxp(Xθ)p(θ)
    其中

    1. p ( θ ∣ X ) p(\theta|X) p(θX) 为后验概率(posterior)
    2. p ( X ∣ θ ) p(X|\theta) p(Xθ) 为似然函数(likelyhood)
    3. p ( θ ) p(\theta) p(θ) 为先验概率(prior)
    4. p ( x ) = ∫ θ p ( x ∣ θ ) ⋅ p ( θ ) d θ p(x) = \int_{\theta}p(x|\theta)\cdot p(\theta) d\theta p(x)=θp(xθ)p(θ)dθ ,是个定积分,算出来是一个确定的数
  • 上述公式中第二个等号是由于分母和 θ \theta θ 没有关系(积分被积掉了)。求解这个 θ \theta θ 值后计算 p ( X ∣ θ ) ⋅ p ( θ ) ∫ θ p ( X ∣ θ ) ⋅ p ( θ ) d θ \frac{p(X|\theta)\cdot p(\theta)}{\int\limits _{\theta}p(X|\theta)\cdot p(\theta)d\theta} θp(Xθ)p(θ)dθp(Xθ)p(θ) ,就得到了参数的后验概率。其中 p ( X ∣ θ ) p(X|\theta) p(Xθ) 叫似然,是我们的模型分布。得到了参数的后验分布后,我们可以将这个分布用于预测贝叶斯预测:
    p ( x n e w ∣ X ) = ∫ θ p ( x n e w ∣ θ ) ⋅ p ( θ ∣ X ) d θ p(x_{new}|X)=\int\limits _{\theta}p(x_{new}|\theta)\cdot p(\theta|X)d\theta p(xnewX)=θp(xnewθ)p(θX)dθ


1.4掷硬币问题的极大似然估计和最大后验估计

  • 极大似然估计

    在掷硬币实验中用1表示出现正面向上,用0表示出现反面向上,即
    x i = { 1 , 正面出现 0 , 反面出现 x_i= \left\{ \begin{aligned} 1,\quad正面出现 \\ 0,\quad反面出现 \end{aligned} \right. xi={1,正面出现0,反面出现

    估计出现正面向上的概率为 θ \theta θ,反面出现向上的概率为 1 − θ 1-\theta 1θ,$x_i \sim B(1.,\theta) $,概率分布函数为

    P ( X = x ) = θ x ( 1 − θ ) 1 − x = { P ( x = 0 ) = 1 − θ P ( x = 1 ) = θ P(X=x) = \theta^x(1-\theta)^{1-x} = \left\{ \begin{aligned} P(x=0) & = 1-\theta \\ P(x=1) & = \theta \end{aligned} \right. P(X=x)=θx(1θ)1x={P(x=0)P(x=1)=1θ=θ

    似然函数:
    L ( θ ) = P ( X 1 = x 1 ∣ θ ) ∗ ⋯ ∗ P ( X n = x n ∣ θ ) = ∏ i = 1 n θ x i ( 1 − θ ) 1 − x i \begin{aligned} L(\theta) & = P(X_1=x_1|\theta)*\cdots*P(X_n=x_n|\theta) \\ & = \prod_{i=1}^n \theta^{x_i}(1-\theta)^{1-x_i} \end{aligned} L(θ)=P(X1=x1θ)P(Xn=xnθ)=i=1nθxi(1θ)1xi
    对数似然函数:
    ln ⁡ L ( θ ) = ln ⁡ ∏ i = 1 n θ x i ( 1 − θ ) 1 − x i = ∑ i = 1 n [ ln ⁡ θ x i + ln ⁡ ( 1 − θ ) 1 − x i ] = ∑ i = 1 n x i ln ⁡ θ + ∑ i = 1 n ( 1 − x i ) ln ⁡ ( 1 − θ ) = ∑ i = 1 n x i ln ⁡ θ + ( n − ∑ i = 1 n x i ) ln ⁡ ( 1 − θ ) \begin{aligned} \ln {L(\theta)} & = \ln { \prod_{i=1}^n \theta^{x_i}(1-\theta)^{1-x_i}} \\ & = \sum_{i=1}^n\left[ \ln\theta^{x_i} + \ln{(1- \theta)^{1-x_i} } \right]\\ & = \sum_{i=1}^nx_i\ln \theta + \sum_{i=1}^n(1-x_i)\ln{(1- \theta)} \\ & = \sum_{i=1}^nx_i\ln \theta + (n-\sum_{i=1}^nx_i)\ln{(1- \theta)} \end{aligned} lnL(θ)=lni=1nθxi(1θ)1xi=i=1n[lnθxi+ln(1θ)1xi]=i=1nxilnθ+i=1n(1xi)ln(1θ)=i=1nxilnθ+(ni=1nxi)ln(1θ)
    目标 m a x ln ⁡ L ( θ ) \mathcal {max} \ln L(\theta) maxlnL(θ)

    θ \theta θ求偏导
    ∂ ln ⁡ L ( θ ) ∂ θ = ∑ i = 1 n x i θ − n − ∑ i = 1 n x i 1 − θ \\ \frac{\partial\ln L(\theta)}{\partial \theta}=\frac{\sum\limits_{i=1}^nx_i}{\theta}-\frac{n-\sum\limits_{i=1}^nx_i}{1-\theta} θlnL(θ)=θi=1nxi1θni=1nxi
    令偏导数等于0,则
    ∑ i = 1 n x i θ = n − ∑ i = 1 n x i 1 − θ \frac{\sum\limits_{i=1}^nx_i}{\theta}=\frac{n-\sum\limits_{i=1}^nx_i}{1-\theta} θi=1nxi=1θni=1nxi
    求出
    θ ^ = 1 n ∑ i = 1 n x i \hat {\theta} = \frac{1}{n}\sum\limits_{i=1}^nx_i θ^=n1i=1nxi

  • 最大后验估计

    假设已知先验概率为 β \beta β分布
    π ( θ ) = Γ ( α + β ) Γ ( α ) Γ ( β ) θ α − 1 ( 1 − θ ) β − 1 \pi(\theta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \theta^{\alpha-1}(1-\theta)^{\beta-1} π(θ)=Γ(α)Γ(β)Γ(α+β)θα1(1θ)β1
    求后验概率 P ( θ ∣ x 1 , x 2 , . . . , x n ) P(\theta \mathcal{|} x_1,x_2,...,x_n) P(θx1,x2,...,xn)
    P ( θ ∣ x 1 , x 2 , . . . , x n ) = P ( θ , x 1 , x 2 , . . . , x n ) P ( x 1 , x 2 , . . . , x n ) = π ( θ ) ∗ p ( x 1 ∣ θ ) ∗ ⋯ ∗ p ( x n ∣ θ ) ∫ P ( θ , x 1 , x 2 , . . . , x n ) d θ ∝ π ( θ ) ∗ p ( x 1 ∣ θ ) ∗ ⋯ ∗ p ( x n ∣ θ ) = θ α − 1 ( 1 − θ ) β − 1 ∗ ln ⁡ ∏ i = 1 n θ x i ( 1 − θ ) 1 − x i = θ ∑ x i + α − 1 ∗ ( 1 − θ ) n − ∑ x i + β − 1 \begin{aligned} P(\theta \mathcal{|} x_1,x_2,...,x_n) & = \frac{P(\theta,x_1,x_2,...,x_n)}{P(x_1,x_2,...,x_n)} \\ & = \frac{\pi (\theta)*p(x_1|\theta)*\cdots*p(x_n|\theta)}{\int P(\theta,x_1,x_2,...,x_n) \mathcal{d} \theta} \\ &\propto \pi (\theta)*p(x_1|\theta)*\cdots*p(x_n|\theta) \\ & = \theta^{\alpha-1}(1-\theta)^{\beta-1}*\ln { \prod_{i=1}^n \theta^{x_i}(1-\theta)^{1-x_i}} \\ & = \theta^{\sum x_i + \alpha -1} * (1-\theta)^{n-\sum x_i +\beta -1} \end{aligned} P(θx1,x2,...,xn)=P(x1,x2,...,xn)P(θ,x1,x2,...,xn)=P(θ,x1,x2,...,xn)dθπ(θ)p(x1θ)p(xnθ)π(θ)p(x1θ)p(xnθ)=θα1(1θ)β1lni=1nθxi(1θ)1xi=θxi+α1(1θ)nxi+β1
    备注:

    1. 因为 ∫ P ( θ , x 1 , x 2 , . . . , x n ) d θ \int P(\theta,x_1,x_2,...,x_n) d\theta P(θ,x1,x2,...,xn)dθ已将 θ \theta θ积分挤掉了,所以与其无关,为一个常数;
    2. ∝ \propto :正比于;
    3. Γ ( α + β ) Γ ( α ) Γ ( β ) \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} Γ(α)Γ(β)Γ(α+β) 也是一个常数不考虑;
    4. θ ∑ x i + α − 1 ∗ ( 1 − θ ) n − ∑ x i + β − 1 \theta^{\sum x_i + \alpha -1} * (1-\theta)^{n-\sum x_i +\beta -1} θxi+α1(1θ)nxi+β1是参数为 ∑ x i + α − 1 , n − ∑ x i + β − 1 \sum x_i + \alpha -1,n-\sum x_i +\beta -1 xi+α1,nxi+β1 β \beta β分布

    此时 L ( θ ) L(\theta) L(θ)
    L ( θ ) = θ ∑ x i + α − 1 ∗ ( 1 − θ ) n − ∑ x i + β − 1 L(\theta) = \theta^{\sum x_i + \alpha -1} * (1-\theta)^{n-\sum x_i +\beta -1} L(θ)=θxi+α1(1θ)nxi+β1
    对数似然
    ln ⁡ L ( θ ) = ( ∑ i = 1 n x i + α − 1 ) ln ⁡ θ + ( n − ∑ i = 1 n x i + β − 1 ) ln ⁡ ( 1 − θ ) \ln L(\theta) = (\sum\limits_{i=1}^n x_i + \alpha -1)\ln \theta +(n-\sum \limits_{i=1}^n x_i +\beta -1)\ln(1-\theta) lnL(θ)=(i=1nxi+α1)lnθ+(ni=1nxi+β1)ln(1θ)
    θ \theta θ求偏导 :
    ∂ ln ⁡ L ( θ ) ∂ θ = ∑ i = 1 n x i + α − 1 θ − n − ∑ i = 1 n x i + β − 1 1 − θ \frac{\partial\ln L(\theta)}{\partial \theta} = \frac{\sum\limits_{i=1}^n x_i + \alpha -1}{\theta} - \frac{n-\sum \limits_{i=1}^n x_i +\beta -1}{1-\theta} θlnL(θ)=θi=1nxi+α11θni=1nxi+β1
    令偏导数等于0,则
    ∑ i = 1 n x i + α − 1 θ = n − ∑ i = 1 n x i + β − 1 1 − θ \frac{\sum\limits_{i=1}^n x_i + \alpha -1}{\theta} = \frac{n-\sum \limits_{i=1}^n x_i +\beta -1}{1-\theta} θi=1nxi+α1=1θni=1nxi+β1
    求出:
    θ ^ = ∑ i = 1 n x i + α − 1 n + α + β − 2 \hat {\theta} = \frac{\sum\limits_{i=1}^n x_i + \alpha -1}{n+\alpha +\beta -2} θ^=n+α+β2i=1nxi+α1

  • 极大似然和最大后验估计总结

    1. 对比极大似然 θ \theta θ估计值 θ ^ = 1 n ∑ i = 1 n x i \hat {\theta} = \frac{1}{n}\sum\limits_{i=1}^nx_i θ^=n1i=1nxi和最大后验估计值 θ ^ = ∑ i = 1 n x i + α − 1 n + α + β − 2 \hat {\theta} = \frac{\sum\limits_{i=1}^n x_i + \alpha -1}{n+\alpha +\beta -2} θ^=n+α+β2i=1nxi+α1,当样本n趋于无穷大的时候,两者 θ \theta θ的估计值是区域一致的;
    2. 最大后验估计中会给出参数的先验信息,当样本n足够大的时候,我们先前的先验信息和样本信息比就微不足道了,所以就近似于只用所有样本信息去估计 θ \theta θ所得到的结果;
    3. 考虑极端情况下,n=1,通过极大似然估计,结果是0,或者是1,但是在最大后验估计中,若样本n=1,那么最大后验估计结果就是 α α + β − 1 \frac{ \alpha }{\alpha +\beta -1} α+β1α或者 α − 1 α + β − 1 \frac{ \alpha-1 }{\alpha +\beta -1} α+β1α1,这是样本量雄安的时候,最大后验估计的优势所在。

1.5 推导正太分布均值的极大似然估计和最大后验估计

  • 问题

    推导下述正太分布均值的极大似然估计和最大后验估计,数据 x 1 , x 2 , . . . , x n x_1,x_2,...,x_n x1,x2,...,xn来自正太分布 N ( μ , σ 2 ) \mathcal{N}(\mu,\sigma^2) N(μ,σ2),其中 σ 2 \sigma^2 σ2已知:

    1. 根据样本 x 1 , x 2 , . . . , x n x_1,x_2,...,x_n x1,x2,...,xn写出 μ \mu μ的极大似然估计
    2. 假设 μ \mu μ的先验分布是 N ( 0 , τ 2 ) \mathcal{N}(0,\tau^2) N(0,τ2),根据样本 x 1 , x 2 , . . . , x n x_1,x_2,...,x_n x1,x2,...,xn写出 μ \mu μ的最大后验估计
  • 根据样本 x 1 , x 2 , . . . , x n x_1,x_2,...,x_n x1,x2,...,xn写出 μ \mu μ的极大似然估计

    样本的概率密度函数 f ( x i ) = 1 2 π σ exp ⁡ ( − ( x i − μ ) 2 2 σ 2 ) i = 1 , 2 , . . . . . , n f(x_i)=\frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(x_i-\mu)^{2}}{2 \sigma^{2}}\right) \quad i=1,2,.....,n f(xi)=2π σ1exp(2σ2(xiμ)2)i=1,2,.....,n

    似然函数:
    L ( x i ; μ ) = ∏ i = 1 n f ( x i ; μ ) = ( 2 π σ ) − n ∗ exp ⁡ ( − 1 2 σ 2 ∑ i = 1 n ( x i − μ ) 2 ) \begin{aligned} L(x_i;\mu) & = \prod_{i=1}^n f(x_i;\mu)\\ & = ({\sqrt{2 \pi} \sigma})^{-n} * \exp \left( -\frac{1}{2 \sigma^{2}} \sum\limits_{i=1}^n(x_i-\mu)^{2} \right) \end{aligned} L(xi;μ)=i=1nf(xi;μ)=(2π σ)nexp(2σ21i=1n(xiμ)2)
    对数似然函数:
    ln ⁡ L ( x i ; μ ) = − n ln ⁡ ( 2 π σ ) − 1 2 σ 2 ∑ i = 1 n ( x i − μ ) 2 ⇒ ∂ ln ⁡ L ( x i ; μ ) ∂ μ = 1 σ 2 ∑ i = 1 n ( x i − μ ) = 1 σ 2 ( ∑ i = 1 n x i − n μ ) 令 1 σ 2 ( ∑ i = 1 n x i − n μ ) = 0 ⇒ μ ^ = 1 n ∑ i = 1 n x i \begin{aligned} \ln L(x_i;\mu) & = -n \ln ({\sqrt{2 \pi} \sigma}) - \frac{1}{2 \sigma^{2}} \sum\limits_{i=1}^n(x_i-\mu)^{2} \\ \Rightarrow \quad \frac{\partial\ln L(x_i;\mu)}{\partial \mu} & = \frac{1}{\sigma^{2}} \sum\limits_{i=1}^n(x_i-\mu) \\ & = \frac{1}{\sigma^{2}} (\sum\limits_{i=1}^nx_i-n\mu) \\ 令 \quad \frac{1}{\sigma^{2}} (\sum\limits_{i=1}^nx_i-n\mu) & = 0 \\ \Rightarrow \quad \hat{\mu} & = \frac{1}{n}\sum\limits_{i=1}^nx_i \end{aligned} lnL(xi;μ)μlnL(xi;μ)σ21(i=1nxinμ)μ^=nln(2π σ)2σ21i=1n(xiμ)2=σ21i=1n(xiμ)=σ21(i=1nxinμ)=0=n1i=1nxi

  • 假设 μ \mu μ的先验分布是$\mathcal{N}(0,\tau^2) , 根据样本 ,根据样本 ,根据样本x_1,x_2,…,x_n 写出 写出 写出\mu$的最大后验估计

    先验分布 f ( μ ) = 1 2 π τ exp ⁡ ( − μ 2 2 τ 2 ) i = 1 , 2 , . . . . . , n f(\mu)=\frac{1}{\sqrt{2 \pi} \tau} \exp \left(-\frac{\mu^{2}}{2 \tau^{2}}\right) \quad i=1,2,.....,n f(μ)=2π τ1exp(2τ2μ2)i=1,2,.....,n
    P ( μ ∣ x 1 , x 2 , . . . , x n ) = P ( μ , x 1 , x 2 , . . . , x n ) P ( x 1 , x 2 , . . . , x n ) = f ( μ ) ∗ p ( x 1 ∣ μ ) ∗ ⋯ ∗ p ( x n ∣ μ ) ∫ P ( μ , x 1 , x 2 , . . . , x n ) d μ ∝ f ( μ ) ∗ p ( x 1 ∣ μ ) ∗ ⋯ ∗ p ( x n ∣ μ ) = 1 2 π τ exp ⁡ ( − μ 2 2 τ 2 ) ∗ ∏ i = 1 n 1 2 π σ exp ⁡ ( − ( x i − μ ) 2 2 σ 2 ) \begin{aligned} P(\mu \mathcal{|} x_1,x_2,...,x_n) & = \frac{P(\mu,x_1,x_2,...,x_n)}{P(x_1,x_2,...,x_n)} \\ & = \frac{f (\mu)*p(x_1|\mu)*\cdots*p(x_n|\mu)}{\int P(\mu,x_1,x_2,...,x_n) \mathcal{d} \mu} \\ &\propto f (\mu)*p(x_1|\mu)*\cdots*p(x_n|\mu) \\ & = \frac{1}{\sqrt{2 \pi} \tau} \exp \left(-\frac{\mu^{2}}{2 \tau^{2}}\right)*\prod_{i=1}^n \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(x_i-\mu)^{2}}{2 \sigma^{2}}\right) \end{aligned} P(μx1,x2,...,xn)=P(x1,x2,...,xn)P(μ,x1,x2,...,xn)=P(μ,x1,x2,...,xn)dμf(μ)p(x1μ)p(xnμ)f(μ)p(x1μ)p(xnμ)=2π τ1exp(2τ2μ2)i=1n2π σ1exp(2σ2(xiμ)2)
    此时 L ( θ ) L(\theta) L(θ)
    L ( θ ) = 1 2 π τ exp ⁡ ( − μ 2 2 τ 2 ) ∗ ∏ i = 1 n 1 2 π σ exp ⁡ ( − ( x i − μ ) 2 2 σ 2 ) L(\theta) =\frac{1}{\sqrt{2 \pi} \tau} \exp \left(-\frac{\mu^{2}}{2 \tau^{2}}\right)*\prod_{i=1}^n \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(x_i-\mu)^{2}}{2 \sigma^{2}}\right) L(θ)=2π τ1exp(2τ2μ2)i=1n2π σ1exp(2σ2(xiμ)2)
    对数似然
    ln ⁡ P ( μ ∣ x 1 , x 2 , . . . , x n ) = − ln ⁡ 2 π τ − μ 2 2 τ 2 − n ln ⁡ ( 2 π σ ) − 1 2 σ 2 ∑ i = 1 n ( x i − μ ) 2 ⇒ ∂ ln ⁡ P ( μ ∣ x 1 , x 2 , . . . , x n ) ∂ μ = − μ τ 2 + 1 σ 2 ∑ i = 1 n ( x i − μ ) = 1 σ 2 ( ∑ i = 1 n x i − n μ ) − μ τ 2 令 1 σ 2 ( ∑ i = 1 n x i − n μ ) − μ τ 2 = 0 ⇒ 1 σ 2 ( ∑ i = 1 n x i − n μ ) = μ τ 2 ⇒ μ ^ = τ 2 ∑ i = 1 2 x i σ 2 + n τ 2 = ∑ i = 1 2 x i n + σ 2 τ 2 \begin{aligned} \ln P(\mu \mathcal{|} x_1,x_2,...,x_n) & = -\ln \sqrt{2 \pi} \tau-\frac{\mu^{2}}{2 \tau^{2}} -n \ln ({\sqrt{2 \pi} \sigma})-\frac{1}{2 \sigma^{2}} \sum\limits_{i=1}^n(x_i-\mu)^{2} \\ \Rightarrow \quad \frac{\partial\ln P(\mu \mathcal{|} x_1,x_2,...,x_n)}{\partial \mu}& = -\frac{\mu}{\tau^2} + \frac{1}{\sigma ^ 2}\sum\limits_{i=1}^n(x_i-\mu) \\ & = \frac{1}{\sigma^{2}} (\sum\limits_{i=1}^nx_i-n\mu) - \frac{\mu}{\tau^2} \\ 令\quad \frac{1}{\sigma^{2}} (\sum\limits_{i=1}^nx_i-n\mu) - \frac{\mu}{\tau^2} & = 0 \\ \Rightarrow \quad \frac{1}{\sigma^{2}}(\sum\limits_{i=1}^nx_i-n\mu) & = \frac{\mu}{\tau^2}\\ \Rightarrow \quad \hat{\mu} & = \frac{\tau^2\sum\limits_{i=1}^2x_i}{\sigma^2+n\tau^2}\\ & = \frac{\sum\limits_{i=1}^2x_i}{n+\frac{\sigma^2}{\tau^2}} \end{aligned} lnP(μx1,x2,...,xn)μlnP(μx1,x2,...,xn)σ21(i=1nxinμ)τ2μσ21(i=1nxinμ)μ^=ln2π τ2τ2μ2nln(2π σ)2σ21i=1n(xiμ)2=τ2μ+σ21i=1n(xiμ)=σ21(i=1nxinμ)τ2μ=0=τ2μ=σ2+nτ2τ2i=12xi=n+τ2σ2i=12xi
    当n较小时候,贝叶斯估计比极大似然估计要准确一些


1.6 朴素贝叶斯算法

  • 基于贝叶斯定理和特征条件独立假设的分类方法

  • 朴素贝叶斯法与贝叶斯估计是不同的概念

  • 生成模型与判别模型
    { 生成模型: P ( Y ∣ X ) = P ( X , Y ) P ( X ) X,Y为随机变量 判别模型: Y = f ( X ) , P ( Y ∣ X ) \left\{ \begin{aligned} &\text{生成模型} :P(Y|X) = \frac{P(X,Y)}{P(X)} \text{X,Y为随机变量} \\ &\text{判别模型} :Y=f(X),P(Y|X) \end{aligned} \right. 生成模型P(YX)=P(X)P(X,Y)X,Y为随机变量判别模型Y=f(X),P(YX)

1.6.1 朴素贝叶斯法的学习与分类

  • 输入: 特征向量 x ∈ X ⊆ R n x \in \mathcal{X} \subseteq \mathrm{R}^{n} xXRn 为实例的特征向量,

    输出:类标记 y i ∈ Y = { c 1 , c 2 , ⋯   , c K } y_{i} \in \mathcal{Y}=\left\{c_{1}, c_{2}, \cdots, c_{K}\right\} yiY={c1,c2,,cK}

    X X X是定义在输入空间 X \mathcal{X} X 上的随机向量, Y Y Y 是定义在输出空间 Y \mathcal{Y} Y 上的随机变量。 P ( X , Y ) P(X, Y) P(X,Y) X X X Y Y Y 的联合概率分布。训练数据集
    T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯   , ( x N , y N ) } = { ( x i , y i ) } i = 1 N \begin{aligned} T& = \left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\left(x_{N}, y_{N}\right)\right\} \\ & = \left\{ (x_i,y_i)\right\}_{i=1}^N \end{aligned} T={(x1,y1),(x2,y2),,(xN,yN)}={(xi,yi)}i=1N
    P ( X , Y ) P(X, Y) P(X,Y) 独立同分布产生。

  • 先验概率分布:
    P ( Y = c k ) , k = 1 , 2 , . . . , K P(Y=c_k),\quad k =1,2,...,K P(Y=ck),k=1,2,...,K

  • 条件概率分布:
    P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , ⋯   , X ( n ) = x ( n ) ∣ Y = c k ) , k = 1 , 2 , ⋯   , K P\left(X=x \mid Y=c_{k}\right)=P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} \mid Y=c_{k}\right), \quad k=1,2, \cdots, K P(X=xY=ck)=P(X(1)=x(1),,X(n)=x(n)Y=ck),k=1,2,,K

  • 联合概率分布:

    由条件概率公式 P ( A B ) = P ( B ) P ( A ∣ B ) P(AB) = P(B)P(A|B) P(AB)=P(B)P(AB) , 结合上面的先验概率分布条件概率分布,可得联合概率分布 P ( X , Y ) P(X,Y) P(X,Y)(或者写成 P ( Y , X ) P(Y,X) P(Y,X)
    P ( Y , X ) = P ( Y = c k , X = x ) = P ( Y = c k ) P ( X = x ∣ Y = c k ) \begin{aligned} P(Y,X) &= P(Y=c_k,X=x) \\ &= P(Y=c_k)P(X=x\mid Y=c_k) \end{aligned} P(Y,X)=P(Y=ck,X=x)=P(Y=ck)P(X=xY=ck)

  • 生成模型(后验概率):

    根据全概率公式和贝叶斯公式:
    P ( B ∣ A ) = P ( A B ) P ( A ) = P ( B ) P ( A ∣ B ) P ( A ) = P ( B ) P ( A ∣ B ) ∑ P ( B ) P ( A ∣ B ) ⇒ P ( Y = c k ∣ X = x ) = P ( Y = c k , X = x ) P ( X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) P ( X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ k P ( X = x ∣ Y = c k ) P ( Y = c k ) \begin{aligned} P(B|A) &= \frac{P(AB)}{P(A)} \\ &=\frac{P(B)P(A|B)}{P(A)}\\ &=\frac{P(B)P(A|B)}{\sum P(B)P(A|B)}\\ \quad \\ \Rightarrow P\left(Y=c_{k} \mid X=x\right)&= \frac{P\left(Y=c_{k} ,X=x\right)}{P(X=x)} \\ &=\frac{P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)}{P(X=x)}\\ &=\frac{P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)}{\sum_{k} P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)} \end{aligned} P(BA)P(Y=ckX=x)=P(A)P(AB)=P(A)P(B)P(AB)=P(B)P(AB)P(B)P(AB)=P(X=x)P(Y=ck,X=x)=P(X=x)P(X=xY=ck)P(Y=ck)=kP(X=xY=ck)P(Y=ck)P(X=xY=ck)P(Y=ck)

  • 模型假设:条件独立性

    条件概率分布 P ( X = x ∣ Y = c k ) P\left(X=x \mid Y=c_{k}\right) P(X=xY=ck) 有指数级数量的参数,若不假设各属性条件独立性,其估计实际是不可行的。
    P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , ⋯   , X ( n ) = x ( n ) ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) ∣ Y = c k ) ∗ P ( X ( 2 ) = x ( 2 ) ∣ Y = c k ) ∗ ⋯ = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) \begin{aligned} P\left(X=x \mid Y=c_{k}\right) &=P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} \mid Y=c_{k}\right) \\ &= P(X^{(1)}=x^{(1)} \mid Y=c_k) * P(X^{(2)}=x^{(2)} \mid Y=c_k) * \cdots \\ &=\prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) \end{aligned} P(X=xY=ck)=P(X(1)=x(1),,X(n)=x(n)Y=ck)=P(X(1)=x(1)Y=ck)P(X(2)=x(2)Y=ck)=j=1nP(X(j)=x(j)Y=ck)
    事实上, 假设 x ( j ) x^{(j)} x(j) 可取值有 S j S_{j} Sj 个, j = 1 , 2 , ⋯   , n , Y j=1,2, \cdots, n, Y j=1,2,,n,Y 可取值有 K K K 个, 那么参数个数 为 K ∏ j = 1 n S j K \prod\limits_{j=1}^{n} S_{j} Kj=1nSj

    朴素贝叶斯法对条件概率分布作了条件独立性的假设。由于这是一个较强的假 设, 朴素贝叶斯法也由此得名。

  • 预测准则:后验概率最大(后面证明)

    结合条件独立性,后验概率为
    y = arg ⁡ max ⁡ c k P ( Y = c k ∣ X = x ) = P ( Y = c k , X = x ) P ( X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) P ( X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ k P ( X = x ∣ Y = c k ) P ( Y = c k ) = P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , ⋯   , K \begin{aligned} y&=\arg\max_{c_k} P\left(Y=c_{k} \mid X=x\right)\\ &= \frac{P\left(Y=c_{k} ,X=x\right)}{P(X=x)} \\ &=\frac{P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)}{P(X=x)}\\ &=\frac{P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)}{\sum_{k} P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)} \\ &=\frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}{\sum_{k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}, \quad k=1,2, \cdots, K \end{aligned} y=argckmaxP(Y=ckX=x)=P(X=x)P(Y=ck,X=x)=P(X=x)P(X=xY=ck)P(Y=ck)=kP(X=xY=ck)P(Y=ck)P(X=xY=ck)P(Y=ck)=kP(Y=ck)jP(X(j)=x(j)Y=ck)P(Y=ck)jP(X(j)=x(j)Y=ck),k=1,2,,K
    这是朴素贝叶斯法分类的基本公式。于是, 朴素贝叶斯分类器可表示为
    y = f ( x ) = arg ⁡ max ⁡ c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=\arg \max _{c_{k}} \frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}{\sum_{k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)} y=f(x)=argckmaxkP(Y=ck)jP(X(j)=x(j)Y=ck)P(Y=ck)jP(X(j)=x(j)Y=ck)


1.6.2 后验概率最大化

  • 朴素贝叶斯法将实例分到后验概率最大的类中。这等价于期望风险最小化。假设 选择 0-1 损失函数:
    L ( Y , f ( X ) ) = { 1 , Y ≠ f ( X ) 0 , Y = f ( X ) L(Y, f(X))= \begin{cases}1, & Y \neq f(X) \\ 0, & Y=f(X)\end{cases} L(Y,f(X))={1,0,Y=f(X)Y=f(X)
    式中 f ( X ) f(X) f(X) 是分类决策函数。这时, 期望风险函数为
    R exp ⁡ ( f ) = E [ L ( Y , f ( X ) ) ] R_{\exp }(f)=E[L(Y, f(X))] Rexp(f)=E[L(Y,f(X))]
    当损失函数期望最小时候,等价与后验概率最大化
    y = arg ⁡ max ⁡ c k P ( Y = c k ∣ X = x ) y =\arg\max_{c_k} P\left(Y=c_{k} \mid X=x\right) y=argckmaxP(Y=ckX=x)

  • 最小化期望风险
    arg ⁡ min ⁡ R exp ⁡ ( f ) = arg ⁡ min ⁡ E [ L ( Y , f ( X ) ) ] = arg ⁡ min ⁡ ∑ Y ∑ X L ( Y , f ( X ) ) P ( X , Y ) = arg ⁡ min ⁡ ∑ Y ∑ X L ( Y , f ( X ) ) P ( Y ∣ X ) P ( X ) = arg ⁡ min ⁡ ∑ X { ∑ Y L ( Y , f ( X ) ) P ( Y ∣ X ) } P ( X ) = E X ∑ Y L ( Y = c k , f ( X ) ) P ( Y = c k ∣ X ) \begin{aligned} \arg\min R_{\exp }(f) & = \arg\min E[L(Y, f(X))] \\ & = \arg\min \sum_Y\sum_X L(Y, f(X))P(X,Y) \\ & = \arg\min \sum_Y\sum_X L(Y, f(X))P(Y|X)P(X) \\ & = \arg\min \sum_X \left\{\sum_Y L(Y, f(X))P(Y|X)\right\}P(X) \\ & = E_X \sum_Y L(Y=c_k, f(X))P(Y=c_k|X) \end{aligned} argminRexp(f)=argminE[L(Y,f(X))]=argminYXL(Y,f(X))P(X,Y)=argminYXL(Y,f(X))P(YX)P(X)=argminX{YL(Y,f(X))P(YX)}P(X)=EXYL(Y=ck,f(X))P(Y=ckX)
    即期望是对联合分布 P ( X , Y ) P(X, Y) P(X,Y) 取的。由此取条件期望
    R exp ⁡ ( f ) = E X ∑ k = 1 K [ L ( Y = c k , f ( X ) ) ] P ( Y = c k ∣ X ) R_{\exp }(f)=E_{X} \sum_{k=1}^{K}\left[L\left(Y=c_{k}, f(X)\right)\right] P\left(Y=c_{k} \mid X\right) Rexp(f)=EXk=1K[L(Y=ck,f(X))]P(Y=ckX)
    为了使期望风险最小化, 只需对 X = x X=x X=x 逐个极小化, 由此得到:
    f ( x ) = arg ⁡ min ⁡ R exp ⁡ ( f ) = arg ⁡ min ⁡ E [ L ( Y , f ( X ) ) ] = arg ⁡ min ⁡ ∑ k = 1 K [ L ( Y = c k , f ( X ) ) ] P ( Y = c k ∣ X ) ∵ L ( Y , f ( X ) ) = { 1 , Y ≠ f ( X ) 0 , Y = f ( X ) i f f ( X ) = Y = c k , t h e n L ( Y = c k , f ( X ) ) = 0 , ∴ L ( Y = c k , f ( X ) ) = I [ f ( x ) ≠ c k ] ⇒ = arg ⁡ min ⁡ ∑ k = 1 K I [ f ( x ) ≠ c k ] P ( Y = c k ∣ X ) = arg ⁡ min ⁡ ∑ k = 1 K [ 1 − I [ f ( x ) = c k ] ] P ( Y = c k ∣ X ) = arg ⁡ min ⁡ ∑ k = 1 K { P ( Y = c k ∣ X ) − I [ f ( x ) = c k ] P ( Y = c k ∣ X ) } = arg ⁡ min ⁡ { ∑ k = 1 K P ( Y = c k ∣ X ) − ∑ k = 1 K I [ f ( x ) = c k ] P ( Y = c k ∣ X ) } ∵ ∑ k = 1 K P ( Y = c k ∣ X ) = 1 ⇒ = arg ⁡ min ⁡ { 1 − ∑ k = 1 K I [ f ( x ) = c k ] P ( Y = c k ∣ X ) } 等价于 ⇒ = arg ⁡ max ⁡ ∑ k = 1 K I [ f ( x ) = c k ] P ( Y = c k ∣ X ) \begin{aligned} f(x) &= \arg\min R_{\exp }(f) \\ & = \arg\min E[L(Y, f(X))] \\ & = \arg\min \sum_{k=1}^{K}\left[L\left(Y=c_{k}, f(X)\right)\right] P\left(Y=c_{k} \mid X\right) \\ \because L(Y, f(X))&= \begin{cases}1, & Y \neq f(X) \\ 0, & Y=f(X)\end{cases}\quad \mathcal{if}\quad f(X) = Y =c_k, then \quad L\left(Y=c_{k}, f(X)\right) = 0,\therefore L\left(Y=c_{k}, f(X)\right)= I[f(x)\neq c_k] \\ \Rightarrow & = \arg\min \sum_{k=1}^{K}I[f(x)\neq c_k]P\left(Y=c_{k} \mid X\right)\\ & = \arg\min \sum_{k=1}^{K} \left[1-I[f(x)= c_k]\right]P\left(Y=c_{k} \mid X\right)\\ & = \arg\min \sum_{k=1}^{K} \left\{P\left(Y=c_{k} \mid X\right)-I[f(x)= c_k]P\left(Y=c_{k} \mid X\right)\right\}\\ & = \arg\min \left\{\sum_{k=1}^{K}P\left(Y=c_{k} \mid X\right)-\sum_{k=1}^{K}I[f(x)= c_k]P\left(Y=c_{k} \mid X\right)\right\}\\ \because &\sum_{k=1}^{K}P\left(Y=c_{k} \mid X\right) = 1 \\ \Rightarrow& = \arg\min \left\{1-\sum_{k=1}^{K}I[f(x)= c_k]P\left(Y=c_{k} \mid X\right)\right\}\\ &\text{等价于} \\ \Rightarrow& = \arg\max \sum_{k=1}^{K}I[f(x)= c_k]P\left(Y=c_{k} \mid X\right) \end{aligned} f(x)L(Y,f(X))=argminRexp(f)=argminE[L(Y,f(X))]=argmink=1K[L(Y=ck,f(X))]P(Y=ckX)={1,0,Y=f(X)Y=f(X)iff(X)=Y=ck,thenL(Y=ck,f(X))=0,L(Y=ck,f(X))=I[f(x)=ck]=argmink=1KI[f(x)=ck]P(Y=ckX)=argmink=1K[1I[f(x)=ck]]P(Y=ckX)=argmink=1K{P(Y=ckX)I[f(x)=ck]P(Y=ckX)}=argmin{k=1KP(Y=ckX)k=1KI[f(x)=ck]P(Y=ckX)}k=1KP(Y=ckX)=1=argmin{1k=1KI[f(x)=ck]P(Y=ckX)}等价于=argmaxk=1KI[f(x)=ck]P(Y=ckX)
    朴素贝叶斯法将实例分到后验概率最大的类中。这等价于期望风险最小化
    f ( x ) = arg ⁡ min ⁡ R exp ⁡ ( f ) = arg ⁡ max ⁡ ∑ k = 1 K I [ f ( x ) = c k ] P ( Y = c k ∣ X ) \begin{aligned} f(x) &= \arg\min R_{\exp }(f) \\ & = \arg\max \sum_{k=1}^{K}I[f(x)= c_k]P\left(Y=c_{k} \mid X\right) \end{aligned} f(x)=argminRexp(f)=argmaxk=1KI[f(x)=ck]P(Y=ckX)
    因为预测后验概率的最大,所以得找一个 c k c_k ck,使得 I [ f ( x ) = c k ] P ( Y = c k ) I[f(x)= c_k]P(Y=c_{k} ) I[f(x)=ck]P(Y=ck) 为真,这样一来,根据期望风险最小化准则就得到了后验概率最大化准则
    f ( x ) = arg ⁡ max ⁡ c k P ( Y = c k ∣ X = x ) f(x) = \arg\max_{c_k}P(Y=c_k\mid X=x) f(x)=argckmaxP(Y=ckX=x)


1.6.3 朴素贝叶斯的参数估计

  • 根据上面的推导
    y = arg ⁡ max ⁡ f ( x ) = arg ⁡ max ⁡ c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∝ arg ⁡ max ⁡ c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) \begin{aligned} y&=\arg\max f(x)\\ &=\arg \max _{c_{k}} \frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}{\sum_{k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}\\ &\propto \arg \max _{c_{k}}P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) \end{aligned} y=argmaxf(x)=argckmaxkP(Y=ck)jP(X(j)=x(j)Y=ck)P(Y=ck)jP(X(j)=x(j)Y=ck)argckmaxP(Y=ck)jP(X(j)=x(j)Y=ck)
    在朴素贝叶斯法中, 学习意味着估计 P ( Y = c k ) P\left(Y=c_{k}\right) P(Y=ck) P ( X ( j ) = x ( j ) ∣ Y = c k ) P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) P(X(j)=x(j)Y=ck)

  • 极大似然估计:

    1. 先验概率 P ( Y = c k ) P\left(Y=c_{k}\right) P(Y=ck)极大似然估计
      P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , ⋯   , K P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, \quad k=1,2, \cdots, K P(Y=ck)=Ni=1NI(yi=ck),k=1,2,,K

    2. 条件概率 P ( X ( j ) = a j l ∣ Y = P\left(X^{(j)}=a_{j l} \mid Y=\right. P(X(j)=ajlY= c k ) \left.c_{k}\right) ck) 的极大似然估计是:(设第 j j j 个特征 x ( j ) x^{(j)} x(j) 可能取值的集合为 { a j 1 , a j 2 , ⋯   , a j S j } \left\{a_{j 1}, a_{j 2}, \cdots, a_{j S_{j}}\right\} {aj1,aj2,,ajSj})
      P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) j = 1 , 2 , ⋯   , n ; l = 1 , 2 , ⋯   , S j ; k = 1 , 2 , ⋯   , K \begin{aligned} &P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)} \\ &j=1,2, \cdots, n ; \quad l=1,2, \cdots, S_{j} ; \quad k=1,2, \cdots, K \end{aligned} P(X(j)=ajlY=ck)=i=1NI(yi=ck)i=1NI(xi(j)=ajl,yi=ck)j=1,2,,n;l=1,2,,Sj;k=1,2,,K
      , x i ( j ) x_{i}^{(j)} xi(j) 是第 i i i 个样本的第 j j j 个特征; a j l a_{j l} ajl 是第 j j j 个特征可能取的第 l l l 个值; I I I 为指 示函数。

  • 贝叶斯估计

    用极大似然估计可能会出现所要估计的概率值为 0 的情况。这时会影响到后验概 率的计算结果, 使分类产生偏差。解决这一问题的方法是采用贝叶斯估计。具体地:

    1. 条件概率的贝叶斯估计
      P λ ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j λ P_{\lambda}\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)+\lambda}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+S_{j} \lambda} Pλ(X(j)=ajlY=ck)=i=1NI(yi=ck)+Sjλi=1NI(xi(j)=ajl,yi=ck)+λ
      式中 λ ⩾ 0 \lambda \geqslant 0 λ0 。等价于在随机变量各个取值的频数上赋予一个正数 λ > 0 \lambda>0 λ>0 λ = 0 \lambda=0 λ=0 时就 是极大似然估计。常取 λ = 1 \lambda=1 λ=1, 这时称为拉普拉斯平滑 (Laplacian smoothing)。显然, 对任何 l = 1 , 2 , ⋯   , S j , k = 1 , 2 , ⋯   , K l=1,2, \cdots, S_{j}, k=1,2, \cdots, K l=1,2,,Sj,k=1,2,,K, 有
      P λ ( X ( j ) = a j l ∣ Y = c k ) > 0 ∑ l = 1 S j P ( X ( j ) = a j l ∣ Y = c k ) = 1 \begin{aligned} &P_{\lambda}\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)>0 \\ &\sum_{l=1}^{S_{j}} P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=1 \end{aligned} Pλ(X(j)=ajlY=ck)>0l=1SjP(X(j)=ajlY=ck)=1
      同样

    2. 先验概率的贝叶斯估计
      P λ ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + K λ P_{\lambda}\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+\lambda}{N+K \lambda} Pλ(Y=ck)=N+Kλi=1NI(yi=ck)+λ


1.6.4 朴素贝叶斯算法流程

  • 算法 4.1 (朴素贝叶斯算法 (naïve Bayes algorithm))
    输入: 训练数据 T = { ( x i , y i ) } i = 1 N T=\left\{ (x_i,y_i)\right\}_{i=1}^N T={(xi,yi)}i=1N, 其中 x i = ( x i ( 1 ) , x i ( 2 ) , ⋯   x_{i}=\left(x_{i}^{(1)}, x_{i}^{(2)}, \cdots\right. xi=(xi(1),xi(2),, x i ( n ) ) T , x i ( j ) \left.x_{i}^{(n)}\right)^{\mathrm{T}}, x_{i}^{(j)} xi(n))T,xi(j) 是第 i i i 个样本的第 j j j 个特征, x i ( j ) ∈ { a j 1 , a j 2 , ⋯   , a j S j } , a j l x_{i}^{(j)} \in\left\{a_{j 1}, a_{j 2}, \cdots, a_{j} S_{j}\right\}, a_{j l} xi(j){aj1,aj2,,ajSj},ajl 是第 j j j 个特 征可能取的第 l l l 个值, j = 1 , 2 , ⋯   , n , l = 1 , 2 , ⋯   , S j , y i ∈ { c 1 , c 2 , ⋯   , c K } j=1,2, \cdots, n, l=1,2, \cdots, S_{j}, y_{i} \in\left\{c_{1}, c_{2}, \cdots, c_{K}\right\} j=1,2,,n,l=1,2,,Sj,yi{c1,c2,,cK}; 实例 x x x;

    输出:实例 x x x 的分类。
    (1) 计算先验概率及条件概率

    1. 先验概率

    P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , ⋯   , K P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, \quad k=1,2, \cdots, K P(Y=ck)=Ni=1NI(yi=ck),k=1,2,,K

    1. 条件概率

    P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) j = 1 , 2 , ⋯   , n ; l = 1 , 2 , ⋯   , S j ; k = 1 , 2 , ⋯   , K P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}\\ j=1,2, \cdots, n ; \quad l=1,2, \cdots, S_{j} ; \quad k=1,2, \cdots, K P(X(j)=ajlY=ck)=i=1NI(yi=ck)i=1NI(xi(j)=ajl,yi=ck)j=1,2,,n;l=1,2,,Sj;k=1,2,,K

    (2) 对于给定的实例 x = ( x ( 1 ) , x ( 2 ) , ⋯   , x ( n ) ) T x=\left(x^{(1)}, x^{(2)}, \cdots, x^{(n)}\right)^{\mathrm{T}} x=(x(1),x(2),,x(n))T, 计算
    P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , ⋯   , K P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right), \quad k=1,2, \cdots, K P(Y=ck)j=1nP(X(j)=x(j)Y=ck),k=1,2,,K
    (3)确定实例 x x x 的类
    y = arg ⁡ max ⁡ c k P ( Y = c k ) ∏ n P ( X ( j ) = x ( j ) ∣ Y = c k ) y=\arg \max _{c_{k}} P\left(Y=c_{k}\right) \prod^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) y=argckmaxP(Y=ck)nP(X(j)=x(j)Y=ck)


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值