机器学习笔记之贝叶斯线性回归(二)推断任务推导过程

机器学习笔记之贝叶斯线性回归——推断任务推导过程

引言

上一节贝叶斯算法在线性回归中的任务进行介绍,本节将介绍贝叶斯线性回归推断任务的推导过程

回顾:贝叶斯线性回归——推断任务

贝叶斯线性回归中的推断任务(Inference)本质上是求解模型参数 W \mathcal W W后验概率结果 P ( W ∣ D a t a ) \mathcal P(\mathcal W \mid Data) P(WData)
其中 D a t a Data Data表示数据集合,包含样本集合 X \mathcal X X和对应标签集合 Y \mathcal Y Y.
P ( W ∣ D a t a ) = P ( Y ∣ W , X ) ⋅ P ( W ) ∫ W P ( Y ∣ W , X ) ⋅ P ( W ) d W ∝ P ( Y ∣ W , X ) ⋅ P ( W ) \begin{aligned} \mathcal P(\mathcal W \mid Data) & = \frac{\mathcal P(\mathcal Y \mid \mathcal W,\mathcal X) \cdot \mathcal P(\mathcal W)}{\int_{\mathcal W} \mathcal P(\mathcal Y \mid \mathcal W,\mathcal X) \cdot \mathcal P(\mathcal W) d\mathcal W} \\ & \propto \mathcal P(\mathcal Y \mid \mathcal W,\mathcal X) \cdot \mathcal P(\mathcal W) \end{aligned} P(WData)=WP(YW,X)P(W)dWP(YW,X)P(W)P(YW,X)P(W)
其中 P ( Y ∣ W , X ) \mathcal P(\mathcal Y \mid \mathcal W,\mathcal X) P(YW,X)似然(Likelihood),根据线性回归模型的定义, P ( Y ∣ W , X ) \mathcal P(\mathcal Y \mid \mathcal W,\mathcal X) P(YW,X)服从高斯分布
各样本之间’独立同分布‘~
Y = W T X + ϵ ϵ ∼ N ( 0 , σ 2 ) P ( Y ∣ W , X ) ∼ N ( W T X , σ 2 ) = ∏ i = 1 N N ( W T x ( i ) , σ 2 ) \begin{aligned} \mathcal Y & = \mathcal W^T\mathcal X + \epsilon \quad \epsilon \sim \mathcal N(0,\sigma^2) \\ \mathcal P(\mathcal Y \mid \mathcal W,\mathcal X) & \sim \mathcal N(\mathcal W^T \mathcal X,\sigma^2) \\ & = \prod_{i=1}^N \mathcal N(\mathcal W^Tx^{(i)},\sigma^2) \end{aligned} YP(YW,X)=WTX+ϵϵN(0,σ2)N(WTX,σ2)=i=1NN(WTx(i),σ2)
P ( W ) \mathcal P(\mathcal W) P(W)表示先验分布(Piror Distribution),表示推断前给定的初始分布。这里假设 P ( W ) \mathcal P(\mathcal W) P(W)同样服从高斯分布
先验分布 P ( W ) \mathcal P(\mathcal W) P(W)的完整表达是 P ( W ∣ X ) \mathcal P(\mathcal W \mid \mathcal X) P(WX),这里 W \mathcal W W和样本 X \mathcal X X无关,故省略。
P ( W ) ∼ N ( 0 , Σ p r i o r ) \mathcal P(\mathcal W) \sim \mathcal N(0,\Sigma_{prior}) P(W)N(0,Σprior)
根据指数族分布的共轭性质 以及高斯分布自身的自共轭性质,后验 P ( W ∣ D a t a ) \mathcal P(\mathcal W \mid Data) P(WData)同样服从高斯分布。定义其高斯分布为 N ( μ W , Σ W ) \mathcal N(\mu_{\mathcal W},\Sigma_{\mathcal W}) N(μW,ΣW),具体表达如下:
N ( μ W , Σ W ) ∝ N ( W T X , σ 2 ) ⋅ N ( 0 , Σ p r i o r ) = [ ∏ i = 1 N N ( y ( i ) ∣ W T x ( i ) , σ 2 ) ] ⋅ N ( 0 , Σ p r i o r ) \begin{aligned} \mathcal N(\mu_{\mathcal W},\Sigma_{\mathcal W}) & \propto \mathcal N(\mathcal W^T\mathcal X,\sigma^2) \cdot \mathcal N(0,\Sigma_{prior}) \\ & = \left[\prod_{i=1}^N \mathcal N(y^{(i)} \mid \mathcal W^Tx^{(i)},\sigma^2)\right] \cdot \mathcal N(0,\Sigma_{prior}) \end{aligned} N(μW,ΣW)N(WTX,σ2)N(0,Σprior)=[i=1NN(y(i)WTx(i),σ2)]N(0,Σprior)

推断任务的目的就是求解 N ( μ W , Σ W ) \mathcal N(\mu_{\mathcal W},\Sigma_{\mathcal W}) N(μW,ΣW)的分布形式,即求解分布参数 μ W , Σ W \mu_{\mathcal W},\Sigma_{\mathcal W} μW,ΣW

推导过程

首先观察似然的概率分布,并进行展开:
需要注意的是: N ( y ( i ) ∣ W T x ( i ) , σ 2 ) ( i = 1 , 2 , ⋯   , N ) \mathcal N(y^{(i)} \mid \mathcal W^Tx^{(i)},\sigma^2)(i=1,2,\cdots,N) N(y(i)WTx(i),σ2)(i=1,2,,N)是一维高斯分布。

P ( Y ∣ W , X ) ∼ ∏ i = 1 N N ( y ( i ) ∣ W T x ( i ) , σ 2 ) = ∏ i = 1 N 1 σ 2 π exp ⁡ [ − 1 2 σ 2 ( y ( i ) − W T x ( i ) ) 2 ] \begin{aligned} \mathcal P(\mathcal Y \mid \mathcal W,\mathcal X) & \sim \prod_{i=1}^N \mathcal N(y^{(i)} \mid \mathcal W^Tx^{(i)},\sigma^2) \\ & = \prod_{i=1}^N \frac{1}{\sigma \sqrt{2\pi}} \exp\left[-\frac{1}{2 \sigma^2} \left(y^{(i)} - \mathcal W^T x^{(i)}\right)^2\right] \end{aligned} P(YW,X)i=1NN(y(i)WTx(i),σ2)=i=1Nσ2π 1exp[2σ21(y(i)WTx(i))2]
连乘符号 ∏ \prod 代入 exp ⁡ \exp exp中,并使用矩阵乘法的方式进行描述:
主要是对 ∑ i = 1 N ( y ( i ) − W T x ( i ) ) 2 \sum_{i=1}^N \left(y^{(i)} - \mathcal W^Tx^{(i)}\right)^2 i=1N(y(i)WTx(i))2进行变换,变换结果表示如下:传送门
∑ i = 1 N ( y ( i ) − W T x ( i ) ) 2 = ( y ( 1 ) − W T x ( 1 ) , ⋯   , y ( N ) − W T x ( N ) ) ( y ( 1 ) − W T x ( 1 ) ⋮ y ( N ) − W T x ( N ) ) = ( Y T − W T X T ) ( Y − X W ) = ( Y − X W ) T ( Y − X W ) \begin{aligned} \sum_{i=1}^N \left(y^{(i)} - \mathcal W^Tx^{(i)}\right)^2 & = \left(y^{(1)} - \mathcal W^Tx^{(1)},\cdots,y^{(N)} - \mathcal W^Tx^{(N)}\right) \begin{pmatrix}y^{(1)} - \mathcal W^Tx^{(1)} \\ \vdots \\ y^{(N)} - \mathcal W^Tx^{(N)}\end{pmatrix} \\ & = (\mathcal Y^T - \mathcal W^T\mathcal X^T)(\mathcal Y - \mathcal X\mathcal W) \\ & = (\mathcal Y - \mathcal X \mathcal W)^T(\mathcal Y -\mathcal X \mathcal W) \end{aligned} i=1N(y(i)WTx(i))2=(y(1)WTx(1),,y(N)WTx(N)) y(1)WTx(1)y(N)WTx(N) =(YTWTXT)(YXW)=(YXW)T(YXW)
1 2 σ 2 \frac{1}{2\sigma^2} 2σ21 i i i无关,拿到连加号外面, I \mathcal I I表示单位矩阵。
= 1 ( 2 π ) N 2 σ N exp ⁡ [ − 1 2 σ 2 ∑ i = 1 N ( y ( i ) − W T x ( i ) ) 2 ] = 1 ( 2 π ) N 2 σ N exp ⁡ [ − 1 2 ( Y − X W ) T σ − 2 I ( Y − X W ) ] \begin{aligned} & = \frac{1}{(2\pi)^{\frac{N}{2}}\sigma^N} \exp \left[-\frac{1}{2\sigma^2} \sum_{i=1}^N \left(y^{(i)} - \mathcal W^Tx^{(i)}\right)^2\right] \\ & = \frac{1}{(2\pi)^{\frac{N}{2}}\sigma^N} \exp \left[- \frac{1}{2} (\mathcal Y - \mathcal X \mathcal W)^T \sigma^{-2} \mathcal I(\mathcal Y - \mathcal X \mathcal W)\right] \end{aligned} =(2π)2NσN1exp[2σ21i=1N(y(i)WTx(i))2]=(2π)2NσN1exp[21(YXW)Tσ2I(YXW)]
观察上式,上式同样也是高斯分布的表达格式,这也从侧面证明后验概率 P ( Y ∣ W , X ) \mathcal P(\mathcal Y \mid \mathcal W,\mathcal X) P(YW,X)确实服从高斯分布。上述高斯分布格式可化简为:
中间的项 σ − 2 I \sigma^{-2} \mathcal I σ2I表示’精度矩阵‘。需要注意~
P ( Y ∣ W , X ) ∼ N ( X W , σ 2 I ) \mathcal P(\mathcal Y \mid \mathcal W,\mathcal X) \sim \mathcal N(\mathcal X\mathcal W,\sigma^2 \mathcal I) P(YW,X)N(XW,σ2I)
至此,后验分布 P ( W ∣ D a t a ) \mathcal P(\mathcal W \mid Data) P(WData)可表示为:
P ( W ∣ D a t a ) ∝ N ( X W , σ 2 I ) ⋅ N ( 0 , Σ p r i o r ) \mathcal P(\mathcal W \mid Data) \propto \mathcal N(\mathcal X \mathcal W,\sigma^2 \mathcal I) \cdot \mathcal N(0,\Sigma_{prior}) P(WData)N(XW,σ2I)N(0,Σprior)
言归正传,如何求解 μ W , Σ W \mu_{\mathcal W},\Sigma_{\mathcal W} μW,ΣW?
对上式进行如下转换:
这里只关心与 W \mathcal W W相关的项,其他的项均视作常数。
P ( W ∣ D a t a ) ∝ { 1 ( 2 π ) N 2 σ N exp ⁡ [ − 1 2 ( Y − X W ) T σ − 2 I ( Y − X W ) ] } ⋅ { 1 ( 2 π ) p 2 ∣ Σ p r i o r ∣ 1 2 [ − 1 2 W T Σ p r i o r − 1 W ] } ∝ exp ⁡ [ − 1 2 ( Y − X W ) T σ − 2 I ( Y − X W ) ] ⋅ exp ⁡ [ − 1 2 W T Σ p r i o r − 1 W ] = exp ⁡ { − 1 2 σ 2 ( Y T − W T X T ) ( Y − X W ) − 1 2 W T Σ p r i o r − 1 W } \begin{aligned} \mathcal P(\mathcal W \mid Data) & \propto \left\{ \frac{1}{(2\pi)^{\frac{N}{2}}\sigma^N} \exp \left[- \frac{1}{2} (\mathcal Y - \mathcal X \mathcal W)^T \sigma^{-2} \mathcal I(\mathcal Y - \mathcal X \mathcal W)\right] \right\} \cdot \left\{\frac{1}{(2\pi)^{\frac{p}{2}}|\Sigma_{prior}|^{\frac{1}{2}}}\left[ - \frac{1}{2} \mathcal W^T \Sigma_{prior}^{-1}\mathcal W \right]\right\} \\ & \propto \exp \left[- \frac{1}{2} (\mathcal Y - \mathcal X \mathcal W)^T \sigma^{-2} \mathcal I(\mathcal Y - \mathcal X \mathcal W)\right] \cdot \exp \left[- \frac{1}{2} \mathcal W^T \Sigma_{prior}^{-1}\mathcal W\right] \\ & = \exp \left\{-\frac{1}{2\sigma^2}(\mathcal Y^T - \mathcal W^T\mathcal X^T)(\mathcal Y - \mathcal X\mathcal W) - \frac{1}{2} \mathcal W^T\Sigma_{prior}^{-1} \mathcal W\right\} \end{aligned} P(WData){(2π)2NσN1exp[21(YXW)Tσ2I(YXW)]}{(2π)2pΣprior211[21WTΣprior1W]}exp[21(YXW)Tσ2I(YXW)]exp[21WTΣprior1W]=exp{2σ21(YTWTXT)(YXW)21WTΣprior1W}
思路:使用配方法,将上式化简为 1 2 ( W − μ W ) T Σ W − 1 ( W − μ W ) \frac{1}{2}(\mathcal W - \mu_{\mathcal W})^T\Sigma_{\mathcal W}^{-1}(\mathcal W - \mu_{\mathcal W}) 21(WμW)TΣW1(WμW)的格式,从而求出 μ W , Σ W − 1 \mu_{\mathcal W},\Sigma_{\mathcal W}^{-1} μW,ΣW1
我们先对 1 2 ( W − μ W ) T Σ W − 1 ( W − μ W ) \frac{1}{2}(\mathcal W - \mu_{\mathcal W})^T\Sigma_{\mathcal W}^{-1}(\mathcal W - \mu_{\mathcal W}) 21(WμW)TΣW1(WμW)进行展开:用 Δ \Delta Δ表示。
这里的 μ W T Σ W − 1 W \mu_{\mathcal W}^T \Sigma_{\mathcal W}^{-1} \mathcal W μWTΣW1W W T Σ W − 1 μ W \mathcal W^T\Sigma_{\mathcal W}^{-1}\mu_{\mathcal W} WTΣW1μW互为转置并且均表示实数,因而有: μ W T Σ W − 1 W = W T Σ W − 1 μ W \mu_{\mathcal W}^T \Sigma_{\mathcal W}^{-1} \mathcal W = \mathcal W^T\Sigma_{\mathcal W}^{-1}\mu_{\mathcal W} μWTΣW1W=WTΣW1μW.
Δ = − 1 2 [ W T Σ W − 1 W − μ W T Σ W − 1 W − W T Σ W − 1 μ W + μ W T Σ W − 1 μ W ] = − 1 2 [ W T Σ W − 1 W − 2 μ W T Σ W − 1 W + μ W T Σ W − 1 μ W ] \begin{aligned} \Delta & = -\frac{1}{2} \left[\mathcal W^T\Sigma_{\mathcal W}^{-1} \mathcal W - \mu_{\mathcal W}^T \Sigma_{\mathcal W}^{-1} \mathcal W - \mathcal W^T\Sigma_{\mathcal W}^{-1}\mu_{\mathcal W} + \mu_{\mathcal W}^T\Sigma_{\mathcal W}^{-1} \mu_{\mathcal W}\right] \\ & = -\frac{1}{2} \left[\mathcal W^T\Sigma_{\mathcal W}^{-1} \mathcal W - 2 \mu_{\mathcal W}^T \Sigma_{\mathcal W}^{-1} \mathcal W + \mu_{\mathcal W}^T\Sigma_{\mathcal W}^{-1} \mu_{\mathcal W}\right] \end{aligned} Δ=21[WTΣW1WμWTΣW1WWTΣW1μW+μWTΣW1μW]=21[WTΣW1W2μWTΣW1W+μWTΣW1μW]
其中二次项是 − 1 2 W T Σ W − 1 W - \frac{1}{2}\mathcal W^T\Sigma_{\mathcal W}^{-1} \mathcal W 21WTΣW1W,一次项是 μ W T Σ W − 1 W \mu_{\mathcal W}^T \Sigma_{\mathcal W}^{-1} \mathcal W μWTΣW1W,常数项是 − 1 2 μ W T Σ W − 1 μ W -\frac{1}{2}\mu_{\mathcal W}^T\Sigma_{\mathcal W}^{-1} \mu_{\mathcal W} 21μWTΣW1μW。对比这三项去寻找目标结果的相应项。
对上式完全展开
观察 Y T X W \mathcal Y^T\mathcal X\mathcal W YTXW W T X T Y \mathcal W^T\mathcal X^T\mathcal Y WTXTY这两项,它们是互为转置,并且均表示实数。因此有: Y T X W = W T X T Y \mathcal Y^T\mathcal X\mathcal W = \mathcal W^T\mathcal X^T\mathcal Y YTXW=WTXTY
P ( W ∣ D a t a ) ∝ exp ⁡ { − 1 2 σ 2 ( Y T Y − Y T X W − W T X T Y + W T X T X W ) − 1 2 W T Σ p i r o r − 1 W } = exp ⁡ { − 1 2 σ 2 ( Y T Y − 2 Y T X W + W T X T X W ) − 1 2 W T Σ p i r o r − 1 W } \begin{aligned} \mathcal P(\mathcal W \mid Data) & \propto \exp \left\{- \frac{1}{2\sigma^2} (\mathcal Y^T\mathcal Y - \mathcal Y^T\mathcal X\mathcal W - \mathcal W^T\mathcal X^T\mathcal Y + \mathcal W^T\mathcal X^T\mathcal X\mathcal W) - \frac{1}{2} \mathcal W^T\Sigma_{piror}^{-1}\mathcal W\right\} \\ & = \exp\left\{- \frac{1}{2\sigma^2} \left(\mathcal Y^T\mathcal Y - 2\mathcal Y^T\mathcal X\mathcal W + \mathcal W^T\mathcal X^T\mathcal X\mathcal W\right)- \frac{1}{2} \mathcal W^T\Sigma_{piror}^{-1}\mathcal W\right\} \end{aligned} P(WData)exp{2σ21(YTYYTXWWTXTY+WTXTXW)21WTΣpiror1W}=exp{2σ21(YTY2YTXW+WTXTXW)21WTΣpiror1W}

  • 观察:该式中的二次项有
    − 1 2 σ 2 W T X T X W − 1 2 W T Σ p r i o r − 1 W = − 1 2 [ W T ( σ − 2 X T X + Σ p r i o r − 1 ) W ] - \frac{1}{2\sigma^2} \mathcal W^T\mathcal X^T\mathcal X\mathcal W - \frac{1}{2} \mathcal W^T\Sigma_{prior}^{-1}\mathcal W = - \frac{1}{2} \left[\mathcal W^T \left(\sigma^{-2} \mathcal X^T\mathcal X + \Sigma_{prior}^{-1}\right) \mathcal W\right] 2σ21WTXTXW21WTΣprior1W=21[WT(σ2XTX+Σprior1)W]
    对比一下 Δ \Delta Δ可以发现: Σ W − 1 = σ − 2 X T X + Σ p r i o r − 1 \Sigma_{\mathcal W}^{-1} = \sigma^{-2} \mathcal X^T\mathcal X + \Sigma_{prior}^{-1} ΣW1=σ2XTX+Σprior1
    这里令 A = Σ W − 1 \mathcal A = \Sigma_{\mathcal W}^{-1} A=ΣW1
    { − 1 2 [ W T ( σ − 2 X T X + Σ p r i o r − 1 ) W ] − 1 2 W T Σ W − 1 W \begin{cases} -\frac{1}{2}\left[\mathcal W^T \left(\sigma^{-2} \mathcal X^T\mathcal X + \Sigma_{prior}^{-1}\right) \mathcal W\right] \\ -\frac{1}{2}\mathcal W^T\Sigma_{\mathcal W}^{-1} \mathcal W \end{cases} {21[WT(σ2XTX+Σprior1)W]21WTΣW1W
  • 同理,该式中的一次项只有一项:
    − 1 2 σ 2 ⋅ ( − 2 ) Y T X W = Y T X σ 2 W - \frac{1}{2\sigma^2} \cdot (-2)\mathcal Y^T\mathcal X\mathcal W = \frac{\mathcal Y^T\mathcal X}{\sigma^2}\mathcal W 2σ21(2)YTXW=σ2YTXW
    对比一下 Δ \Delta Δ可以发现: μ W T Σ W − 1 = μ W T A = Y T X σ 2 \mu_{\mathcal W}^T\Sigma_{\mathcal W}^{-1} = \mu_{\mathcal W}^T \mathcal A = \frac{\mathcal Y^T\mathcal X}{\sigma^2} μWTΣW1=μWTA=σ2YTX
    { Y T X σ 2 W μ W T Σ W − 1 W \begin{cases} \frac{\mathcal Y^T\mathcal X}{\sigma^2}\mathcal W \\ \mu_{\mathcal W}^T \Sigma_{\mathcal W}^{-1} \mathcal W \end{cases} {σ2YTXWμWTΣW1W

此时我们不需要在去观察’常数项部分‘。因为仅需要求解 μ W \mu_{\mathcal W} μW Σ W \Sigma_{\mathcal W} ΣW.此时已经得到了两个方程:
{ μ W T Σ W − 1 = Y T X σ 2 Σ W − 1 = A \begin{cases} \mu_{\mathcal W}^T \Sigma_{\mathcal W}^{-1} = \frac{\mathcal Y^T\mathcal X} {\sigma^2} \\ \Sigma_{\mathcal W}^{-1} = \mathcal A \end{cases} {μWTΣW1=σ2YTXΣW1=A
解这个方程,有:
{ μ W = A − 1 X T Y σ 2 Σ W − 1 = A \begin{cases} \mu_{\mathcal W} = \frac{\mathcal A^{-1}\mathcal X^T\mathcal Y}{\sigma^2} \\ \Sigma_{\mathcal W}^{-1} = \mathcal A \end{cases} {μW=σ2A1XTYΣW1=A

至此, μ W , Σ W − 1 \mu_{\mathcal W},\Sigma_{\mathcal W}^{-1} μW,ΣW1均已求解,那么后验概率分布 P ( W ∣ D a t a ) \mathcal P(\mathcal W \mid Data) P(WData)表示为:
P ( W ∣ D a t a ) ∼ N ( μ W , Σ W ) { μ W = A − 1 X T Y σ 2 Σ W = A − 1 A = X T X σ 2 + Σ p i r o r − 1 \begin{aligned} \mathcal P(\mathcal W \mid Data) \sim \mathcal N(\mu_{\mathcal W},\Sigma_{\mathcal W}) \quad \begin{cases} \mu_{\mathcal W} = \frac{\mathcal A^{-1}\mathcal X^T\mathcal Y}{\sigma^2} \\ \Sigma_{\mathcal W} = \mathcal A^{-1} \\ \mathcal A = \frac{\mathcal X^T\mathcal X}{\sigma^2} + \Sigma_{piror}^{-1} \end{cases} \end{aligned} P(WData)N(μW,ΣW) μW=σ2A1XTYΣW=A1A=σ2XTX+Σpiror1

下一节将介绍预测任务(Prediction)。

相关参考:
机器学习-贝叶斯线性回归(3)-推导Inference

  • 2
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

静静的喝酒

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值