多元线性回归推导 — Multi-Variable Linear Regression Derivation

多元线性回归通式定义如下:
{ y ^ i = f ( x ) = θ 0 + ∑ j = 1 n θ j x i j y i = y ^ i + ϵ i \begin{cases} \hat y_i = f(x) = \theta_0 + \sum\limits_{j=1}^n \theta_j x_i^j \\ y_i = \hat y_i + \epsilon_i \end{cases} y^i=f(x)=θ0+j=1nθjxijyi=y^i+ϵi
其中 θ 0 ​ \theta_0​ θ0为偏置项, x i j ​ x_i^j​ xij为第 i ​ i​ i条数据的第 j ​ j​ j项特征值, θ j ​ \theta_j​ θj为第 j ​ j​ j项特征的权重, y ^ i ​ \hat y_i​ y^i为第 i ​ i​ i条数据 x i ​ x_i​ xi的预测值, y i ​ y_i​ yi为第 i ​ i​ i条数据 x i ​ x_i​ xi对应的实际值, ϵ i ​ \epsilon_i​ ϵi为预测值 y ^ i ​ \hat y_i​ y^i与实际值 y i ​ y_i​ yi之间的残差。

x i 0 = 1 x_i^0 = 1 xi0=1 y ^ i = θ 0 + ∑ j = 1 n θ j x i j \hat y_i = \theta_0 + \sum\limits_{j=1}^n \theta_j x_i^j y^i=θ0+j=1nθjxij可被改写成 y ^ i = θ T x i \hat y_i = \theta^Tx_i y^i=θTxi

假设每条数据 x i x_i xi相互独立且同分布,残差项 ϵ i \epsilon_i ϵi符合标准正态分布。残差项的概率密度函数如下:
p ( ϵ i ) = 1 2 π σ exp ⁡ ( − ϵ i 2 2 σ 2 ) = 1 2 π σ exp ⁡ ( − ( y i − θ T x i ) 2 2 σ 2 ) p(\epsilon_i) = \frac {1}{\sqrt{2\pi}\sigma} \exp({-\frac {\epsilon_i^2}{2 \sigma^2}}) = \frac {1}{\sqrt{2\pi}\sigma} \exp({-\frac {(y_i-\theta^Tx_i)^2}{2 \sigma^2}}) p(ϵi)=2π σ1exp(2σ2ϵi2)=2π σ1exp(2σ2(yiθTxi)2)
norm_dist

我们的目的是最小化 ∣ ϵ i ∣ \vert \epsilon_i \vert ϵi,而 p ( ϵ i ) p(\epsilon_i) p(ϵi) ϵ i \epsilon_i ϵi值趋近 μ = 0 \mu = 0 μ=0时取得最大值,如上图。因此,已知 x i x_i xi y i y_i yi,我们可以使用最大似然估计来得到 θ \theta θ
L ( θ ) = ∏ i = 1 m p ( y i ∣ x i ; θ ) = ∏ i = 1 m 1 2 π σ exp ⁡ ( − ( y i − θ T x i ) 2 2 σ 2 ) L(\theta) = \prod\limits_{i=1}^m p(y_i|x_i;\theta) = \prod\limits_{i=1}^m \frac {1}{\sqrt{2\pi}\sigma} \exp({-\frac {(y_i-\theta^Tx_i)^2}{2 \sigma^2}}) L(θ)=i=1mp(yixi;θ)=i=1m2π σ1exp(2σ2(yiθTxi)2)

L ( θ ) L(\theta) L(θ)取对数,得到 log ⁡ L ( θ ) \log L(\theta) logL(θ),连乘变成了累加,消掉了指数幂,减少了函数的复杂度。并且, L ( θ ) L(\theta) L(θ) log ⁡ L ( θ ) \log L(\theta) logL(θ)具有相同的变化趋势。因此,求 L ( θ ) L(\theta) L(θ)取得最大值时 θ \theta θ的取值等同于求 log ⁡ L ( θ ) \log L(\theta) logL(θ)取得最大值时 θ \theta θ的取值。
log ⁡ L ( θ ) = ∑ i = 1 m log ⁡ ( 1 2 π σ exp ⁡ ( − ( y i − θ T x i ) 2 2 σ 2 ) ) = ∑ i = 1 m ( log ⁡ 1 2 π σ − ( y i − θ T x i ) 2 2 σ 2 ) = ∑ i = 1 m log ⁡ 1 2 π σ − ∑ i = 1 m ( y i − θ T x i ) 2 2 σ 2 \begin{aligned} & \log L(\theta) = \sum\limits_{i=1}^m \log (\frac {1}{\sqrt{2\pi}\sigma} \exp({-\frac {(y_i-\theta^Tx_i)^2}{2 \sigma^2}}))\\ & = \sum\limits_{i=1}^m (\log \frac {1}{\sqrt{2\pi}\sigma} -\frac {(y_i-\theta^Tx_i)^2}{2 \sigma^2}) \\ & = \sum\limits_{i=1}^m \log \frac {1}{\sqrt{2\pi}\sigma} - \sum\limits_{i=1}^m \frac {(y_i-\theta^Tx_i)^2}{2 \sigma^2} \end{aligned} logL(θ)=i=1mlog(2π σ1exp(2σ2(yiθTxi)2))=i=1m(log2π σ12σ2(yiθTxi)2)=i=1mlog2π σ1i=1m2σ2(yiθTxi)2

由于 ∑ i = 1 m log ⁡ 1 2 π σ \sum\limits_{i=1}^m \log \frac {1}{\sqrt{2\pi}\sigma} i=1mlog2π σ1 2 σ 2 2\sigma^2 2σ2是常量, ∑ i = 1 m ( y i − θ T x i ) 2 \sum\limits_{i=1}^m (y_i-\theta^Tx_i)^2 i=1m(yiθTxi)2取得最小值时, log ⁡ L ( θ ) \log L(\theta) logL(θ)取得最大值。

于是,我们得到了最小二乘法函数
J ( θ ) = ∑ i = 1 m ( y i − θ T x i ) 2 J(\theta) = \sum\limits_{i=1}^m (y_i - \theta^Tx_i)^2 J(θ)=i=1m(yiθTxi)2

J ( θ ) J(\theta) J(θ)在其偏微分取值为零时取得最小值。因此,我们对 J ( θ ) J(\theta) J(θ)偏微分,求其取值为零时 θ \theta θ的值。
J θ = ∂ J ∂ θ = [ ∂ J ∂ θ 0 ∂ J ∂ θ 1 … ∂ J ∂ θ n ] T = 0 J_\theta = \frac {\partial J}{\partial \theta} = \begin{bmatrix} \frac {\partial J}{\partial \theta_0} & \frac {\partial J}{\partial \theta_1} & \dots & \frac {\partial J}{\partial \theta_n} \end{bmatrix}^T = \mathbf 0 Jθ=θJ=[θ0Jθ1JθnJ]T=0

我们以 ∂ J ∂ θ 0 \frac {\partial J}{\partial \theta_0} θ0J为例:
∂ J ∂ θ 0 = ∂ ∂ θ 0 ∑ i = 1 m ( y i − θ T x i ) 2 = ∂ ∂ θ 0 [ ( y 1 − θ T x 1 ) 2 + ( y 2 − θ T x 2 ) 2 + ⋯ + ( y m − θ T x m ) 2 ] = ∂ ∂ θ 0 [ ( y 1 − ∑ j = 1 n θ j x 1 j ) 2 + ( y 2 − ∑ j = 1 n θ j x 2 j ) 2 + ⋯ + ( y m − ∑ j = 1 n θ j x m j ) 2 ] = 2 ( y 1 − ∑ j = 1 n θ j x 1 j ) ( − x 1 0 ) + 2 ( y 2 − ∑ j = 1 n θ j x 2 j ) ( − x 2 0 ) + ⋯ + 2 ( y m − ∑ j = 1 n θ j x m j ) ( − x m 0 ) = − 2 [ y 1 − ∑ j = 1 n θ j x 1 j y 2 − ∑ j = 1 n θ j x 2 j … y m − ∑ j = 1 n θ j x m j ] [ x 1 0 x 2 0 … x m 0 ] T = − 2 [ y 1 − θ T x 1 y 2 − θ T x 2 … y m − θ T x m ] [ x 1 0 x 2 0 … x m 0 ] T \begin{aligned} & \frac {\partial J}{\partial \theta_0} = \frac {\partial}{\partial \theta_0} \sum\limits_{i=1}^m(y_i - \theta^Tx_i)^2 \\ & = \frac {\partial}{\partial \theta_0}[(y_1 - \theta^Tx_1)^2 + (y_2 - \theta^Tx_2)^2 + \dots + (y_m - \theta^Tx_m)^2] \\ & = \frac {\partial}{\partial \theta_0}[(y_1 - \sum\limits_{j=1}^n \theta_jx_1^j)^2 + (y_2 - \sum\limits_{j=1}^n \theta_jx_2^j)^2 + \dots + (y_m - \sum\limits_{j=1}^n \theta_jx_m^j)^2] \\ & = 2(y_1 - \sum\limits_{j=1}^n \theta_jx_1^j)(-x_1^0) + 2(y_2 - \sum\limits_{j=1}^n \theta_jx_2^j)(-x_2^0) + \dots + 2(y_m - \sum\limits_{j=1}^n \theta_jx_m^j)(-x_m^0) \\ & = -2 \begin{bmatrix} y_1 - \sum\limits_{j=1}^n\theta_jx_1^j & y_2 - \sum\limits_{j=1}^n\theta_jx_2^j & \dots & y_m - \sum\limits_{j=1}^n\theta_jx_m^j \end{bmatrix} \begin{bmatrix} x_1^0 & x_2^0 & \dots & x_m^0 \end{bmatrix}^T \\ & = -2 \begin{bmatrix} y_1 - \theta^Tx_1 & y_2 - \theta^Tx_2 & \dots & y_m - \theta^Tx_m \end{bmatrix} \begin{bmatrix} x_1^0 & x_2^0 & \dots & x_m^0 \end{bmatrix}^T \end{aligned} θ0J=θ0i=1m(yiθTxi)2=θ0[(y1θTx1)2+(y2θTx2)2++(ymθTxm)2]=θ0[(y1j=1nθjx1j)2+(y2j=1nθjx2j)2++(ymj=1nθjxmj)2]=2(y1j=1nθjx1j)(x10)+2(y2j=1nθjx2j)(x20)++2(ymj=1nθjxmj)(xm0)=2[y1j=1nθjx1jy2j=1nθjx2jymj=1nθjxmj][x10x20xm0]T=2[y1θTx1y2θTx2ymθTxm][x10x20xm0]T

同理,我们可以求得 ∂ J ∂ θ 1 ​ \frac {\partial J}{\partial \theta_1}​ θ1J ∂ J ∂ θ n ​ \frac {\partial J}{\partial \theta_n}​ θnJ的偏微分:
∂ J ∂ θ 0 = − 2 [ y 1 − θ T x 1 y 2 − θ T x 2 … y m − θ T x m ] [ x 1 0 x 2 0 … x m 0 ] T ∂ J ∂ θ 1 = − 2 [ y 1 − θ T x 1 y 2 − θ T x 2 … y m − θ T x m ] [ x 1 1 x 2 1 … x m 1 ] T ⋮ ∂ J ∂ θ n = − 2 [ y 1 − θ T x 1 y 2 − θ T x 2 … y m − θ T x m ] [ x 1 n x 2 n … x m n ] T \frac {\partial J}{\partial \theta_0} = -2 \begin{bmatrix} y_1 - \theta^Tx_1 & y_2 - \theta^Tx_2 & \dots & y_m - \theta^Tx_m \end{bmatrix} \begin{bmatrix} x_1^0 & x_2^0 & \dots & x_m^0 \end{bmatrix}^T \\ \frac {\partial J}{\partial \theta_1} = -2 \begin{bmatrix} y_1 - \theta^Tx_1 & y_2 - \theta^Tx_2 & \dots & y_m - \theta^Tx_m \end{bmatrix} \begin{bmatrix} x_1^1 & x_2^1 & \dots & x_m^1 \end{bmatrix}^T \\ \vdots \\ \frac {\partial J}{\partial \theta_n} = -2 \begin{bmatrix} y_1 - \theta^Tx_1 & y_2 - \theta^Tx_2 & \dots & y_m - \theta^Tx_m \end{bmatrix} \begin{bmatrix} x_1^n & x_2^n & \dots & x_m^n \end{bmatrix}^T θ0J=2[y1θTx1y2θTx2ymθTxm][x10x20xm0]Tθ1J=2[y1θTx1y2θTx2ymθTxm][x11x21xm1]TθnJ=2[y1θTx1y2θTx2ymθTxm][x1nx2nxmn]T

将其改写成矩阵的形式:
J θ = ∂ J ∂ θ = [ ∂ J ∂ θ 0 ∂ J ∂ θ 1 ⋮ ∂ J ∂ θ n ] = − 2 [ y i − θ T x 1 y 2 − θ T x 2 … y m − θ T x m ] [ x 1 0 x 1 1 … x 1 n x 2 0 x 2 1 … x 2 n ⋮ x m 0 x m 1 … x m n ] = − 2 ( [ y 1 y 2 … y m ] − [ θ T x 1 θ T x 2 … θ T x m ] ) X = − 2 ( Y T − θ T [ x 1 x 2 … x m ] ) X = − 2 ( Y T − θ T X T ) X \begin{aligned} & J_\theta = \frac {\partial J}{\partial \theta} = \begin{bmatrix} \frac {\partial J}{\partial \theta_0} \\ \frac {\partial J}{\partial \theta_1} \\ \vdots \\ \frac {\partial J}{\partial \theta_n} \end{bmatrix} = -2 \begin{bmatrix} y_i - \theta^Tx_1 & y_2 - \theta^Tx_2 & \dots & y_m - \theta^Tx_m \end{bmatrix} \begin{bmatrix} x_1^0 & x_1^1 & \dots & x_1^n \\ x_2^0 & x_2^1 & \dots & x_2^n \\ \vdots \\ x_m^0 & x_m^1 & \dots & x_m^n \end{bmatrix} \\ & = -2( \begin{bmatrix} y_1 & y_2 & \dots & y_m \end{bmatrix} - \begin{bmatrix} \theta^Tx_1 & \theta^Tx_2 & \dots & \theta^Tx_m \end{bmatrix} ) \mathbf X \\ & = -2(\mathbf Y^T - \theta^T \begin{bmatrix} x_1 & x_2 & \dots & x_m \end{bmatrix}) \mathbf X \\ & = -2(\mathbf Y^T - \theta^T \mathbf X^T) \mathbf X \end{aligned} Jθ=θJ=θ0Jθ1JθnJ=2[yiθTx1y2θTx2ymθTxm]x10x20xm0x11x21xm1x1nx2nxmn=2([y1y2ym][θTx1θTx2θTxm])X=2(YTθT[x1x2xm])X=2(YTθTXT)X

J θ = − 2 ( Y T − θ T X T ) X = 0 ​ J_\theta = -2(\mathbf Y^T - \theta^T \mathbf X^T) \mathbf X = \mathbf 0​ Jθ=2(YTθTXT)X=0,得到:
Y T X = θ T X T X θ T = Y T X ( X T X ) − 1 = Y T X X − 1 ( X T ) − 1 = Y T ( X T ) − 1 = Y T ( X − 1 ) T θ = ( θ T ) T = X − 1 Y \mathbf Y^T \mathbf X = \theta^T \mathbf X^T \mathbf X \\ \theta^T = \mathbf Y^T \mathbf X(\mathbf X^T \mathbf X)^{-1} = \mathbf Y^T \mathbf X \mathbf X^{-1} (\mathbf X^T)^{-1} = \mathbf Y^T (\mathbf X^T)^{-1} = \mathbf Y^T (\mathbf X^{-1})^T \\ \theta = (\theta^T)^T = \mathbf X^{-1} \mathbf Y YTX=θTXTXθT=YTX(XTX)1=YTXX1(XT)1=YT(XT)1=YT(X1)Tθ=(θT)T=X1Y

因此,当 θ = X − 1 Y \theta = \mathbf X^{-1} \mathbf Y θ=X1Y时, y ^ i = θ T x i \hat y_i = \theta^Tx_i y^i=θTxi能最好地描述 X \mathbf X X Y \mathbf Y Y之间的线形相关性。

然而,现实中的数据并非都能通过求偏导来求极值。例如: z = y 2 − x 2 z=y^2 - x^2 z=y2x2。如下图,当 ∂ z ∂ y = 0 ; ∂ z ∂ x = 0 \frac {\partial z}{\partial y} = 0; \frac {\partial z}{\partial x} = 0 yz=0;xz=0时, y = 0 ; x = 0 y=0; x=0 y=0;x=0(鞍点)却非极值点。因此,实际中多采用梯度下降法或牛顿法来求得目标函数极值的近似解。通过线性回归得到回归参数后,可以通过计算判定系数 R 2 R^2 R2来评估回归函数的拟合优度。
saddle_point

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值