(西瓜书)多元线性回归公式推导

目录

一、线性回归概述:

二、数学知识储备:

三、证明损失函数  是关于  的凸函数 :

四、求解  :


 

 

一、线性回归概述:

上文 : https://blog.csdn.net/qq_42185999/article/details/102941535  中我们推导了线性回归中最简单的情形:即输入属性的目只有一个。下面我们来推导更一般的情形:即样本由 d 个属性描述。

 

给定数据集  D=\left \{ \left ( x_{1} , y_{1}\right ) , \left ( x_{2}, y_{2} \right ) , .... , \left ( x_{m}, y_{m} \right ) \right \}  , 其中  x_{i}=\left ( x_{i1},x_{i2}, ... ,x_{id} \right ) , y_{i}\in R ,线性回归试图学得:f\left ( x_{i} \right ) = w^{T}*x_{i} + b ,使得  f \left ( x_{i} \right ) \simeq y_{i}  . 这称为 “多元线性回归” 。

 

为了便于讨论,我们把 w 和 b 吸收入向量形式 \hat{w}=\left ( w;b \right ) ,

f\left ( x_{i} \right )     =      w^{T}*x_{i} + b             

              =      \begin{pmatrix} w_{1} & w_{2} & \cdots & w_{d} \end{pmatrix} \begin{pmatrix} x_{i1}\\ x_{i2}\\ \vdots \\ x_{id} \end{pmatrix} + b

              =      w_{1}*x_{i1} + w_{2}*x_{i2} + \cdots + w_{d} * x_{id} + b

              =      w_{1}*x_{i1} + w_{2}*x_{i2} + \cdots + w_{d} * x_{id} + w_{d+1}*1

              =       \begin{pmatrix} w_{1} & w_{2} & \cdots & w_{d} & w_{d+1} \end{pmatrix} \begin{pmatrix} x_{i1}\\ x_{i2}\\ \vdots \\ x_{id} \\ 1 \end{pmatrix}

              =       \hat{w}^{T}* \hat{ x_{i} }

 

相应的,把数据集 D 表示为一个 m*(d+1) 大小的矩阵 X ,其中每行对应于一个示例,该行前 d 个元素对应于示例的 d 个属性值,最后一个元素恒置为 1 ,即:

                            X= \begin{pmatrix} x_{11} &x_{12} & \cdots & x_{1d} & 1 \\ x_{21} & x_{22} & \cdots & x_{2d} & 1 \\ \vdots & \vdots & \ddots & \vdots &\vdots \\ x_{m1} & x_{m2} & \cdots & x_{md} & 1 \end{pmatrix} = \begin{pmatrix} x_{1}^{T} & 1 \\ x_{2}^{T} & 1 \\ \vdots & \vdots \\ x_{m}^{T} & 1 \end{pmatrix}  = \begin{pmatrix} \hat{x}_{1}^{T} \\ \hat{x}_{2}^{T} \\ \vdots \\ \hat{x}_{m}^{T} \end{pmatrix}

再把标记也写成向量形式 : y=\left ( y_{1},y_{2}, ... ,y_{m} \right )^{T}

则损失函数: E_{\hat{w}}    =     \sum_{i=1}^{m} \left ( y_{i} - \hat{w}^{T}*\hat{x_{i}} \right ) \right )^{2}

                              =     \left ( y_{1} - \hat{w}^{T}*\hat{x_{1}} \right ) \right )^{2} + \left ( y_{2} - \hat{w}^{T}*\hat{x_{2}} \right ) \right )^{2} + \cdots + \left ( y_{m} - \hat{w}^{T}*\hat{x_{m}} \right ) \right )^{2}

                              =     \begin{pmatrix} y_{1} - \hat{w}^{T}*\hat{x_{1}} & y_{2} - \hat{w}^{T}*\hat{x_{2}} & \cdots & y_{m} - \hat{w}^{T}*\hat{x_{m}} \end{pmatrix} \begin{pmatrix} y_{1} - \hat{w}^{T}*\hat{x_{1}}\\ y_{2} - \hat{w}^{T}*\hat{x_{2}} \\ \vdots \\ y_{m} - \hat{w}^{T}*\hat{x_{m}} \end{pmatrix}                       

下面对上式进行化简:

        \begin{pmatrix} y_{1} - \hat{w}^{T}*\hat{x_{1}}\\ y_{2} - \hat{w}^{T}*\hat{x_{ 2}} \\ \vdots \\ y_{m} - \hat{w}^{T}*\hat{x_{m}} \end{pmatrix} = \begin{pmatrix} y_{1}\\ y_{2}\\ \vdots \\ y_{m} \end{pmatrix} - \begin{pmatrix} \hat{w}^{T}*\hat{x_{1}} \\ \hat{w}^{T}*\hat{x_{2}} \\ \vdots \\ \hat{w}^{T}*\hat{x_{m }} \end{pmatrix}  = \begin{pmatrix} y_{1}\\ y_{2}\\ \vdots \\ y_{m} \end{pmatrix} - \begin{pmatrix} \hat{x_{1}} ^{T}*\hat{w}\\ \hat{x_{2}} ^{T}*\hat{w} \\ \vdots \\ \hat{x_{m}} ^{T}*\hat{w} \end{pmatrix}

又因为:

           \begin{pmatrix} \hat{x_{1}} ^{T}*\hat{w}\\ \hat{x_{2}} ^{T}*\hat{w} \\ \vdots \\ \hat{x_{m}} ^{T}*\hat{w} \end{pmatrix} = \begin{pmatrix} \hat{x_{1}} ^{T} \\ \hat{x_{2}} ^{T} \\ \vdots \\ \hat{x_{m}} ^{T} \end{pmatrix} * \hat{w} = X * \hat{w}    (这里不大明白的话可以往上翻翻)

所以:

          \begin{pmatrix} y_{1} - \hat{w}^{T}*\hat{x_{1}}\\ y_{2} - \hat{w}^{T}*\hat{x_{ 2}} \\ \vdots \\ y_{m} - \hat{w}^{T}*\hat{x_{m}} \end{pmatrix} = \begin{pmatrix} y_{1}\\ y_{2}\\ \vdots \\ y_{m} \end{pmatrix} - \begin{pmatrix} \hat{w}^{T}*\hat{x_{1}} \\ \hat{w}^{T}*\hat{x_{2}} \\ \vdots \\ \hat{w}^{T}*\hat{x_{m }} \end{pmatrix} = y-X * \hat{w}

 

E_{\hat{w}}   =      \begin{pmatrix} y_{1} - \hat{w}^{T}*\hat{x_{1}} & y_{2} - \hat{w}^{T}*\hat{x_{2}} & \cdots & y_{m} - \hat{w}^{T}*\hat{x_{m}} \end{pmatrix} \begin{pmatrix} y_{1} - \hat{w}^{T}*\hat{x_{1}}\\ y_{2} - \hat{w}^{T}*\hat{x_{2}} \\ \vdots \\ y_{m} - \hat{w}^{T}*\hat{x_{m}} \end{pmatrix}

        =       \left ( y-X * \hat{w} \right ) ^{T} \left ( y-X * \hat{w} \right )                       

此即为西瓜书式 3.9 后面的部分

 

 

 

二、数学知识储备:

 

凸集定义: 

设集合 D \in R^{n}    ,如果对任意的  x,y \in D  与任意的  a \in \left [ 0,1 \right ]  , 有    a*x + \left ( 1-a\right )*y \in D     ,则称集合 D 是凸集。

凸集的几何意义是:若两个点属于此集合,则这两点连线上的任意一点均属于此集合。

         

 

 

梯度定义:

设 n 元函数  f(x)   对自变量   x=\left ( x_{1}, x_{2}, .... , x_{n} \right ) ^ {T}   的各分量 x_{i}  的偏导数  \frac{ \partial f\left ( x \right ) }{ \partial x_{i} }  (i = 1,2 , ... , n) 都存在,则称函数  f(x)  在 x  处一阶可导,并称向量   

                                                  \triangledown f(x) = \begin{pmatrix} \frac{ \partial f\left ( x \right ) }{ \partial x_{1} } \\ \frac{ \partial f\left ( x \right ) }{ \partial x_{2} } \\ ... \\ \frac{ \partial f\left ( x \right ) }{ \partial x_{n} }\end{pmatrix}

为函数  f(x)  在  x  处的一阶导数或梯度,记为  \triangledown f(x)  (列向量)

 

 

 

Hessian(海塞)矩阵定义:

设 n 元函数  f(x)  对自变量  x= \left ( x_{1}, x_{1}, ...., x_{1} \right ) ^{T}   的各分量  x_{i}  的二级偏导数   \frac{ \partial ^{2} f\left ( x \right ) }{ \partial x_{i} \partial x_{j} }   (i = 1,2 , ... , n ;  j = 1,2 , ... , n) 都存在,则称函数  f(x)  在点 x 处二阶可导,并称矩阵

                           \triangledown^{^{2}} f(x) = \begin{pmatrix} \frac{ \partial ^{2} f\left ( x \right ) }{ \partial x_{1}^{2} } & \frac{ \partial ^{2} f\left ( x \right ) }{ \partial x_{1} \partial x_{2} } & ... & \frac{ \partial ^{2} f\left ( x \right ) }{ \partial x_{1} \partial x_{n} } \\ \frac{ \partial ^{2} f\left ( x \right ) }{ \partial x_{2 } \partial x_{ 1} } & \frac{ \partial ^{2} f\left ( x \right ) }{ \partial x_{2 } ^{2}} & ... & \frac{ \partial ^{2} f\left ( x \right ) }{ \partial x_{ 2 } \partial x_{ n } } \\ \vdots & \vdots & \ddots & \vdots \\ \frac{ \partial ^{2} f\left ( x \right ) }{ \partial x_{ n } \partial x_{1} } & \frac{ \partial ^{2} f\left ( x \right ) }{ \partial x_{n} \partial x_{2} } & \cdots & \frac{ \partial ^{2} f\left ( x \right ) }{ \partial x_{n} ^{2}} \end{pmatrix}

为 f(x)  在 x 处的二阶导数或Hessian矩阵,记为  \triangledown^{^{2}} f(x) ,若  f(x)  对 x 各变元的所有二阶偏导数都连续 , 则  \frac{ \partial ^{2} f\left ( x \right ) }{ \partial x_{i} \partial x_{j} } = \frac{ \partial ^{2} f\left ( x \right ) }{ \partial x_{j} \partial x_{ i } }  , 此时  \triangledown^{^{2}} f(x)  为对称矩阵。

 

 

多元实值函数凹凸性判定定理:

设  D \in R^{n} 是非空开凸集,f : D \in R^{n} \rightarrow R , 且  f(x)  在 D 上二阶连续可微,如果 f(x) 的 Hessian矩阵 \triangledown^{^{2}} f(x) 在 D 上是正定的,则  f(x) 是 D 上的严格凸函数。

 

凸充分性定理:

若 f : R^{n} \rightarrow R 是凸函数,且  f(x)  一阶连续可微,则  x^{*}  是全局解的充分必要条件是  \triangledown f(x^{*})=0 , 其中 \triangledown f(x) 为 关于 x 的一阶导数(也称梯度)。

 

 

三、证明损失函数 E_{\hat{w}} 是关于 \hat{w} 的凸函数 :

 

\frac{ \partial E_{ \hat{ w } } }{\partial \hat{ w } }    =      \frac{ \partial }{\partial \hat{ w } } \left [ \left ( y-X * \hat{w} \right ) ^{T} \left ( y-X * \hat{w} \right ) \right ] 

           =      \frac{ \partial }{\partial \hat{ w } } \left [ \left ( y^{T} - \hat{w}^{T} * X^{ T } \right ) \left ( y-X * \hat{w} \right ) \right ]

           =     \frac{ \partial }{\partial \hat{ w } } \left [{\color{Red} y^{T} *y} - y^{T} *X * \hat{w} - \hat{w}^{T} * X^{ T }*y + \hat{w}^{T} * X^{ T } *X * \hat{w} \right ]

           =     \frac{ \partial }{\partial \hat{ w } } \left [ -y^{T} *X * \hat{w} - \hat{w}^{T} * X^{ T }*y + \hat{w}^{T} * X^{ T } *X * \hat{w} \right ]

 

 

【标量-向量】的矩阵微分公式为:

其中, x=\left ( x_{1},x_{2}, ... ,x_{n} \right )^{T}   为 n 维列向量, y 为 x 的 n 元标量函数。

(1)

\frac{\partial y}{\partial x} = \begin{pmatrix} \frac{\partial y}{\partial x_{1}} \\ \frac{\partial y}{\partial x_{2}}\\ \vdots \\ \frac{\partial y}{\partial x_{n}} \end{pmatrix}       (分母布局)【默认采用】

(2)

\frac{\partial y}{\partial x} = \begin{pmatrix} \frac{\partial y}{\partial x_{1}}&\frac{\partial y}{\partial x_{2}} & \cdots & \frac{\partial y}{\partial x_{n}} \end{pmatrix}   (分子布局)

 

 

 

由【标量-向量】的矩阵微分公式可推得:

\frac{ \partial \left ( x^{T}*a \right ) }{ \partial x} = \frac{ \partial \left ( a^{T}* x \right ) }{ \partial x} = \begin{pmatrix} \frac{\partial \left ( a_{1}*x_{1} + a_{2}*x_{2} + \cdots + a_{n}*x_{n} \right ) }{\partial x_{1}} \\ \frac{\partial \left ( a_{1}*x_{1} + a_{2}*x_{2} + \cdots + a_{n}*x_{n} \right ) }{\partial x_{2}}\\ \vdots \\ \frac{\partial \left ( a_{1}*x_{1} + a_{2}*x_{2} + \cdots + a_{n}*x_{n} \right ) }{\partial x_{n}} \end{pmatrix}  = \begin{pmatrix} a_{1} \\ a_{2}\\ \vdots \\ a_{n} \end{pmatrix} = a

 

 

同理,可推得:\frac{ \partial \left ( x^{T}*B*x \right ) }{ \partial x} = \left ( B+B^{T} \right )*x

下面简单推导一下:

\frac{ \partial \left ( x^{T}*B*x \right ) }{ \partial x}       =         \frac{ \partial \left ( \begin{pmatrix} x_{1} & x_{2} & x_{3} \end{pmatrix} * \begin{pmatrix} \theta _{11} & \theta _{12} & \theta _{13} \\ \theta _{21} & \theta _{22} & \theta _{23} \\ \theta _{31} & \theta _{32} & \theta _{33} \end{pmatrix} * \begin{pmatrix} x_{1}\\ x_{2}\\ x_{3} \end{pmatrix} \right ) } { \partial \begin{pmatrix} x_{1}\\ x_{2}\\ x_{3} \end{pmatrix} }

                              =        \begin{pmatrix} \frac{\partial \left [ \left ( \theta _{11}*x_{1} + \theta _{21}*x_{2} + \theta _{31}*x_{3} \right )*x_{1} + \left ( \theta _{12}*x_{1} + \theta _{22}*x_{2} + \theta _{32}*x_{3} \right )*x_{2} + \left ( \theta _{13}*x_{1} + \theta _{23}*x_{2} + \theta _{33}*x_{3} \right )*x_{3} \right ] }{\partial x_{1}} \\ \frac{\partial \left [ \left ( \theta _{11}*x_{1} + \theta _{21}*x_{2} + \theta _{31}*x_{3} \right )*x_{1} + \left ( \theta _{12}*x_{1} + \theta _{22}*x_{2} + \theta _{32}*x_{3} \right )*x_{2} + \left ( \theta _{13}*x_{1} + \theta _{23}*x_{2} + \theta _{33}*x_{3} \right )*x_{3} \right ] }{\partial x_{2}}\\ \frac{\partial \left [ \left ( \theta _{11}*x_{1} + \theta _{21}*x_{2} + \theta _{31}*x_{3} \right )*x_{1} + \left ( \theta _{12}*x_{1} + \theta _{22}*x_{2} + \theta _{32}*x_{3} \right )*x_{2} + \left ( \theta _{13}*x_{1} + \theta _{23}*x_{2} + \theta _{33}*x_{3} \right )*x_{3} \right ]}{\partial x_{3}} \end{pmatrix}

                              =         \begin{pmatrix} 2* \theta _{11}*x_{1}+ \left ( \theta _{12}+\theta _{21} \right )*x_{2}+ \left ( \theta _{13}+\theta _{31} \right )*x_{3} \\ \left ( \theta _{21}+\theta _{12} \right )*x_{1}+ 2*\theta _{22}*x_{2}+ \left ( \theta _{23}+\theta _{32} \right )*x_{3}\\ \left ( \theta _{31}+\theta _{13} \right )*x_{1}+ \left ( \theta _{32}+\theta _{23} \right )*x_{2}+ 2*\theta _{33}*x_{3} \end{pmatrix}

                              =         \begin{pmatrix} \left ( 2* \theta _{11}+ \left ( \theta _{12}+\theta _{21} \right )+ \left ( \theta _{13}+\theta _{31} \right ) \right ) * \begin{pmatrix} x_{1}\\ x_{2}\\ x_{3} \end{pmatrix} \\ \left ( \left ( \theta _{21}+\theta _{12} \right )+ 2*\theta _{22}+ \left ( \theta _{23}+\theta _{32} \right ) \right )* \begin{pmatrix} x_{1}\\ x_{2}\\ x_{3} \end{pmatrix} \\ \left ( \left ( \theta _{31}+\theta _{13} \right )+ \left ( \theta _{32}+\theta _{23} \right )+ 2*\theta _{33} \right )* \begin{pmatrix} x_{1}\\ x_{2}\\ x_{3} \end{pmatrix} \end{pmatrix}

                              =         \begin{pmatrix} 2* \theta _{11}+ \left ( \theta _{12}+\theta _{21} \right )+ \left ( \theta _{13}+\theta _{31} \right ) \\ \left ( \theta _{21}+\theta _{12} \right )+ 2*\theta _{22}+ \left ( \theta _{23}+\theta _{32} \right ) \\ \left ( \theta _{31}+\theta _{13} \right )+ \left ( \theta _{32}+\theta _{23} \right )+ 2*\theta _{33} \end{pmatrix} * \begin{pmatrix} x_{1}\\ x_{2}\\ x_{3} \end{pmatrix}

                              =         \left ( B+B^{T} \right )*x

 

 

\frac{ \partial E_{ \hat{ w } } }{\partial \hat{ w } }   =     -\frac{ \partial }{\partial \hat{ w } }\left ( y^{T} *X * \hat{w} \right ) - \frac{ \partial }{\partial \hat{ w } }\left ( \hat{w}^{T} * X^{ T }*y \right ) +\frac{ \partial }{\partial \hat{ w } }\left ( \hat{w}^{T} * X^{ T } *X * \hat{w} \right )

由矩阵微分公式  \frac{ \partial \left ( x^{T}*a \right ) }{ \partial x} = a  ,  \frac{ \partial \left ( x^{T}*B*x \right ) }{ \partial x} = \left ( B+B^{T} \right )*x  可得:

\frac{ \partial E_{ \hat{ w } } }{\partial \hat{ w } }    =     -X^{T}*y-X^{T}*y+\left ( X^{T}*X + X^{T}*X \right )\hat{w}

            =     2*X^{T}\left ( X*\hat{w} - y \right )                (此即为西瓜书式 3.10 

 

 

\frac{ \partial^{2} E_{ \hat{ w } } }{ \partial \hat{ w } \partial \hat{ w }^{T}}   =     \frac{ \partial }{\partial \hat{ w } } \left ( \frac{ \partial E_{ \hat{ w } } }{\partial \hat{ w } } \right )

               =    \frac{ \partial }{\partial \hat{ w } }\left [ 2*X^{T}\left ( X*\hat{w} - y \right ) \right ]  

               =    \frac{ \partial }{\partial \hat{ w } } \left [ 2*X^{T} X*\hat{w} - 2*X^{T}*y \right ]

                 =    2*X^{T} X              (此即为Hessian(海塞)矩阵 

 

Hessian矩阵 \triangledown^{^{2}} f(x) 在 D 上是正定的,所以损失函数 E_{\hat{w}} 是关于 \hat{w} 的凸函数 。

 

 

四、求解 \hat{w} :

 

令一阶导数等于 0 解出  \hat{w} :

\frac{ \partial E_{ \hat{ w } } }{\partial \hat{ w } }    =     2*X^{T}\left ( X*\hat{w} - y \right )     =    0

   2*X^{T} * X*\hat{w} - 2*X^{T}*y     =    0

                              X^{T} * X*\hat{w}     =    X^{T}*y

                                              \hat{w}     =     \left ( X^{T} * X \right )^{-1}X^{T}*y             (此即为西瓜书式 3.11 

 

 

 

 

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值