多元线性回归分析

多元线性回归分析及自定义代码实现算法

一.算法推导:

1.首先定义多元线性回归的函数:
f ( x 1 , x 2 . . . x n ) = α 0 + α 1 x 1 + α 2 x 2 + . . . + α n x n = α 0 + ∑ i = 1 n α i x i f\left( x_1,x_2...x_n \right) =\alpha _0+\alpha _1x_1+\alpha _2x_2+...+\alpha _nx_n=\alpha _0+\sum_{i=1}^n{\alpha _ix_i} f(x1,x2...xn)=α0+α1x1+α2x2+...+αnxn=α0+i=1nαixi
此时残差的表达式可以写成:
ε i = ( y i − α 0 − ∑ j = 1 n α j x i j ) ① \varepsilon _i=\left( y_i-\alpha _0-\sum_{j=1}^n{\alpha _jx_{ij}} \right) ① εi=(yiα0j=1nαjxij)
其中xij表示第i行所有元素,aj表示a的第j行元素,具体形式如下
mxn的数据矩阵:
[ x 11 x 12 . . . x 1 n x 21 x 22 . . . x 2 n . . . . . . . . . . . . x m 1 x m 2 . . . x m n ] \left[ \begin{matrix} x_{11}& x_{12}& ...& x_{1n}\\ x_{21}& x_{22}& ...& x_{2n}\\ ...& ...& ...& ...\\ x_{m1}& x_{m2}& ...& x_{mn}\\ \end{matrix} \right] x11x21...xm1x12x22...xm2............x1nx2n...xmn
系数矩阵:
[ α 1 α 2 . . . α n ] \left[ \begin{array}{c} \alpha _1\\ \alpha _2\\ ...\\ \alpha _n\\ \end{array} \right] α1α2...αn
将①式矢量化为:
ε ⃗ = y ⃗ − [ 1 x 11 . . . x 1 n 1 x 21 . . . x 2 n . . . . . . . . . . . . 1 x m 1 . . . x m n ] ⋅ [ α 0 α 1 . . . α n ] \vec{\varepsilon}=\vec{y}-\left[ \begin{matrix} 1& x_{11}& ...& x_{1n}\\ 1& x_{21}& ...& x_{2n}\\ ...& ...& ...& ...\\ 1& x_{m1}& ...& x_{mn}\\ \end{matrix} \right] \cdot \left[ \begin{array}{c} \alpha _0\\ \alpha _1\\ ...\\ \alpha _n\\ \end{array} \right] ε =y 11...1x11x21...xm1............x1nx2n...xmnα0α1...αn
左侧矩阵叫设计矩阵用A表示,右侧为系数矩阵用a向量表示

ε ⃗ = y ⃗ − A α ⃗ \vec{\varepsilon}=\vec{y}-A\vec{\alpha} ε =y Aα
同样用最小二法确定多元回归模型的系数:
ε ⃗ ⋅ ε ⃗ = ∥ ε ⃗ ∥ 2 = < A α ⃗ − y ⃗ , A α ⃗ − y ⃗ > = < A α ⃗ , A α ⃗ > − 2 < y ⃗ , A α ⃗ > + < y ⃗ , y ⃗ > \vec{\varepsilon}\cdot \vec{\varepsilon}=\lVert \vec{\varepsilon} \rVert ^2=\left< A\vec{\alpha}-\vec{y},A\vec{\alpha}-\vec{y} \right> =\left< A\vec{\alpha},A\vec{\alpha} \right> -2\left< \vec{y},A\vec{\alpha} \right> +\left< \vec{y},\vec{y} \right> ε ε =ε 2=Aα y ,Aα y =Aα ,Aα 2y ,Aα +y ,y
对上式求导:
∂ < ε ⃗ ⋅ ε ⃗ > ∂ α ⃗ = − 2 A T y ⃗ + ( A T A ) α ⃗ + ( A T A ) T α ⃗ = − 2 A T y ⃗ + 2 A T A α ⃗ \frac{\partial \left< \vec{\varepsilon}\cdot \vec{\varepsilon} \right>}{\partial \vec{\alpha}}=-2A^T\vec{y}+\left( A^TA \right) \vec{\alpha}+\left( A^TA \right) ^T\vec{\alpha}=-2A^T\vec{y}+2A^TA\vec{\alpha} α ε ε =2ATy +(ATA)α +(ATA)Tα =2ATy +2ATAα
函数存在极值,则导函数为0
− 2 A T y ⃗ + 2 A T A α ⃗ = 0 -2A^T\vec{y}+2A^TA\vec{\alpha}=0 2ATy +2ATAα =0
A T y ⃗ = A T A α ⃗ A^T\vec{y}=A^TA\vec{\alpha} ATy =ATAα
( A T A ) − 1 A T A α ⃗ = ( A T A ) − 1 A T y ⃗ \left( A^TA \right) ^{-1}A^TA\vec{\alpha}=\left( A^TA \right) ^{-1}A^T\vec{y} (ATA)1ATAα =(ATA)1ATy

重要公式:

α ⃗ = ( A T A ) − 1 A T y ⃗ \vec{\alpha}=\left( A^TA \right) ^{-1}A^T\vec{y} α =(ATA)1ATy

帮助公式:

内积的对称性: < A α ⃗ , y ⃗ > = < y ⃗ , A α ⃗ > \left< A\vec{\alpha},\vec{y} \right> =\left< \vec{y},A\vec{\alpha} \right> Aα ,y =y ,Aα
内积式求导:
∂ < A α ⃗ , y ⃗ > ∂ α ⃗ = A T y ⃗ ① \frac{\partial \left< A\vec{\alpha},\vec{y} \right>}{\partial \vec{\alpha}}=A^T\vec{y} ① α Aα ,y =ATy
∂ < A α ⃗ > ∂ α ⃗ = A \frac{\partial \left< A\vec{\alpha} \right>}{\partial \vec{\alpha}}=A α Aα =A
∂ < A α ⃗ , α ⃗ > ∂ α ⃗ = A T α ⃗ + A α ⃗ \frac{\partial \left< A\vec{\alpha},\vec{\alpha} \right>}{\partial \vec{\alpha}}=A^T\vec{\alpha}+A\vec{\alpha} α Aα ,α =ATα +Aα
< A α ⃗ , A α ⃗ > = < α ⃗ , A T A α ⃗ > \left< A\vec{\alpha},A\vec{\alpha} \right> =\left< \vec{\alpha},A^TA\vec{\alpha} \right> Aα ,Aα =α ,ATAα

2.自定义函数实现算法:

#1.导包
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from scipy import linalg
from sklearn.datasets import make_regression
#2.自定义函数实现回归系数的求解
def OLS_function(X,y):#X数据集,y是目标集
    data_matrix = np.column_stack((np.ones(len(X)),X))#为数据集加上1列形成设计矩阵
    if np.abs(linalg.det(data_matrix.T@data_matrix))<10**(-14): #ATA求逆的前提条件是,ATA的行列式不为0
        return 'No solution'
       tita=linalg.inv(data_matrix.T@data_matrix)@(data_matrix.T@y)#a=(ATA)-1Ay
    return tita
#3.生成回归数据集并拆分训练和测试集
 data_2 = make_regression(6000,2,2,1,0.12,noise=0.6,random_state=121)
#4.调用函数求解回归
OLS_function(X_train,y_train)

#array([ 0.12809328, 85.82056618, 35.33386918])
第一给参数为截距,后两个为斜率

#5.利用回归系数预测测试集
coef=OLS_function(X_train,y_train)#回归系数
data_matrix1 = np.column_stack((np.ones(len(X_test)),X_test))#将测试集做成设计矩阵
y_pre1=data_matrix1@coef
y_pre1
#5.利用sklearn中的线性模型生成回归系数做对比
lin=linear_model.LinearRegression()
lin.fit(X_train,y_train)
lin.coef_,lin.intercept_
y_pre=lin.predict(X_test)#返回测试集的预测值

#(array([85.82056618, 35.33386918]), 0.1280932773964853)


#6.模型评判
r2_score(y_test,y_pre1)#自定义算法的预测值
r2_score(y_test,y_pre)#sklearn算法的预测值

#0.9999570694281368
#0.9999553900491037
通过对比回归系数,和模型评判分数,可知自定义算法实现的与sklearn库中的线性回归模型具有相同的效果.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值