多元线性回归分析及自定义代码实现算法
一.算法推导:
1.首先定义多元线性回归的函数:
f
(
x
1
,
x
2
.
.
.
x
n
)
=
α
0
+
α
1
x
1
+
α
2
x
2
+
.
.
.
+
α
n
x
n
=
α
0
+
∑
i
=
1
n
α
i
x
i
f\left( x_1,x_2...x_n \right) =\alpha _0+\alpha _1x_1+\alpha _2x_2+...+\alpha _nx_n=\alpha _0+\sum_{i=1}^n{\alpha _ix_i}
f(x1,x2...xn)=α0+α1x1+α2x2+...+αnxn=α0+i=1∑nαixi
此时残差的表达式可以写成:
ε
i
=
(
y
i
−
α
0
−
∑
j
=
1
n
α
j
x
i
j
)
①
\varepsilon _i=\left( y_i-\alpha _0-\sum_{j=1}^n{\alpha _jx_{ij}} \right) ①
εi=(yi−α0−j=1∑nαjxij)①
其中xij表示第i行所有元素,aj表示a的第j行元素,具体形式如下
mxn的数据矩阵:
[
x
11
x
12
.
.
.
x
1
n
x
21
x
22
.
.
.
x
2
n
.
.
.
.
.
.
.
.
.
.
.
.
x
m
1
x
m
2
.
.
.
x
m
n
]
\left[ \begin{matrix} x_{11}& x_{12}& ...& x_{1n}\\ x_{21}& x_{22}& ...& x_{2n}\\ ...& ...& ...& ...\\ x_{m1}& x_{m2}& ...& x_{mn}\\ \end{matrix} \right]
⎣⎢⎢⎡x11x21...xm1x12x22...xm2............x1nx2n...xmn⎦⎥⎥⎤
系数矩阵:
[
α
1
α
2
.
.
.
α
n
]
\left[ \begin{array}{c} \alpha _1\\ \alpha _2\\ ...\\ \alpha _n\\ \end{array} \right]
⎣⎢⎢⎡α1α2...αn⎦⎥⎥⎤
将①式矢量化为:
ε
⃗
=
y
⃗
−
[
1
x
11
.
.
.
x
1
n
1
x
21
.
.
.
x
2
n
.
.
.
.
.
.
.
.
.
.
.
.
1
x
m
1
.
.
.
x
m
n
]
⋅
[
α
0
α
1
.
.
.
α
n
]
\vec{\varepsilon}=\vec{y}-\left[ \begin{matrix} 1& x_{11}& ...& x_{1n}\\ 1& x_{21}& ...& x_{2n}\\ ...& ...& ...& ...\\ 1& x_{m1}& ...& x_{mn}\\ \end{matrix} \right] \cdot \left[ \begin{array}{c} \alpha _0\\ \alpha _1\\ ...\\ \alpha _n\\ \end{array} \right]
ε=y−⎣⎢⎢⎡11...1x11x21...xm1............x1nx2n...xmn⎦⎥⎥⎤⋅⎣⎢⎢⎡α0α1...αn⎦⎥⎥⎤
左侧矩阵叫设计矩阵用A表示,右侧为系数矩阵用a向量表示
ε
⃗
=
y
⃗
−
A
α
⃗
\vec{\varepsilon}=\vec{y}-A\vec{\alpha}
ε=y−Aα
同样用最小二法确定多元回归模型的系数:
ε
⃗
⋅
ε
⃗
=
∥
ε
⃗
∥
2
=
<
A
α
⃗
−
y
⃗
,
A
α
⃗
−
y
⃗
>
=
<
A
α
⃗
,
A
α
⃗
>
−
2
<
y
⃗
,
A
α
⃗
>
+
<
y
⃗
,
y
⃗
>
\vec{\varepsilon}\cdot \vec{\varepsilon}=\lVert \vec{\varepsilon} \rVert ^2=\left< A\vec{\alpha}-\vec{y},A\vec{\alpha}-\vec{y} \right> =\left< A\vec{\alpha},A\vec{\alpha} \right> -2\left< \vec{y},A\vec{\alpha} \right> +\left< \vec{y},\vec{y} \right>
ε⋅ε=∥ε∥2=⟨Aα−y,Aα−y⟩=⟨Aα,Aα⟩−2⟨y,Aα⟩+⟨y,y⟩
对上式求导:
∂
<
ε
⃗
⋅
ε
⃗
>
∂
α
⃗
=
−
2
A
T
y
⃗
+
(
A
T
A
)
α
⃗
+
(
A
T
A
)
T
α
⃗
=
−
2
A
T
y
⃗
+
2
A
T
A
α
⃗
\frac{\partial \left< \vec{\varepsilon}\cdot \vec{\varepsilon} \right>}{\partial \vec{\alpha}}=-2A^T\vec{y}+\left( A^TA \right) \vec{\alpha}+\left( A^TA \right) ^T\vec{\alpha}=-2A^T\vec{y}+2A^TA\vec{\alpha}
∂α∂⟨ε⋅ε⟩=−2ATy+(ATA)α+(ATA)Tα=−2ATy+2ATAα
函数存在极值,则导函数为0
−
2
A
T
y
⃗
+
2
A
T
A
α
⃗
=
0
-2A^T\vec{y}+2A^TA\vec{\alpha}=0
−2ATy+2ATAα=0
A
T
y
⃗
=
A
T
A
α
⃗
A^T\vec{y}=A^TA\vec{\alpha}
ATy=ATAα
(
A
T
A
)
−
1
A
T
A
α
⃗
=
(
A
T
A
)
−
1
A
T
y
⃗
\left( A^TA \right) ^{-1}A^TA\vec{\alpha}=\left( A^TA \right) ^{-1}A^T\vec{y}
(ATA)−1ATAα=(ATA)−1ATy
重要公式:
α ⃗ = ( A T A ) − 1 A T y ⃗ \vec{\alpha}=\left( A^TA \right) ^{-1}A^T\vec{y} α=(ATA)−1ATy
帮助公式:
内积的对称性:
<
A
α
⃗
,
y
⃗
>
=
<
y
⃗
,
A
α
⃗
>
\left< A\vec{\alpha},\vec{y} \right> =\left< \vec{y},A\vec{\alpha} \right>
⟨Aα,y⟩=⟨y,Aα⟩
内积式求导:
∂
<
A
α
⃗
,
y
⃗
>
∂
α
⃗
=
A
T
y
⃗
①
\frac{\partial \left< A\vec{\alpha},\vec{y} \right>}{\partial \vec{\alpha}}=A^T\vec{y} ①
∂α∂⟨Aα,y⟩=ATy①
∂
<
A
α
⃗
>
∂
α
⃗
=
A
\frac{\partial \left< A\vec{\alpha} \right>}{\partial \vec{\alpha}}=A
∂α∂⟨Aα⟩=A
∂
<
A
α
⃗
,
α
⃗
>
∂
α
⃗
=
A
T
α
⃗
+
A
α
⃗
\frac{\partial \left< A\vec{\alpha},\vec{\alpha} \right>}{\partial \vec{\alpha}}=A^T\vec{\alpha}+A\vec{\alpha}
∂α∂⟨Aα,α⟩=ATα+Aα
<
A
α
⃗
,
A
α
⃗
>
=
<
α
⃗
,
A
T
A
α
⃗
>
\left< A\vec{\alpha},A\vec{\alpha} \right> =\left< \vec{\alpha},A^TA\vec{\alpha} \right>
⟨Aα,Aα⟩=⟨α,ATAα⟩
2.自定义函数实现算法:
#1.导包
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from scipy import linalg
from sklearn.datasets import make_regression
#2.自定义函数实现回归系数的求解
def OLS_function(X,y):#X数据集,y是目标集
data_matrix = np.column_stack((np.ones(len(X)),X))#为数据集加上1列形成设计矩阵
if np.abs(linalg.det(data_matrix.T@data_matrix))<10**(-14): #ATA求逆的前提条件是,ATA的行列式不为0
return 'No solution'
tita=linalg.inv(data_matrix.T@data_matrix)@(data_matrix.T@y)#a=(ATA)-1Ay
return tita
#3.生成回归数据集并拆分训练和测试集
data_2 = make_regression(6000,2,2,1,0.12,noise=0.6,random_state=121)
#4.调用函数求解回归
OLS_function(X_train,y_train)
#array([ 0.12809328, 85.82056618, 35.33386918])
第一给参数为截距,后两个为斜率
#5.利用回归系数预测测试集
coef=OLS_function(X_train,y_train)#回归系数
data_matrix1 = np.column_stack((np.ones(len(X_test)),X_test))#将测试集做成设计矩阵
y_pre1=data_matrix1@coef
y_pre1
#5.利用sklearn中的线性模型生成回归系数做对比
lin=linear_model.LinearRegression()
lin.fit(X_train,y_train)
lin.coef_,lin.intercept_
y_pre=lin.predict(X_test)#返回测试集的预测值
#(array([85.82056618, 35.33386918]), 0.1280932773964853)
#6.模型评判
r2_score(y_test,y_pre1)#自定义算法的预测值
r2_score(y_test,y_pre)#sklearn算法的预测值
#0.9999570694281368
#0.9999553900491037
通过对比回归系数,和模型评判分数,可知自定义算法实现的与sklearn库中的线性回归模型具有相同的效果.