线性回归
一元线性回归模型
函数形式:
f
(
x
)
=
ω
x
i
+
b
(1.1)
f(x)=\omega{x_i+b}\tag{1.1}
f(x)=ωxi+b(1.1)
求解 w w w和 b b b:
- method 1 最小二乘法公式法
ω = ∑ i = 1 n y i ( x i − x ‾ ) ∑ i = 1 n x i 2 − 1 n ( ∑ i = 1 n ) 2 (1.2) \omega=\frac{\sum\limits_{i=1}^ny_i(x_i-\overline{x})}{\sum\limits^n_{i=1}x_i^2-\frac{1}{n}(\sum^n_{i=1})^2}\tag{1.2} ω=i=1∑nxi2−n1(∑i=1n)2i=1∑nyi(xi−x)(1.2)
b
=
1
n
∑
i
=
1
n
(
y
i
−
ω
x
i
)
(1.3)
b=\frac{1}{n}\sum_{i=1}^n(y_i-\omega x_i)\tag{1.3}
b=n1i=1∑n(yi−ωxi)(1.3)
其中
x
‾
=
1
n
∑
i
=
1
n
x
i
\overline{x}=\frac{1}{n}\sum \limits_{i=1}^n x_i
x=n1i=1∑nxi为
x
x
x的均值.
- method 2 公式法
w = ∑ ( x i − x ‾ ) ( y i − y ‾ ) ∑ ( x i − x ‾ ) 2 w = \frac{\sum(x_i - \overline{x})(y_i-\overline{y})}{\sum(x_i -\overline{x})^2} w=∑(xi−x)2∑(xi−x)(yi−y)
b = y ‾ − w x ‾ b = \overline{y} - w\overline{x} b=y−wx
-
method 3 调用statmodels统计建模模块中的ols函数**
适用于自变量X和因变量Y为线性关系,具体来说,画出散点图可以用一条直线来近似拟合。一般线性模型要求观测值之间相互独立、残差(因变量)服从正态分布、残差(因变量)方差齐性
统计模型被假定为Y=Xβ+μ, μ∼N(0,Σ)
ols : 最简单的最小二乘法模型
Methods desc fit() Full fit of the model from_formula(formula, data) Create a Model from a formula and dataframe predict(params) Return linear predicted values from a design matrix. score(params) Evaluate the score function at a given point 一般,直接.summary()即可获取所有信息。
-
method 4 调用sklearn子模块中的linear_model中的LinearRegression方法**
最小二乘法线性回归:
sklearn.linear_model.LinearRegression(fit_intercept=True, normalize=False,copy_X=True, n_jobs=1)
主要参数说明:
fit_intercept 布尔型 默认为True,是否对训练数据进行中心化。如果该变量为false,则表明输入的数据已经进行了中心化,在下面的过程里不进行中心化处理;否则,对输入的训练数据进行中心化处理 normalize 布尔型 默认为False,是否需要对数据进行标准化处理 copy_X 布尔型 默认为True,是否对X复制,如果选择false,则直接对原数据进行覆盖。(即经过中心化,标准化后,是否把新数据覆盖到原数据上) n_jobs 整型 默认为1,计算时设置的任务个数(number of jobs)。如果选择-1,则代表使用所有的CPU。这一参数的对于目标个数>1(n_targets>1)且足够大规模的问题有加速作用。 Attributes(属性):
coef_ array, shape (n_features, )
or (n_targets, n_features)返回模型的系数 intercept_ array 返回模型的截距 方法:
fit
(X, y[, sample_weight])Fit linear model. get_params
([deep])Get parameters for this estimator. predict
(X)Predict using the linear model. score
(X, y[, sample_weight])Return the coefficient of determination of the prediction. R 2 R^2 R2 set_params
(**params)Set the parameters of this estimator.
import matplotlib.pyplot as plt
import numpy as np
X=[165,165,157,170,175,165,155,170]
Y=[48,57,50,54,64,61,43,59]
plt.xlabel("height/cm")
plt.ylabel("weight/kg")
plt.plot(X,Y,"og")
plt.show()
'''
method 1
'''
x=[165,165,157,170,175,165,155,170]
y=[48,57,50,54,64,61,43,59]
sumx=0
sumx2=0
sumy=0
xy=0
n=len(x)
for i in range(n):
sumx+=x[i]
sumy+=y[i]
sumx2+=x[i]**2
xy+=x[i]*y[i]
k=(xy-1/n*sumy*sumx)/(sumx2-1/n*(sumx**2))
b=1/n*(sumy-k*sumx)
print("k=:",k,"b=:",b )
x=172
print("predict weight=:",k*x+b)
k=: 0.8484848484848485 b=: -85.71212121212122
predict weight=: 60.22727272727272
'''
method 2
'''
import numpy as np
def fit(x,y):
sfz = 0
sfm = 0
for i in range(len(x)):
sfz += (x[i] - np.mean(x))*(y[i] - np.mean(y))
sfm += (x[i] - np.mean(x))**2
w = sfz/sfm
b = np.mean(y) - w*np.mean(x)
return w,b
def predict(x,w,b):
y = w*x + b
return y
if __name__ == "__main__":
x_train = [165,165,157,170,175,165,155,170]
y_train = [48,57,50,54,64,61,43,59]
w,b = fit(x_train,y_train)
print("w = ",w,"b = ",b)
x_test = 172
print(predict(x_test,w,b))
k=: 0.8484848484848485 b=: -85.71212121212122
predict weight=: 60.22727272727272
'''
method 3
'''
import statsmodels.api as sm
import pandas as pd
import numpy as np
x = [165,165,157,170,175,165,155,170]
y = [48,57,50,54,64,61,43,59]
data = np.vstack((x,y)) #按垂直方向(行顺序)堆叠数组构成一个新的数组
dat = pd.DataFrame(data.T,columns = ['x','y'])
res = sm.formula.ols('y ~ x',data = dat).fit()
print(res.params)
print(res.summary())
Intercept -85.712121
x 0.848485
dtype: float64
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.638
Model: OLS Adj. R-squared: 0.577
Method: Least Squares F-statistic: 10.55
Date: Wed, 03 Feb 2021 Prob (F-statistic): 0.0175
Time: 22:06:21 Log-Likelihood: -22.451
No. Observations: 8 AIC: 48.90
Df Residuals: 6 BIC: 49.06
Df Model: 1
Covariance Type: nonrobust
==============================================================================
'''
method 4
'''
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
x = np.array([165,165,157,170,175,165,155,170])
y = np.array([48,57,50,54,64,61,43,59])
model = LinearRegression(fit_intercept = True)
model.fit(x[:,np.newaxis],y)
'''
newaxis:用于索引数组
x = np.arange(3)
array([0, 1, 2])
x[:, newaxis]
array([[0],
[1],
[2]])
'''
print("w: ",model.coef_[0])
print("b",model.intercept_)
w: 0.8484848484848485
b -85.71212121212122
import matplotlib.pyplot as plt
import numpy as np
x_t=[165,165,157,170,175,165,155,170]
y_t=[48,57,50,54,64,61,43,59]
x_p=[165,165,157,170,175,165,155,170,172]
y_p=[]
w= 0.8484848484848485
b= -85.71212121212122
for i in range(len(x_p)):
y_p.append(w*x_p[i]+b)
plt.xlabel("height/cm")
plt.ylabel("weight/kg")
plt.plot(x_t,y_t,"ko",label="true")
plt.plot(x_p,y_p,linewidth='1',label="predict",color='royalblue')
plt.legend(loc='upper left')
plt.show()
残差平方和
R
2
R^2
R2:
L
(
w
)
=
∑
i
=
1
n
∣
∣
w
T
x
i
−
y
i
∣
∣
2
(1.4)
L(w)=\sum_{i=1}^n||w^Tx_i-y_i||^2\tag{1.4}
L(w)=i=1∑n∣∣wTxi−yi∣∣2(1.4)
二、多元线性回归
线性组合函数:
f
(
x
)
=
w
1
x
1
+
w
2
x
2
+
.
.
.
+
w
d
x
d
+
b
(2.1)
f(x)=w_1x_1+w_2x_2+...+w_dx_d+b\tag{2.1}
f(x)=w1x1+w2x2+...+wdxd+b(2.1)
向量形式:
f
(
x
)
=
w
T
x
+
b
(2.2)
f(x)=w^Tx+b\tag{2.2}
f(x)=wTx+b(2.2)
【1】矩阵表达
数据集 D = ( x 1 , y 1 ) , ( x 2 , y 2 ) , … ( x i , y i ) , x i ∈ R ∗ , y i ∈ R , i = 1 , 2 , … , n D={(x_1,y_1),(x_2,y_2),\dots(x_i,y_i),x_i\in R^*,y_i \in R,i=1,2,\dots,n} D=(x1,y1),(x2,y2),…(xi,yi),xi∈R∗,yi∈R,i=1,2,…,n
X = ( x 1 x 2 … x n ) T X=\left(\begin{matrix}x_1&x_2&\dots&x_n\end{matrix}\right)^T X=(x1x2…xn)T= ( x 1 T x 2 T ⋮ x n T ) \left(\begin{matrix}x_1^T\\x_2^T\\\vdots\\x_n^T\end{matrix}\right) ⎝⎜⎜⎜⎛x1Tx2T⋮xnT⎠⎟⎟⎟⎞= X = ( x 11 x 12 … x 1 p x 21 x 22 … x 2 p ⋮ ⋮ ⋮ ⋮ x n 1 x n 2 … x n p ) n × p X=\left(\begin{matrix}x_{11}&x_{12}&\dots&x_{1p}\\x_{21}&x_{22}&\dots&x_{2p}\\\vdots&\vdots&\vdots&\vdots\\x_{n1}&x_{n2}&\dots&x_{np}\end{matrix}\right)_{n\times p} X=⎝⎜⎜⎜⎛x11x21⋮xn1x12x22⋮xn2……⋮…x1px2p⋮xnp⎠⎟⎟⎟⎞n×p
一般 n > > p n>>p n>>p
Y = ( y 1 y 2 ⋮ y n ) n × 1 Y=\left(\begin{matrix}y_1\\y_2\\\vdots\\y_n\end{matrix}\right)_{n\times 1} Y=⎝⎜⎜⎜⎛y1y2⋮yn⎠⎟⎟⎟⎞n×1
【2】最小二乘估计
法一(loss function):
L ( w ) = ∑ i = 1 n ∣ ∣ w T x i − y i ∣ ∣ 2 (2.3) L(w)=\sum_{i=1}^n||w^Tx_i-y_i||^2\tag{2.3} L(w)=i=1∑n∣∣wTxi−yi∣∣2(2.3)
= ( W T X T − Y T ) ⋅ ( X W − Y ) =(W^TX^T-Y^T)\cdot(XW-Y) =(WTXT−YT)⋅(XW−Y)
= W T X T W − 2 W T X T Y + Y T Y =W^TX^TW-2W^TX^TY+Y^TY =WTXTW−2WTXTY+YTY
w ^ = a r g m i n L ( w ) \hat{w}=argminL(w) w^=argminL(w)
∂ L ( w ) ∂ w = 2 X X T W − 2 X T Y = 0 \frac{\partial L(w)}{\partial w}=2XX^TW-2X^TY=0 ∂w∂L(w)=2XXTW−2XTY=0
w = ( X T X ) − 1 X T Y (2.4) w=(X^TX)^{-1}X^TY\tag{2.4} w=(XTX)−1XTY(2.4)
法二:
w
w
w也可从几何意义上求解,
X
X
X看成
p
p
p维列向量,令
f
(
x
)
=
w
T
=
x
T
β
,
x
T
为
系
数
f(x)=w^T=x^T\beta,x^T为系数
f(x)=wT=xTβ,xT为系数,要求的
β
\beta
β为垂直于这个
p
p
p维子空间的垂直投影,即
X
T
(
Y
−
X
β
)
=
0
(
p
×
1
零
矩
阵
)
(2.5)
X^T(Y-X\beta)=0(p\times1零矩阵)\tag{2.5}
XT(Y−Xβ)=0(p×1零矩阵)(2.5)
解得
β
=
(
X
T
X
)
−
1
X
T
Y
\beta=(X^TX)^{-1}X^TY
β=(XTX)−1XTY
方法一把总误差平均的分散在每个样本上面,法二把误差分散在 p p p个维度上。
法三:概率角度 略
【3】过拟合
方法:
1.加数据
2.特征选取/PCA特征提取(降维)
3.正则化
正则化框架
a r g m i n [ L ( W ) + λ P ( W ) ] (2.6) argmin[L(W)+\lambda P(W)]\tag{2.6} argmin[L(W)+λP(W)](2.6)
其中 L ( w ) 为 l o s s f u n c t i o n , P ( w ) = p e n a l t y f u n c t i o n L(w)为loss function,P(w)=penalty function L(w)为lossfunction,P(w)=penaltyfunction
两种具体的正则化:
L 1 : L a c c s o , p ( w ) = ∣ ∣ w ∣ ∣ 1 L1: Laccso, p(w)=||w||_1 L1:Laccso,p(w)=∣∣w∣∣1
L
2
(
权
值
衰
减
)
:
R
i
d
g
e
(
岭
回
归
)
,
p
(
w
)
=
∣
∣
w
∣
∣
2
2
=
W
T
W
L2(权值衰减): Ridge(岭回归),p(w)=||w||_2^2=W^TW
L2(权值衰减):Ridge(岭回归),p(w)=∣∣w∣∣22=WTW
J
(
w
)
=
∑
i
=
1
n
∣
∣
w
T
x
i
−
y
i
∣
∣
2
+
λ
w
T
w
(2.7)
J(w)=\sum\limits_{i=1}^n||w^Tx_i-y_i||^2+\lambda w^Tw\tag{2.7}
J(w)=i=1∑n∣∣wTxi−yi∣∣2+λwTw(2.7)
w
^
=
a
r
g
m
i
n
J
(
w
)
,
λ
>
0
\hat{w}=argminJ(w),\lambda>0
w^=argminJ(w),λ>0
解得:
w
^
=
(
X
T
X
+
λ
I
)
−
1
X
T
Y
\hat{w}=(X^TX+\lambda I)^{-1}X^TY
w^=(XTX+λI)−1XTY
Python实现多元回归和预测
分别使用理论法、统计学习中的statmodels库和机器学习sklearn库中的回归实现;
数据集 :一个选取了11组数据,其中将8个作为训练集train和3个测试集test.
- 理论法
import numpy as np
from sklearn import model_selection
x = np.matrix([[1,1,1,1,1,1,1,1,1,1,1],[104, 101.8, 105.8, 111.5, 115.97, 120.03, 113.3, 116.4, 105.1, 83.4, 73.3],
[135.6, 140.2, 140.1, 146.9, 144, 143, 133.3, 135.7, 125.8, 98.5, 99.8],
[131.6, 135.5, 142.6, 143.2, 142.2, 138.4, 138.4, 135, 122.5, 87.2, 96.5],
[54.2, 54.9, 54.8, 56.3, 54.5, 54.6, 54.9, 54.8, 49.3, 41.5, 48.9],])
x = np.transpose(x)
# x_train = x[:9,:]
# x_test = x[9:,:]
x_train,x_test = model_selection.train_test_split(x,test_size = 0.2)
"""
逗号前是取行,逗号后是取列。列要全部取,行只取前9
"""
y = np.matrix([560823, 542386, 604834, 591248, 583031, 640636, 575688, 689637, 570790, 519574, 614677])
y = np.transpose(y)
# y_train =y[:9]
# y_test = y[9:]
y_train,y_test = model_selection.train_test_split(y,test_size = 0.2)
"""
y = y.reshape([-1, 1])
"""
tx = np.transpose(x_train) #矩阵转置
nx = tx*x_train
inv = np.linalg.pinv(nx)
w = inv*tx*y_train
print(w)
f = (np.transpose(w))*(np.transpose(x_test))
print("pre",f)
print("ture",y_test)
Intercept and coef:
#the first row is Intercept, others are coef
[[656767.09979853]
[ -1037.71064526]
[ -4176.44093565]
[ 1250.05721125]
[ 7924.32921807]]
pre [[566108.88010529 672024.32377186 570019.56312871]]
ture [[689637][614677][640636]]
- statmodels库
from sklearn import model_selection
import statsmodels.api as sm
import pandas as pd
import numpy as np
x = np.matrix([[104, 101.8, 105.8, 111.5, 115.97, 120.03, 113.3, 116.4, 105.1, 83.4, 73.3],
[135.6, 140.2, 140.1, 146.9, 144, 143, 133.3, 135.7, 125.8, 98.5, 99.8],
[131.6, 135.5, 142.6, 143.2, 142.2, 138.4, 138.4, 135, 122.5, 87.2, 96.5],
[54.2, 54.9, 54.8, 56.3, 54.5, 54.6, 54.9, 54.8, 49.3, 41.5, 48.9],])
x = np.transpose(x)
# x_wb = sm.add_constant(x)
x_train,x_test = model_selection.train_test_split(x,test_size = 0.2)
# x_train = x[:9,:]
# x_test = x[9:,:]
"""
逗号前是取行,逗号后是取列。列要全部取,行只取前9
"""
y = np.matrix([560823, 542386, 604834, 591248, 583031, 640636, 575688, 689637, 570790, 519574, 614677])
y = np.transpose(y)
y_train,y_test = model_selection.train_test_split(y,test_size = 0.2)
'''
第一种使用方式
'''
model = sm.OLS(y_train,x_train)
results = model.fit()
print(results.params)
print(results.summary())
'''
第二种使用方式
'''
data = pd.DataFrame(x_train,columns = ['x1','x2','x3','x4'])
testdata = pd.DataFrame(x_test,columns = ['x1','x2','x3','x4'])
data['y'] = y_train
print(data.head())
res = sm.formula.ols('y~x1+x2+x3+x4',data = data).fit()
print(res.params)
print(res.tvalues)
print(x_test)
print("pre:",res.predict(testdata))
print("true:",y_test)
OLS Regression Results
=======================================================================================
Dep. Variable: y R-squared (uncentered): -5.323
Model: OLS Adj. R-squared (uncentered): -11.647
Method: Least Squares F-statistic: -0.8419
Date: Wed, 03 Feb 2021 Prob (F-statistic): 1.00
Time: 21:45:05 Log-Likelihood: -90.425
No. Observations: 8 AIC: 188.9
Df Residuals: 4 BIC: 189.2
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 6456.6335 1592.525 4.054 0.015 2035.075 1.09e+04
x2 4056.7051 3908.981 1.038 0.358 -6796.367 1.49e+04
x3 -1.383e+04 3084.779 -4.482 0.011 -2.24e+04 -5262.265
x4 2.168e+04 2834.596 7.649 0.002 1.38e+04 2.96e+04
==============================================================================
Omnibus: 1.122 Durbin-Watson: 1.996
Prob(Omnibus): 0.571 Jarque-Bera (JB): 0.781
Skew: -0.516 Prob(JB): 0.677
Kurtosis: 1.869 Cond. No. 113.
==============================================================================
Intercept 477698.682835
x1 4442.644279
x2 -1414.453576
x3 -4615.177872
x4 7913.426734
dtype: float64
Intercept 1.923370
x1 2.749555
x2 -0.340907
x3 -0.862686
x4 1.057069
dtype: float64
[[105.8 140.1 142.6 54.8]
[104. 135.6 131.6 54.2]
[113.3 133.3 138.4 54.9]]
pre:
0 525096.922084
1 569484.104019
2 588210.128225
dtype: float64
true:
0 542386
1 560823
2 689637
- sklearn库 Regression
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np
x = np.matrix([[104, 101.8, 105.8, 111.5, 115.97, 120.03, 113.3, 116.4, 105.1, 83.4, 73.3],
[135.6, 140.2, 140.1, 146.9, 144, 143, 133.3, 135.7, 125.8, 98.5, 99.8],
[131.6, 135.5, 142.6, 143.2, 142.2, 138.4, 138.4, 135, 122.5, 87.2, 96.5],
[54.2, 54.9, 54.8, 56.3, 54.5, 54.6, 54.9, 54.8, 49.3, 41.5, 48.9],])
x = np.transpose(x)
x_train,x_test = model_selection.train_test_split(x,test_size = 0.2)
# x_train = x[:9,:]
# x_test = x[9:,:]
"""
逗号前是取行,逗号后是取列。列要全部取,行只取前9
"""
y = np.matrix([560823, 542386, 604834, 591248, 583031, 640636, 575688, 689637, 570790, 519574, 614677])
y = np.transpose(y)
y_train,y_test = model_selection.train_test_split(y,test_size = 0.2)
# y_train =y[:9]
# y_test = y[9:]
model = LinearRegression(fit_intercept = True)
model.fit(x_train,y_train)
print("Intercept:",model.intercept_)
print("coef:",model.coef_)
pred = model.predict(x_test)
print("predition:\n",pred)
print("real:\n",y_test)
Intercept: [117471.80714502]
coef: [[ -574.45939702 8092.27590779 -7413.31955575 8143.91966418]]
predition:
[[552183.69901895]
[624369.48768329]
[605809.06246475]]
real:
[[519574]
[570790]
[640636]]