线性回归学习笔记

线性回归

一元线性回归模型

函数形式:
f ( x ) = ω x i + b (1.1) f(x)=\omega{x_i+b}\tag{1.1} f(x)=ωxi+b(1.1)

求解 w w w b b b
  • method 1 最小二乘法公式法

ω = ∑ i = 1 n y i ( x i − x ‾ ) ∑ i = 1 n x i 2 − 1 n ( ∑ i = 1 n ) 2 (1.2) \omega=\frac{\sum\limits_{i=1}^ny_i(x_i-\overline{x})}{\sum\limits^n_{i=1}x_i^2-\frac{1}{n}(\sum^n_{i=1})^2}\tag{1.2} ω=i=1nxi2n1(i=1n)2i=1nyi(xix)(1.2)

b = 1 n ∑ i = 1 n ( y i − ω x i ) (1.3) b=\frac{1}{n}\sum_{i=1}^n(y_i-\omega x_i)\tag{1.3} b=n1i=1n(yiωxi)(1.3)
其中 x ‾ = 1 n ∑ i = 1 n x i \overline{x}=\frac{1}{n}\sum \limits_{i=1}^n x_i x=n1i=1nxi x x x的均值.

  • method 2 公式法

w = ∑ ( x i − x ‾ ) ( y i − y ‾ ) ∑ ( x i − x ‾ ) 2 w = \frac{\sum(x_i - \overline{x})(y_i-\overline{y})}{\sum(x_i -\overline{x})^2} w=(xix)2(xix)(yiy)

b = y ‾ − w x ‾ b = \overline{y} - w\overline{x} b=ywx

  • method 3 调用statmodels统计建模模块中的ols函数**

    适用于自变量X和因变量Y为线性关系,具体来说,画出散点图可以用一条直线来近似拟合。一般线性模型要求观测值之间相互独立、残差(因变量)服从正态分布、残差(因变量)方差齐性
    统计模型被假定为 Y=Xβ+μ, μ∼N(0,Σ)

    ols : 最简单的最小二乘法模型

    Methodsdesc
    fit()Full fit of the model
    from_formula(formula, data)Create a Model from a formula and dataframe
    predict(params)Return linear predicted values from a design matrix.
    score(params)Evaluate the score function at a given point

    一般,直接.summary()即可获取所有信息。

  • method 4 调用sklearn子模块中的linear_model中的LinearRegression方法**

    最小二乘法线性回归:

    sklearn.linear_model.LinearRegression(fit_intercept=True, normalize=False,copy_X=True, n_jobs=1)

    主要参数说明:

    fit_intercept布尔型默认为True,是否对训练数据进行中心化。如果该变量为false,则表明输入的数据已经进行了中心化,在下面的过程里不进行中心化处理;否则,对输入的训练数据进行中心化处理
    normalize布尔型默认为False,是否需要对数据进行标准化处理
    copy_X布尔型默认为True,是否对X复制,如果选择false,则直接对原数据进行覆盖。(即经过中心化,标准化后,是否把新数据覆盖到原数据上)
    n_jobs整型默认为1,计算时设置的任务个数(number of jobs)。如果选择-1,则代表使用所有的CPU。这一参数的对于目标个数>1(n_targets>1)且足够大规模的问题有加速作用。

    Attributes(属性):

    coef_array, shape (n_features, )
    or (n_targets, n_features)
    返回模型的系数
    intercept_array返回模型的截距

    方法:

    fit(X, y[, sample_weight])Fit linear model.
    get_params([deep])Get parameters for this estimator.
    predict(X)Predict using the linear model.
    score(X, y[, sample_weight])Return the coefficient of determination of the prediction. R 2 R^2 R2
    set_params(**params)Set the parameters of this estimator.
  1. 画散点图
import matplotlib.pyplot as plt
import numpy as np
X=[165,165,157,170,175,165,155,170]
Y=[48,57,50,54,64,61,43,59]
plt.xlabel("height/cm")
plt.ylabel("weight/kg")
plt.plot(X,Y,"og")
plt.show()

在这里插入图片描述

  1. 求解回归方程和预测
'''
method 1
'''
x=[165,165,157,170,175,165,155,170]
y=[48,57,50,54,64,61,43,59]
sumx=0
sumx2=0
sumy=0
xy=0
n=len(x)
for i in range(n):
    sumx+=x[i]
    sumy+=y[i]
    sumx2+=x[i]**2
    xy+=x[i]*y[i]
k=(xy-1/n*sumy*sumx)/(sumx2-1/n*(sumx**2))
b=1/n*(sumy-k*sumx)
print("k=:",k,"b=:",b )
x=172
print("predict weight=:",k*x+b)
k=: 0.8484848484848485 b=: -85.71212121212122
predict weight=: 60.22727272727272
'''
method 2
'''
import numpy as np

def fit(x,y):
    sfz = 0
    sfm = 0
    for i in range(len(x)):
        sfz += (x[i] - np.mean(x))*(y[i] - np.mean(y))
        sfm += (x[i] - np.mean(x))**2
    w = sfz/sfm
    b = np.mean(y) - w*np.mean(x)
    return w,b

def predict(x,w,b):
    y = w*x + b
    return y

if __name__ == "__main__":
    x_train = [165,165,157,170,175,165,155,170]
    y_train = [48,57,50,54,64,61,43,59]
    w,b = fit(x_train,y_train)
    print("w = ",w,"b = ",b)
    x_test = 172
    print(predict(x_test,w,b))
k=: 0.8484848484848485 b=: -85.71212121212122
predict weight=: 60.22727272727272
'''
method 3
'''
import statsmodels.api as sm
import pandas as pd
import numpy as np
x = [165,165,157,170,175,165,155,170]
y = [48,57,50,54,64,61,43,59]
data = np.vstack((x,y))   #按垂直方向(行顺序)堆叠数组构成一个新的数组
dat = pd.DataFrame(data.T,columns = ['x','y'])
res = sm.formula.ols('y ~ x',data = dat).fit()
print(res.params)
print(res.summary())
Intercept   -85.712121
x             0.848485
dtype: float64
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.638
Model:                            OLS   Adj. R-squared:                  0.577
Method:                 Least Squares   F-statistic:                     10.55
Date:                Wed, 03 Feb 2021   Prob (F-statistic):             0.0175
Time:                        22:06:21   Log-Likelihood:                -22.451
No. Observations:                   8   AIC:                             48.90
Df Residuals:                       6   BIC:                             49.06
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
'''
method 4
'''
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
x = np.array([165,165,157,170,175,165,155,170])
y = np.array([48,57,50,54,64,61,43,59])
model = LinearRegression(fit_intercept = True)
model.fit(x[:,np.newaxis],y)
'''
newaxis:用于索引数组
x = np.arange(3)
array([0, 1, 2])
x[:, newaxis]
array([[0],
[1],
[2]])
'''
print("w: ",model.coef_[0])
print("b",model.intercept_)
w:  0.8484848484848485
b -85.71212121212122
  1. 画函数图像
import matplotlib.pyplot as plt
import numpy as np
x_t=[165,165,157,170,175,165,155,170]
y_t=[48,57,50,54,64,61,43,59]
x_p=[165,165,157,170,175,165,155,170,172]
y_p=[]
w= 0.8484848484848485
b= -85.71212121212122
for i in range(len(x_p)):
    y_p.append(w*x_p[i]+b)
plt.xlabel("height/cm")
plt.ylabel("weight/kg")
plt.plot(x_t,y_t,"ko",label="true")
plt.plot(x_p,y_p,linewidth='1',label="predict",color='royalblue')
plt.legend(loc='upper left')
plt.show()

在这里插入图片描述
残差平方和 R 2 R^2 R2:
L ( w ) = ∑ i = 1 n ∣ ∣ w T x i − y i ∣ ∣ 2 (1.4) L(w)=\sum_{i=1}^n||w^Tx_i-y_i||^2\tag{1.4} L(w)=i=1nwTxiyi2(1.4)

二、多元线性回归

线性组合函数:
f ( x ) = w 1 x 1 + w 2 x 2 + . . . + w d x d + b (2.1) f(x)=w_1x_1+w_2x_2+...+w_dx_d+b\tag{2.1} f(x)=w1x1+w2x2+...+wdxd+b(2.1)
向量形式:
f ( x ) = w T x + b (2.2) f(x)=w^Tx+b\tag{2.2} f(x)=wTx+b(2.2)

【1】矩阵表达

数据集 D = ( x 1 , y 1 ) , ( x 2 , y 2 ) , … ( x i , y i ) , x i ∈ R ∗ , y i ∈ R , i = 1 , 2 , … , n D={(x_1,y_1),(x_2,y_2),\dots(x_i,y_i),x_i\in R^*,y_i \in R,i=1,2,\dots,n} D=(x1,y1),(x2,y2),(xi,yi),xiR,yiR,i=1,2,,n

X = ( x 1 x 2 … x n ) T X=\left(\begin{matrix}x_1&x_2&\dots&x_n\end{matrix}\right)^T X=(x1x2xn)T= ( x 1 T x 2 T ⋮ x n T ) \left(\begin{matrix}x_1^T\\x_2^T\\\vdots\\x_n^T\end{matrix}\right) x1Tx2TxnT= X = ( x 11 x 12 … x 1 p x 21 x 22 … x 2 p ⋮ ⋮ ⋮ ⋮ x n 1 x n 2 … x n p ) n × p X=\left(\begin{matrix}x_{11}&x_{12}&\dots&x_{1p}\\x_{21}&x_{22}&\dots&x_{2p}\\\vdots&\vdots&\vdots&\vdots\\x_{n1}&x_{n2}&\dots&x_{np}\end{matrix}\right)_{n\times p} X=x11x21xn1x12x22xn2x1px2pxnpn×p

一般 n > > p n>>p n>>p

Y = ( y 1 y 2 ⋮ y n ) n × 1 Y=\left(\begin{matrix}y_1\\y_2\\\vdots\\y_n\end{matrix}\right)_{n\times 1} Y=y1y2ynn×1

【2】最小二乘估计
法一(loss function):

L ( w ) = ∑ i = 1 n ∣ ∣ w T x i − y i ∣ ∣ 2 (2.3) L(w)=\sum_{i=1}^n||w^Tx_i-y_i||^2\tag{2.3} L(w)=i=1nwTxiyi2(2.3)

= ( W T X T − Y T ) ⋅ ( X W − Y ) =(W^TX^T-Y^T)\cdot(XW-Y) =(WTXTYT)(XWY)

= W T X T W − 2 W T X T Y + Y T Y =W^TX^TW-2W^TX^TY+Y^TY =WTXTW2WTXTY+YTY

w ^ = a r g m i n L ( w ) \hat{w}=argminL(w) w^=argminL(w)

∂ L ( w ) ∂ w = 2 X X T W − 2 X T Y = 0 \frac{\partial L(w)}{\partial w}=2XX^TW-2X^TY=0 wL(w)=2XXTW2XTY=0

w = ( X T X ) − 1 X T Y (2.4) w=(X^TX)^{-1}X^TY\tag{2.4} w=(XTX)1XTY(2.4)

法二:

w w w也可从几何意义上求解, X X X看成 p p p维列向量,令 f ( x ) = w T = x T β , x T 为 系 数 f(x)=w^T=x^T\beta,x^T为系数 f(x)=wT=xTβ,xT,要求的 β \beta β为垂直于这个 p p p维子空间的垂直投影,即
X T ( Y − X β ) = 0 ( p × 1 零 矩 阵 ) (2.5) X^T(Y-X\beta)=0(p\times1零矩阵)\tag{2.5} XT(YXβ)=0(p×1)(2.5)
解得 β = ( X T X ) − 1 X T Y \beta=(X^TX)^{-1}X^TY β=(XTX)1XTY

方法一把总误差平均的分散在每个样本上面,法二把误差分散在 p p p个维度上。

法三:概率角度 略
【3】过拟合

方法:

1.加数据

2.特征选取/PCA特征提取(降维)

3.正则化

正则化框架

a r g m i n [ L ( W ) + λ P ( W ) ] (2.6) argmin[L(W)+\lambda P(W)]\tag{2.6} argmin[L(W)+λP(W)](2.6)

其中 L ( w ) 为 l o s s f u n c t i o n , P ( w ) = p e n a l t y f u n c t i o n L(w)为loss function,P(w)=penalty function L(w)lossfunctionPw)=penaltyfunction

两种具体的正则化:

L 1 : L a c c s o , p ( w ) = ∣ ∣ w ∣ ∣ 1 L1: Laccso, p(w)=||w||_1 L1:Laccso,p(w)=w1

L 2 ( 权 值 衰 减 ) : R i d g e ( 岭 回 归 ) , p ( w ) = ∣ ∣ w ∣ ∣ 2 2 = W T W L2(权值衰减): Ridge(岭回归),p(w)=||w||_2^2=W^TW L2:Ridgep(w)=w22=WTW
J ( w ) = ∑ i = 1 n ∣ ∣ w T x i − y i ∣ ∣ 2 + λ w T w (2.7) J(w)=\sum\limits_{i=1}^n||w^Tx_i-y_i||^2+\lambda w^Tw\tag{2.7} J(w)=i=1nwTxiyi2+λwTw(2.7)
w ^ = a r g m i n J ( w ) , λ > 0 \hat{w}=argminJ(w),\lambda>0 w^=argminJ(w),λ>0

解得:
w ^ = ( X T X + λ I ) − 1 X T Y \hat{w}=(X^TX+\lambda I)^{-1}X^TY w^=(XTX+λI)1XTY

Python实现多元回归和预测

分别使用理论法、统计学习中的statmodels库和机器学习sklearn库中的回归实现;

数据集 :一个选取了11组数据,其中将8个作为训练集train和3个测试集test.

  • 理论法
import numpy as np
from sklearn import model_selection
x = np.matrix([[1,1,1,1,1,1,1,1,1,1,1],[104, 101.8, 105.8, 111.5, 115.97, 120.03, 113.3, 116.4, 105.1, 83.4, 73.3],
     [135.6, 140.2, 140.1, 146.9, 144, 143, 133.3, 135.7, 125.8, 98.5, 99.8],
     [131.6, 135.5, 142.6, 143.2, 142.2, 138.4, 138.4, 135, 122.5, 87.2, 96.5],
     [54.2, 54.9, 54.8, 56.3, 54.5, 54.6, 54.9, 54.8, 49.3, 41.5, 48.9],])
x = np.transpose(x)
# x_train = x[:9,:]
# x_test = x[9:,:]
x_train,x_test = model_selection.train_test_split(x,test_size = 0.2)
"""
逗号前是取行,逗号后是取列。列要全部取,行只取前9
"""
y = np.matrix([560823, 542386, 604834, 591248, 583031, 640636, 575688, 689637, 570790, 519574, 614677])
y = np.transpose(y)
# y_train =y[:9]
# y_test = y[9:]
y_train,y_test = model_selection.train_test_split(y,test_size = 0.2)
"""
y = y.reshape([-1, 1])
"""
tx = np.transpose(x_train) #矩阵转置
nx = tx*x_train
inv = np.linalg.pinv(nx)
w = inv*tx*y_train
print(w)
f = (np.transpose(w))*(np.transpose(x_test))
print("pre",f)
print("ture",y_test)
Intercept and coef:
#the first row is Intercept, others are coef
[[656767.09979853]
 [ -1037.71064526]
 [ -4176.44093565]
 [  1250.05721125]
 [  7924.32921807]]
pre [[566108.88010529 672024.32377186 570019.56312871]]
ture [[689637][614677][640636]]
  • statmodels库
from sklearn import model_selection
import statsmodels.api as sm
import pandas as pd
import numpy as np

x = np.matrix([[104, 101.8, 105.8, 111.5, 115.97, 120.03, 113.3, 116.4, 105.1, 83.4, 73.3],
     [135.6, 140.2, 140.1, 146.9, 144, 143, 133.3, 135.7, 125.8, 98.5, 99.8],
     [131.6, 135.5, 142.6, 143.2, 142.2, 138.4, 138.4, 135, 122.5, 87.2, 96.5],
     [54.2, 54.9, 54.8, 56.3, 54.5, 54.6, 54.9, 54.8, 49.3, 41.5, 48.9],])
x = np.transpose(x)
# x_wb = sm.add_constant(x)
x_train,x_test = model_selection.train_test_split(x,test_size = 0.2)
# x_train = x[:9,:]
# x_test = x[9:,:]
"""
逗号前是取行,逗号后是取列。列要全部取,行只取前9
"""
y = np.matrix([560823, 542386, 604834, 591248, 583031, 640636, 575688, 689637, 570790, 519574, 614677])
y = np.transpose(y)
y_train,y_test = model_selection.train_test_split(y,test_size = 0.2)
'''
第一种使用方式
'''
model = sm.OLS(y_train,x_train)
results = model.fit()
print(results.params)
print(results.summary())
'''
第二种使用方式
'''
data = pd.DataFrame(x_train,columns = ['x1','x2','x3','x4'])
testdata = pd.DataFrame(x_test,columns = ['x1','x2','x3','x4'])
data['y'] = y_train
print(data.head())
res = sm.formula.ols('y~x1+x2+x3+x4',data = data).fit()
print(res.params)
print(res.tvalues)
print(x_test)
print("pre:",res.predict(testdata))
print("true:",y_test)
  OLS Regression Results                                
=======================================================================================
Dep. Variable:                      y   R-squared (uncentered):                  -5.323
Model:                            OLS   Adj. R-squared (uncentered):            -11.647
Method:                 Least Squares   F-statistic:                            -0.8419
Date:                Wed, 03 Feb 2021   Prob (F-statistic):                        1.00
Time:                        21:45:05   Log-Likelihood:                         -90.425
No. Observations:                   8   AIC:                                      188.9
Df Residuals:                       4   BIC:                                      189.2
Df Model:                           4                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1          6456.6335   1592.525      4.054      0.015    2035.075    1.09e+04
x2          4056.7051   3908.981      1.038      0.358   -6796.367    1.49e+04
x3         -1.383e+04   3084.779     -4.482      0.011   -2.24e+04   -5262.265
x4          2.168e+04   2834.596      7.649      0.002    1.38e+04    2.96e+04
==============================================================================
Omnibus:                        1.122   Durbin-Watson:                   1.996
Prob(Omnibus):                  0.571   Jarque-Bera (JB):                0.781
Skew:                          -0.516   Prob(JB):                        0.677
Kurtosis:                       1.869   Cond. No.                         113.
==============================================================================
Intercept    477698.682835
x1             4442.644279
x2            -1414.453576
x3            -4615.177872
x4             7913.426734
dtype: float64
Intercept    1.923370
x1           2.749555
x2          -0.340907
x3          -0.862686
x4           1.057069
dtype: float64
[[105.8 140.1 142.6  54.8]
 [104.  135.6 131.6  54.2]
 [113.3 133.3 138.4  54.9]]
pre: 
0    525096.922084
1    569484.104019
2    588210.128225
dtype: float64
true: 
0    542386 
1    560823
2    689637
  • sklearn库 Regression
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np

x = np.matrix([[104, 101.8, 105.8, 111.5, 115.97, 120.03, 113.3, 116.4, 105.1, 83.4, 73.3],
     [135.6, 140.2, 140.1, 146.9, 144, 143, 133.3, 135.7, 125.8, 98.5, 99.8],
     [131.6, 135.5, 142.6, 143.2, 142.2, 138.4, 138.4, 135, 122.5, 87.2, 96.5],
     [54.2, 54.9, 54.8, 56.3, 54.5, 54.6, 54.9, 54.8, 49.3, 41.5, 48.9],])
x = np.transpose(x)
x_train,x_test = model_selection.train_test_split(x,test_size = 0.2)
# x_train = x[:9,:]
# x_test = x[9:,:]
"""
逗号前是取行,逗号后是取列。列要全部取,行只取前9
"""
y = np.matrix([560823, 542386, 604834, 591248, 583031, 640636, 575688, 689637, 570790, 519574, 614677])
y = np.transpose(y)
y_train,y_test = model_selection.train_test_split(y,test_size = 0.2)
# y_train =y[:9]
# y_test = y[9:]
model = LinearRegression(fit_intercept = True)
model.fit(x_train,y_train)
print("Intercept:",model.intercept_)
print("coef:",model.coef_)

pred = model.predict(x_test)
print("predition:\n",pred)
print("real:\n",y_test)
Intercept: [117471.80714502]
coef: [[ -574.45939702  8092.27590779 -7413.31955575  8143.91966418]]
predition:
 [[552183.69901895]
 [624369.48768329]
 [605809.06246475]]
real:
 [[519574]
 [570790]
 [640636]]
  • 4
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值