线性回归学习笔记

最新推荐文章于 2024-07-17 09:15:39 发布

Amiyai

最新推荐文章于 2024-07-17 09:15:39 发布

阅读量406

点赞数 4

分类专栏：线性回归文章标签： python 机器学习

本文链接：https://blog.csdn.net/weixin_45812669/article/details/113620764

版权

线性回归专栏收录该内容

1 篇文章 0 订阅

订阅专栏

线性回归

一元线性回归模型

函数形式：
$f(x)=\omega{x_i+b}\tag{1.1}$

求解 $w$ 和 $b$ ：

method 1 最小二乘法公式法

$\omega=\frac{\sum\limits_{i=1}^ny_i(x_i-\overline{x})}{\sum\limits^n_{i=1}x_i^2-\frac{1}{n}(\sum^n_{i=1})^2}\tag{1.2}$

$b=\frac{1}{n}\sum_{i=1}^n(y_i-\omega x_i)\tag{1.3}$
其中 $\overline{x}=\frac{1}{n}\sum \limits_{i=1}^n x_i$ 为 $x$ 的均值.

method 2 公式法

$\frac{\sum(x_i - \overline{x})(y_i-\overline{y})}{\sum(x_i -\overline{x})^2}$

$\overline{y} - w\overline{x}$

method 3 调用statmodels统计建模模块中的ols函数**

适用于自变量X和因变量Y为线性关系，具体来说，画出散点图可以用一条直线来近似拟合。一般线性模型要求观测值之间相互独立、残差(因变量)服从正态分布、残差(因变量)方差齐性
统计模型被假定为 Y=Xβ+μ, μ∼N(0,Σ)

ols : 最简单的最小二乘法模型

Methods	desc
fit()	Full fit of the model
from_formula(formula, data)	Create a Model from a formula and dataframe
predict(params)	Return linear predicted values from a design matrix.
score(params)	Evaluate the score function at a given point

一般，直接.summary()即可获取所有信息。

method 4 调用sklearn子模块中的linear_model中的LinearRegression方法**

最小二乘法线性回归：

sklearn.linear_model.LinearRegression(fit_intercept=True, normalize=False,copy_X=True, n_jobs=1)

主要参数说明：

fit_intercept	布尔型	默认为True，是否对训练数据进行中心化。如果该变量为false，则表明输入的数据已经进行了中心化，在下面的过程里不进行中心化处理；否则，对输入的训练数据进行中心化处理
normalize	布尔型	默认为False，是否需要对数据进行标准化处理
copy_X	布尔型	默认为True，是否对X复制，如果选择false，则直接对原数据进行覆盖。（即经过中心化，标准化后，是否把新数据覆盖到原数据上）
n_jobs	整型	默认为1，计算时设置的任务个数(number of jobs)。如果选择-1，则代表使用所有的CPU。这一参数的对于目标个数>1（n_targets>1）且足够大规模的问题有加速作用。

Attributes（属性）:

coef_	array, shape (n_features, ) or (n_targets, n_features)	返回模型的系数
intercept_	array	返回模型的截距

方法：

`fit`(X, y[, sample_weight])	Fit linear model.
`get_params`([deep])	Get parameters for this estimator.
`predict`(X)	Predict using the linear model.
`score`(X, y[, sample_weight])	Return the coefficient of determination of the prediction. $R^2$
`set_params`(**params)	Set the parameters of this estimator.

画散点图

import matplotlib.pyplot as plt
import numpy as np
X=[165,165,157,170,175,165,155,170]
Y=[48,57,50,54,64,61,43,59]
plt.xlabel("height/cm")
plt.ylabel("weight/kg")
plt.plot(X,Y,"og")
plt.show()

在这里插入图片描述

求解回归方程和预测

'''
method 1
'''
x=[165,165,157,170,175,165,155,170]
y=[48,57,50,54,64,61,43,59]
sumx=0
sumx2=0
sumy=0
xy=0
n=len(x)
for i in range(n):
    sumx+=x[i]
    sumy+=y[i]
    sumx2+=x[i]**2
    xy+=x[i]*y[i]
k=(xy-1/n*sumy*sumx)/(sumx2-1/n*(sumx**2))
b=1/n*(sumy-k*sumx)
print("k=:",k,"b=:",b )
x=172
print("predict weight=:",k*x+b)

k=: 0.8484848484848485 b=: -85.71212121212122
predict weight=: 60.22727272727272

'''
method 2
'''
import numpy as np

def fit(x,y):
    sfz = 0
    sfm = 0
    for i in range(len(x)):
        sfz += (x[i] - np.mean(x))*(y[i] - np.mean(y))
        sfm += (x[i] - np.mean(x))**2
    w = sfz/sfm
    b = np.mean(y) - w*np.mean(x)
    return w,b

def predict(x,w,b):
    y = w*x + b
    return y

if __name__ == "__main__":
    x_train = [165,165,157,170,175,165,155,170]
    y_train = [48,57,50,54,64,61,43,59]
    w,b = fit(x_train,y_train)
    print("w = ",w,"b = ",b)
    x_test = 172
    print(predict(x_test,w,b))

k=: 0.8484848484848485 b=: -85.71212121212122
predict weight=: 60.22727272727272

'''
method 3
'''
import statsmodels.api as sm
import pandas as pd
import numpy as np
x = [165,165,157,170,175,165,155,170]
y = [48,57,50,54,64,61,43,59]
data = np.vstack((x,y))   #按垂直方向（行顺序）堆叠数组构成一个新的数组
dat = pd.DataFrame(data.T,columns = ['x','y'])
res = sm.formula.ols('y ~ x',data = dat).fit()
print(res.params)
print(res.summary())

Intercept   -85.712121
x             0.848485
dtype: float64
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.638
Model:                            OLS   Adj. R-squared:                  0.577
Method:                 Least Squares   F-statistic:                     10.55
Date:                Wed, 03 Feb 2021   Prob (F-statistic):             0.0175
Time:                        22:06:21   Log-Likelihood:                -22.451
No. Observations:                   8   AIC:                             48.90
Df Residuals:                       6   BIC:                             49.06
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================

'''
method 4
'''
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
x = np.array([165,165,157,170,175,165,155,170])
y = np.array([48,57,50,54,64,61,43,59])
model = LinearRegression(fit_intercept = True)
model.fit(x[:,np.newaxis],y)
'''
newaxis:用于索引数组
x = np.arange(3)
array([0, 1, 2])
x[:, newaxis]
array([[0],
[1],
[2]])
'''
print("w: ",model.coef_[0])
print("b",model.intercept_)

w:  0.8484848484848485
b -85.71212121212122

画函数图像

import matplotlib.pyplot as plt
import numpy as np
x_t=[165,165,157,170,175,165,155,170]
y_t=[48,57,50,54,64,61,43,59]
x_p=[165,165,157,170,175,165,155,170,172]
y_p=[]
w= 0.8484848484848485
b= -85.71212121212122
for i in range(len(x_p)):
    y_p.append(w*x_p[i]+b)
plt.xlabel("height/cm")
plt.ylabel("weight/kg")
plt.plot(x_t,y_t,"ko",label="true")
plt.plot(x_p,y_p,linewidth='1',label="predict",color='royalblue')
plt.legend(loc='upper left')
plt.show()

在这里插入图片描述
残差平方和 $R^2$ :
$L(w)=\sum_{i=1}^n||w^Tx_i-y_i||^2\tag{1.4}$

二、多元线性回归

线性组合函数：
$f(x)=w_1x_1+w_2x_2+...+w_dx_d+b\tag{2.1}$
向量形式:
$f(x)=w^Tx+b\tag{2.2}$

【1】矩阵表达

数据集 $D={(x_1,y_1),(x_2,y_2),\dots(x_i,y_i),x_i\in R^*,y_i \in R,i=1,2,\dots,n}$

$X=\left(\begin{matrix}x_1&x_2&\dots&x_n\end{matrix}\right)^T$ = $\left(\begin{matrix}x_1^T\\x_2^T\\\vdots\\x_n^T\end{matrix}\right)$ = $X=\left(\begin{matrix}x_{11}&x_{12}&\dots&x_{1p}\\x_{21}&x_{22}&\dots&x_{2p}\\\vdots&\vdots&\vdots&\vdots\\x_{n1}&x_{n2}&\dots&x_{np}\end{matrix}\right)_{n\times p}$

一般 $n > > p$

$Y=\left(\begin{matrix}y_1\\y_2\\\vdots\\y_n\end{matrix}\right)_{n\times 1}$

【2】最小二乘估计

法一（loss function）：

$L(w)=\sum_{i=1}^n||w^Tx_i-y_i||^2\tag{2.3}$

$=(W^TX^T-Y^T)\cdot(XW-Y)$

$W^TX^TW-2W^TX^TY+Y^TY$

$\hat{w}=argminL(w)$

$\frac{\partial L(w)}{\partial w}=2XX^TW-2X^TY=0$

$w=(X^TX)^{-1}X^TY\tag{2.4}$

法二：

$w$ 也可从几何意义上求解， $X$ 看成 $p$ 维列向量，令 $f(x)=w^T=x^T\beta,x^T为系数$ ，要求的 $\beta$ 为垂直于这个 $p$ 维子空间的垂直投影，即
$X^T(Y-X\beta)=0(p\times1零矩阵)\tag{2.5}$
解得 $\beta=(X^TX)^{-1}X^TY$

方法一把总误差平均的分散在每个样本上面，法二把误差分散在 $p$ 个维度上。

法三：概率角度略

【3】过拟合

方法：

1.加数据

2.特征选取/PCA特征提取(降维)

3.正则化

正则化框架

$argmin[L(W)+\lambda P(W)]\tag{2.6}$

其中 $L (w) 为 l o s s f u n c t i o n ， P （ w) = p e n a l t y f u n c t i o n$

两种具体的正则化：

$L1: Laccso, p(w)=||w||_1$

$L2（权值衰减）: Ridge（岭回归），p(w)=||w||_2^2=W^TW$
$J(w)=\sum\limits_{i=1}^n||w^Tx_i-y_i||^2+\lambda w^Tw\tag{2.7}$
$\hat{w}=argminJ(w),\lambda>0$

解得：
$\hat{w}=(X^TX+\lambda I)^{-1}X^TY$

Python实现多元回归和预测

分别使用理论法、统计学习中的statmodels库和机器学习sklearn库中的回归实现;

数据集 :一个选取了11组数据，其中将8个作为训练集train和3个测试集test.

理论法

import numpy as np
from sklearn import model_selection
x = np.matrix([[1,1,1,1,1,1,1,1,1,1,1],[104, 101.8, 105.8, 111.5, 115.97, 120.03, 113.3, 116.4, 105.1, 83.4, 73.3],
     [135.6, 140.2, 140.1, 146.9, 144, 143, 133.3, 135.7, 125.8, 98.5, 99.8],
     [131.6, 135.5, 142.6, 143.2, 142.2, 138.4, 138.4, 135, 122.5, 87.2, 96.5],
     [54.2, 54.9, 54.8, 56.3, 54.5, 54.6, 54.9, 54.8, 49.3, 41.5, 48.9],])
x = np.transpose(x)
# x_train = x[:9,:]
# x_test = x[9:,:]
x_train,x_test = model_selection.train_test_split(x,test_size = 0.2)
"""
逗号前是取行，逗号后是取列。列要全部取，行只取前9
"""
y = np.matrix([560823, 542386, 604834, 591248, 583031, 640636, 575688, 689637, 570790, 519574, 614677])
y = np.transpose(y)
# y_train =y[:9]
# y_test = y[9:]
y_train,y_test = model_selection.train_test_split(y,test_size = 0.2)
"""
y = y.reshape([-1, 1])
"""
tx = np.transpose(x_train) #矩阵转置
nx = tx*x_train
inv = np.linalg.pinv(nx)
w = inv*tx*y_train
print(w)
f = (np.transpose(w))*(np.transpose(x_test))
print("pre",f)
print("ture",y_test)

Intercept and coef：
#the first row is Intercept, others are coef
[[656767.09979853]
 [ -1037.71064526]
 [ -4176.44093565]
 [  1250.05721125]
 [  7924.32921807]]
pre [[566108.88010529 672024.32377186 570019.56312871]]
ture [[689637][614677][640636]]

statmodels库

from sklearn import model_selection
import statsmodels.api as sm
import pandas as pd
import numpy as np

x = np.matrix([[104, 101.8, 105.8, 111.5, 115.97, 120.03, 113.3, 116.4, 105.1, 83.4, 73.3],
     [135.6, 140.2, 140.1, 146.9, 144, 143, 133.3, 135.7, 125.8, 98.5, 99.8],
     [131.6, 135.5, 142.6, 143.2, 142.2, 138.4, 138.4, 135, 122.5, 87.2, 96.5],
     [54.2, 54.9, 54.8, 56.3, 54.5, 54.6, 54.9, 54.8, 49.3, 41.5, 48.9],])
x = np.transpose(x)
# x_wb = sm.add_constant(x)
x_train,x_test = model_selection.train_test_split(x,test_size = 0.2)
# x_train = x[:9,:]
# x_test = x[9:,:]
"""
逗号前是取行，逗号后是取列。列要全部取，行只取前9
"""
y = np.matrix([560823, 542386, 604834, 591248, 583031, 640636, 575688, 689637, 570790, 519574, 614677])
y = np.transpose(y)
y_train,y_test = model_selection.train_test_split(y,test_size = 0.2)
'''
第一种使用方式
'''
model = sm.OLS(y_train,x_train)
results = model.fit()
print(results.params)
print(results.summary())
'''
第二种使用方式
'''
data = pd.DataFrame(x_train,columns = ['x1','x2','x3','x4'])
testdata = pd.DataFrame(x_test,columns = ['x1','x2','x3','x4'])
data['y'] = y_train
print(data.head())
res = sm.formula.ols('y~x1+x2+x3+x4',data = data).fit()
print(res.params)
print(res.tvalues)
print(x_test)
print("pre:",res.predict(testdata))
print("true:",y_test)

  OLS Regression Results                                
=======================================================================================
Dep. Variable:                      y   R-squared (uncentered):                  -5.323
Model:                            OLS   Adj. R-squared (uncentered):            -11.647
Method:                 Least Squares   F-statistic:                            -0.8419
Date:                Wed, 03 Feb 2021   Prob (F-statistic):                        1.00
Time:                        21:45:05   Log-Likelihood:                         -90.425
No. Observations:                   8   AIC:                                      188.9
Df Residuals:                       4   BIC:                                      189.2
Df Model:                           4                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1          6456.6335   1592.525      4.054      0.015    2035.075    1.09e+04
x2          4056.7051   3908.981      1.038      0.358   -6796.367    1.49e+04
x3         -1.383e+04   3084.779     -4.482      0.011   -2.24e+04   -5262.265
x4          2.168e+04   2834.596      7.649      0.002    1.38e+04    2.96e+04
==============================================================================
Omnibus:                        1.122   Durbin-Watson:                   1.996
Prob(Omnibus):                  0.571   Jarque-Bera (JB):                0.781
Skew:                          -0.516   Prob(JB):                        0.677
Kurtosis:                       1.869   Cond. No.                         113.
==============================================================================
Intercept    477698.682835
x1             4442.644279
x2            -1414.453576
x3            -4615.177872
x4             7913.426734
dtype: float64
Intercept    1.923370
x1           2.749555
x2          -0.340907
x3          -0.862686
x4           1.057069
dtype: float64
[[105.8 140.1 142.6  54.8]
 [104.  135.6 131.6  54.2]
 [113.3 133.3 138.4  54.9]]
pre: 
0    525096.922084
1    569484.104019
2    588210.128225
dtype: float64
true: 
0    542386 
1    560823
2    689637

sklearn库 Regression

from sklearn import preprocessing
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np

x = np.matrix([[104, 101.8, 105.8, 111.5, 115.97, 120.03, 113.3, 116.4, 105.1, 83.4, 73.3],
     [135.6, 140.2, 140.1, 146.9, 144, 143, 133.3, 135.7, 125.8, 98.5, 99.8],
     [131.6, 135.5, 142.6, 143.2, 142.2, 138.4, 138.4, 135, 122.5, 87.2, 96.5],
     [54.2, 54.9, 54.8, 56.3, 54.5, 54.6, 54.9, 54.8, 49.3, 41.5, 48.9],])
x = np.transpose(x)
x_train,x_test = model_selection.train_test_split(x,test_size = 0.2)
# x_train = x[:9,:]
# x_test = x[9:,:]
"""
逗号前是取行，逗号后是取列。列要全部取，行只取前9
"""
y = np.matrix([560823, 542386, 604834, 591248, 583031, 640636, 575688, 689637, 570790, 519574, 614677])
y = np.transpose(y)
y_train,y_test = model_selection.train_test_split(y,test_size = 0.2)
# y_train =y[:9]
# y_test = y[9:]
model = LinearRegression(fit_intercept = True)
model.fit(x_train,y_train)
print("Intercept:",model.intercept_)
print("coef:",model.coef_)

pred = model.predict(x_test)
print("predition:\n",pred)
print("real:\n",y_test)

Intercept: [117471.80714502]
coef: [[ -574.45939702  8092.27590779 -7413.31955575  8143.91966418]]
predition:
 [[552183.69901895]
 [624369.48768329]
 [605809.06246475]]
real:
 [[519574]
 [570790]
 [640636]]

Amiyai

关注

4
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
线性回归学习笔记

线性回归一元线性回归模型函数形式：f(x)=ωxi+b(1.1)f(x)=\omega{x_i+b}\tag{1.1}f(x)=ωxi+b(1.1)求解www和bbb：method 1 最小二乘法公式法ω=∑i=1nyi(xi−x‾)∑i=1nxi2−1n(∑i=1n)2(1.2)\omega=\frac{\sum\limits_{i=1}^ny_i(x_i-\overline{x})}{\sum\limits^n_{i=1}x_i^2-\frac{1}{n}(\sum^n_{i=1
复制链接

扫一扫