Linear regression 利用 Sklearn实现

最新推荐文章于 2024-09-14 13:58:53 发布

Luke Liu

最新推荐文章于 2024-09-14 13:58:53 发布

阅读量1.7k

点赞数 1

分类专栏： Machine Learning 文章标签：机器学习线性回归

本文链接：https://blog.csdn.net/magicboom/article/details/88937878

版权

Machine Learning 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

说到Linear Regression ，许多人的第一反应就是我们初中学过的线性回归方程。其实上，线性回归方程就是当feature为一个时候的特殊情况。和许多机器学习一样，做 Linear Regression 的步骤也是三步：

STEP1: CONFIRM A MODEL(function sets)

例如： $\widetilde{y} = b + W x^{features}$

对于多对象用户，我们应该考虑每个特征值xj与其权重w乘积之和：

所以我们的Linear Model 就是：

$\widehat{y}= b+\sum WiX^{_{j}}$

我们用： $X_{j}^{i}$ 上标i表示第几个元素，下标j 表示这个元素的第几个特征值。

STEP2: Loss Function

损失函数函数用来评价这个model 中的某个function有多差。

L(f) =L(w,b)

我们可以用square error的方式进行评估

$L(f) =\sum_{n=1}^{n}(\widehat{y}-(b+Wi*x_{fea}^{i})$

我们要做的就是找L(f)最小时候，w与b的值

*W,*b= arg L

arg 运算代表求系数运算。

STEP3: FOUND and Pick the Best W and B !

我们可以使用 GD算法， Gradient descent 梯度下降算法求最合适的W 与B

Review: 梯度下降法

这里有篇基于台大李宏毅老师的博客：

https://blog.csdn.net/zyq522376829/article/details/66632699

深刻讲解了 GD算法以及SGD 算法。

Ok，现在我们尝试已用Python 中的Sklearn 接口，来进行线性预测.

首先使用SKlearn

__author__ = "Luke Liu"
#encoding="utf-8"
import  cv2
import numpy as np
import  matplotlib.pyplot as plt
from  sklearn import model_selection
from sklearn import  metrics
from sklearn import  datasets

boston = datasets.load_boston()  # 载入boston房价模型
print(dir(boston),"\n",boston.data.shape,"\n",boston.target.shape)
#查看模型描述, 特征值数量， 目标数量

from sklearn import  linear_model
linereg01= linear_model.LinearRegression()  #生成一个线性回归实例

# 分割模型为训练集与测试集（9:1)

X_train,X_test,y_train,y_test= model_selection.train_test_split(
    boston.data,boston.target,test_size=0.1,random_state=42
)

#训练测试集（利用 gradient desecnt 寻找 w 与 b)

linereg01.fit(X_train,y_train)
y_predict_in_train= linereg01.predict(X_train)
y_predict_in_test = linereg01.predict(X_test)
w = linereg01.coef_  #得到权重列表
b = linereg01.intercept_ #得到bias值
print(len(w))  # 输出参数数目
print([round(i,5) for i in w])  #输出w列表，保留5位小数
print(b)                       #输出bias
error_in_train  = metrics.mean_squared_error(y_predict_in_train,y_train)  #训练集上的Loss fucntion值（mean square）
error_in_test   = metrics.mean_squared_error(y_predict_in_test,y_test)    #测试集上 Loss function的值（mean square)
R_value = linereg01.score(X_train,y_train)  # 计算 X与y 的R 相关指数的大小
print("error in train:{}".format(error_in_train))
print("error in test:{}".format(error_in_test))
print("the R value is {}".format(R_value))

#我们将二者的拟合程度画出来
plt.figure(figsize=[10,6])
plt.axis([-10,60,-10,60])  #规定了X轴与Y轴都是-10 到60
plt.subplot(211)
plt.plot(y_test,linewidth=3,label='the truth price of samples of boston')
plt.plot(y_predict_in_test,linewidth=3,label="This is the predict price from ML")
plt.legend()
plt.title("How do prdict fit (1)")

plt.ylabel("housing price")

#看的更加直观，是否y_predict==y_test，与对角线的拟合
plt.subplot(212)
plt.plot(y_test,y_predict_in_test,'o')

plt.title("how do predict fit(2)")
plt.xlabel("y_test")
plt.ylabel("y_predict_in_test")
plt.plot([-10,60],[-10,60],'k--')
plt.show()

输出值：

C:\Users\asus\AppData\Local\Programs\Python\Python35-32\python.exe "D:/BaiduYunDownload/python_exe/daily exercise/OpenCV and MachineLearning/Linear_regression.py"
['DESCR', 'data', 'feature_names', 'filename', 'target'] 
 (506, 13) 
 (506,)
13
[-0.11989, 0.03991, 0.02129, 2.77565, -18.5855, 3.75579, 0.00457, -1.47065, 0.31188, -0.01181, -0.94756, 0.01033, -0.5501]
36.73146277462432
error in train:22.7375901544866
error in test:14.995852876582545
the R value is 0.7375152736886281

Process finished with exit code 0

看以看出拟合的不错，我们输出拟合方程Model 的参数 w 与 bias B 就是

w = [-0.11989, 0.03991, 0.02129, 2.77565, -18.5855, 3.75579, 0.00457, -1.47065, 0.31188, -0.01181, -0.94756, 0.01033, -0.5501]
b = 36.73146277462432

是否最好还要取决于random_state的大小，这里我们是42，结果：

error in train:22.7375901544866
error in test:14.995852876582545

在测试集上相对较好。但是在测试集上表现的好，可能是variances相对较大，Bias相对较小。因此可能存在 过拟合 问题（Overfitting）

解决过拟合问题的思路有2个：

1. 扩大数据集，让我们的模型有普遍的代表性。

2. 将我们的模型正则化，Regularization

正则化有2种手段（1）L1正则化：在评分函数中加入与所有权重绝对值之和成正比的元素（曼哈顿距离），这样的做法也称作Lasso回归。

$\dpi{200} L(f) =\sum_{n=1}^{n}(\widehat{y}-(b+Wi*x_{fea}^{i})+\sum_{i=1}^{n}Wi$

这样虽然增加了Bias，但是使得评价函数变的平滑，可以在普遍的测试集上做的更好。

在sklearn 中，定义学习模型时候，可以使用Lasso回归做L1正则化。

Linear = linear_model.Lasso()

(2) L2正则化，在评分函数中加入与所有权重平方和之和成正比的元素（欧几里何距离），这样的做法也称作ridge回归。

$L(f) =\sum_{n=1}^{n}(\widehat{y}-(b+Wi*x_{fea}^{i})+\sum_{i=1}^{n}Wi^{^{2}}$

在sklearn 中，定义学习模型时候，可以使用Lasso回归做L2正则化。

Linear = linear_model.RidgeRegression()

我们在代码中测试一下：

__author__ = "Luke Liu"
#encoding="utf-8"
import  cv2
import numpy as np
import  matplotlib.pyplot as plt
from  sklearn import model_selection
from sklearn import  metrics
from sklearn import  datasets

boston = datasets.load_boston()  # 载入boston房价模型
print(dir(boston),"\n",boston.data.shape,"\n",boston.target.shape)
#查看模型描述, 特征值数量， 目标数量

from sklearn import  linear_model
linereg01= linear_model.Lasso() #生成一个线性回归L1正则实例

# 分割模型为训练集与测试集（9:1)

X_train,X_test,y_train,y_test= model_selection.train_test_split(
    boston.data,boston.target,test_size=0.1,random_state=42
)

#训练测试集（利用 gradient desecnt 寻找 w 与 b)

linereg01.fit(X_train,y_train)
y_predict_in_train= linereg01.predict(X_train)
y_predict_in_test = linereg01.predict(X_test)
w = linereg01.coef_  #得到权重列表
b = linereg01.intercept_ #得到bias值
print(len(w))  # 输出参数数目
print([round(i,5) for i in w])  #输出w列表，保留5位小数
print(b)                       #输出bias

error_in_train  = metrics.mean_squared_error(y_predict_in_train,y_train)  #训练集上的Loss fucntion值（mean square）
error_in_test   = metrics.mean_squared_error(y_predict_in_test,y_test)    #测试集上 Loss function的值（mean square)
R_value = linereg01.score(X_train,y_train)  # 计算 X与y 的R 相关指数的大小
print("error in train:{}".format(error_in_train))
print("error in test:{}".format(error_in_test))
print("the R value is {}".format(R_value))

#我们将二者的拟合程度画出来
plt.figure(figsize=[10,6])
plt.axis([-10,60,-10,60])  #规定了X轴与Y轴都是-10 到60
plt.subplot(211)
plt.plot(y_test,linewidth=3,label='the truth price of samples of boston')
plt.plot(y_predict_in_test,linewidth=3,label="This is the predict price from ML")
plt.legend()
plt.title("How do prdict fit (1)")

plt.ylabel("housing price")

#看的更加直观，是否y_predict==y_test，与对角线的拟合
plt.subplot(212)
plt.plot(y_test,y_predict_in_test,'o')

plt.title("how do predict fit(2)")
plt.xlabel("y_test")
plt.ylabel("y_predict_in_test")
plt.plot([-10,60],[-10,60],'k--')
plt.show()

我们可以看到 output的结果：

C:\Users\asus\AppData\Local\Programs\Python\Python35-32\python.exe "D:/BaiduYunDownload/python_exe/daily exercise/OpenCV and MachineLearning/Linear_regression.py"
['DESCR', 'data', 'feature_names', 'filename', 'target'] 
 (506, 13) 
 (506,)
13
[-0.07681, 0.03881, -0.0, 0.0, -0.0, 1.01045, 0.02415, -0.64204, 0.27318, -0.01488, -0.73972, 0.00923, -0.77997]
40.54859458744227
error in train:27.60650506200869
error in test:18.64532694611624
the R value is 0.6813080948165031

虽然 error in train 和 error in test 增大了，但是无疑可以在普遍数据上更好。