机器学习学习笔记一：线性回归（一）

最新推荐文章于 2024-04-26 01:45:18 发布

Super_Meredith

最新推荐文章于 2024-04-26 01:45:18 发布

阅读量325

点赞数 1

分类专栏： sklearn linear 文章标签： linear sklearn pipeline polynomial

本文链接：https://blog.csdn.net/weixin_38048889/article/details/86585638

版权

sklearn 同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

linear

2 篇文章 0 订阅

订阅专栏

本文主要练习LinearRegression, ElasticNetCV, LassoCV, RidgeCV四种模型进行建模，掌握四种模型建模基本操作。同时运用多项式处理比较1阶到5阶的拟合程度及R方变化，运用Pipeline管道（流水线）缩短代码，完成本文可以基本掌握四种模型sklearn用法，另外学会运用Pipeline。

首先导入需要使用的模块，这个也可以边写边倒入。

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, ElasticNetCV, LassoCV, RidgeCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

然后可以设置画图时对中文的支持，这里windows比较简单，mac我一般用如下办法，网上还有其他办法，操作较多较复杂

#打印中文，mac版，之后会用到这函数
def getChineseFont():  
    return FontProperties(fname='/System/Library/Fonts/STHeiti Medium.ttc') 

#打印中文windows版
mpl.rcParams['font.sans-serif']=[u'simHei']
mpl.rcParams['axes.unicode_minus']=False

之后导入数据，观察数据特点

path = r'./datas/household_power_consumption_1000.txt'
datas = pd.read_csv(path, sep=';')
datas.head(10)

Date	Time	Global_active_power	Global_reactive_power	Voltage	Global_intensity	Sub_metering_1	Sub_metering_3
0	16/12/2006	17:24:00	4.216	0.418	234.84	18.4	1.0
1	16/12/2006	17:25:00	5.360	0.436	233.63	23.0	1.0
2	16/12/2006	17:26:00	5.374	0.498	233.29	23.0	2.0
3	16/12/2006	17:27:00	5.388	0.502	233.74	23.0	1.0
4	16/12/2006	17:28:00	3.666	0.528	235.68	15.8	1.0
5	16/12/2006	17:29:00	3.520	0.522	235.02	15.0	2.0
6	16/12/2006	17:30:00	3.702	0.520	235.09	15.8	1.0
7	16/12/2006	17:31:00	3.700	0.520	235.22	15.8	1.0
8	16/12/2006	17:32:00	3.668	0.510	233.99	15.8	1.0
9	16/12/2006	17:33:00	3.662	0.510	233.86	15.8	2.0

datas.describe()

Global_active_power	Global_reactive_power	Voltage	Global_intensity	Sub_metering_1	Sub_metering_2	Sub_metering_3
count	1000.000000	1000.000000	1000.00000	1000.000000	1000.0	1000.000000
mean	2.418772	0.089232	240.03579	10.351000	0.0	2.749000
std	1.239979	0.088088	4.08442	5.122214	0.0	8.104053
min	0.206000	0.000000	230.98000	0.800000	0.0	0.000000
25%	1.806000	0.000000	236.94000	8.400000	0.0	0.000000
50%	2.414000	0.072000	240.65000	10.000000	0.0	0.000000
75%	3.308000	0.126000	243.29500	14.000000	0.0	1.000000
max	7.706000	0.528000	249.37000	33.200000	0.0	38.000000

指定X与Y，指定训练集与测试集

names = ['Date', 'Time', 'Global_active_power', 'Global_reactive_power', 'Voltage', 'Global_intensity', 'Sub_metering_1', 'Sub_metering_2', 'Sub_metering_3']
X = datas[names[4:6]]
Y = datas[names[2]]
#划分训练集与测试集，测试集占20%
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state=3)

ss = StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.transform(x_test)
lr = LinearRegression()
lr.fit(x_train, y_train)
print('训练集R^2: ', lr.score(x_train, y_train))
print('截距: ', lr.intercept_)
print('参数: ', lr.coef_)
print('测试集R^2', lr.score(x_test, y_test))

from matplotlib.font_manager import FontManager, FontProperties 

y_predict = lr.predict(x_test)
#画图
plt.figure(figsize=(12, 6), facecolor='w')
plt.plot(range(len(x_test)), y_test, 'r-', lw=1, label='test', zorder=10)
plt.plot(range(len(x_test)), y_predict, 'b-', lw=1, label='predict', zorder=10)
plt.title(u'功率与电流、电压的关系', fontproperties=getChineseFont())
plt.legend(loc='upper left')
plt.show()

训练集R^2: 0.9914990458818783
截距: 2.4425775
参数: [0.02165243 1.2555645 ]
测试集R^2: 0.9901973293430661
在这里插入图片描述
到这里用了linear模型小试牛刀，下面比较四个模型，运用了多项式处理和pipeline技术。

#利用pipeline管道（网上也叫流水线，更容易理解），减少重复代码，之后如果需要使用直接利用pipeline就好
models = [
    Pipeline([
        ('poly', PolynomialFeatures()),
        ('lr', LinearRegression())
    ]),
    Pipeline([
        ('poly', PolynomialFeatures()),
        ('lr', LassoCV())
    ]),
    Pipeline([
        ('poly', PolynomialFeatures()),
        ('lr', RidgeCV())
    ]),
    Pipeline([
        ('poly', PolynomialFeatures()),
        ('lr', ElasticNetCV())
    ])
]
#划分训练街和测试集，测试集占30%
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
model_name = ['lr', 'lasso', 'ridge', 'ela']
colors = ['r', 'r', 'y', 'y', 'b', 'b']
#标准化
ss = StandardScaler()
X_train_ = ss.fit_transform(X_train)  #对训练集用fit和transform
X_test_ = ss.transform(X_test)  #因为上面一行已经fit过了，训练集和测试集用一套fit标准就好，所以这里不fit

#建模加画图
plt.figure(facecolor='y')
plt.subplots(figsize=(21, 24))
for i in range(len(models)):
    plt.subplot(4, 1, i+1)
    model = models[i]  #每次导入一个模型
    for j in range(1, 6, 2):
        #设置阶数，poly为管道里PolynomialFeatures别名，degree为设置参数，中间用两个下划线_     
        model.set_params(poly__degree=j)  
        model.fit(X_train, Y_train)  #训练模型
        poly = model.get_params('poly')['poly']  #获取多项式扩展对象，两个参数都为要获取对象的别名
        feature = poly.get_feature_names()  #获取多项式对象的变量属性，这样可以把每个参数对应到变量，写出模型表达式
        lin = model.get_params('lr')['lr']  #获取线性回归对象
        output = '%d阶，%s模型，分数为：%.3f，参数为：' % (j, model_name[i], model.score(X_test, Y_test))
        print(output, lin.coef_)
        print('feature:', feature)
        y_predict = model.predict(X_test)
        label = '%d阶,score:%.3f'%(j, model.score(X_test, Y_test))
        plt.plot(range(len(X_test)), y_predict, color=colors[j-1], lw=1, label=label)
    plt.plot(range(len(X_test)), Y_test, 'g-', lw=1)
    plt.legend(loc='upper left')  #设置label位置
    plt.title(model_name[i], fontsize=16)

plt.show()

结果如下
1阶，lr模型，分数为：0.992，参数为： [0. 0.00611712 0.24437297]
feature: [‘1’, ‘x0’, ‘x1’]
3阶，lr模型，分数为：0.993，参数为： [ 0.00000000e+00 -3.70951434e+01 -6.40167154e+00 1.53651601e-01
5.22340225e-02 1.79832277e-02 -2.12195366e-04 -1.02794156e-04
-6.16294133e-05 -8.12414760e-05]
feature: [‘1’, ‘x0’, ‘x1’, ‘x0^2’, ‘x0 x1’, ‘x1^2’, ‘x0^3’, ‘x0^2 x1’, ‘x0 x1^2’, ‘x1^3’]
5阶，lr模型，分数为：0.995，参数为： [ 0.00000000e+00 -6.69622090e-01 1.45306572e+00 -1.13216456e+01
-3.90041968e+01 2.69703377e+02 8.95263141e-02 4.74095642e-01
-3.14655737e+00 -1.19288298e+00 -2.65128414e-04 -1.92118023e-03
1.22165319e-02 9.45563983e-03 1.96964251e-03 2.78816740e-07
2.59505716e-06 -1.57775191e-05 -1.88099529e-05 -7.25206107e-06
-2.76570191e-06]
feature: [‘1’, ‘x0’, ‘x1’, ‘x0^2’, ‘x0 x1’, ‘x1^2’, ‘x0^3’, ‘x0^2 x1’, ‘x0 x1^2’, ‘x1^3’, ‘x0^4’, ‘x0^3 x1’, ‘x0^2 x1^2’, ‘x0 x1^3’, ‘x1^4’, ‘x0^5’, ‘x0^4 x1’, ‘x0^3 x1^2’, ‘x0^2 x1^3’, ‘x0 x1^4’, ‘x1^5’]
1阶，lasso模型，分数为：0.992，参数为： [0. 0.00488852 0.24340701]
feature: [‘1’, ‘x0’, ‘x1’]
3阶，lasso模型，分数为：0.992，参数为： [ 0.00000000e+00 -0.00000000e+00 0.00000000e+00 -0.00000000e+00
0.00000000e+00 0.00000000e+00 -9.96040425e-08 4.24340539e-06
0.00000000e+00 0.00000000e+00]
feature: [‘1’, ‘x0’, ‘x1’, ‘x0^2’, ‘x0 x1’, ‘x1^2’, ‘x0^3’, ‘x0^2 x1’, ‘x0 x1^2’, ‘x1^3’]
5阶，lasso模型，分数为：0.990，参数为： [ 0.00000000e+00 -0.00000000e+00 0.00000000e+00 -0.00000000e+00
0.00000000e+00 0.00000000e+00 -0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 -0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 -2.46402283e-12
7.32871262e-11 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00]
feature: [‘1’, ‘x0’, ‘x1’, ‘x0^2’, ‘x0 x1’, ‘x1^2’, ‘x0^3’, ‘x0^2 x1’, ‘x0 x1^2’, ‘x1^3’, ‘x0^4’, ‘x0^3 x1’, ‘x0^2 x1^2’, ‘x0 x1^3’, ‘x1^4’, ‘x0^5’, ‘x0^4 x1’, ‘x0^3 x1^2’, ‘x0^2 x1^3’, ‘x0 x1^4’, ‘x1^5’]
剩下两个模型结果略
<matplotlib.figure.Figure at 0x1185afcf8>
在这里插入图片描述

结论：可以看到Ridge在5阶时出现了过拟合现象，其他模型在5阶以内拟合效果较好，说明L2正则项在阶数较高时容易出现过拟合现象，这在使用时需要注意。
总结：本文重点是掌握四种线性回归的基本用法，学会运用多项式扩展和pipeline技术来优化模型及代码

Super_Meredith

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
机器学习学习笔记一：线性回归（一）

本文主要练习LinearRegression, ElasticNetCV, LassoCV, RidgeCV四种模型进行建模，掌握四种模型建模基本操作。同时运用多项式处理比较1阶到5阶的拟合程度及R方变化，运用Pipeline管道（流水线）缩短代码，完成本文可以基本掌握四种模型sklearn用法，另外学会运用Pipeline。...
复制链接

扫一扫

专栏目录