机器学习学习笔记一:线性回归(一)

本文主要练习LinearRegression, ElasticNetCV, LassoCV, RidgeCV四种模型进行建模,掌握四种模型建模基本操作。同时运用多项式处理比较1阶到5阶的拟合程度及R方变化,运用Pipeline管道(流水线)缩短代码,完成本文可以基本掌握四种模型sklearn用法,另外学会运用Pipeline。

首先导入需要使用的模块,这个也可以边写边倒入。

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, ElasticNetCV, LassoCV, RidgeCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

然后可以设置画图时对中文的支持,这里windows比较简单,mac我一般用如下办法,网上还有其他办法,操作较多较复杂

#打印中文,mac版,之后会用到这函数
def getChineseFont():  
    return FontProperties(fname='/System/Library/Fonts/STHeiti Medium.ttc') 

#打印中文windows版
mpl.rcParams['font.sans-serif']=[u'simHei']
mpl.rcParams['axes.unicode_minus']=False

之后导入数据,观察数据特点

path = r'./datas/household_power_consumption_1000.txt'
datas = pd.read_csv(path, sep=';')
datas.head(10)
DateTimeGlobal_active_powerGlobal_reactive_powerVoltageGlobal_intensitySub_metering_1Sub_metering_2Sub_metering_3
016/12/200617:24:004.2160.418234.8418.40.01.0
116/12/200617:25:005.3600.436233.6323.00.01.0
216/12/200617:26:005.3740.498233.2923.00.02.0
316/12/200617:27:005.3880.502233.7423.00.01.0
416/12/200617:28:003.6660.528235.6815.80.01.0
516/12/200617:29:003.5200.522235.0215.00.02.0
616/12/200617:30:003.7020.520235.0915.80.01.0
716/12/200617:31:003.7000.520235.2215.80.01.0
816/12/200617:32:003.6680.510233.9915.80.01.0
916/12/200617:33:003.6620.510233.8615.80.02.0
datas.describe()
Global_active_powerGlobal_reactive_powerVoltageGlobal_intensitySub_metering_1Sub_metering_2Sub_metering_3
count1000.0000001000.0000001000.000001000.0000001000.01000.000000
mean2.4187720.089232240.0357910.3510000.02.749000
std1.2399790.0880884.084425.1222140.08.104053
min0.2060000.000000230.980000.8000000.00.000000
25%1.8060000.000000236.940008.4000000.00.000000
50%2.4140000.072000240.6500010.0000000.00.000000
75%3.3080000.126000243.2950014.0000000.01.000000
max7.7060000.528000249.3700033.2000000.038.000000

指定X与Y,指定训练集与测试集

names = ['Date', 'Time', 'Global_active_power', 'Global_reactive_power', 'Voltage', 'Global_intensity', 'Sub_metering_1', 'Sub_metering_2', 'Sub_metering_3']
X = datas[names[4:6]]
Y = datas[names[2]]
#划分训练集与测试集,测试集占20%
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state=3)
ss = StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.transform(x_test)
lr = LinearRegression()
lr.fit(x_train, y_train)
print('训练集R^2: ', lr.score(x_train, y_train))
print('截距: ', lr.intercept_)
print('参数: ', lr.coef_)
print('测试集R^2', lr.score(x_test, y_test))

from matplotlib.font_manager import FontManager, FontProperties 

y_predict = lr.predict(x_test)
#画图
plt.figure(figsize=(12, 6), facecolor='w')
plt.plot(range(len(x_test)), y_test, 'r-', lw=1, label='test', zorder=10)
plt.plot(range(len(x_test)), y_predict, 'b-', lw=1, label='predict', zorder=10)
plt.title(u'功率与电流、电压的关系', fontproperties=getChineseFont())
plt.legend(loc='upper left')
plt.show()

训练集R^2: 0.9914990458818783
截距: 2.4425775
参数: [0.02165243 1.2555645 ]
测试集R^2: 0.9901973293430661
在这里插入图片描述
到这里用了linear模型小试牛刀,下面比较四个模型,运用了多项式处理和pipeline技术。

#利用pipeline管道(网上也叫流水线,更容易理解),减少重复代码,之后如果需要使用直接利用pipeline就好
models = [
    Pipeline([
        ('poly', PolynomialFeatures()),
        ('lr', LinearRegression())
    ]),
    Pipeline([
        ('poly', PolynomialFeatures()),
        ('lr', LassoCV())
    ]),
    Pipeline([
        ('poly', PolynomialFeatures()),
        ('lr', RidgeCV())
    ]),
    Pipeline([
        ('poly', PolynomialFeatures()),
        ('lr', ElasticNetCV())
    ])
]
#划分训练街和测试集,测试集占30%
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
model_name = ['lr', 'lasso', 'ridge', 'ela']
colors = ['r', 'r', 'y', 'y', 'b', 'b']
#标准化
ss = StandardScaler()
X_train_ = ss.fit_transform(X_train)  #对训练集用fit和transform
X_test_ = ss.transform(X_test)  #因为上面一行已经fit过了,训练集和测试集用一套fit标准就好,所以这里不fit

#建模加画图
plt.figure(facecolor='y')
plt.subplots(figsize=(21, 24))
for i in range(len(models)):
    plt.subplot(4, 1, i+1)
    model = models[i]  #每次导入一个模型
    for j in range(1, 6, 2):
        #设置阶数,poly为管道里PolynomialFeatures别名,degree为设置参数,中间用两个下划线_     
        model.set_params(poly__degree=j)  
        model.fit(X_train, Y_train)  #训练模型
        poly = model.get_params('poly')['poly']  #获取多项式扩展对象,两个参数都为要获取对象的别名
        feature = poly.get_feature_names()  #获取多项式对象的变量属性,这样可以把每个参数对应到变量,写出模型表达式
        lin = model.get_params('lr')['lr']  #获取线性回归对象
        output = '%d阶,%s模型,分数为:%.3f,参数为:' % (j, model_name[i], model.score(X_test, Y_test))
        print(output, lin.coef_)
        print('feature:', feature)
        y_predict = model.predict(X_test)
        label = '%d阶,score:%.3f'%(j, model.score(X_test, Y_test))
        plt.plot(range(len(X_test)), y_predict, color=colors[j-1], lw=1, label=label)
    plt.plot(range(len(X_test)), Y_test, 'g-', lw=1)
    plt.legend(loc='upper left')  #设置label位置
    plt.title(model_name[i], fontsize=16)

plt.show()  

结果如下
1阶,lr模型,分数为:0.992,参数为: [0. 0.00611712 0.24437297]
feature: [‘1’, ‘x0’, ‘x1’]
3阶,lr模型,分数为:0.993,参数为: [ 0.00000000e+00 -3.70951434e+01 -6.40167154e+00 1.53651601e-01
5.22340225e-02 1.79832277e-02 -2.12195366e-04 -1.02794156e-04
-6.16294133e-05 -8.12414760e-05]
feature: [‘1’, ‘x0’, ‘x1’, ‘x0^2’, ‘x0 x1’, ‘x1^2’, ‘x0^3’, ‘x0^2 x1’, ‘x0 x1^2’, ‘x1^3’]
5阶,lr模型,分数为:0.995,参数为: [ 0.00000000e+00 -6.69622090e-01 1.45306572e+00 -1.13216456e+01
-3.90041968e+01 2.69703377e+02 8.95263141e-02 4.74095642e-01
-3.14655737e+00 -1.19288298e+00 -2.65128414e-04 -1.92118023e-03
1.22165319e-02 9.45563983e-03 1.96964251e-03 2.78816740e-07
2.59505716e-06 -1.57775191e-05 -1.88099529e-05 -7.25206107e-06
-2.76570191e-06]
feature: [‘1’, ‘x0’, ‘x1’, ‘x0^2’, ‘x0 x1’, ‘x1^2’, ‘x0^3’, ‘x0^2 x1’, ‘x0 x1^2’, ‘x1^3’, ‘x0^4’, ‘x0^3 x1’, ‘x0^2 x1^2’, ‘x0 x1^3’, ‘x1^4’, ‘x0^5’, ‘x0^4 x1’, ‘x0^3 x1^2’, ‘x0^2 x1^3’, ‘x0 x1^4’, ‘x1^5’]
1阶,lasso模型,分数为:0.992,参数为: [0. 0.00488852 0.24340701]
feature: [‘1’, ‘x0’, ‘x1’]
3阶,lasso模型,分数为:0.992,参数为: [ 0.00000000e+00 -0.00000000e+00 0.00000000e+00 -0.00000000e+00
0.00000000e+00 0.00000000e+00 -9.96040425e-08 4.24340539e-06
0.00000000e+00 0.00000000e+00]
feature: [‘1’, ‘x0’, ‘x1’, ‘x0^2’, ‘x0 x1’, ‘x1^2’, ‘x0^3’, ‘x0^2 x1’, ‘x0 x1^2’, ‘x1^3’]
5阶,lasso模型,分数为:0.990,参数为: [ 0.00000000e+00 -0.00000000e+00 0.00000000e+00 -0.00000000e+00
0.00000000e+00 0.00000000e+00 -0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 -0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 -2.46402283e-12
7.32871262e-11 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00]
feature: [‘1’, ‘x0’, ‘x1’, ‘x0^2’, ‘x0 x1’, ‘x1^2’, ‘x0^3’, ‘x0^2 x1’, ‘x0 x1^2’, ‘x1^3’, ‘x0^4’, ‘x0^3 x1’, ‘x0^2 x1^2’, ‘x0 x1^3’, ‘x1^4’, ‘x0^5’, ‘x0^4 x1’, ‘x0^3 x1^2’, ‘x0^2 x1^3’, ‘x0 x1^4’, ‘x1^5’]
剩下两个模型结果略
<matplotlib.figure.Figure at 0x1185afcf8>
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

结论:可以看到Ridge在5阶时出现了过拟合现象,其他模型在5阶以内拟合效果较好,说明L2正则项在阶数较高时容易出现过拟合现象,这在使用时需要注意。
总结:本文重点是掌握四种线性回归的基本用法,学会运用多项式扩展和pipeline技术来优化模型及代码

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值