scikit-learn : 线性回归

# 线性回归背景 从线性回归(Linear regression)开始学习回归分析,线性回归是最早的也是最基本的模型——把数据拟合成一条直线。 — # 数据集 使用scikit-learn里的数据集boston,boston数据集很适合用来演示线性回归。boston数据集包含了波士顿地区的房屋价格中位数。还有一些可能会影响房价的因素,比如犯罪率(crime rate)。 ## 加载数据
from sklearn import datasets
boston = datasets.load_boston()
## 数据可视化
import pandas as pd
import warnings # 用来忽略seaborn绘图库产生的warnings
warnings.filterwarnings("ignore")
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="white", color_codes=True)
%matplotlib inline
## scikit-learn 数据转换成pandas Datafram
def skdata2df(skdata):
    dfdata = pd.DataFrame(skdata.data,columns=skdata.feature_names)
    dfdata["target"] = skdata.target
    return dfdata
bs = skdata2df(boston)
bs.head()
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATtarget
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2
bs.describe()
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATtarget
count506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000
mean3.59376111.36363611.1367790.0691700.5546956.28463468.5749013.7950439.549407408.23715418.455534356.67403212.65306322.532806
std8.59678323.3224536.8603530.2539940.1158780.70261728.1488612.1057108.707259168.5371162.16494691.2948647.1410629.197104
min0.0063200.0000000.4600000.0000000.3850003.5610002.9000001.1296001.000000187.00000012.6000000.3200001.7300005.000000
25%0.0820450.0000005.1900000.0000000.4490005.88550045.0250002.1001754.000000279.00000017.400000375.3775006.95000017.025000
50%0.2565100.0000009.6900000.0000000.5380006.20850077.5000003.2074505.000000330.00000019.050000391.44000011.36000021.200000
75%3.64742312.50000018.1000000.0000000.6240006.62350094.0750005.18842524.000000666.00000020.200000396.22500016.95500025.000000
max88.976200100.00000027.7400001.0000000.8710008.780000100.00000012.12650024.000000711.00000022.000000396.90000037.97000050.000000
fig = plt.figure()
for i,f in enumerate(boston.feature_names):
    sns.jointplot(x=f, y="target", data=bs, kind='reg', size=6)

这里写图片描述



这里写图片描述



这里写图片描述


线性回归模型

用scikit-learn的线性回归非常简单
首先,导入LinearRegression类创建一个对象:

from sklearn.linear_model import LinearRegression
lr = LinearRegression()

现在,再把自变量和因变量传给LinearRegression的fit方法:

lr.fit(boston.data, boston.target)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

现在开始预测

predictions = lr.predict(boston.data)

用预测值与实际值的残差(residuals)直方图分布来直观显示预测结果:

%matplotlib inline
f, ax = plt.subplots(figsize=(7, 5))
f.tight_layout()
ax.hist(boston.target-predictions,bins=40, label='Residuals Linear', color='b', alpha=.5);
ax.set_title("Histogram of Residuals")
ax.legend(loc='best');

这里写图片描述

查看相关系数

lr.coef_
array([ -1.07170557e-01,   4.63952195e-02,   2.08602395e-02,
         2.68856140e+00,  -1.77957587e+01,   3.80475246e+00,
         7.51061703e-04,  -1.47575880e+00,   3.05655038e-01,
        -1.23293463e-02,  -9.53463555e-01,   9.39251272e-03,
        -5.25466633e-01])
list(zip(boston.feature_names, lr.coef_))
[('CRIM', -0.1071705565603549),
 ('ZN', 0.046395219529801912),
 ('INDUS', 0.020860239532175279),
 ('CHAS', 2.6885613993180009),
 ('NOX', -17.79575866030935),
 ('RM', 3.8047524602580065),
 ('AGE', 0.00075106170332261968),
 ('DIS', -1.4757587965198196),
 ('RAD', 0.3056550383391009),
 ('TAX', -0.012329346305275379),
 ('PTRATIO', -0.95346355469056254),
 ('B', 0.0093925127221887728),
 ('LSTAT', -0.52546663290078754)]

用条形图直观查看相关系数

def plotCofBar(x_feature,y_cof):
    x_value = range(len(x_feature))
    plt.bar(x_value, y_cof, alpha = 1, color = 'r', align="center")
    plt.autoscale(tight=True)
    plt.xticks([i for i in range(len(x_feature))],x_feature,rotation="90")
    plt.xlabel("feature names")
    plt.ylabel("cof")
    plt.title("The cof of Linear regression")
    plt.show()
plotCofBar(boston.feature_names,lr.coef_)

这里写图片描述


线性回归原理

线性回归的基本理念是找出满足 y=Xβ 的相关系数集合 β ,其中 X 是因变量数据矩阵。想找一组完全能够满足等式的相关系数很难,因此通常会增加一个误差项表示不精确程度或测量误差。因此,方程就变成了 y=Xβ+ϵ ,其中 ϵ 被认为是服从正态分布且与 X 独立的随机变量。用几何学的观点描述,就是说这个变量与 X 是正交的(perpendicular)。可以证明 E(Xϵ)=0

为了找到相关系数集合 β ,我们最小化误差项,这转化成了残差平方和最小化问题。

这个问题可以用解析方法解决,其解是:

β=(XTX)1XTy^

线性回归可以自动标准正态化(normalize或scale)输入数据

回归模型都可以实现自动标准正态化输入数据,但是像KNN这种模型数据标准正态化前后性能差别很大。参考我的另一篇文章

lr2 = LinearRegression(normalize=True)
lr2.fit(boston.data, boston.target)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)
predictions2 = lr2.predict(boston.data)
%matplotlib inline
from matplotlib import pyplot as plt
f, ax = plt.subplots(figsize=(7, 5))
f.tight_layout()
ax.hist(boston.target-predictions2,bins=40, label='Residuals Linear', color='b', alpha=.5);
ax.set_title("Histogram of Residuals")
ax.legend(loc='best');

这里写图片描述

import numpy as np
print "after normalize:",np.percentile(boston.target-predictions2, 75)
print "before normalize:",np.percentile(boston.target-predictions,75)
after normalize: 1.78311579433
before normalize: 1.78311579433

从上面分位数看没有任何差别。

©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页