机器学习[7]-多变量线性回归的学习

最新推荐文章于 2022-10-12 18:46:11 发布

arris1992

最新推荐文章于 2022-10-12 18:46:11 发布

阅读量469

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/arris1992/article/details/104650564

版权

机器学习专栏收录该内容

7 篇文章 0 订阅

订阅专栏

多变量线性回归的机器学习跟单变量基本一样，只是在展示数据的相关性的时候不能单纯的用二维形式绘制，常见的有直方图、箱线图、相关系数热力图与散点图矩阵等图形。

下面还是用波士顿房价的数据集演示：

# -*- coding: utf-8 -*
# 以波士顿房屋价格为例演示多变量线性回归

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics     # 评价模块
from pandas.plotting import scatter_matrix  # 散点图矩阵


def main():
    boston = load_boston()              # 读取数据
    print(boston.keys())                # 数据中包含的内容
    print(boston.feature_names)         # data变量名

    bos = pd.DataFrame(boston.data)     # 将data数据转换为DataFrame格式以便展示
    x = bos.iloc[:, 0:3]                # 使用前三列数据
    bos_target = pd.DataFrame(boston.target)    # MEDV（房价）
    y = bos_target
    df = pd.concat([y, x], axis=1)
    df.columns = ['MEDV', 'CRIM', 'ZN', 'INDUS']
    print(df)

    # 直方图
    # xlabelsize=12：整体x轴的尺寸；ylabelsize=12：整个y轴的尺寸；figsize=(12, 7)：整个图形的尺寸
    df.hist(xlabelsize=12, ylabelsize=12, figsize=(12, 7))
    plt.show()

    # 密度图
    # kind='density'：密度图；subplots=True：绘制多个子图；layout=(2, 2)：子图数量2×2；sharex=False：不共享x坐标轴
    # fontsize=8：图形字体大小；figsize=(12, 7)：整个图形的尺寸
    df.plot(kind='density', subplots=True, layout=(2, 2), sharex=False, fontsize=8, figsize=(12, 7))
    plt.show()

    # 箱线图
    # 参数同密度图
    df.plot(kind='box', layout=(2, 2), sharex=False, sharey=False, fontsize=8, figsize=(12, 7))
    plt.show()

    # 多变量相关系数热力图
    names = ['MEDV', 'CRIM', 'ZN', 'INDUS']
    correlations = df.corr()     # 计算变量之间的相关系数矩阵
    fig = plt.figure()
    ax = fig.add_subplot(111)
    cax = ax.matshow(correlations, vmin=0.3, vmax=1)
    fig.colorbar(cax)
    ticks = np.arange(0, 4, 1)
    ax.set_xticks(ticks)
    ax.set_yticks(ticks)
    ax.set_xticklabels(names)
    ax.set_yticklabels(names)
    plt.show()

    # 散点图矩阵
    scatter_matrix(df, figsize=(8, 8), c='b')
    plt.show()

    x = df.iloc[:, 1:4]
    y = df['MEDV']
    # 划分训练集和测试集
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)
    print(x_train.shape)
    print(x_test.shape)
    print(y_train.shape)
    print(y_test.shape)

    # 训练，打印截距和系数
    lr = LinearRegression()
    lr.fit(x_train, y_train)
    print(lr.intercept_)
    print(lr.coef_)

    # 使用模型进行预测
    y_pred = lr.predict(x_test)

    # 模型评价，作图
    t = np.arange(len(x_test))
    plt.plot(t, y_test, color='red', linewidth=1.0, linestyle='-', label='y_test')
    plt.plot(t, y_pred, color='green', linewidth=1.0, linestyle='-', label='y_pred')
    plt.legend()
    plt.grid(True)
    plt.show()

    # 模型评价，指标计算
    r2 = lr.score(x_test, y_test)
    mae = metrics.mean_absolute_error(y_test, y_pred)
    mse = metrics.mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    print('r2 = ', r2)
    print('mae = ', mae)
    print('mse = ', mse)
    print('rmse = ', rmse)


if __name__ == '__main__':
    main()

最后做出的图表如下：