线性回归：波士顿房价

车前猛跑

已于 2024-05-30 22:19:00 修改

阅读量658

点赞数 3

分类专栏：机器学习基础文章标签：线性回归算法回归波士顿房价

于 2024-05-30 17:48:16 首次发布

本文链接：https://blog.csdn.net/cin_ie/article/details/139329926

版权

机器学习基础专栏收录该内容

17 篇文章 0 订阅

订阅专栏

波士顿房价简述

波士顿房价问题是一个经典的机器学习问题，用于预测波士顿地区房屋的中位数价格。该问题涉及的数据集包含了506个样本，每个样本有13个特征指标，这些特征涵盖了城镇的各种社会经济和地理因素。以下是这些特征指标的简要描述：

CRIM：城镇的犯罪率（人均犯罪率）。
ZN：住宅用地超过25000平方英尺的比例。
INDUS：每个城镇的非零售商业用地比例。
CHAS：查尔斯河虚拟变量（如果房屋附近是查尔斯河，则为1；否则为0）。
NOX：一氧化氮浓度（每千万份）。
RM：每个住宅的平均房间数。
AGE：1940年之前建造的自住单位比例。
DIS：到波士顿五个就业中心的加权距离。
RAD：径向公路的可达性指数。
TAX：每10000美元的全价值财产税率。
PTRATIO：每个城镇的学生与教师比例。
B：计算公式为1000(Bk - 0.63)^2，其中Bk是城镇的黑人比例。
LSTAT：低收入人群的百分比。
这些特征指标涵盖了从社会安全（如犯罪率）到经济因素（如财产税率、黑人比例）再到地理位置（如到就业中心的距离）等多个方面，为预测波士顿地区的房价提供了丰富的信息。通过使用这些特征，可以应用各种机器学习算法（如线性回归、神经网络等）来构建预测模型，以预测给定特征下房屋的中位数价格。

代码实现思路

有些博文或者视频里用datasets.boston可以直接获取到506个样本数据，但是在新版本sklearn里取消了boston，所以此处提供了506条房价数据的txt文档
数据由506行，14列组成，前13列是影响房价的13个特征属性，最后1列是结果

将数据保存成数组
分理处506 * 13的X，506 * 1的y
选出80%作为训练数据，生成模型
将20%的X带入模型，生成y_predict
观察y_predict与20%的y_test

import numpy as np
from sklearn.linear_model import LinearRegression
import re

# 不用科学计数法现实
np.set_printoptions(suppress=True)

# 加载数据
def load_data(file_path):
    data = []
    ff = open(file_path).readlines()
    for item in ff:
        # 调用 re.sub("\s{2,}"," ",item) 函数，使用正则表达式将字符串中连续两个或多个空格替换为单个空格
        # 使用 strip() 函数去除字符串两端的空格和换行符
        out = re.sub("\s{2,}", " ", item).strip()
        temp = out.split(" ")
        # 将 data 列表转换为一个 NumPy 数组，并将所有元素的类型转换为 np.float
        temp = np.array(temp).astype(np.float64)
        data.append(temp)
    data = np.array(data)
    return data

# 生成训练数据
def genTrainData(data):
    # 打乱数据
    np.random.shuffle(data)
    # 把X，y分离出来
    # 特征数据:13个特征 w*x
    X = data[:, 0:13]
    # 目标数据
    y = data[:, 13]

    # 506个数据，用80%做线性回归，20%验证线性回归得到的模型(w, b)
    X_train = X[:406]
    y_train = y[:406]

    # 20%数据做测试
    X_test = X[406:]
    y_test = y[406:]

    return X_train, y_train, X_test, y_test

data = load_data("/Users/bmo/pks/boston.txt")
X_train, y_train, X_test, y_test = genTrainData(data)

model = LinearRegression(fit_intercept=True)
model.fit(X_train, y_train)
# 获取斜率(权重)与截距
print(f"模型的斜率:{model.coef_}，截距:{model.intercept_}")

# 建模获取了斜率，有大有小，有正有负，表示什么？
#正：正相关，面积，越大，房价越高
#负：刚好相反，犯罪率，环境污染
y_predict = model.predict(X_test).round(2)

pairs = [f"{a} <-> {b}" for a, b in zip(y_predict, y_test)]
print("[使用20%X的预测值y]与[20%真实值y]的对应关系：")
for pair in pairs:
    print(pair)

模型的斜率:[ -0.11225805   0.05053865   0.02966496   1.9157657  -17.10023384
   3.61262367  -0.00209593  -1.4725001    0.3153099   -0.01313788
  -0.95864297   0.00968798  -0.51735275]，截距:37.56398728469011
[使用20%X的预测值y]与[20%真实值y]的对应关系：
 ['3.72 <-> 8.8', '18.44 <-> 10.9', '28.43 <-> 28.0', '34.98 <-> 30.1', '12.85 <-> 10.5', '9.91 <-> 8.3', '18.54 <-> 19.5', '34.38 <-> 31.0', '28.46 <-> 33.4', '24.32 <-> 23.4', '20.76 <-> 21.0', '32.14 <-> 27.9', '30.78 <-> 32.9', '16.52 <-> 13.8', '34.59 <-> 37.3', '30.62 <-> 28.7', '8.14 <-> 5.0', '16.49 <-> 17.6', '18.96 <-> 17.8', '22.63 <-> 25.0', '33.23 <-> 36.1', '35.45 <-> 32.4', '23.17 <-> 22.9', '18.03 <-> 14.5', '15.22 <-> 15.4', '17.37 <-> 15.1', '19.28 <-> 18.2', '30.37 <-> 34.7', '39.27 <-> 50.0', '40.58 <-> 50.0', '14.37 <-> 13.1', '28.61 <-> 22.8', '25.59 <-> 24.1', '21.36 <-> 20.1', '6.5 <-> 10.5', '23.09 <-> 24.7', '36.38 <-> 36.0', '25.56 <-> 23.1', '24.6 <-> 21.7', '5.84 <-> 8.8', '15.53 <-> 16.6', '13.31 <-> 12.8', '15.06 <-> 15.7', '18.2 <-> 14.2', '20.79 <-> 18.8', '17.83 <-> 7.2', '17.3 <-> 18.1', '23.89 <-> 20.1', '31.28 <-> 30.7', '28.34 <-> 25.0', '37.93 <-> 44.8', '19.41 <-> 19.6', '31.64 <-> 30.8', '24.29 <-> 21.9', '20.8 <-> 21.1', '26.59 <-> 22.6', '18.92 <-> 16.1', '11.14 <-> 23.1', '29.06 <-> 24.3', '19.42 <-> 19.9', '28.08 <-> 26.6', '19.57 <-> 18.5', '23.38 <-> 20.8', '31.75 <-> 29.1', '19.02 <-> 19.9', '16.33 <-> 10.2', '13.49 <-> 13.9', '31.83 <-> 32.2', '18.45 <-> 14.1', '22.58 <-> 21.1', '25.35 <-> 24.0', '22.45 <-> 23.2', '18.19 <-> 12.6', '31.94 <-> 33.2', '14.28 <-> 11.0', '20.39 <-> 20.4', '25.91 <-> 22.2', '6.53 <-> 13.8', '16.54 <-> 19.3', '22.86 <-> 24.4', '27.21 <-> 20.6', '8.8 <-> 8.7', '17.42 <-> 19.4', '36.52 <-> 50.0', '25.27 <-> 50.0', '43.27 <-> 50.0', '30.55 <-> 29.1', '21.21 <-> 19.3', '34.77 <-> 43.8', '24.03 <-> 22.2', '18.04 <-> 22.5', '29.36 <-> 25.0', '34.55 <-> 35.2', '33.86 <-> 32.0', '34.35 <-> 35.4', '26.96 <-> 22.1', '14.72 <-> 14.8', '22.3 <-> 22.0', '21.31 <-> 19.6', '28.57 <-> 24.4']
模型的得分:0.7799984355156543

模型评估打分

score = model.score(X_test, y_test)
print(f"模型的得分:{score}")

模型的得分:0.7801469895898483

score的值小于1，可以是负数。
score越接近1，模型的近似效果就越好。

score的算法：
The coefficient of determination :math:R^2 is defined as
:math: $\frac{u}{v})$ , where
:math:u is the residual sum of squares ((y_true - y_pred)** 2).sum() and
:math:v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum().
The best possible score is 1.0 and it can be negative (because the
model can be arbitrarily worse). A constant model that always predicts
the expected value of y, disregarding the input features, would get
a :math:R^2 score of 0.0.

使用最小二乘法也能得到模型误差值

from sklearn.metrics import mean_squared_error
mea = mean_squared_error(y_test, y_predict)
print(f"模型的误差值:{mea}")

模型的误差值:14.743696000000005