机器学习实例（预测房价中位数）（附代码）

最新推荐文章于 2024-10-12 16:29:50 发布

苹果树上有橘子

最新推荐文章于 2024-10-12 16:29:50 发布

阅读量2.5k

点赞数 5

分类专栏：总结文章标签：机器学习 python sklearn 数据分析回归

本文链接：https://blog.csdn.net/qq_51153436/article/details/121527662

版权

前提条件：

1、有一些python编程经验。
2、熟悉python主要科学库，特别是：numpy，pandas和matplotlib。
3、最好使用Jupyter 编程。（没有的话，建议下载Anaconda。里面有。）

一、下载数据：

1、下载一个压缩文件housing.tgz即可，其包含housing.csv（已经包含书有数据。)，用 tax xzf housing.tgz 来解压提取CSV文件。

import os
import tarfile
import urllib.request

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

之后应用函数就好了。Jupyter 最好用谷歌浏览器，搞不好会报错（没有网站访问权限）。

fetch_housing_data()

2、使用pandas加载数据，返回包含所用数据的DF 对象。

import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path=os.path.join(housing_path,"housing.csv")
    return pd.read_csv(csv_path)
load_housing_data(HOUSING_PATH)# 查看信息。

查看数据结构：

# 住房信息
housing = load_housing_data()
housing.head()
housing.info()
#统计学数据
housing.describe()
#每个数值属性的直方图
%matplotlib inline
import matplotlib.pyplot as plt

housing.hist(bins=50,figsize=(20,15))
plt.show()

3、创建测试集（一般为数据集的百分之20，数据集越大，比例越小。）

# to make this notebook's output identical at every run
import numpy as np
np.random.seed(42)
# For illustration only. Sklearn has train_test_split()
def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]
train_set, test_set = split_train_test(housing, 0.2)
len(train_set)
len(test_set)


from zlib import crc32

def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]
import hashlib

def test_set_check(identifier, test_ratio, hash=hashlib.md5):
    return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio


def test_set_check(identifier, test_ratio, hash=hashlib.md5):
    return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio
#使用行索引做ID
housing_with_id = housing.reset_index()   # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")

test_set.head()

4、用Scikit-Learn 随机拆分和分层抽样出的数据测试集：

#随机拆分：
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
#观看效果：
test_set.head()
housing["median_income"]

最低0.47元/天解锁文章