一个完整的机器学习项目

最新推荐文章于 2024-05-07 03:16:26 发布

明镜应缺

最新推荐文章于 2024-05-07 03:16:26 发布

阅读量1.6k

点赞数 1

分类专栏：机器学习文章标签：机器学习

本文链接：https://blog.csdn.net/weixin_42662126/article/details/98804018

版权

ipynb文件见：https://github.com/824024445/Machine-learning-notes/blob/master/一个完整的机器学习项目.ipynb
《Sklearn与TensorFlow机器学习实用指南》学习笔记

一、下载数据

import os
import tarfile  # 用于压缩和解压文件
import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"


# 下载数据
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    # urlretrieve()方法直接将远程数据下载到本地
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)  # 解压文件到指定路径，不指定就是解压到当前路径
    housing_tgz.close()
fetch_housing_data()

二、加载数据

import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH + "/"):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()
housing.head()

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY

三、查看数据结构

3.1 info()

info()方法可以快速查看数据的描述，特别是总行数、每个属性的类型和非空值的数量

housing.info() 
# 分析：数据集中共有 20640 个实例，按照机器学习的标准这个数据量很小，但是非常适合入门。
# 我们注意到总卧室数只有 20433 个非空值，这意味着有 207 个街区缺少这个值。我们将在后面对它进行处理。

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

3.2 value_counts()

另外，由上面查看前几行数据可以看出，距离大海一栏是重复的，意味着可能是表示某一类别的属性。这种情况使用value_counts()方法查看类别

housing["ocean_proximity"].value_counts()

<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

3.3 describe()

describe()方法展示了数值属性的概括

housing.describe()

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
count	20640.000000	20640.000000	20640.000000	20640.000000	20433.000000	20640.000000	20640.000000	20640.000000	20640.000000
mean	-119.569704	35.631861	28.639486	2635.763081	537.870553	1425.476744	499.539680	3.870671	206855.816909
std	2.003532	2.135952	12.585558	2181.615252	421.385070	1132.462122	382.329753	1.899822	115395.615874
min	-124.350000	32.540000	1.000000	2.000000	1.000000	3.000000	1.000000	0.499900	14999.000000
25%	-121.800000	33.930000	18.000000	1447.750000	296.000000	787.000000	280.000000	2.563400	119600.000000
50%	-118.490000	34.260000	29.000000	2127.000000	435.000000	1166.000000	409.000000	3.534800	179700.000000
75%	-118.010000	37.710000	37.000000	3148.000000	647.000000	1725.000000	605.000000	4.743250	264725.000000
max	-114.310000	41.950000	52.000000	39320.000000	6445.000000	35682.000000	6082.000000	15.000100	500001.000000

3.4图形描述

使用matplotlib的hist()将属性值画成柱状图，更直观

%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(10,10))
plt.show() # 不是必要的

在这里插入图片描述

房屋年龄中位数和房屋价值中位数也被设了上限，因此图中末尾为一条直线。这种情况解决办法有两种
- 1是对于被设置了上线的数据重新收集
- 2是将这些数据从训练集中移除
有些柱状图尾巴很长，离中位数过远。这会使得检测规律变难，因此会后面后尝试变换属性使其变为正太分布。

四、创建测试集

**在这个阶段就要分割数据。**如果你查看了测试集，就会不经意地按照测试集中的规律来选择某个特定的机器学习模型。再当你使用测试集来评估误差率时，就会导致评估过于乐观，而实际部署的系统表现就会差。这称为数据透视偏差。

4.1 不完美的切分方法

下面的方法，再次运行程序，就会产生一个不同的测试集。
解决的办法之一是保存第一次运行得到的测试集，并在随后的过程加载。另一种方法是在调用np.random.permutation()之前，设置随机数生成器的种子（比如np.random.seed(42)），以产生总是相同的洗牌指数（shuffled indices）
但是仍旧不完美

import numpy as np

def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data)) # permutation中文排列，输入数字x，将x以内的数字随机打散
    test_set_size = int(len(data)*test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]
  
train_set, test_set = split_train_test(housing, 0.2)
print(len(train_set), "train +", len(test_set), "test")

16512 train + 4128 test

4.2 通过实例的哈希值切分

import hashlib

def test_set_check(identifier, test_ratio, hash):
  return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio
def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
  ids = data[id_column]
  in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
  return data.loc[~in_test_set], data.loc[in_test_set]

housing_with_id = housing.reset_index() # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "inde

最低0.47元/天解锁文章

明镜应缺

关注

1
点赞
踩
23

收藏

觉得还不错? 一键收藏
0
评论
一个完整的机器学习项目

ipynb文件见：https://github.com/824024445/Machine-learning-notes/blob/master/一个完整的机器学习项目.ipynb笔记来源：《Sklearn与TensorFlow机器学习实用指南》一、下载数据import osimport tarfile # 用于压缩和解压文件import urllibDOWNLOAD_ROOT...
复制链接

扫一扫