ipynb文件见:https://github.com/824024445/Machine-learning-notes/blob/master/一个完整的机器学习项目.ipynb
《Sklearn与TensorFlow机器学习实用指南》学习笔记
一、下载数据
import os
import tarfile # 用于压缩和解压文件
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"
# 下载数据
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
if not os.path.isdir(housing_path):
os.makedirs(housing_path)
tgz_path = os.path.join(housing_path, "housing.tgz")
# urlretrieve()方法直接将远程数据下载到本地
urllib.request.urlretrieve(housing_url, tgz_path)
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path) # 解压文件到指定路径,不指定就是解压到当前路径
housing_tgz.close()
fetch_housing_data()
二、加载数据
import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH + "/"):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)
housing = load_housing_data()
housing.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
三、查看数据结构
3.1 info()
info()方法可以快速查看数据的描述,特别是总行数、每个属性的类型和非空值的数量
housing.info()
# 分析:数据集中共有 20640 个实例,按照机器学习的标准这个数据量很小,但是非常适合入门。
# 我们注意到总卧室数只有 20433 个非空值,这意味着有 207 个街区缺少这个值。我们将在后面对它进行处理。
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude 20640 non-null float64
latitude 20640 non-null float64
housing_median_age 20640 non-null float64
total_rooms 20640 non-null float64
total_bedrooms 20433 non-null float64
population 20640 non-null float64
households 20640 non-null float64
median_income 20640 non-null float64
median_house_value 20640 non-null float64
ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
3.2 value_counts()
另外,由上面查看前几行数据可以看出,距离大海一栏是重复的,意味着可能是表示某一类别的属性。这种情况使用value_counts()方法查看类别
housing["ocean_proximity"].value_counts()
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: ocean_proximity, dtype: int64
3.3 describe()
describe()方法展示了数值属性的概括
housing.describe()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
count | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20433.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 |
mean | -119.569704 | 35.631861 | 28.639486 | 2635.763081 | 537.870553 | 1425.476744 | 499.539680 | 3.870671 | 206855.816909 |
std | 2.003532 | 2.135952 | 12.585558 | 2181.615252 | 421.385070 | 1132.462122 | 382.329753 | 1.899822 | 115395.615874 |
min | -124.350000 | 32.540000 | 1.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 0.499900 | 14999.000000 |
25% | -121.800000 | 33.930000 | 18.000000 | 1447.750000 | 296.000000 | 787.000000 | 280.000000 | 2.563400 | 119600.000000 |
50% | -118.490000 | 34.260000 | 29.000000 | 2127.000000 | 435.000000 | 1166.000000 | 409.000000 | 3.534800 | 179700.000000 |
75% | -118.010000 | 37.710000 | 37.000000 | 3148.000000 | 647.000000 | 1725.000000 | 605.000000 | 4.743250 | 264725.000000 |
max | -114.310000 | 41.950000 | 52.000000 | 39320.000000 | 6445.000000 | 35682.000000 | 6082.000000 | 15.000100 | 500001.000000 |
3.4图形描述
使用matplotlib的hist()将属性值画成柱状图,更直观
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(10,10))
plt.show() # 不是必要的
- 房屋年龄中位数和房屋价值中位数也被设了上限,因此图中末尾为一条直线。这种情况解决办法有两种
- 1是对于被设置了上线的数据重新收集
- 2是将这些数据从训练集中移除
- 有些柱状图尾巴很长,离中位数过远。这会使得检测规律变难,因此会后面后尝试变换属性使其变为正太分布。
四、创建测试集
**在这个阶段就要分割数据。**如果你查看了测试集,就会不经意地按照测试集中的规律来选择某个特定的机器学习模型。再当你使用测试集来评估误差率时,就会导致评估过于乐观,而实际部署的系统表现就会差。这称为数据透视偏差。
4.1 不完美的切分方法
下面的方法,再次运行程序,就会产生一个不同的测试集。
解决的办法之一是保存第一次运行得到的测试集,并在随后的过程加载。另一种方法是在调用np.random.permutation()
之前,设置随机数生成器的种子(比如np.random.seed(42)
),以产生总是相同的洗牌指数(shuffled indices)
但是仍旧不完美
import numpy as np
def split_train_test(data, test_ratio):
shuffled_indices = np.random.permutation(len(data)) # permutation中文排列,输入数字x,将x以内的数字随机打散
test_set_size = int(len(data)*test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]
train_set, test_set = split_train_test(housing, 0.2)
print(len(train_set), "train +", len(test_set), "test")
16512 train + 4128 test
4.2 通过实例的哈希值切分
import hashlib
def test_set_check(identifier, test_ratio, hash):
return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio
def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
return data.loc[~in_test_set], data.loc[in_test_set]
housing_with_id = housing.reset_index() # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "inde