《机器学习实战》第二章 端到端的机器学习项目①
文章目录
一个端到端的项目需要以下主要步骤:
- 观察大局
- 获得数据
- 从数据探索和可视化中获得洞见
- 机器学习算法的数据准备
- 选择和训练模型
- 微调模型
- 展示解决方案
- 启动、监视和维护系统
选择性能指标
公式2-1:均方根误差(RMSE)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oyJ0OSU1-1594355098082)(F:\book¬e\学习日记\end to end ML——housing.assets\gif.latex)]
公式2-2:平均绝对误差
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Fn7YPrjW-1594355098091)(F:\book¬e\学习日记\end to end ML——housing.assets\gif.latex)]
快速查看数据
import pandas as pd
HOUSING_PATH = os.path.join("datasets//", "housing")
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)
housing = load_housing_data()
housing.head()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-A38m7jAw-1594355098093)(F:\book¬e\学习日记\end to end ML——housing.assets\image-20200707123605494.png)]
head函数能够输出DataFrames的前五行是怎样的。可以添加一个int参数,指明显示前几行housing.head(int x)
housing.info()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-n1y0Ay4E-1594355098096)(F:\book¬e\学习日记\end to end ML——housing.assets\image-20200707123620463.png)]
info方法可以快速获取数据集的简单描述,特别是总行数,每个属性的类型和非空数值的数量
housing["ocean_proximity"].value_counts()
Out[7]:<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: ocean_proximity, dtype: int64
value_counts()方法查看有多少中分类存在,每个分类下分别有多少个区域
housing.describe()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-htPLVBSh-1594355098097)(F:\book¬e\学习日记\end to end ML——housing.assets\image-20200707123927321.png)]
可以显示数值属性的摘要,求和、平均、最大最小值
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
save_fig("attribute_histogram_plots")
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5hCbAsMg-1594355098098)(F:\book¬e\学习日记\end to end ML——housing.assets\image-20200707155340465.png)]
hist()方法,绘制每个属性的直方图。这里bins参数理解了很久。
bins:每个方条代表一个bin容器,bins如果是整数,会规定直方图范围内等宽条柱的数目,默认为10;如果bins是序列,则会给出容器的边缘,包括第一的容器左边边缘和最后一个容器的右边缘。除了最右边的bin是闭合区间,其他bin都是左闭右开的区间。如果bins
为:[1, 2, 3, 4]那么第一个bin是[1, 2),第二个是[2, 3),最后一个bin是[3, 4]。如果bin是序列,则支持不等间距的bin 。
figsize:就是整个图像的size。
创建测试集
完全的随机划分
import numpy as np
# For illustration only. Sklearn has train_test_split()
def split_train_test(data, test_ratio):
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]
train_set, test_set = split_train_test(housing, 0.2)
print(len(train_set), "train +", len(test_set), "test")
numpy.random.permutation()
numpy.random.permutation(x)
**参数:**x : 整数或者数组,如果x是整数,则随机排列np.arange(x)。若果x是数组,对其复制之后再搅乱其元素。
返回: 排列的序列或数组
np.random.permutation(10)
输出:
array([1, 7, 4, 3, 0, 9, 2, 5, 8, 6])
np.random.permutation([1, 4, 9, 12, 15])
输出:
array([15, 1, 9, 4, 12])
arr = np.arange(9).reshape((3, 3))
np.random.permutation(arr)
输出:
array([[6, 7, 8],
[0, 1, 2],
[3, 4, 5]])
随机生成数种子random.seed()
seed() 方法改变随机数生成器的种子,可以在调用其他随机模块函数之前调用此函数。
我们调用 random.random() 生成随机数时,每一次生成的数都是随机的。但是,当我们预先使用 random.seed(x) 设定好种子之后,其中的 x 可以是任意数字,如10,这个时候,先调用它的情况下,使用 random() 生成的随机数将会是同一个。
那么在permutation之前用random.seed()就可以保证每次permutation返回的序列结果是相同的
利用这个区分可以划分数据集和验证集
hashlib
可以计算每个实例标识符的hash值,只取hash的最后一位数,如果该值小于等于51(≈256*0.2),则将该实例放入测试集。
import hashlib
def test_set_check(identifier, test_ratio, hash):
return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio
def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
return data.loc[~in_test_set], data.loc[in_test_set]
housing_with_id = housing.reset_index() # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")
housing_with_id = housing.reset_index() # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")
Scikit-Learn提供的函数
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
分层抽样划分
# Divide by 1.5 to limit the number of income categories
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
# Label those above 5 as 5
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
np.ceil()函数,取整操作
Dataframes.where(),第一个参数为判断条件,如果为第一个条件为False,就替换。如果第一个条件为True,则保留原值。
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
我们先从StratifiedShuffleSplit()函数的参数开始吧:
然后解释一下参数的含义:
参数 n_splits是将训练数据分成train/test对的组数,可根据需要进行设置,默认为10
参数test_size和train_size是用来设置train/test对中train和test所占的比例。例如:
1.提供10个数据num进行训练和测试集划分
2.设置train_size=0.8 test_size=0.2
3.train_num=numtrain_size=8 test_num=numtest_size=2
4.即10个数据,进行划分以后8个是训练数据,2个是测试数据
>>> import numpy as np
>>> from sklearn.model_selection import StratifiedShuffleSplit
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([0, 0, 0, 1, 1, 1])
>>> sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
>>> sss.get_n_splits(X, y)
5
>>> print(sss)
StratifiedShuffleSplit(n_splits=5, random_state=0, ...)
>>> for train_index, test_index in sss.split(X, y):
... print("TRAIN:", train_index, "TEST:", test_index)
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
TRAIN: [5 2 3] TEST: [4 1 0]
TRAIN: [5 1 4] TEST: [0 2 3]
TRAIN: [5 0 2] TEST: [4 3 1]
TRAIN: [4 1 0] TEST: [2 3 5]
TRAIN: [0 5 1] TEST: [3 4 2]
split.split()第二个参数是分层抽样分层的依据
pandas.Series.loc函数,切分Series
下面可以删除income_cat属性了
for set_ in (strat_train_set, strat_test_set):
set_.drop("income_cat", axis=1, inplace=True)
从数据探索和可视化中获得洞见
corr()方法来计算每对属性之间的标准相关系数
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XlNEcfHa-1594355098099)(F:\book¬e\学习日记\end to end ML——housing.assets\image-20200707165201254.png)]
scatter_matrix()显示各属性之间的依赖关系
# from pandas.tools.plotting import scatter_matrix # For older versions of Pandas
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms",
"housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
save_fig("scatter_matrix_plot")
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-00boxykm-1594355098100)(F:\book¬e\学习日记\end to end ML——housing.assets\image-20200707165303271.png)]
主对角线上的图,如果是相关性,那么都会是直线图,所以这里默认用直方图替代。
可以看出收入中位数和房价中位数是有很强相关关系的。
机器学习算法的数据准备
数据清理
因为我们不一定会使用相同的转化方式,所以要复制一个干净的数据集strat_train_set,然后将预测器和标签分开。
housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()
drop()
会创建一个数据副本,但是不影响strat_train_set
sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()
sample_incomplete_rows
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lpTonb8X-1594355098101)(F:\book¬e\学习日记\end to end ML——housing.assets\image-20200707170700243.png)]
sample_incomplete_rows.dropna(subset=["total_bedrooms"]) # option 1
DataFrame.dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False)
参数:axis :如果是0,就丢掉包含空值的行,如果是1,就丢掉包含空值的属性
subset :要考虑的其他轴上的标签,例如,如果要删除行,这些标签将是要包含的列列表。
sample_incomplete_rows.drop("total_bedrooms", axis=1) # option 2
option2 简单的把totalbedrooms这个属性删除掉
median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True) # option 3
sample_incomplete_rows
sklearn提供了一个很容易上手的教程来处理缺失值:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
因为中位数只能在数值上计算,所以这里我们先抛弃“ocean_proximity”这个属性(属性值有文字)
housing_num = housing.drop('ocean_proximity', axis=1)
# alternatively: housing_num = housing.select_dtypes(include=[np.number])
imputer.fit(housing_num)
# 使用fit方法将imputer适配数据集
imputer.statistics_
# imputer计算了每个属性的中位数值,将其存储在statistics_中
out:
array([-118.51 , 34.26 , 29. , 2119.5 , 433. , 1164. ,
408. , 3.5409])
使用这个imputer完成缺失值替换成中位数值
X = imputer.transform(housing_num) # 返回的是一个numpy数组
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
index = list(housing.index.values))
# 将numpy数组放回到Pandas DataFrame
处理文本和分类属性
LabelEncoder转换器
housing_cat_encoded, housing_categories = housing_cat.factorize()
housing_cat_encoded[:10]
out:
array([0, 0, 1, 2, 0, 2, 0, 2, 0, 0], dtype=int64)
这个转换器将<1H OCEAN对应为0,“INLAND"对应位1,等等
housing_categories
out:
Index(['<1H OCEAN', 'NEAR OCEAN', 'INLAND', 'NEAR BAY', 'ISLAND'], dtype='object')
OneHotEncoder编码器
from sklearn.preprocessing import OneHotEncoder
# The OneHotEncoder returns a sparse array by default, but we can convert it to a dense array if needed
encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot
housing_cat_1hot.toarray()
out:
array([[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
...,
[0., 0., 1., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0.]])
自定义转换器
from sklearn.base import BaseEstimator, TransformerMixin
# column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
population_per_household = X[:, population_ix] / X[:, household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
这个类的功能是为原始dataset添加新的特征
特征缩放
最常用的有两种方法:最小最大缩放和标准化
Scikit-Learn提供了一个名为MinMaxScaler的转换器和一个标准化转化器StandadScaler
转化流水线
Scikit-Learn提供了pipeline来支持流水线
# 这是一个处理数值数据的流水线
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
housing_num_tr = num_pipeline.fit_transform(housing_num)
pipeline构造函数通过一系列名称/估算器的配对来定义步骤,除了最后一个是估算器以外,前面都必须是转化器,也就是说必须有fit_transform()
方法
from sklearn.pipeline import FeatureUnion
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', SimpleImputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('cat_encoder', CategoricalEncoder(encoding="onehot-dense")),
])
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared
FeatureUnion类,当transform方法被调用的时候,它会并行运算每个转换器的transform方法,等待他们的输出,。
每条子流水线从选择器开始:只需要挑出所需的属性,删除其余的数据,然后再生成的DataFrame转换成Numpy数组,数据转换就完成了。Scikit-Learn中没有可以用来处理DataFrame的,因此我们需要一个简单的自定义转化器。
from sklearn.base import BaseEstimator, TransformerMixin
# Create a class to select numerical or categorical columns
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values