《机器学习实战》第二章端到端的机器学习项目①

最新推荐文章于 2024-04-11 23:14:53 发布

zuckzhao95

最新推荐文章于 2024-04-11 23:14:53 发布

阅读量318

点赞数

分类专栏：《机器学习实战基于sklearn和TensorFlow》文章标签：机器学习

本文链接：https://blog.csdn.net/qq_35059338/article/details/107248247

版权

《机器学习实战基于sklearn和TensorFlow》专栏收录该内容

2 篇文章 0 订阅

订阅专栏

《机器学习实战》第二章端到端的机器学习项目①

一个端到端的项目需要以下主要步骤：

观察大局
获得数据
从数据探索和可视化中获得洞见
机器学习算法的数据准备
选择和训练模型
微调模型
展示解决方案
启动、监视和维护系统

选择性能指标

公式2-1：均方根误差（RMSE)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oyJ0OSU1-1594355098082)(F:\book&note\学习日记\end to end ML——housing.assets\gif.latex)]

公式2-2：平均绝对误差

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Fn7YPrjW-1594355098091)(F:\book&note\学习日记\end to end ML——housing.assets\gif.latex)]

快速查看数据

import pandas as pd

HOUSING_PATH = os.path.join("datasets//", "housing")

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()
housing.head()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-A38m7jAw-1594355098093)(F:\book&note\学习日记\end to end ML——housing.assets\image-20200707123605494.png)]

head函数能够输出DataFrames的前五行是怎样的。可以添加一个int参数，指明显示前几行housing.head(int x)

housing.info()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-n1y0Ay4E-1594355098096)(F:\book&note\学习日记\end to end ML——housing.assets\image-20200707123620463.png)]

info方法可以快速获取数据集的简单描述，特别是总行数，每个属性的类型和非空数值的数量

housing["ocean_proximity"].value_counts()

Out[7]:<1H OCEAN     9136
        INLAND        6551
        NEAR OCEAN    2658
        NEAR BAY      2290
        ISLAND           5
        Name: ocean_proximity, dtype: int64

value_counts()方法查看有多少中分类存在，每个分类下分别有多少个区域

housing.describe()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-htPLVBSh-1594355098097)(F:\book&note\学习日记\end to end ML——housing.assets\image-20200707123927321.png)]

可以显示数值属性的摘要，求和、平均、最大最小值

%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
save_fig("attribute_histogram_plots")
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5hCbAsMg-1594355098098)(F:\book&note\学习日记\end to end ML——housing.assets\image-20200707155340465.png)]

hist()方法，绘制每个属性的直方图。这里bins参数理解了很久。

bins：每个方条代表一个bin容器，bins如果是整数，会规定直方图范围内等宽条柱的数目，默认为10；如果bins是序列，则会给出容器的边缘，包括第一的容器左边边缘和最后一个容器的右边缘。除了最右边的bin是闭合区间，其他bin都是左闭右开的区间。如果bins为：[1, 2, 3, 4]那么第一个bin是[1, 2），第二个是[2, 3)，最后一个bin是[3, 4]。如果bin是序列，则支持不等间距的bin 。

figsize：就是整个图像的size。

创建测试集

完全的随机划分

import numpy as np

# For illustration only. Sklearn has train_test_split()
def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

train_set, test_set = split_train_test(housing, 0.2)
print(len(train_set), "train +", len(test_set), "test")

numpy.random.permutation()

numpy.random.permutation(x)

**参数：**x : 整数或者数组，如果x是整数，则随机排列np.arange(x)。若果x是数组，对其复制之后再搅乱其元素。

返回: 排列的序列或数组

np.random.permutation(10)
输出：
array([1, 7, 4, 3, 0, 9, 2, 5, 8, 6])

np.random.permutation([1, 4, 9, 12, 15])
输出：
array([15,  1,  9,  4, 12])

arr = np.arange(9).reshape((3, 3))
np.random.permutation(arr)
输出：
array([[6, 7, 8],
       [0, 1, 2],
       [3, 4, 5]])

随机生成数种子random.seed()

seed() 方法改变随机数生成器的种子，可以在调用其他随机模块函数之前调用此函数。

我们调用 random.random() 生成随机数时，每一次生成的数都是随机的。但是，当我们预先使用 random.seed(x) 设定好种子之后，其中的 x 可以是任意数字，如10，这个时候，先调用它的情况下，使用 random() 生成的随机数将会是同一个。

那么在permutation之前用random.seed()就可以保证每次permutation返回的序列结果是相同的
利用这个区分可以划分数据集和验证集

hashlib

可以计算每个实例标识符的hash值，只取hash的最后一位数，如果该值小于等于51（≈256*0.2），则将该实例放入测试集。

import hashlib

def test_set_check(identifier, test_ratio, hash):
    return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio

def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
    return data.loc[~in_test_set], data.loc[in_test_set]

housing_with_id = housing.reset_index()   # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

housing_with_id = housing.reset_index()   # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

Scikit-Learn提供的函数

from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

分层抽样划分

# Divide by 1.5 to limit the number of income categories
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
# Label those above 5 as 5
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)

np.ceil()函数，取整操作

Dataframes.where()，第一个参数为判断条件，如果为第一个条件为False，就替换。如果第一个条件为True，则保留原值。

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

我们先从StratifiedShuffleSplit()函数的参数开始吧：
然后解释一下参数的含义：
参数 n_splits是将训练数据分成train/test对的组数，可根据需要进行设置，默认为10

参数test_size和train_size是用来设置train/test对中train和test所占的比例。例如：
1.提供10个数据num进行训练和测试集划分
2.设置train_size=0.8 test_size=0.2
3.train_num=numtrain_size=8 test_num=numtest_size=2
4.即10个数据，进行划分以后8个是训练数据，2个是测试数据

>>> import numpy as np
>>> from sklearn.model_selection import StratifiedShuffleSplit
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([0, 0, 0, 1, 1, 1])
>>> sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
>>> sss.get_n_splits(X, y)
5
>>> print(sss)
StratifiedShuffleSplit(n_splits=5, random_state=0, ...)
>>> for train_index, test_index in sss.split(X, y):
...     print("TRAIN:", train_index, "TEST:", test_index)
...     X_train, X_test = X[train_index], X[test_index]
...     y_train, y_test = y[train_index], y[test_index]
TRAIN: [5 2 3] TEST: [4 1 0]
TRAIN: [5 1 4] TEST: [0 2 3]
TRAIN: [5 0 2] TEST: [4 3 1]
TRAIN: [4 1 0] TEST: [2 3 5]
TRAIN: [0 5 1] TEST: [3 4 2]

split.split()第二个参数是分层抽样分层的依据

pandas.Series.loc函数，切分Series

下面可以删除income_cat属性了

for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

从数据探索和可视化中获得洞见

corr()方法来计算每对属性之间的标准相关系数

corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XlNEcfHa-1594355098099)(F:\book&note\学习日记\end to end ML——housing.assets\image-20200707165201254.png)]

scatter_matrix()显示各属性之间的依赖关系

# from pandas.tools.plotting import scatter_matrix # For older versions of Pandas
from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
save_fig("scatter_matrix_plot")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-00boxykm-1594355098100)(F:\book&note\学习日记\end to end ML——housing.assets\image-20200707165303271.png)]

主对角线上的图，如果是相关性，那么都会是直线图，所以这里默认用直方图替代。

可以看出收入中位数和房价中位数是有很强相关关系的。

机器学习算法的数据准备

数据清理

因为我们不一定会使用相同的转化方式，所以要复制一个干净的数据集strat_train_set，然后将预测器和标签分开。

housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()

drop() 会创建一个数据副本，但是不影响strat_train_set

sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()
sample_incomplete_rows

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lpTonb8X-1594355098101)(F:\book&note\学习日记\end to end ML——housing.assets\image-20200707170700243.png)]

sample_incomplete_rows.dropna(subset=["total_bedrooms"])    # option 1

DataFrame.dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False)

参数：axis ：如果是0，就丢掉包含空值的行，如果是1，就丢掉包含空值的属性

subset ：要考虑的其他轴上的标签，例如，如果要删除行，这些标签将是要包含的列列表。

sample_incomplete_rows.drop("total_bedrooms", axis=1)       # option 2

option2 简单的把totalbedrooms这个属性删除掉

median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True) # option 3
sample_incomplete_rows

sklearn提供了一个很容易上手的教程来处理缺失值：

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

因为中位数只能在数值上计算，所以这里我们先抛弃“ocean_proximity”这个属性（属性值有文字）

housing_num = housing.drop('ocean_proximity', axis=1)
# alternatively: housing_num = housing.select_dtypes(include=[np.number])

imputer.fit(housing_num)
# 使用fit方法将imputer适配数据集
imputer.statistics_
# imputer计算了每个属性的中位数值，将其存储在statistics_中

out：

array([-118.51  ,   34.26  ,   29.    , 2119.5   ,  433.    , 1164.    ,
        408.    ,    3.5409])

使用这个imputer完成缺失值替换成中位数值

X = imputer.transform(housing_num) # 返回的是一个numpy数组

housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index = list(housing.index.values))
# 将numpy数组放回到Pandas DataFrame

处理文本和分类属性

LabelEncoder转换器

housing_cat_encoded, housing_categories = housing_cat.factorize()
housing_cat_encoded[:10]

out:

array([0, 0, 1, 2, 0, 2, 0, 2, 0, 0], dtype=int64)

这个转换器将<1H OCEAN对应为0，“INLAND"对应位1，等等

housing_categories

out：

Index(['<1H OCEAN', 'NEAR OCEAN', 'INLAND', 'NEAR BAY', 'ISLAND'], dtype='object')

OneHotEncoder编码器

from sklearn.preprocessing import OneHotEncoder
# The OneHotEncoder returns a sparse array by default, but we can convert it to a dense array if needed
encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot

housing_cat_1hot.toarray()

out：

array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       ...,
       [0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])

自定义转换器

from sklearn.base import BaseEstimator, TransformerMixin

# column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

这个类的功能是为原始dataset添加新的特征

特征缩放

最常用的有两种方法：最小最大缩放和标准化

Scikit-Learn提供了一个名为MinMaxScaler的转换器和一个标准化转化器StandadScaler

转化流水线

Scikit-Learn提供了pipeline来支持流水线

# 这是一个处理数值数据的流水线
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)

pipeline构造函数通过一系列名称/估算器的配对来定义步骤，除了最后一个是估算器以外，前面都必须是转化器，也就是说必须有fit_transform()方法

from sklearn.pipeline import FeatureUnion

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('cat_encoder', CategoricalEncoder(encoding="onehot-dense")),
    ])
    
full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])
    
housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared

FeatureUnion类，当transform方法被调用的时候，它会并行运算每个转换器的transform方法，等待他们的输出，。

每条子流水线从选择器开始：只需要挑出所需的属性，删除其余的数据，然后再生成的DataFrame转换成Numpy数组，数据转换就完成了。Scikit-Learn中没有可以用来处理DataFrame的，因此我们需要一个简单的自定义转化器。

from sklearn.base import BaseEstimator, TransformerMixin

# Create a class to select numerical or categorical columns 
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

zuckzhao95

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
《机器学习实战》第二章端到端的机器学习项目①

《机器学习实战》第二章端到端的机器学习项目①文章目录《机器学习实战》第二章端到端的机器学习项目①选择性能指标快速查看数据创建测试集完全的随机划分随机生成数种子random.seed()hashlibScikit-Learn提供的函数分层抽样划分从数据探索和可视化中获得洞见corr()方法来计算每对属性之间的标准相关系数scatter_matrix()显示各属性之间的依赖关系机器学习算法的数据准备数据清理处理文本和分类属性LabelEncoder转换器OneHotEncoder编码器自定义转换器特征缩放
复制链接

扫一扫