Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2)

Pipline

A sequence of data processing components is called a data pipeline.

Root Mean Square Error (RMSE)

R M S E ( X , h ) = 1 m ∑ i = 1 m ( h ( x ( i ) ) − y ( i ) ) 2 RMSE(X,h) = \sqrt{\frac{1}{m}\sum_{i=1}^m(h(x^{(i)})-y^{(i)})^2} RMSE(X,h)=m1i=1m(h(x(i))y(i))2

  • x ( i ) x^{(i)} x(i) is a vector of all the feature values, y ( i ) y^{(i)} y(i) is its label.
  • X is a matrix containing all the feature values, and the i t h i^{th} ith row is equal to the transpose of x ( i ) x^{(i)} x(i), noted ( x ( i ) ) T (x^{(i)} )^T (x(i))T
  • h is prediction function, also called a hypothesis.
  • RMSE(X,h) is true cost function.

Mean Absolute Error (MAE)

M A E ( X , h ) = 1 m ∑ i = 1 m ∣ h ( x ( i ) ) − y ( i ) ∣ MAE(X,h) = \frac{1}{m}\sum_{i=1}^m|h(x^{(i)})-y^{(i)}| MAE(X,h)=m1i=1mh(x(i))y(i)

Both the RMSE and the MAE are ways to measure the distance between two vectors.

  • RMSE corresponds to the l 2 l_2 l2 norm, also called the Euclidean norm.
  • MAE corresponds to the l 1 l_1 l1 norm, also called the Manhattan norm.
  • The higher the norm index, the more it focuses on large values and neglects small ones. This is why the RMSE is more sensitive to outliers than the MAE.

Creating an isolated environment

# install virtualenv
$ python3 -m pip install --user -U virtualenv
# create an isolated environment
$ python3 -m virtualenv my_env
# activate this environment
$ source my_env/bin/activate # on Linux or macOS
$ .\my_env\Scripts\activate  # on Windows

# register virtualenv to Jupyter and give it a name
$ python3 -m ipykernel install --user --name=python3

Download the Data

# fetch the data
import os
import tarfile
import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

# load the data using pandas
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

Take a Quick Look at the Data Structure

housing =load_housing_data()
# show the total number of rows, each attribute’s type, and the number of nonnull values
housing.info()
# how many districts belong to each category 
housing["ocean_proximity"].value_counts()
# show a summary of the numerical attributes
housing.describe()

# plot a histogram
%matplotlib inline   # only in a Jupyter notebook
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()
  • The 25%, 50%, and 75% rows show the corresponding percentiles: a percentile indicates the value below which a given percentage of observations in a group of observations fall.
  • Tail-heavy: they extend much farther to the right of the median than to the left.

Create a Test Set

import numpy as np

def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

# set the random number generator’s seed so that generate the same shuffled indices
np.random.seed(42)

# To have a stable train/test split after updating the dataset, 
# a solution is to use each instance’s identifier.
from zlib import crc32

def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]

# Use the row index as the ID as identifier column
housing_with_id = housing.reset_index()   # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

# Combine a district’s latitude and longitude into an ID
housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")

# Scikit-Learn's functions
from sklearn.model_selection import train_test_split
# random_state parameter allows to set the random generator seed
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

# create an income category attribute with five categories (labeled from 1 to 5)
housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])
housing["income_cat"].hist()

# use Scikit-Learn’s StratifiedShuffleSplit class to do stratified sampling
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

# looking at the income category proportions
strat_test_set["income_cat"].value_counts() / len(strat_test_set)

# remove the income_cat attribute so the data is back to its original state
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)
  • stratified sampling: the population is divided into homogeneous subgroups called strata, and the right number of instances are sampled from each stratum to guarantee that the test set is representative of the overall population.
  • the test set generated using stratified sampling has income category proportions almost identical to those in the full dataset, whereas the test set generated using random sampling is skewed.

Discover and Visualize the Data to Gain Insights

# create a copy 
housing = strat_train_set.copy()

# create a scatterplot of all districts
# Setting the alpha option to 0.1 makes it easier to visualize the places where there is a high density
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)

# The radius of each circle represents the district’s population (option s)
#  the color represents the price (option c). 
# use a predefined color map (option cmap) called jet, which ranges from blue (low values) to red (high prices)
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
    s=housing["population"]/100, label="population", figsize=(10,7),
    c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
)
plt.legend()

# compute the standard correlation coefficient (also called Pearson’s r) 
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

# plot every numerical attribute against every other numerical attribute
from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))

housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
  • The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that there is a strong positive correlation; When the coefficient is close to –1, it means that there is a strong negative correlation; coefficients close to 0 mean that there is no linear correlation.
  • The correlation coefficient only measures linear correlations.

Prepare the Data for Machine Learning Algorithms

# drop() creates a copy of the data and does not affect strat_train_set
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

# 1.Get rid of the corresponding districts.
housing.dropna(subset=["total_bedrooms"]) 
# 2. Get rid of the whole attribute.
housing.drop("total_bedrooms", axis=1) 
# 3. Set the values to some value (zero, the mean, the median)
median = housing["total_bedrooms"].median() 
housing["total_bedrooms"].fillna(median, inplace=True)

from sklearn.impute import SimpleImputer
# create a SimpleImputer instance, specifying that you want to replace each attribute’s missing values with the median of that attribute
imputer = SimpleImputer(strategy="median")
# create a copy of the data without the text attribute ocean_proximity
housing_num = housing.drop("ocean_proximity", axis=1)
# fit the imputer instance to the training data
imputer.fit(housing_num)
# The imputer has computed the median of each attribute and stored the result in its statistics_ instance variable. 
imputer.statistics_
housing_num.median().values
# use this “trained” imputer to transform the training set by replacing missing values with the learned medians
X = imputer.transform(housing_num)
# put the result back into a pandas DataFrame
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值