【机器学习流程】

1. 基本的数据探索

每个机器学习的第一步就是:导入数据,且深入了解数据信息。

1.1 熟悉数据

python中的pandas库有强大的数据处理功能,前面的笔记有详细介绍Pandas数据分析

这里有一份澳大利亚墨尔本房价数据集,首先导入数据,熟悉数据,具体情况具体分析。

import pandas as pd
melbourne_data = pd.read_csv("./data/melb_data.csv") 
melbourne_data.describe()# 描述性统计

在这里插入图片描述

描述性统计包括:计数、均值、标准差、最小值、25%分位数、中位数、75%中位数、最大值8个特征。

还有数据中的数值型变量,分类型变量,因变量的数据信息,数据分布等都在这一阶段了解。

melbourne_data.columns

看一下数据有哪些变量,哪个是因变量,哪些可以是特征值。本例因变量是y = melbourne_data.Price

1.2 数据预处理

1.2.1 异常值处理

具体可见这一篇笔记数据分析Pandas

1.2.2 缺失值处理

查看缺失值的数量:

# Number of missing values in each column of training data
missing_column = (X_train.isnull().sum())
print(missing_column[missing_column > 0])
  • 1. 直接删除"na"="not available"
melbourne_data = melbourne_data.dropna(axis=0)
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]

# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

print(f"MAE from Approach 1 (Drop columns with missing values):%.4f" % score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

一般不常用。

  • 2. 均值填充
    利用Python的SimpleImputer来处理缺失值:默认均值填充。
from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print("MAE from Approach 2 (Imputation):%.4f"% score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))
  • 3. 中位数填充
  • 4. 众数等填充

1.2.3 分类型变量categorical variable

一般处理方法有:

  • 1. 删除分类型变量
    一般不常用。
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])

print("MAE from Approach 1 (Drop categorical variables):%.4f"% score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))
  • 2. 序号编码Ordinal Encoding
    在这里插入图片描述
    一般可用于顺序变量,值有大小区别。
from sklearn.preprocessing import OrdinalEncoder

# Make copy to avoid changing original data 
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])

print("MAE from Approach 2 (Ordinal Encoding):%.4f"% score_dataset(label_X_train, label_X_valid, y_train, y_valid)) 
  • 3. 独热编码One-Hot Encoding
    也可叫零一编码,一般用于没有大小区别的分类型变量,且运用较广。如果分类变量值太多,表现不好,一般分类变量不超过15个不同的值
    在这里插入图片描述
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('./data/melb_data.csv')

# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# Drop columns with missing values (simplest approach)
cols_with_missing = [col for col in X_train_full.columns if X_train_full[col].isnull().any()] 
X_train_full.drop(cols_with_missing, axis=1, inplace=True)
X_valid_full.drop(cols_with_missing, axis=1, inplace=True)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

得到分类型变量的列表:

# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)

Python中用OneHotEncoder去进行零一编码:参数handle_unknown='ignore'是为了避免验证集中出现训练集中没有的分类型变量;参数sparse=False是微课确保零一编码处理后数据是numpy数组,而不是稀疏矩阵。

from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

print("MAE from Approach 3 (One-Hot Encoding):%.4f"% score_dataset(OH_X_train, OH_X_valid, y_train, y_valid)) 

2. 建立模型

在拟合模型之前,需要确定因变量和特征值。

  • 凭直觉选择变量
  • 统计方法自动选择变量

凭直觉选择特征的模型:

y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
X.describe()

在这里插入图片描述

一般建立模型的步骤:

  • 定义:选择什么类型的模型?回归还是分类?模型具体参数?
  • 拟合:一般需要划分数据集成训练集和测试集;
  • 预测:预测因变量;
  • 评估:模型的表现效果?模型的评估指标?

2.1 简单的模型——决策树

2.1.1 决策树介绍

在这里插入图片描述
sh上面是一个简单的一层决策树模型,它只把房子分成两类,所谓的拟合和训练模型,就是利用数据决定怎样划分这些房子成两类,然后预测每一类的房价是多少。用于拟合的数据集叫作训练集

在这里插入图片描述
对比上面两个决策树模型,似乎第一个更合理,联系实际,一般卧室越多房价越高。以上模型最大的缺点:没有利用更多影响房价的因素,比如浴室,住房面积,离地铁距离,离学校距离等。

下面的决策树有了更多分支:树底部就是预测的房价,底部的点就是叶节点

在这里插入图片描述

用决策树拟合本例房价数据:

from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

random_state确保每次运行得到相同的结果。

预测数据前5行的结果:

print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

2.2 欠拟合和过拟合

在决策树模型中,最重要的选择取决于树的深度。

  • 如果树太深
    可能出现过拟合,当树有很多叶子时,每个叶子包括更少的房子。房子数越少,可能导致房价更接近真实值,但是将会很难预测一个新的数据,因为每个预测只基于很少的数据。一般过拟合的情况,模型拟合训练集效果很好,但是在验证集和测试集效果很差
  • 如果树太浅
    可能出现欠拟合。它划分数据并没有啥太大的区别。比如一个极端,只有一层,数据被分成2类,每类的数据差别很大,导致这个预测效果较差,在训练集表现也很差。一般欠拟合的情况,模型拟合训练集效果不好,验证集和测试集效果也不好

过拟合和欠拟合二者都会使模型的预测效果变差,所以要避免模型出现过拟合和欠拟合。

在这里插入图片描述

下面定义一个函数来对比各模型的MAE

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

melbourne_data = pd.read_csv("./data/melb_data.csv") 
# Filter rows with missing values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

2.3 随机森林

随机森林有很多决策树,通过平均每棵树的预测效果来提高预测的准确率,一般比单一的决策树预测效果更好,且在初始化参数时表现也更好。RandomForestRegressor来拟合随机森林模型。

melbourne_data = pd.read_csv("./data/melb_data.csv") 
# Filter rows with missing values
melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]

from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

2.4 XGBoost

上面的随机森林是一种集成模型,XGBoost也是一种集成模型。

2.4.1 梯度提升

梯度提升是一种方法,它通过不断迭代优化模型,从一个单一的初始化模型开始,一般步骤如下:

  • 使用现在的集成模型做预测;
  • 用这些预测去计算损失函数;
  • 使用损失函数去拟合将要加入集成模型的新模型;
  • 不断迭代……

在Python中通过xgboost.XGBRegressor实现:

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('./data/melb_data.csv')

# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

# Separate data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y)

from xgboost import XGBRegressor

my_model = XGBRegressor()
my_model.fit(X_train, y_train)

from sklearn.metrics import mean_absolute_error

predictions = my_model.predict(X_valid)
print("Mean Absolute Error: %.4f" % mean_absolute_error(predictions, y_valid))

2.4.2 调参

    1. n_estimators
      等价于集成模型的数量。值太小,容易造成欠拟合,导致训练集和测试集的预测效果都不好;值太大,容易造成过拟合,导致训练集表现较好,但测试集效果往往不好。一般取值范围100-1000。
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train)
    1. early_stopping_rounds
      目的是发现理想的n_estimators,一般设置较高的n_estimators,使用early_stopping_rounds去实现什么时候停止迭代,比如说当验证得分不在提高时就停止迭代,取early_stopping_rounds=5是较合适的。
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)],
             verbose=False)
    1. learning_rate
      一般情况下,较小的学习率learning_rate和较大的n_estimators往往有更准确的XGBoost模型。默认值learning_rate=0.1
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)
    1. n_jobs
      在小数据集时,对提高运行时间用处不大,但对于大数据集,比较有用。
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)

2.5 Pipelines

Pipelines一般用于美化代码,对于复杂的模型,使代码更明朗简洁。

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('./data/melb_data.csv')

# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

构造完整的pipeline一般需要以下三步:

  • 1. 数据预处理——分类型数据&数值型数据
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])
  • 2. 选择模型
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)
  • 3. 预测评估模型
from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:%.4f' % score)

3. 模型验证

模型验证就是验证模型的拟合效果怎样?大多数模型最重要就是预测准确

3.1 MAE平均绝对误差

误差就是真实值-预测值,取误差的绝对值,再求绝对值的平均就是MAE。用Python中的sklearn.metrics库的mean_absolute_error计算MAE。

#melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv("./data/melb_data.csv") 
# Filter rows with missing price values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

from sklearn.tree import DecisionTreeRegressor
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(X, y)

from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

注意到:及时这个模型在这个数据集表现的好,但更新数据时,该模型对于新的数据集效果不一定好,所以需要划分数据集,下面用train_test_split划分成训练集和验证集

from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

函数score_dataset求模型的评价指标MAE:

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

3.2 交叉验证

在划分数据集时,因为是随机划分,也就是说一个模型在这个20%的数据集中表现较好,但在其他20%的数据不一定表现好。一般情况下,数据量越大,随机误差或者“噪声”越小,模型越准确。但是验证集数据越大,训练集数据就越小。

比如说5折交叉验证:把数据集划分成5份,每一份轮流做验证集去拟合模型。

在这里插入图片描述

这里有两点建议:

  • 小数据集
    应该使用交叉验证
  • 大数据集
    大数据的验证集模型表现较好,一般可不用交叉验证。

Python中用cross_val_score()进行交叉验证:

import pandas as pd

# Read the data
data = pd.read_csv('./data/melb_data.csv')

# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

my_pipeline = Pipeline(steps=[('preprocessor', SimpleImputer()),
                              ('model', RandomForestRegressor(n_estimators=50,
                                                              random_state=0))
                             ])
from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

print("MAE scores:\n", scores)#\n 换行符
print("Average MAE score (across experiments):%.4f"% scores.mean())

4. 模型改进

  • 重新建模,更换模型
  • 重新提取特征
  • 剔除不相关特征
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值