【机器学习流程】

最新推荐文章于 2024-08-23 16:53:34 发布

好好学习_rich

最新推荐文章于 2024-08-23 16:53:34 发布

阅读量215

点赞数

分类专栏：数据分析机器学习文章标签： python 人工智能

本文链接：https://blog.csdn.net/Four2017/article/details/128942738

版权

数据分析同时被 2 个专栏收录

11 篇文章 0 订阅

订阅专栏

机器学习

1 篇文章 0 订阅

订阅专栏

本文介绍了机器学习的基本流程，包括数据探索（熟悉数据、数据预处理，如异常值和缺失值处理，以及分类型变量处理），模型建立（如决策树、随机森林和XGBoost，以及调参），模型验证（MAE和交叉验证）和模型改进。在数据预处理中，涉及了删除、填充缺失值和处理分类型变量的方法。模型建立部分讲解了决策树、随机森林和XGBoost的原理及应用。模型验证部分强调了MAE作为评估指标的重要性，并介绍了交叉验证的概念。

摘要由CSDN通过智能技术生成

1. 基本的数据探索

每个机器学习的第一步就是：导入数据，且深入了解数据信息。

1.1 熟悉数据

python中的pandas库有强大的数据处理功能，前面的笔记有详细介绍Pandas数据分析。

这里有一份澳大利亚墨尔本房价数据集，首先导入数据，熟悉数据，具体情况具体分析。

import pandas as pd
melbourne_data = pd.read_csv("./data/melb_data.csv") 
melbourne_data.describe()# 描述性统计

在这里插入图片描述

描述性统计包括：计数、均值、标准差、最小值、25%分位数、中位数、75%中位数、最大值8个特征。

还有数据中的数值型变量，分类型变量，因变量的数据信息，数据分布等都在这一阶段了解。

melbourne_data.columns

看一下数据有哪些变量，哪个是因变量，哪些可以是特征值。本例因变量是y = melbourne_data.Price

1.2 数据预处理

1.2.1 异常值处理

具体可见这一篇笔记数据分析Pandas

1.2.2 缺失值处理

查看缺失值的数量：

# Number of missing values in each column of training data
missing_column = (X_train.isnull().sum())
print(missing_column[missing_column > 0])

1. 直接删除"na"="not available"

melbourne_data = melbourne_data.dropna(axis=0)
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]

# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

print(f"MAE from Approach 1 (Drop columns with missing values):%.4f" % score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

一般不常用。

2. 均值填充
利用Python的SimpleImputer来处理缺失值：默认均值填充。

from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print("MAE from Approach 2 (Imputation):%.4f"% score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

3. 中位数填充
4. 众数等填充等

1.2.3 分类型变量categorical variable

一般处理方法有：

1. 删除分类型变量；
一般不常用。

drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])

print("MAE from Approach 1 (Drop categorical variables):%.4f"% score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

2. 序号编码Ordinal Encoding；

一般可用于顺序变量，值有大小区别。

from sklearn.preprocessing import OrdinalEncoder

# Make copy to avoid changing original data 
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])

print("MAE from Approach 2 (Ordinal Encoding):%.4f"% score_dataset(label_X_train, label_X_valid, y_train, y_valid))

3. 独热编码One-Hot Encoding
也可叫零一编码，一般用于没有大小区别的分类型变量，且运用较广。如果分类变量值太多，表现不好，一般分类变量不超过15个不同的值。

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('./data/melb_data.csv')

# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# Drop columns with missing values (simplest approach)
cols_with_missing = [col for col in X_train_full.columns if X_train_full[col].isnull().any()] 
X_train_full.drop(cols_with_missing, axis=1, inplace=True)
X_valid_full.drop(cols_with_missing, axis=1, inplace=True)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

得到分类型变量的列表：

# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)

Python中用OneHotEncoder去进行零一编码：参数handle_unknown='ignore'是为了避免验证集中出现训练集中没有的分类型变量；参数sparse=False是微课确保零一编码处理后数据是numpy数组，而不是稀疏矩阵。

from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

print("MAE from Approach 3 (One-Hot Encoding):%.4f"% score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

2. 建立模型

在拟合模型之前，需要确定因变量和特征值。

凭直觉选择变量；
统计方法自动选择变量。

凭直觉选择特征的模型：

y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
X.describe()

在这里插入图片描述

一般建立模型的步骤：

定义：选择什么类型的模型？回归还是分类？模型具体参数？
拟合：一般需要划分数据集成训练集和测试集；
预测：预测因变量；
评估：模型的表现效果？模型的评估指标？

2.1 简单的模型——决策树

2.1.1 决策树介绍

在这里插入图片描述
sh上面是一个简单的一层决策树模型，它只把房子分成两类，所谓的拟合和训练模型，就是利用数据决定怎样划分这些房子成两类，然后预测每一类的房价是多少。用于拟合的数据集叫作训练集。

在这里插入图片描述
对比上面两个决策树模型，似乎第一个更合理，联系实际，一般卧室越多房价越高。以上模型最大的缺点：没有利用更多影响房价的因素，比如浴室，住房面积，离地铁距离，离学校距离等。

下面的决策树有了更多分支：树底部就是预测的房价，底部的点就是叶节点。

在这里插入图片描述

用决策树拟合本例房价数据：

from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

random_state确保每次运行得到相同的结果。

预测数据前5行的结果：

print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

2.2 欠拟合和过拟合

在决策树模型中，最重要的选择取决于树的深度。

如果树太深：
可能出现过拟合，当树有很多叶子时，每个叶子包括更少的房子。房子数越少，可能导致房价更接近真实值，但是将会很难预测一个新的数据，因为每个预测只基于很少的数据。一般过拟合的情况，模型拟合训练集效果很好，但是在验证集和测试集效果很差。
如果树太浅：
可能出现欠拟合。它划分数据并没有啥太大的区别。比如一个极端，只有一层，数据被分成2类，每类的数据差别很大，导致这个预测效果较差，在训练集表现也很差。一般欠拟合的情况，模型拟合训练集效果不好，验证集和测试集效果也不好。

过拟合和欠拟合二者都会使模型的预测效果变差，所以要避免模型出现过拟合和欠拟合。

在这里插入图片描述

下面定义一个函数来对比各模型的MAE。

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

melbourne_data = pd.read_csv("./data/melb_data.csv") 
# Filter rows with missing values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

2.3 随机森林

随机森林有很多决策树，通过平均每棵树的预测效果来提高预测的准确率，一般比单一的决策树预测效果更好，且在初始化参数时表现也更好。RandomForestRegressor来拟合随机森林模型。

melbourne_data = pd.read_csv("./data/melb_data.csv") 
# Filter rows with missing values
melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]

from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

2.4 XGBoost

上面的随机森林是一种集成模型，XGBoost也是一种集成模型。

2.4.1 梯度提升

梯度提升是一种方法，它通过不断迭代优化模型，从一个单一的初始化模型开始，一般步骤如下：

使用现在的集成模型做预测；
用这些预测去计算损失函数；
使用损失函数去拟合将要加入集成模型的新模型；
不断迭代……

在Python中通过xgboost.XGBRegressor实现：

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('./data/melb_data.csv')

# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

# Separate data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y)

from xgboost import XGBRegressor

my_model = XGBRegressor()
my_model.fit(X_train, y_train)

from sklearn.metrics import mean_absolute_error

predictions = my_model.predict(X_valid)
print("Mean Absolute Error: %.4f" % mean_absolute_error(predictions, y_valid))

2.4.2 调参

1. n_estimators
  等价于集成模型的数量。值太小，容易造成欠拟合，导致训练集和测试集的预测效果都不好；值太大，容易造成过拟合，导致训练集表现较好，但测试集效果往往不好。一般取值范围100-1000。

my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train)

1. early_stopping_rounds
  目的是发现理想的n_estimators，一般设置较高的n_estimators，使用early_stopping_rounds去实现什么时候停止迭代，比如说当验证得分不在提高时就停止迭代，取early_stopping_rounds=5是较合适的。

my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)],
             verbose=False)

1. learning_rate
  一般情况下，较小的学习率learning_rate和较大的n_estimators往往有更准确的XGBoost模型。默认值learning_rate=0.1。

my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)

1. n_jobs
  在小数据集时，对提高运行时间用处不大，但对于大数据集，比较有用。

my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)

2.5 Pipelines

Pipelines一般用于美化代码，对于复杂的模型，使代码更明朗简洁。

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('./data/melb_data.csv')

# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

构造完整的pipeline一般需要以下三步：

1. 数据预处理——分类型数据&数值型数据；

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

2. 选择模型；

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)

3. 预测评估模型。

from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:%.4f' % score)

3. 模型验证

模型验证就是验证模型的拟合效果怎样？大多数模型最重要就是预测准确。

3.1 MAE平均绝对误差

误差就是真实值-预测值，取误差的绝对值，再求绝对值的平均就是MAE。用Python中的sklearn.metrics库的mean_absolute_error计算MAE。

#melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv("./data/melb_data.csv") 
# Filter rows with missing price values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

from sklearn.tree import DecisionTreeRegressor
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(X, y)

from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

注意到：及时这个模型在这个数据集表现的好，但更新数据时，该模型对于新的数据集效果不一定好，所以需要划分数据集，下面用train_test_split划分成训练集和验证集：

from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

函数score_dataset求模型的评价指标MAE：

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

3.2 交叉验证

在划分数据集时，因为是随机划分，也就是说一个模型在这个20%的数据集中表现较好，但在其他20%的数据不一定表现好。一般情况下，数据量越大，随机误差或者“噪声”越小，模型越准确。但是验证集数据越大，训练集数据就越小。

比如说5折交叉验证：把数据集划分成5份，每一份轮流做验证集去拟合模型。

在这里插入图片描述

这里有两点建议：

小数据集；
应该使用交叉验证
大数据集。
大数据的验证集模型表现较好，一般可不用交叉验证。

Python中用cross_val_score()进行交叉验证：

import pandas as pd

# Read the data
data = pd.read_csv('./data/melb_data.csv')

# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

my_pipeline = Pipeline(steps=[('preprocessor', SimpleImputer()),
                              ('model', RandomForestRegressor(n_estimators=50,
                                                              random_state=0))
                             ])
from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

print("MAE scores:\n", scores)#\n 换行符
print("Average MAE score (across experiments):%.4f"% scores.mean())