机器学习一般流程
1. 基本的数据探索
每个机器学习的第一步就是:导入数据,且深入了解数据信息。
1.1 熟悉数据
python
中的pandas
库有强大的数据处理功能,前面的笔记有详细介绍Pandas数据分析。
这里有一份澳大利亚墨尔本房价数据集,首先导入数据,熟悉数据,具体情况具体分析。
import pandas as pd
melbourne_data = pd.read_csv("./data/melb_data.csv")
melbourne_data.describe()# 描述性统计
描述性统计包括:计数、均值、标准差、最小值、25%分位数、中位数、75%中位数、最大值8个特征。
还有数据中的数值型变量,分类型变量,因变量的数据信息,数据分布等都在这一阶段了解。
melbourne_data.columns
看一下数据有哪些变量,哪个是因变量,哪些可以是特征值。本例因变量是y = melbourne_data.Price
1.2 数据预处理
1.2.1 异常值处理
具体可见这一篇笔记数据分析Pandas
1.2.2 缺失值处理
查看缺失值的数量:
# Number of missing values in each column of training data
missing_column = (X_train.isnull().sum())
print(missing_column[missing_column > 0])
- 1. 直接删除
"na"="not available"
melbourne_data = melbourne_data.dropna(axis=0)
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns
if X_train[col].isnull().any()]
# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)
print(f"MAE from Approach 1 (Drop columns with missing values):%.4f" % score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))
一般不常用。
- 2. 均值填充
利用Python的SimpleImputer
来处理缺失值:默认均值填充。
from sklearn.impute import SimpleImputer
# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
print("MAE from Approach 2 (Imputation):%.4f"% score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))
- 3. 中位数填充
- 4. 众数等填充等
1.2.3 分类型变量categorical variable
一般处理方法有:
- 1. 删除分类型变量;
一般不常用。
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
print("MAE from Approach 1 (Drop categorical variables):%.4f"% score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))
- 2. 序号编码Ordinal Encoding;
一般可用于顺序变量,值有大小区别。
from sklearn.preprocessing import OrdinalEncoder
# Make copy to avoid changing original data
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()
# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])
print("MAE from Approach 2 (Ordinal Encoding):%.4f"% score_dataset(label_X_train, label_X_valid, y_train, y_valid))
- 3. 独热编码One-Hot Encoding
也可叫零一编码,一般用于没有大小区别的分类型变量,且运用较广。如果分类变量值太多,表现不好,一般分类变量不超过15个不同的值。
import pandas as pd
from sklearn.model_selection import train_test_split
# Read the data
data = pd.read_csv('./data/melb_data.csv')
# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)
# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
random_state=0)
# Drop columns with missing values (simplest approach)
cols_with_missing = [col for col in X_train_full.columns if X_train_full[col].isnull().any()]
X_train_full.drop(cols_with_missing, axis=1, inplace=True)
X_valid_full.drop(cols_with_missing, axis=1, inplace=True)
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and
X_train_full[cname].dtype == "object"]
# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]
# Keep selected columns only
my_cols = low_cardinality_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
得到分类型变量的列表:
# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)
print("Categorical variables:")
print(object_cols)
Python中用OneHotEncoder
去进行零一编码:参数handle_unknown='ignore'
是为了避免验证集中出现训练集中没有的分类型变量;参数sparse=False
是微课确保零一编码处理后数据是numpy
数组,而不是稀疏矩阵。
from sklearn.preprocessing import OneHotEncoder
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))
# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
print("MAE from Approach 3 (One-Hot Encoding):%.4f"% score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))
2. 建立模型
在拟合模型之前,需要确定因变量和特征值。
- 凭直觉选择变量;
- 统计方法自动选择变量。
凭直觉选择特征的模型:
y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
X.describe()
一般建立模型的步骤:
- 定义:选择什么类型的模型?回归还是分类?模型具体参数?
- 拟合:一般需要划分数据集成训练集和测试集;
- 预测:预测因变量;
- 评估:模型的表现效果?模型的评估指标?
2.1 简单的模型——决策树
2.1.1 决策树介绍
sh上面是一个简单的一层决策树模型,它只把房子分成两类,所谓的拟合和训练模型,就是利用数据决定怎样划分这些房子成两类,然后预测每一类的房价是多少。用于拟合的数据集叫作训练集。
对比上面两个决策树模型,似乎第一个更合理,联系实际,一般卧室越多房价越高。以上模型最大的缺点:没有利用更多影响房价的因素,比如浴室,住房面积,离地铁距离,离学校距离等。
下面的决策树有了更多分支:树底部就是预测的房价,底部的点就是叶节点。
用决策树拟合本例房价数据:
from sklearn.tree import DecisionTreeRegressor
# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)
# Fit model
melbourne_model.fit(X, y)
random_state
确保每次运行得到相同的结果。
预测数据前5行的结果:
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))
2.2 欠拟合和过拟合
在决策树模型中,最重要的选择取决于树的深度。
- 如果树太深:
可能出现过拟合,当树有很多叶子时,每个叶子包括更少的房子。房子数越少,可能导致房价更接近真实值,但是将会很难预测一个新的数据,因为每个预测只基于很少的数据。一般过拟合的情况,模型拟合训练集效果很好,但是在验证集和测试集效果很差。 - 如果树太浅:
可能出现欠拟合。它划分数据并没有啥太大的区别。比如一个极端,只有一层,数据被分成2类,每类的数据差别很大,导致这个预测效果较差,在训练集表现也很差。一般欠拟合的情况,模型拟合训练集效果不好,验证集和测试集效果也不好。
过拟合和欠拟合二者都会使模型的预测效果变差,所以要避免模型出现过拟合和欠拟合。
下面定义一个函数来对比各模型的MAE。
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return(mae)
melbourne_data = pd.read_csv("./data/melb_data.csv")
# Filter rows with missing values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea',
'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]
from sklearn.model_selection import train_test_split
# split data into training and validation data, for both features and target
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))
2.3 随机森林
随机森林有很多决策树,通过平均每棵树的预测效果来提高预测的准确率,一般比单一的决策树预测效果更好,且在初始化参数时表现也更好。RandomForestRegressor
来拟合随机森林模型。
melbourne_data = pd.read_csv("./data/melb_data.csv")
# Filter rows with missing values
melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea',
'YearBuilt', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))
2.4 XGBoost
上面的随机森林是一种集成模型,XGBoost也是一种集成模型。
2.4.1 梯度提升
梯度提升是一种方法,它通过不断迭代优化模型,从一个单一的初始化模型开始,一般步骤如下:
- 使用现在的集成模型做预测;
- 用这些预测去计算损失函数;
- 使用损失函数去拟合将要加入集成模型的新模型;
- 不断迭代……
在Python中通过xgboost.XGBRegressor
实现:
import pandas as pd
from sklearn.model_selection import train_test_split
# Read the data
data = pd.read_csv('./data/melb_data.csv')
# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]
# Select target
y = data.Price
# Separate data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y)
from xgboost import XGBRegressor
my_model = XGBRegressor()
my_model.fit(X_train, y_train)
from sklearn.metrics import mean_absolute_error
predictions = my_model.predict(X_valid)
print("Mean Absolute Error: %.4f" % mean_absolute_error(predictions, y_valid))
2.4.2 调参
-
n_estimators
等价于集成模型的数量。值太小,容易造成欠拟合,导致训练集和测试集的预测效果都不好;值太大,容易造成过拟合,导致训练集表现较好,但测试集效果往往不好。一般取值范围100-1000。
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train)
-
early_stopping_rounds
目的是发现理想的n_estimators
,一般设置较高的n_estimators
,使用early_stopping_rounds
去实现什么时候停止迭代,比如说当验证得分不在提高时就停止迭代,取early_stopping_rounds=5
是较合适的。
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train,
early_stopping_rounds=5,
eval_set=[(X_valid, y_valid)],
verbose=False)
-
learning_rate
一般情况下,较小的学习率learning_rate
和较大的n_estimators
往往有更准确的XGBoost
模型。默认值learning_rate=0.1
。
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(X_train, y_train,
early_stopping_rounds=5,
eval_set=[(X_valid, y_valid)],
verbose=False)
-
n_jobs
在小数据集时,对提高运行时间用处不大,但对于大数据集,比较有用。
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)
my_model.fit(X_train, y_train,
early_stopping_rounds=5,
eval_set=[(X_valid, y_valid)],
verbose=False)
2.5 Pipelines
Pipelines一般用于美化代码,对于复杂的模型,使代码更明朗简洁。
import pandas as pd
from sklearn.model_selection import train_test_split
# Read the data
data = pd.read_csv('./data/melb_data.csv')
# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)
# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
random_state=0)
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and
X_train_full[cname].dtype == "object"]
# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]
# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
构造完整的pipeline一般需要以下三步:
- 1. 数据预处理——分类型数据&数值型数据;
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')
# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
- 2. 选择模型;
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=0)
- 3. 预测评估模型。
from sklearn.metrics import mean_absolute_error
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('model', model)
])
# Preprocessing of training data, fit model
my_pipeline.fit(X_train, y_train)
# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)
# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:%.4f' % score)
3. 模型验证
模型验证就是验证模型的拟合效果怎样?大多数模型最重要就是预测准确。
3.1 MAE平均绝对误差
误差就是真实值-预测值,取误差的绝对值,再求绝对值的平均就是MAE。用Python中的sklearn.metrics
库的mean_absolute_error
计算MAE。
#melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv("./data/melb_data.csv")
# Filter rows with missing price values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea',
'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]
from sklearn.tree import DecisionTreeRegressor
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(X, y)
from sklearn.metrics import mean_absolute_error
predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)
注意到:及时这个模型在这个数据集表现的好,但更新数据时,该模型对于新的数据集效果不一定好,所以需要划分数据集,下面用train_test_split
划分成训练集和验证集:
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)
# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))
函数score_dataset
求模型的评价指标MAE:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(X_train, y_train)
preds = model.predict(X_valid)
return mean_absolute_error(y_valid, preds)
3.2 交叉验证
在划分数据集时,因为是随机划分,也就是说一个模型在这个20%的数据集中表现较好,但在其他20%的数据不一定表现好。一般情况下,数据量越大,随机误差或者“噪声”越小,模型越准确。但是验证集数据越大,训练集数据就越小。
比如说5折交叉验证:把数据集划分成5份,每一份轮流做验证集去拟合模型。
这里有两点建议:
- 小数据集;
应该使用交叉验证 - 大数据集。
大数据的验证集模型表现较好,一般可不用交叉验证。
Python中用cross_val_score()
进行交叉验证:
import pandas as pd
# Read the data
data = pd.read_csv('./data/melb_data.csv')
# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]
# Select target
y = data.Price
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
my_pipeline = Pipeline(steps=[('preprocessor', SimpleImputer()),
('model', RandomForestRegressor(n_estimators=50,
random_state=0))
])
from sklearn.model_selection import cross_val_score
# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
cv=5,
scoring='neg_mean_absolute_error')
print("MAE scores:\n", scores)#\n 换行符
print("Average MAE score (across experiments):%.4f"% scores.mean())
4. 模型改进
- 重新建模,更换模型;
- 重新提取特征;
- 剔除不相关特征等