今天想分享的是数据挖掘中集成算法,具体的原理我就不分享了,代码案在我的github上.
另外这是我主要参考的两篇博客:
1. XGBoost类库概述
XGBoost除了支持Python外,也支持R,Java等语言。本文关注于Python的XGBoost类库,安装使用"pip install xgboost"即可,目前使用的是XGBoost的0.90版本。XGBoost类库除了支持决策树作为弱学习器外,还支持线性分类器,以及带DropOut的决策树DART,不过通常情况下,我们使用默认的决策树弱学习器即可,本文也只会讨论使用默认决策树弱学习器的XGBoost。
XGBoost有2种Python接口风格。一种是XGBoost自带的原生Python API接口,另一种是sklearn风格的API接口,两者的实现是基本一样的,仅仅有细微的API使用的不同,主要体现在参数命名上,以及数据集的初始化上面。
2.使用原生Python API接口
原生XGBoost需要先把数据集按输入特征部分,输出部分分开,然后放到一个DMatrix数据结构里面,这个DMatrix我们不需要关心里面的细节,使用我们的训练集X和y初始化即可。
import xgboost as xgb
import pandas as pd
import numpy as np
import pickle
import sys
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, make_scorer,mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from scipy.sparse import csr_matrix, hstack
from sklearn.model_selection import KFold, train_test_split,cross_validate
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore') ##忽略警告信息
train = pd.read_csv(r'./data/insurance/train.csv') ## 读取数据
features = [x for x in train.columns if x not in ['id','loss', 'log_loss']] ## 特征字段
cat_features = [x for x in train.select_dtypes(
include=['object']).columns if x not in ['id','loss', 'log_loss']] ## 类别性特征
num_features = [x for x in train.select_dtypes(
exclude=['object']).columns if x not in ['id','loss', 'log_loss']] ## 数值性特征
print ("Categorical features:", len(cat_features))
print ("Numerical features:", len(num_features))
ntrain = train.shape[0]
train_x = train[features] ## X特征数据集
train_y = train['log_loss'] ## y标签值
for c in range(len(cat_features)):
"""
遍历所有类别型特征,把字符映射为数值类型
"""
train_x[cat_features[c]] = train_x[cat_features[c]].astype('category').cat.codes
print("Xtrain:", train_x.shape)
print("ytrain:", train_y.shape)
xtrain,xtest,ytrain,ytest =train_test_split(train_x,train_y,test_size=0.33, random_state=42)
def xg_eval_mae(yhat, dtrain):
"""
自定义评估准则,平均绝对误差
:param yhat: 预测值
:param dtrain: DMatrix数据结构,
:return:
"""
y = dtrain.get_label() ##选取真实标签值
return 'mae', mean_absolute_error(np.exp(y), np.exp(yhat))
dtrain = xgb.DMatrix(xtrain, ytrain)
dtest = xgb.DMatrix(xtest,ytest)
xgb_params = {
'seed': 0, ## 随机种子
'eta': 0.07, #类似于学习率
'colsample_bytree': 0.6, ##特征的随机采样频率
'silent': 1,
'subsample': 0.9, ## 样本采样平率
'objective': 'reg:linear', ##回归类型优化器
'max_depth': 8, ##每棵树的最大深度
'min_child_weight': 6, ##最小的子节点权重阈值,如果某个树节点的权重小于这个阈值,则不会再分裂子树,即这个树节点就是叶子节点
'gamma':0.2 ##决策树分裂所带来的损失减小阈值
} ##定义参数
## 交叉验证
bst_cv1 = xgb.cv(xgb_params, dtrain, num_boost_round=50, nfold=3, seed=0,
feval=xg_eval_mae, maximize=False, early_stopping_rounds=10)
print(bst_cv1)
print ('CV score:', bst_cv1.iloc[-1,:]['test-mae-mean']) ##CV score: 1223.1750083333334
plt.figure()
bst_cv1[['train-mae-mean', 'test-mae-mean']].plot()
plt.show()
输出结果
test和train的平均误差变动曲线
raw_model = xgb.train(xgb_params, dtrain, num_boost_round=200,feval=xg_eval_mae) ##训练数据建立200棵弱分类器决策树
with open('./save/insurance/xgboost_insurance.pickle','wb') as f: ##保存模型结果
pickle.dump(raw_model,f)
with open('./save/insurance/xgboost_insurance.pickle','rb') as f: ## 获取模型对象
raw_model=pickle.load(f)
pred_test_raw =np.expm1(raw_model.predict(dtest)) ##预测测试集
print(mean_absolute_error(np.expm1(dtest.get_label()),pred_test_raw)) ##平均绝对误差
predictions = pd.DataFrame({"predict":pred_test_raw, "true":np.expm1(dtest.get_label())})
predictions.plot(x = "predict", y = "true", kind = "scatter")
predictions.to_csv('./save/insurance/output.csv')
plt.show()
下图最理想的情况是真实值和预测值在对角线上分布,由于没有做调参以及特征工程,误差会优点大!
3.使用sklearn风格接口,使用sklearn风格参数
对于sklearn风格的接口,主要有2个类可以使用,一个是分类用的XGBClassifier,另一个是回归用的XGBRegressor。
XGBoost可以和sklearn的网格搜索类GridSeachCV结合使用来调参,使用时和普通sklearn分类回归算法没有区别。具体的流程的一个示例如下:
xgb_reg = xgb.XGBRegressor(min_child_weight=3,subsample=0.8,learning_rate=0.5, verbosity=1)
def mae_score(y_true, y_pred):
return mean_absolute_error(np.exp(y_true), np.exp(y_pred))
mae_scorer = make_scorer(mae_score, greater_is_better=False)
xgb_param_grid = {'max_depth': list(range(5,8)), 'min_child_weight': list((1,3,6))}
grid = GridSearchCV(xgb_reg,param_grid=xgb_param_grid, cv=4, scoring=mae_scorer)
grid.fit(xtrain, ytrain)
print(grid.scorer_)
print(grid.best_params_)
print(grid.best_score_)
model = grid.best_estimator_
with open('./save/insurance/xgboost_insurance1.pickle','wb') as f:
pickle.dump(model,f)
with open('./save/insurance/xgboost_insurance1.pickle','rb') as f:
xgb_reg=pickle.load(f)
xgb_preds = np.expm1(xgb_reg.predict(xtest))
predictions = pd.DataFrame({"predict":xgb_preds, "true":np.expm1(ytest)})
predictions.plot(x = "predict", y = "true", kind = "scatter")
predictions.to_csv('./save/insurance/output1.csv')
plt.show()