一、原理
个人认为写的通俗易懂的一篇好文:
https://www.jianshu.com/p/7467e616f227
二、网格调参
用xgboost既可以用来做二分类、多分类,也可以用来做回归预测数值,除了特征之外,影响模型的是如何调参了,一般是按一定的步骤、网格搜索最优参数,如下两篇文章一个是用来分类,一个是用来预测数值的案例,并且详细给出了调参的步骤和代码:
https://blog.csdn.net/han_xiaoyang/article/details/52665396 (用来分类XGBClassifier)
https://segmentfault.com/a/1190000014040317 (用来预测数字,XGBRegressor)
三、实践
参考以上的博客,用iris经典数据集进行多分类的预测(三个类别):
import pandas as pd
import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import model_selection, metrics
from sklearn.grid_search import GridSearchCV #Perforing grid search
import matplotlib.pylab as plt
%matplotlib inline
import warnings
warnings.filterwarnings(module='sklearn*', action='ignore', category=DeprecationWarning)
from sklearn import datasets
data=iris.data
label=iris.target
from sklearn.cross_validation import train_test_split
train_x, test_x, train_y, test_y = train_test_split(data, label,test_size=0.3, random_state=0)
dtrain=xgb.DMatrix(train_x,label=train_y)
dtest=xgb.DMatrix(test_x,label=test_y)
cv_params = {
'n_estimators': [1,2,3,4,5,6]}
other_params = {
'learning_rate': 0.1, 'n_estimators': 500, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0,
'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}
model = xgb.XGBClassifier(**other_params)
optimized_GBM = GridSearchCV(estimator=model, param_grid=cv_params, scoring='accuracy', cv=5, verbose=1, n_jobs=4)
optimized_GBM.fit(train_x, train_y)
evalute_result = optimized_GBM.grid_scores_
#print('每轮迭代运行结果:{0}'.format(evalute_result))
print('参数的最佳取值:{0}'.format(optimized_GBM.best_params_))
print('最佳模型得分:{0}'.format(optimized_GBM.best_score_))
Fitting 5 folds for each of 6 candidates, totalling 30 fits
参数的最佳取值:{'n_estimators': 4}
最佳模型得分:0.9619047619047619
[Parallel(n_jobs=4)]: Done 30 out of 30 | elapsed: 8.9s finished
cv_params = {
'max_depth': [3, 4, 5, 6, 7, 8, <