主要参考:https://blog.csdn.net/kicilove/article/details/78413112#comments
https://wuhuhu800.github.io/2018/02/28/XGboost_param_share/#xgboost%E7%9A%84%E5%8F%82%E6%95%B0
开始:
导入库,加载数据
import pandas as pd
import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize']=(12.0,4.0)
train = pd.read_csv("C:\\Users\\Nihil\\Documents\\pythonlearn\\data\\train_modified.csv")
target='Disbursed' # Disbursed的值就是二元分类的输出
IDcol = 'ID'#写成格式方便对比
x_columns = [x for x in train.columns if x not in [target, IDcol]]
X_train = train[x_columns]
y_train = train['Disbursed']
关于rcParams的用法参考Matplotlib中plt.rcParams用法(设置图像细节)
以上代码采用两种形式的XGBoost:
xgb -原生,可将使用该库中的特定函数“cv”,在每一步迭代中使用交叉验证。并返回理想的树数量。
但交叉验证很慢,所以可以import另一种:
XGBClassifier - XGBoost的sklearn封装。为了用GridSearchCV调整其他参数。
找迭代1:定义一个modelfit函数,建立XGBoost models并进行交叉验证
def modelfit(alg, useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
if useTrainCV:
xgb_param = alg.get_xgb_params()
xgtrain = xgb.DMatrix(X_train,label=y_train)
cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
metrics='auc', early_stopping_rounds=early_stopping_rounds, verbose_eval=False)
alg.set_params(n_estimators=cvresult.shape[0])
alg.fit(X_train,y_train,eval_metric='auc')
dtrain_predictions = alg.predict(X_train)
dtrain_predprob = alg.predict_proba(X_train)[:, 1]
acc = metrics.accuracy_score(y_train.values,dtrain_predictions)
auc = metrics.roc_auc_score(y_train,dtrain_predprob)
print("Accuracy is {:.4f}".format(acc))
print('Best number of trees ={}'.format(cvresult.shape[0]))#输出树的数量
print("AUC Score(Train) is {:.4f}".format(auc))
画出特征重要性
print(alg.feature_importances_)
plt.bar(range(len(alg.feature_importances_)),alg.feature_importances_)
plt.show()
另一种,推荐使用
feat_imp = pd.Series(alg.get_booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar',title='Feature Important')
plt.ylabel('Feature Importance Score')
plt.show()
有几个地方:
- alg.set_params(n_estimators=cvresult.shape[0])
cvresult.shape[0]和alg.get_params()[‘n_estimators’]值一样
在numpy里.shape[0]代表行数,shape[1]代表列数。
参照numpy.array 的shape属性理解 - dtrain_predprob = alg.predict_proba(X_train)[:, 1]
属于第二类的概率
predict predict_proba区别的小例子 - Format格式化
format 格式化函数
用Early Stop
修改了一下代码
这一段参照:https://www.yuque.com/zhaoshijun/md/mtx7ty
import pandas as pd
import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize']=(12.0,