XGBoost Stopping to Avoid Overfitting(early_stopping_rounds )

最新推荐文章于 2025-03-10 17:44:58 发布

夏天7788

最新推荐文章于 2025-03-10 17:44:58 发布

阅读量4.6k

点赞数

分类专栏：其他开源库机器学习之分类

机器学习之分类同时被 2 个专栏收录

18 篇文章

订阅专栏

其他开源库

14 篇文章

订阅专栏

from：http://blog.csdn.net/lujiandong1/article/details/52777168

XGBoost模型和其他模型一样，如果迭代次数过多，也会进入过拟合。表现就是随着迭代次数的增加，测试集上的测试误差开始下降。当开始过拟合或者过训练时，测试集上的测试误差开始上升，或者说波动。下面通过实验来说明这种情况：

下面实验数据的来源：https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

[python]view plaincopy 
   
 # monitor training performance  
 from numpy import loadtxt  
 from xgboost import XGBClassifier  
 from sklearn.cross_validation import train_test_split  
 from sklearn.metrics import accuracy_score  
 # load data  
 dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")  
 # split data into X and y  
 X = dataset[:,0:8]  
 Y = dataset[:,8]  
 # split data into train and test sets  
 X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)  
 # fit model no training data  
 model = XGBClassifier()  
 eval_set = [(X_test, y_test)]  
 model.fit(X_train, y_train, eval_metric="error", eval_set=eval_set, verbose=True)  
 # make predictions for test data  
 y_pred = model.predict(X_test)  
 predictions = [round(value) for value in y_pred]  
 # evaluate predictions  
 accuracy = accuracy_score(y_test, predictions)  
 print("Accuracy: %.2f%%" % (accuracy * 100.0))  

下面，分析每次迭代时，test error的情况：

分析：当迭代次数过多时，测试集上的测试误差基本上已经不再下降。并且测试误差基本上已经在一个水平附近波动，甚至下降。说明，已经进入了过训练阶段

==============================================================================================================================

下面，我们通过可视化训练loss，测试loss来说明过拟合的现象

[python]view plaincopy 
   
 # plot learning curve  
 from numpy import loadtxt  
 from xgboost import XGBClassifier  
 from sklearn.cross_validation import train_test_split  
 from sklearn.metrics import accuracy_score  
 from matplotlib import pyplot  
 # load data  
 dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")  
 # split data into X and y  
 X = dataset[:,0:8]  
 Y = dataset[:,8]  
 # split data into train and test sets  
 X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)  
 # fit model no training data  
 model = XGBClassifier()  
 eval_set = [(X_train, y_train), (X_test, y_test)]  
 model.fit(X_train, y_train, eval_metric=["error", "logloss"], eval_set=eval_set, verbose=True)  
 # make predictions for test data  
 y_pred = model.predict(X_test)  
 predictions = [round(value) for value in y_pred]  
 # evaluate predictions  
 accuracy = accuracy_score(y_test, predictions)  
 print("Accuracy: %.2f%%" % (accuracy * 100.0))  
 # retrieve performance metrics  
 results = model.evals_result()  
 epochs = len(results['validation_0']['error'])  
 x_axis = range(0, epochs)  
 # plot log loss  
 fig, ax = pyplot.subplots()  
 ax.plot(x_axis, results['validation_0']['logloss'], label='Train')  
 ax.plot(x_axis, results['validation_1']['logloss'], label='Test')  
 ax.legend()  
 pyplot.ylabel('Log Loss')  
 pyplot.title('XGBoost Log Loss')  
 pyplot.show()  
 # plot classification error  
 fig, ax = pyplot.subplots()  
 ax.plot(x_axis, results['validation_0']['error'], label='Train')  
 ax.plot(x_axis, results['validation_1']['error'], label='Test')  
 ax.legend()  
 pyplot.ylabel('Classification Error')  
 pyplot.title('XGBoost Classification Error')  
 pyplot.show()  

说明：对于该代码的一些解说

[python]view plaincopy 
   
 eval_set = [(X_train, y_train), (X_test, y_test)]  
 model.fit(X_train, y_train, eval_metric=["error", "logloss"], eval_set=eval_set, verbose=True)  

说明：每轮迭代的过程中，需要对训练集和测试进行评测，并且，评测的指标是"error","logloss"

[python]view plaincopy 
   
 # retrieve performance metrics  
 results = model.evals_result()  
 epochs = len(results['validation_0']['error'])  
 x_axis = range(0, epochs)  
 # plot log loss  
 fig, ax = pyplot.subplots()  
 ax.plot(x_axis, results['validation_0']['logloss'], label='Train')  
 ax.plot(x_axis, results['validation_1']['logloss'], label='Test')  
 ax.legend()  
 pyplot.ylabel('Log Loss')  
 pyplot.title('XGBoost Log Loss')  
 pyplot.show()  

说明：每轮评估的结果可以通过evals_result取得,results['validation_0']对应的是训练集的评估结果,results['validation_1']对应的是测试集上的评估结果

下面可视化训练集误差曲线和测试集误差曲线：

通过logloss图，很明显看出，当nround大于40的时候，测试集上的误差开始上升，已经进入了过拟合了。

XGBoost可以通过设置参数 early_stopping_rounds 来解决因为迭代次数过多而过拟合的状态。

[python]view plaincopy 
   
 eval_set = [(X_test, y_test)]  
 model.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True)  

说明：设置early_stopping_rounds=10，当logloss在10轮迭代之内，都没有提升的话，就stop。如果说eval_metric有很多个指标，那就以最后一个指标为准。

[python]view plaincopy 
   
 # early stopping  
 from numpy import loadtxt  
 from xgboost import XGBClassifier  
 from sklearn.cross_validation import train_test_split  
 from sklearn.metrics import accuracy_score  
 # load data  
 dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")  
 # split data into X and y  
 X = dataset[:,0:8]  
 Y = dataset[:,8]  
 # split data into train and test sets  
 seed = 7  
 test_size = 0.33  
 X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)  
 # fit model no training data  
 model = XGBClassifier()  
 eval_set = [(X_test, y_test)]  
 model.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True)  
 # make predictions for test data  
 y_pred = model.predict(X_test)  
 predictions = [round(value) for value in y_pred]  
 # evaluate predictions  
 accuracy = accuracy_score(y_test, predictions)  
 print("Accuracy: %.2f%%" % (accuracy * 100.0))  

说明：当nround = 42时，就stop了，说明logloss最佳的状态在nround = 32的时候。经验上，选择early_stopping_rounds = 10%*(总迭代次数)

当使用了early_stopping_rounds,可以通过best_iteration属性来提取出最适合的迭代次数，然后预测的时候就使用stop之前训练的树来预测。

[python]view plaincopy 
   
 print (model.best_iteration)  
 limit = model.best_iteration  
 y_pred = model.predict(X_test,ntree_limit=limit)  
 predictions = [round(value) for value in y_pred]  
 # evaluate predictions  
 accuracy = accuracy_score(y_test, predictions)  
 print("Accuracy: %.2f%%" % (accuracy * 100.0))