XGBoost中如何防止过拟合

最新推荐文章于 2022-08-22 23:45:02 发布

机器不学习我学习

最新推荐文章于 2022-08-22 23:45:02 发布

阅读量6.4k

点赞数 7

分类专栏： python 机器学习

本文链接：https://blog.csdn.net/AugustMe/article/details/96284556

版权

python 同时被 2 个专栏收录

116 篇文章 16 订阅

订阅专栏

机器学习

32 篇文章 8 订阅

订阅专栏

过拟合问题是在使用复杂的非线性学习算法时会经常碰到（例如gradient boosting算法），在前面的博客中，我们也已经详细的讲述了过拟合问题。
在本博客中，主要讲述XGBoost算法用Early Stopping方法避免过拟合。

项目中用到的数据集：
Pima Indians Diabetes Data Set（皮马印第安人糖尿病数据集）
数据集的内容是皮马人的医疗记录，以及过去5年内是否有糖尿病。所有的数据都是数字，问题是（是否有糖尿病是1或0），是二分类问题。数据有8个属性，2个类别（0/1）
　　【1】Pregnancies：怀孕次数
　　【2】Glucose：葡萄糖
　　【3】BloodPressure：血压 (mm Hg)
　　【4】SkinThickness：皮层厚度 (mm)
　　【5】Insulin：胰岛素 2小时血清胰岛素（mu U / ml
　　【6】BMI：体重指数（体重/身高）^2
　　【7】DiabetesPedigreeFunction：糖尿病谱系功能
　　【8】Age：年龄（岁）
　　 # 类别
　　【9】Outcome：类标变量（0或1）

下面这段代码将会在67%的数据集上训练模型，并且在每一轮迭代中使用剩下的33%数据来评估模型的性能。每次迭代都会输出分类错误，最终将会输出最后的分类准确率。

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 17 10:28:59 2019

@author: ZQQ
"""

# monitor training performance
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")

# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]

# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)

# fit model no training data
model = XGBClassifier()
eval_set = [(X_test, y_test)]
model.fit(X_train, y_train, eval_metric="error", eval_set=eval_set, verbose=True)

# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

运行结果：
在这里插入图片描述
观察所有的输出，我们可以看到，在训练快要结束时测试集上的模型性能的变化是平缓的，甚至变得更差。

绘制学习曲线可视化训练过程：
提取出模型在测试数据集上的表现并绘制成曲线，从而更好地观察到在整个训练过程中学习曲线是如何变化的。
python3代码如下：

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 17 10:45:21 2019

@author: ZQQ
"""

# plot learning curve
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from matplotlib import pyplot
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
# fit model no training data
model = XGBClassifier()
eval_set = [(X_train, y_train), (X_test, y_test)]
model.fit(X_train, y_train, eval_metric=["error", "logloss"], eval_set=eval_set, verbose=True)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# retrieve performance metrics
results = model.evals_result()
epochs = len(results['validation_0']['error'])
x_axis = range(0, epochs)

# plot log loss
fig1, ax = pyplot.subplots()
ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
ax.legend()
pyplot.ylabel('Log Loss')
pyplot.title('XGBoost Log Loss')
pyplot.show()

# plot classification error
fig2, ax = pyplot.subplots()
ax.plot(x_axis, results['validation_0']['error'], label='Train')
ax.plot(x_axis, results['validation_1']['error'], label='Test')
ax.legend()
pyplot.ylabel('Classification Error')
pyplot.title('XGBoost Classification Error')
pyplot.show()

运行结果：
在这里插入图片描述

第一张图表示的是模型在每一轮迭代中在两个数据集上的对数损失；
第二张图表示分类错误率；
从第一张图来看，20轮迭代过后loss上升，似乎有机会可以进行Early Stopping，大约在20到40轮迭代时比较合适。
从第二张图可以得到相似的结果，大概在40轮迭代时效果比较理想，后面Error开始上升。

在XGBoost中进行Early Stopping（提前终止）
XGBoost提供了在指定轮数完成后提前停止训练的功能。
除了提供用于评估每轮迭代中的评价指标和数据集之外，还需要指定一个窗口大小，意味着连续这么多轮迭代中模型的效果没有提升。这是通过early_stopping_rounds参数来设置的。
直接上python3代码：

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 17 11:13:15 2019

@author: ZQQ
"""

# early stopping
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
# fit model no training data
model = XGBClassifier()
eval_set = [(X_test, y_test)]
model.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

运行结果：
在这里插入图片描述
我们可以看到模型在迭代到42轮时停止了训练，在32轮迭代后观察到了最好的效果。
通常将early_stopping_rounds设置为一个与总训练轮数相关的函数（本例中是10%），或者通过观察学习曲线来设置使得训练过程包含拐点，这两种方法都是不错的选择。

后期将代码放在github上，包括数据集。（此数据集是公开数据集，也可以自行下载）

参考和引用：

https://www.cnblogs.com/xxtalhr/p/10859517.html

https://machinelearningmastery.com/avoid-overfitting-by-early-stopping-with-xgboost-in-python/

https://coolboygym.github.io/2018/12/15/early-stop-in-xgboost/

仅用来个人学习和分享，如若侵权，留言立删。

尊重他人知识产权，不做拿来主义者！

喜欢的可以关注我哦QAQ，

你的关注就是我write博文的动力。

机器不学习我学习

关注

7
点赞
踩
26

收藏

觉得还不错? 一键收藏
打赏
0
评论
XGBoost中如何防止过拟合

过拟合问题是在使用复杂的非线性学习算法时会经常碰到（例如gradient boosting算法），在前面的博客中，我们也已经详细的讲述了过拟合问题。在本博客中，主要讲述XGBoost算法用Early Stopping方法避免过拟合。项目中用到的数据集：Pima Indians Diabetes Data Set（皮马印第安人糖尿病数据集）数据集的内容是皮马人的医疗记录，以及过去5年内是否有...
复制链接

扫一扫