过拟合问题是在使用复杂的非线性学习算法时会经常碰到(例如gradient boosting算法),在前面的博客中,我们也已经详细的讲述了过拟合问题。
在本博客中,主要讲述XGBoost算法用Early Stopping方法避免过拟合。
项目中用到的数据集:
Pima Indians Diabetes Data Set(皮马印第安人糖尿病数据集)
数据集的内容是皮马人的医疗记录,以及过去5年内是否有糖尿病。所有的数据都是数字,问题是(是否有糖尿病是1或0),是二分类问题。数据有8个属性,2个类别(0/1)
【1】Pregnancies:怀孕次数
【2】Glucose:葡萄糖
【3】BloodPressure:血压 (mm Hg)
【4】SkinThickness:皮层厚度 (mm)
【5】Insulin:胰岛素 2小时血清胰岛素(mu U / ml
【6】BMI:体重指数 (体重/身高)^2
【7】DiabetesPedigreeFunction:糖尿病谱系功能
【8】Age:年龄 (岁)
# 类别
【9】Outcome:类标变量 (0或1)
下面这段代码将会在67%的数据集上训练模型,并且在每一轮迭代中使用剩下的33%数据来评估模型的性能。每次迭代都会输出分类错误,最终将会输出最后的分类准确率。
# -*- coding: utf-8 -*-
"""
Created on Wed Jul 17 10:28:59 2019
@author: ZQQ
"""
# monitor training performance
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
# fit model no training data
model = XGBClassifier()
eval_set = [(X_test, y_test)]
model.fit(X_train, y_train, eval_metric="error", eval_set=eval_set, verbose=True)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
运行结果:
观察所有的输出,我们可以看到,在训练快要结束时测试集上的模型性能的变化是平缓的,甚至变得更差。
绘制学习曲线可视化训练过程:
提取出模型在测试数据集上的表现并绘制成曲线,从而更好地观察到在整个训练过程中学习曲线是如何变化的。
python3代码如下:
# -*- coding: utf-8 -*-
"""
Created on Wed Jul 17 10:45:21 2019
@author: ZQQ
"""
# plot learning curve
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from matplotlib import pyplot
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
# fit model no training data
model = XGBClassifier()
eval_set = [(X_train, y_train), (X_test, y_test)]
model.fit(X_train, y_train, eval_metric=["error", "logloss"], eval_set=eval_set, verbose=True)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# retrieve performance metrics
results = model.evals_result()
epochs = len(results['validation_0']['error'])
x_axis = range(0, epochs)
# plot log loss
fig1, ax = pyplot.subplots()
ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
ax.legend()
pyplot.ylabel('Log Loss')
pyplot.title('XGBoost Log Loss')
pyplot.show()
# plot classification error
fig2, ax = pyplot.subplots()
ax.plot(x_axis, results['validation_0']['error'], label='Train')
ax.plot(x_axis, results['validation_1']['error'], label='Test')
ax.legend()
pyplot.ylabel('Classification Error')
pyplot.title('XGBoost Classification Error')
pyplot.show()
运行结果:
第一张图表示的是模型在每一轮迭代中在两个数据集上的对数损失;
第二张图表示分类错误率;
从第一张图来看,20轮迭代过后loss上升,似乎有机会可以进行Early Stopping,大约在20到40轮迭代时比较合适。
从第二张图可以得到相似的结果,大概在40轮迭代时效果比较理想,后面Error开始上升。
在XGBoost中进行Early Stopping(提前终止)
XGBoost提供了在指定轮数完成后提前停止训练的功能。
除了提供用于评估每轮迭代中的评价指标和数据集之外,还需要指定一个窗口大小,意味着连续这么多轮迭代中模型的效果没有提升。这是通过early_stopping_rounds参数来设置的。
直接上python3代码:
# -*- coding: utf-8 -*-
"""
Created on Wed Jul 17 11:13:15 2019
@author: ZQQ
"""
# early stopping
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
# fit model no training data
model = XGBClassifier()
eval_set = [(X_test, y_test)]
model.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
运行结果:
我们可以看到模型在迭代到42轮时停止了训练,在32轮迭代后观察到了最好的效果。
通常将early_stopping_rounds设置为一个与总训练轮数相关的函数(本例中是10%),或者通过观察学习曲线来设置使得训练过程包含拐点,这两种方法都是不错的选择。
后期将代码放在github上,包括数据集。(此数据集是公开数据集,也可以自行下载)
参考和引用:
https://www.cnblogs.com/xxtalhr/p/10859517.html
https://machinelearningmastery.com/avoid-overfitting-by-early-stopping-with-xgboost-in-python/
https://coolboygym.github.io/2018/12/15/early-stop-in-xgboost/
仅用来个人学习和分享,如若侵权,留言立删。
尊重他人知识产权,不做拿来主义者!
喜欢的可以关注我哦QAQ,
你的关注就是我write博文的动力。