Kaggle学习笔记--XGBoost

最新推荐文章于 2020-05-04 22:56:32 发布

weixin_44398470

最新推荐文章于 2020-05-04 22:56:32 发布

阅读量254

点赞数

文章标签： python 机器学习

原文链接：https://www.kaggle.com/alexisbcook/xgboost

版权

本文介绍了XGBoost的概念、优势及在Kaggle学习过程中的实践应用。通过逐步调整模型参数，如n_estimators和learning_rate，探讨了如何优化XGBoost模型以降低预测的平均绝对误差（MAE），并强调了正则化、并行处理等特性在提升模型性能中的作用。

摘要由CSDN通过智能技术生成

课程原文：https://www.kaggle.com/alexisbcook/xgboost

简介

XGBoost是什么

Xgboost是Boosting算法的其中一种，Boosting算法的思想是将许多弱分类器集成在一起，形成一个强分类器。因为Xgboost是一种提升树模型，所以它是将许多树模型集成在一起，形成一个很强的分类器。而所用到的树模型则是CART回归树模型。Xgboost是在GBDT的基础上进行改进，使之更强大，适用于更大范围。
【详细介绍 https://www.cnblogs.com/wj-1314/p/9402324.html】

梯度提升XGBoost是一种通过循环迭代的将模型添加到集合中的方法

1.首先从用单个模型初始化集合开始，其预测结果可能很天真.（即使其预测非常不准确，随后对该集合进行的添加也将修正这些错误。）
2.然后，开始循环：
首先使用当前集合为数据集中的每个观测值生成预测。为了做出预测，将集合中所有模型的预测值相加。这些预测用于计算损失函数（例如，均方误差）。
然后，使用损失函数来拟合将要添加到集合中的新模型。具体来说，我们确定了模型参数，所以将此新模型添加到集合中将减少损失。（附带说明：“梯度增强”中的“梯度”是指对损失函数使用梯度下降【gradient descent】来确定此新模型中的参数。）
3.最后，将新模型添加到集合中，并重复上述操作！

XGBoost 的优点

1.正则化
　　XGBoost在代价函数里加入了正则项，用于控制模型的复杂度。正则项里包含了树的叶子节点个数、每个叶子节点上输出的score的L2模的平方和。从Bias-variance tradeoff角度来讲，正则项降低了模型的variance，使学习出来的模型更加简单，防止过拟合，这也是xgboost优于传统GBDT的一个特性。
2. 并行处理
　　XGBoost工具支持并行。Boosting不是一种串行的结构吗?怎么并行的？注意XGBoost的并行不是tree粒度的并行，XGBoost也是一次迭代完才能进行下一次迭代的（第t次迭代的代价函数里包含了前面t-1次迭代的预测值）。XGBoost的并行是在特征粒度上的。
　　我们知道，决策树的学习最耗时的一个步骤就是对特征的值进行排序（因为要确定最佳分割点），XGBoost在训练之前，预先对数据进行了排序，然后保存为block结构，后面的迭代中重复地使用这个结构，大大减小计算量。这个block结构也使得并行成为了可能，在进行节点的分裂时，需要计算每个特征的增益，最终选增益最大的那个特征去做分裂，那么各个特征的增益计算就可以开多线程进行。
3. 灵活性
　　XGBoost支持用户自定义目标函数和评估函数，只要目标函数二阶可导就行。
4. 缺失值处理
　　对于特征的值有缺失的样本，xgboost可以自动学习出它的分裂方向
5. 剪枝
　　XGBoost 先从顶到底建立所有可以建立的子树，再从底到顶反向进行剪枝。比起GBM，这样不容易陷入局部最优解。
6. 内置交叉验证
　　XGBoost允许在每一轮boosting迭代中使用交叉验证。因此，可以方便地获得最优boosting迭代次数。而GBM使用网格搜索，只能检测有限个值。　　
【原文链接：https://blog.csdn.net/luanpeng825485697/article/details/79907149.】

数据加载

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X = pd.read_csv('.../train.csv', index_col='Id')
X_test_full = pd.read_csv('.../test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)

# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and
                        X_train_full[cname].dtype == "object"]

# Select numeric columns
numeric_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numeric_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

# One-hot encode the data (to shorten the code, we use pandas)
X_train = pd.get_dummies(X_train)
X_valid = pd.get_dummies(X_valid)
X_test = pd.get_dummies(X_test)
X_train, X_valid = X_train.align(X_valid, join='left', axis=1)
X_train, X_test = X_train.align(X_test, join='left', axis=1)

步骤1：创建XGBoost模型

XGBoost（xgboost.XGBRegressor）导入scikit-learn API。这使得能够像在scikit-learn中一样构建和拟合模型。 XGBRegressor类具有许多可调参数

具体步骤如下：
1.首先将“ my_model_1”设置为XGBoost模型。
2.使用XGBRegressor类，并将随机种子设置为0（random_state = 0）。 将所有其他参数保留为默认值。
3.将模型拟合到X_train和y_train中的训练数据。
4.最后，使用mean_absolute_error（）函数来计算与验证集的预测相对应的平均绝对误差（MAE）。其中验证数据的标签存储在y_valid中。

from xgboost import XGBRegressor
# Define the model
my_model_1 = XGBRegressor(random_state=0) 
# Fit the model
my_model_1.fit(X_train, y_train) 
from sklearn.metrics import mean_absolute_error
# Get predictions
predictions_1 = my_model_1.predict(X_valid) 
''''''
# Calculate MAE
mae_1 = mean_absolute_error(y_valid,predictions_1) # Your code here
# Uncomment to print MAE
print("Mean Absolute Error:" , mae_1)

Mean Absolute Error: 16803.434690710616

步骤2：改进模型（1）——获得更低的MAE

1.将my_model_2设置为XGBoost模型。例设置n_estimators和learning_rate参数以获得更好的结果。
2.将模型拟合到X_train和y_train中的训练数据。
3.将“ predictions_2”设置为模型对验证数据的预测。验证功能存储在X_valid中。
4.最后，使用mean_absolute_error（）函数来计算与验证集上的预测相对应的平均绝对误差（MAE）。验证数据的标签存储在y_valid中。

# Define the model   
my_model_2 = XGBRegressor(learning_rate=0.1,n_estimators=450,random_state=0)  # Your code here

# Fit the model
my_model_2.fit(X_train,y_train)# Your code here

# Get predictions
predictions_2 = my_model_2.predict(X_valid) # Your code here

# Calculate MAE
mae_2 = mean_absolute_error(y_valid,predictions_2) # Your code here

# Uncomment to print MAE
print("Mean Absolute Error:" , mae_2)

Mean Absolute Error: 15875.706670055652

步骤3：改进模型（2）——获得更高的MAE

1.将my_model_3设置为XGBoost模型。修改n_estimators和learning_rate参数以获得更差的结果。
2.将模型拟合到X_train和y_train中的训练数据。
3.将“ predictions_3”设置为模型对验证数据的预测。验证功能存储在X_valid中。
4.最后，使用mean_absolute_error（）函数来计算与验证集上的预测相对应的平均绝对误差（MAE）。验证数据的标签存储在y_valid中。

# Define the model 
my_model_3 = XGBRegressor(n_estimators=10,learning_rate=0.5)

# Fit the model
my_model_3.fit(X_train,y_train) # Your code here

# Get predictions
predictions_3 = my_model_3.predict(X_valid)

# Calculate MAE
mae_3 = mean_absolute_error(y_valid,predictions_3)

# Uncomment to print MAE
print("Mean Absolute Error:" , mae_3)

Mean Absolute Error: 21031.549991973458

总结

通过修改XGBRegressor中n_estimators和learning_rate的参数的值，从而获得更好或者更差的结果。通过多次尝试可找出针对本模型MAE的变化规律：
n_estimators越大 learning_rate越小模型效果越好
n_estimators越小,learning_rate越大模型效果越差
从而获得更优秀的训练模型。

【链接】：XGBRegressor 参数调优

weixin_44398470

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Kaggle学习笔记--XGBoost

Kaggle学习笔记--XGBoost简介XGBoost是什么梯度提升XGBoost是一种通过循环迭代的将模型添加到集合中的方法XGBoost 的优点数据加载步骤1：创建XGBoost模型步骤2：改进模型（1）——获得更低的MAE步骤3：改进模型（2）——获得更高的MAE总结简介XGBoost是什么Xgboost是Boosting算法的其中一种，Boosting算法的思想是将许多弱分类器集成...
复制链接

扫一扫