xgboost优化_什么是xgboost以及如何对其进行优化

最新推荐文章于 2024-07-11 19:07:01 发布

weixin_26750481

最新推荐文章于 2024-07-11 19:07:01 发布

阅读量675

点赞数

文章标签： java python 人工智能算法

原文链接：https://towardsdatascience.com/what-is-xgboost-and-how-to-optimize-it-d3c24e0e41b4

版权

本文介绍了XGBoost的概念，它是一种高效的梯度提升框架，广泛应用于机器学习任务。文章详细阐述了如何通过调整参数来优化XGBoost模型，包括正则化参数、学习率、树的数量和深度等，旨在提升模型的预测性能和防止过拟合。

摘要由CSDN通过智能技术生成

xgboost优化

介绍 (Introduction)

Like many data scientists, XGBoost is now part of my toolkit. This algorithm is among the most popular in the world of data science (real-world or competition). Its multitasking aspect allows it to be used in regression or classification projects. It can be used on tabular, structured, and unstructured data.

像许多数据科学家一样，XGBoost现在已成为我工具包的一部分。该算法是数据科学领域(实际或竞争)中最受欢迎的算法。它的多任务处理功能使其可以用于回归或分类项目。它可以用于表格，结构化和非结构化数据。

A notebook containing the code is available on GitHub. The notebook is intended to be used in the case of classification of documents (text).

包含代码的笔记本可以在GitHub上找到。笔记本打算用于文档(文本)分类的情况。

XGBoost (XGBoost)

XGBoost or eXtreme Gradient Boosting is a based-tree algorithm (Chen and Guestrin, 2016[2]). XGBoost is part of the tree family (Decision tree, Random Forest, bagging, boosting, gradient boosting).

XGBoost或eXtreme Gradient Boosting是一种基于树的算法(Chen和Guestrin，2016 [2])。 XGBoost是树家族的一部分(决策树，随机森林，装袋，增强，梯度增强)。

Boosting is an ensemble method with the primary objective of reducing bias and variance. The goal is to create weak trees sequentially so that each new tree (or learner) focuses on the weakness (misclassified data) of the previous one. After a weak learner is added, the data weights are readjusted, known as “re-weighting”. The whole forming a strong model after convergence due to the auto-correction after every new learner added.

增强是一种合奏方法，其主要目的是减少偏差和方差。目的是顺序创建弱树，以便每个新树(或学习者)都专注于前一棵树的弱点(分类错误的数据)。添加了弱学习者后，数据权重将重新调整，称为“重新加权”。每个新学习者添加后，由于自动校正，收敛后整体形成了一个强大的模型。

The strength of XGBoost is parallelism and hardware optimization. The data is stored in in-memory, called a block, and stored in the compressed column [CSC] format. The algorithm can perform tree pruning in order to remove branches with a low probability. The loss function of the model has a term to penalize the complexity of the model with regularization to smooth the learning process (decrease the possibility of overfitting).

XGBoost的优势在于并行性和硬件优化。数据存储在内存中(称为块)，并以压缩列[CSC]格式存储。该算法可以执行树修剪以便以低概率除去分支。模型的损失函数具有用正则化来惩罚模型的复杂性以简化学习过程(减少过度拟合的可能性)的术语。

The model performs well even with missing values or lots of zero values with sparsity awareness. XGBoost uses an algorithm called “weighted quantile sketch algorithm”, this allows the algorithm to focus on data that are misclassified. The goal of each new learner is to learn how to classify the wrong data after each iteration. The method allows you to sort the data by quantiles in order to find the right splitting point. It’s the goal of the ϵ parameter which is the value of split (ϵ=0.1; quantiles=[10%, 20%,…, 90%]).

即使缺少值或带有零稀疏意识的大量零值，该模型也能很好地执行。 XGBoost使用一种称为“加权分位数素描算法”的算法，该算法可使该算法专注于错误分类的数据。每个新学习者的目标是学习每次迭代后如何对错误数据进行分类。该方法使您可以按分位数对数据进行排序，以找到正确的分割点。 ϵ参数的目标是split的值(ϵ = 0.1；分位数= [10％，20％，…，90％])。

The number of iteration for the boosting process is automatically determined by the algorithm with an integrated cross-validation method.

该算法使用集成的交叉验证方法自动确定增强过程的迭代次数。

The authors provide a table with a comparison of different tree algorithms:

作者提供了一个表格，其中比较了不同的树算法：

Image for post — Chen and Guestin, 2016. p.7, Table 1.

最佳化 (Optimizations)

XGBoost has hyperparameters (parameters that are not learned by the estimator) that must be determined with parameter optimization. The process is simple, each parameter to be estimated is represented by a list of values, each combination is then tested by the model whose metrics are compared to deduce the best combination. The search for parameters needs to be guided with metrics by cross-validation. Don’t be afraid, sklearn has two functions to do that for you: RandomizedSearchCV and GridSearchCV.

XGBoost具有必须通过参数优化确定的超参数(估计器无法学习的参数)。该过程很简单，每个要估计的参数都由一列值表示，然后由模型测试每个组合，将其度量进行比较以得出最佳组合。参数搜索需要通过交叉验证以度量为指导。不用担心， sklearn有两个功能可以为您做到这一点： RandomizedSearchCV和GridSearchCV 。

Grid search

网格搜索

Grid search is a tuning technique that attempts to calculate the optimal values for hyperparameters. It’s an exhaustive searching through a manually specified subset of the hyperparameter space of the algorithm.

网格搜索是一种调整技术，试图为超参数计算最佳值。这是对算法超参数空间的手动指定子集的详尽搜索。

Random search

随机搜寻

This second method is considered more powerful (Bergstra and Bengio, 2012)[1] because the parameters are chosen randomly. Random search can be used in continuous and mixed spaces and is very efficient when only a small number of hyperparameters affects the final performance.

由于参数是随机选择的，因此第二种方法被认为更强大( Bergstra和Bengio，2012) [1]。随机搜索可用于连续空间和混合空间，当只有少量超参数影响最终性能时，随机搜索非常有效。

How to use these methods?

如何使用这些方法？

笔记本 (Notebook)

Random search

随机搜寻

Let’s begin with a random search. The code below shows how to the RandomizedSearchCV(). The hyperparameters to be estimated are stored in a dictionary. XGBoost can take into account other hyperparameters during the training like early stopping and validation set. You can configure them with another dictionary passed during the fit() method.

让我们从随机搜索开始。下面的代码显示了如何对RandomizedSearchCV() 。要估计的超参数存储在字典中。 XGBoost可以在训练过程中考虑其他超参数，例如提前停止和验证集。您可以使用在fit()方法期间传递的另一个字典来配置它们。

In the next code, I use the best parameters obtained with the random search (contained in the variable best_params_)to initialize the dictionary of the grid search.

在下一个代码中，我将使用通过随机搜索获得的最佳参数(包含在变量best_params_)来初始化网格搜索的字典。

Grid search

网格搜索

In the following code, you will find the configuration for the grid search. As previously shown, the cross-validation is fixed at 3 folds.

在以下代码中，您将找到网格搜索的配置。如前所示，交叉验证固定为3倍。

结论 (Conclusion)

You reach the end of this (small) tutorial on the hyperparameters optimization for XGBoost. But these methods can be exploited with every machine learning algorithm. Keep in mind that these optimization methods are expensive in terms of computation (the first code above generates 3000 runs).

您将到达有关XGBoost的超参数优化的本(小型)教程的结尾。但是，每种机器学习算法都可以利用这些方法。请记住，这些优化方法在计算方面很昂贵(上面的第一个代码生成3000次运行)。

You are now ready to use it in your work or competitions. Enjoy!

您现在可以在工作或比赛中使用它了。请享用！