跟我读论文系列之XGBoost_xgboost论文-CSDN博客

本文链接：https://blog.csdn.net/meiyh3/article/details/127156523

XGBoost是一款用于机器学习的可拓展端到端树提升系统，被广泛应用于数据科学并取得先进成果。文章详细介绍了其设计原理，包括带惩罚项的目标函数、梯度提升树、分裂算法以及系统设计优化，如缓存访问、数据压缩和分片，以实现高效处理大规模数据。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

理论

TITLE 标题

XGBoost: A Scalable Tree Boosting System

XGBoost：可拓展的树提升系统

AUTHOR 作者

Tianqi Chen 陈天奇
Carlos Guestrin

ABSTRACT 摘要

In this paper, we describe a scalable end to end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges

在本文中，我们描述了一个名为XGBoost的可扩展的端到端树提升系统，该系统被数据科学家广泛用于在许多机器学习挑战中并获得最先进的结果

More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system

更重要的是，我们提供了关于缓存访问模式、数据压缩和分片的见解，以构建可扩展的树提升系统

By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems

通过结合这些见解，XGBoost可以使用比现有系统少得多的资源来扩展数十亿个示例

Large-scale Machine Learning

大规模机器学习

INTRODUCTION 介绍

There are two important factors that drive these successful applications: usage of effective (statistical) models that capture the complex data dependencies and scalable learning systems that learn the model of interest from large datasets.

有两个重要因素推动了这些成功的应用:有效的(统计)模型的使用，捕捉复杂的数据依赖关系和从大型数据集中学习感兴趣的模型的可拓展学习系统。

Tree boosting has been shown to give state-of-the-art results on many standard classification benchmarks

树提升已被证明可以在许多标准分类基准上提供最先进的结果

LambdaMART, a variant of tree boosting for ranking, achieves state-of-the-art result for ranking problems. Besides being used as a stand-alone predictor, it is also incorporated into real-world production pipelines for ad click through rate prediction

LambdaMART是一种用于排序的树提升变体，在排序问题上实现了最先进的结果。除了作为一个独立的预测器，它也被纳入到真实的生产管道中，用于广告点击率的预测

In this paper, we describe XGBoost, a scalable machine learning system for tree boosting. The system is available as an open source package2. The impact of the system has been widely recognized in a number of machine learning and data mining challenges.Take the challenges hosted by the machine learning competition site Kaggle for example. Among the 29 challenge winning solutions3published at Kaggle’s blog during 2015, 17 solutions used XGBoost. Among these solutions, eight solely used XGBoost to train the model, while most others combined XGBoost with neural nets in ensembles. For comparison, the second most popular method, deep neural nets, was used in 11 solutions.The success of the system was also witnessed in KDDCup 2015, where XGBoost was used by every winning team in the top-10. Moreover, the winning teams reported that ensemble methods outperform a well-configured XGBoost by only a small amount

这一段在说xgboost在kaggle竞赛中很受欢迎

These results demonstrate that our system gives state-ofthe-art results on a wide range of problems.Examples of the problems in these winning solutions include: store sales prediction; high energy physics event classification; web text classification; customer behavior prediction; motion detection; ad click through rate prediction; malware classification; product categorization; hazard risk prediction; massive online course dropout rate prediction.While domain dependent data analysis and feature engineering play an important role in these solutions, the fact that XGBoost is the consensus choice of learner shows the impact and importance of our system and tree boosting.

这些结果表明，我们的系统在广泛的问题上提供了最先进的结果。这些成功的解决方案中的问题包括:商店销售预测;高能物理事件分类;web文本分类;客户行为的预测;运动检测;广告点击率预测;恶意软件分类;产品分类;灾害风险预测;海量在线课程辍学率预测。虽然领域相关的数据分析和特征工程在这些解决方案中发挥着重要作用，但XGBoost是学习者的共识选择这一事实表明了我们的系统和树提升的影响和重要性。

The most important factor behind the success of XGBoost is its scalability in all scenarios. The system runs more than ten times faster than existing popular solutions on a single machine and scales to billions of examples in distributed or memory-limited settings. The scalability of XGBoost is due to several important systems and algorithmic optimizations. These innovations include: a novel tree learning algorithm is for handlingsparse data; a theoretically justified weighted quantile sketch procedure enables handling instance weights in approximate tree learning. Parallel and distributed computing makes learning faster which enables quicker model exploration. More importantly, XGBoost exploits out-of-core computation and enables data scientists to process hundred millions of examples on a desktop. Finally, it is even more exciting to combine these techniques to make an end-to-end system that scales to even larger data with the least amount of cluster resources

XGBoost成功背后最重要的因素是它在所有场景中的可拓展性。该系统在单台机器上的运行速度比现有的流行解决方案快十倍以上，并且在分布式或内存有限的设置下可扩展到数十亿个示例。XGBoost的可拓展性源于几个重要的系统和算法优化。这些创新包括:

一种新的树学习算法用于处理稀疏数据;
一个理论上合理的加权分位数草图过程可以在近似树学习中处理实例权值。
并行和分布式计算使学习更快，从而使模型探索更快。
更重要的是，XGBoost利用了内核外计算和使数据科学家能够在桌面上处理数亿个例子。

最后，更令人兴奋的是，将这些技术结合在一起，以最少的集群资源创建一个端到端系统，可以扩展到更大的数据

The major contributions of this paper is listed as follows:•We design and build a highly scalable end-to-end tree boosting system.•We propose a theoretically justified weighted quantile sketch for efficient proposal calculation.•We introduce a novel sparsity-aware algorithm for parallel tree learning.•We propose an effective cache-aware block structure for out-of-core tree learning.

本文的主要贡献如下:

我们设计并构建了一个高度可扩展的端到端树提升系统。
我们提出了一个理论上合理的加权分位数草图，用于高效的提案计算。
我们介绍了一种新的稀疏感知并行树学习算法。
我们提出了一种有效的缓存感知块结构，用于核外树学习。

TREE BOOSTING IN A NUTSHELL 简单介绍一下树提升

Regularized Learning Objective 带惩罚项的目标函数

函数公式

给定一个n个样本m个特征的数据集 $D=\{(x_i, y_i)\}$ ，树提升模型的公式为：

$\hat{y_i} = \phi{(x_i)} = \sum_{k=1}^{K}{f_k(x_i)}, f_k\in F$

其中 $F=\{f(x)=\omega_{q(x)}\}(q: R^m -> T, \omega \in R^T)$ ，是回归树(CART)空间

其中 $q$ 表示每棵树的结构，将一个样本映射到相应的叶索引

其中 $f_k$ 是一棵具体树 $q$ 和叶子权重 $\omega$

损失函数

$L(\phi) = \sum_{i}{l(\hat{y_i}, y_i)} + \sum_{k}{\Omega{(f_k)}}\\ \Omega{(f)} = \gamma T + \frac{1}{2}\lambda\|\omega\|^2$

其中 $\Omega{(f)}$ 是正则化项，避免过拟合的

Gradient Tree Boosting 梯度提升树

The tree ensemble model in Eq. (2) includes functions as parameters and cannot be optimized using traditional optimization methods in Euclidean space.

上述式子中的树集成模型是以函数为参数的，无法在欧氏空间中使用传统的优化方法进行优化。

使用 $\hat{y}_{i}^{(t)}$ 表示第t棵树时第i个样本的预测，则目标转换成找出 $f_t$ 使下式最小