XGBoost 阅读之 Weighted Quantile Sketch-CSDN博客

本文链接：https://blog.csdn.net/Xiaoyi_Zhang/article/details/89608553

3.3 Weighted Quantile Sketch（加权分位数略图）

One important step in the approximate algorithm is to propose candidate split points. Usually percentiles of a feature are used to make candidates distribute evenly on the data. Formally, let multi-set represent the k-th feature values and second order gradient statistics of each training instances. We can define a rank functions as

which represents the proportion of instances whose feature value k is smaller than z. The goal is to find candidate split points such that

Here ε is an approximation factor. Intuitively, this means that there is roughly 1/ε candidate points. Here each data point is weighted by hi. To see why hi represents the weight, we can rewrite Eq(3) as

which is exactly weighted squared loss with labels gi/hi and wieghts hi. For large datasets, it is non-trivial to find candidate splits that satisfy the criteria. When every instance has equal weights, an existing algorithm called quantile sketch[14, 24] solves the problem. However, there is no existing quantile sketch for the weighted datasets. Therefore, most existing approximate algorithms either resorted to sorting on a random subset of data which have a chance of failure or heuristics that do not have theoretical guarantee.

近似算法（分桶后再greedy搜索）中很重要的一步就是提出备选分割点。通常情况下，均匀挑选一个特征的百分位数作为候选。用集合Dk表示样本第k个特征的（特征值，损失函数对样本的二阶梯度)。据此，我们定义一个排序函数：rk（式8）。因此，对于每个分割点，其函数值是用此分位点下包含的二阶导之和的比例。目标是找到一组候选的分割点skl满足条件（式9）。即相邻的两个分割点之间的间隔小于给定的ε。这里的ε是一个近似系数。直观上看，这大致表示，总共选出1/ε个候选点。这里每一个点都按照其二阶导h进行加权。为了说明为何可以使用二阶导h作为权重，我们可以把等式3改写成如下形式。关于这里的符号错误问题，可参见如下回答：

https://datascience.stackexchange.com/questions/10997/need-help-understanding-xgboosts-approximate-split-points-proposal

https://www.hrwhisper.me/machine-learning-xgboost/

即恰好是gi/hi做label时的L2损失加上hi作为权重。对于大规模数据集来说，寻找符合条件的候选分割点是重要的。已经有一个分位数略图算法[14,24]可以解决当样本权重相同时的问题。然而，还没有哪个现成的分位数略图算法可以解决有权重的样本。因此，大多数近似算法使用随机抽样或启发式的方法，然而这些方法要不是有失败的可能，要不就是没有理论支撑。

To solve this problem we introduced a novel distributed weighted quantile sketch algorithm that can handle weighted data with a provable theoretical guarantee. The general idea is to propose a data structure that supports merge and prune operations, with each operation proven to maintain a certain accuracy level. A detailed description of the algorithm as well as proofs are given in the supplementary material5(link in the foot note)

5:Link to the supplementary material:https://homes.cs.washington.edu/~tqchen/pdf/xgboost-supp.pdf

为了解决这个问题，我们引入了一个新的，有理论支撑的分布式加权分位数略图算法，其大意是提出一个可以支持融合与剪枝操作的数据结构。而每一个合并与剪枝操作又能保证一定量级上的准确度。细节算法和证明见链接。

3.3 总结：

提出了一种给定权重情况下，寻找候选分桶点的算法，详细原理在附录中。
解释了为什么用二阶导作为权重是合理的。但是有两个问题，一是符号错了。二是，即是加上符号，也没说清楚为什么合理。即，使用-hi/gi作为label有什么意义？
对于第二个问题的理解：看式3,由于目标是损失函数二阶泰勒展开后的两余项，当只有一个样本时，显然最优解是-gi/hi。这就解释了为什么3.3中要化成使用-gi/hi做label的形式。当求样本集最优且不考虑正则化项时，其值就是这个式子：(-sigma(gi))/sigma(hi)。显然可见，如果损失函数在各处二阶导都相同，则其值就是avg(gi)/const与各个样本点处的二阶导差异无关。而正因为存在二阶导处处不同的损失函数，在考虑所有样本时，其最优解不是avg(gi)。因此，最优的w相当于按二阶导加权后的结果。