Boosting方法的基本思路是用多个决策树来得到更好的效果。每个决策树可能分类效果不好,但是综合多个分类(弱分类器)的多个结果肯定会得到更准确的预测。Gradient tree boosting方法有多个别名,gradient boosting machine (GBM),gradient boosted regression tree (GBRT),GBDT等等,都是类似的。而XGboost之所以被广泛推崇,主要在实现上进行了算法优化。然boosting能在普通的pc机上面的速度提高10倍以上,可以让数据处理,特别是参加比赛的队伍能够快速方便的使用。并且该实现由很强的可拓展性(scalability)。方便了XGboost在其他开发环境中的使用。具体算法实现上,在文章中给出了主要改进的地方:
对稀疏的数据非常好的分类优化
对损失函数的规划项进行的调整
增加了分布式并行计算的支持
The scalability of XGBoost is dueto several important systems and algorithmic optimizations.These innovations include: a novel tree learning algorithmis for handling sparse data; a theoretically justified weighted quantile sketch procedure enables handling instance weights in approximate tree learning. Parallel and distributed computing makes learning faster which enables quicker model ex-ploration. More importantly, XGBoost exploits out-of-core computation and enables data scientists to process hundredmillions of examples on a desktop.
文中提到,
While there are some existing works on parallel tree boost-ing [22, 23, 19], the directions such as out-of-core computation, cache-aware and sparsity-aware learning have notbeen explored. 现在的boosting的方法中,后面三个算法还没有实现,但不是科班出身的我对后面的三个意思并太清楚具体含义,往大家指正。