小象学院第11章提升_小象学院决策树-CSDN博客

中段这些课程是非常重要的：

或许我们可以想一个方式，

那么说，我有没有可没有可能，通过样本加权，或者是分类器加权的方式

通过这样子的

或许我们想一个办法，对分类器加上一个权值

这个权值怎么考虑呢？

我们可以通过损失函数

沿着它的梯度、（梯度是一阶导是吧），或者是加上二阶导，（海森矩阵）

如果使用一阶导，或是二阶导

就能得到GBDT，XGBOOST

然后，如果使用损失函数，ADBOOST，自适应提升

所以说呢今天要探索的事情呢，其实相当是对一些基本分类器的集成

但是这种集成，不是简单的像随机森林那样把它加起来

而是要对它加权的方式

举个例子

提升的想法是很有趣的

我们想一想，如果要做分类器或回归器的时候，

如果我要做随机森林的时候，理论

但是N个分类器是直接“少数服从多数”

但是100个分类器是相对独立的，

我们来想一下，如果前面已经得到了M-1棵决策树的时候，

我们能不能通过已有的样本的决策树的信息，对第M棵决策树的建立产生有益的影响呢？

比如说，我们得到若干棵决策树以后，真的只能通过小数服从多数这种简单的方式进行最后的预测吗？

我们可以对决策树加上权值吗？

wiki 百科上的解释：

Gradient boosting is a machine learning technique for regression and classificationproblems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

梯度提升是一种机器学习的技术，可以用于分类和回归；

它产生一种预测模型，它是一种把弱分类器集成在一起的方法，特别是决策树。

课件上的解释：

提升是一种机器学习技术，可以用于分类和回归；

它每一步产生一个弱预测模型（如决策树），并加权累加到总模型中；

如果每一步的弱预测模型生成都是依据损失函数的梯度方向，则称之为梯度提升(Gradient Boosting)

梯度提升算法首先给定一个目标损失函数，它的定义域是所有可行的弱函数集合（基函数);

提升算法通过迭代，每次选择一个负梯度方向上的基函数来逐渐逼近局部极小值。

给定不同的树，损失函数是不断发生变化的：

我们可以把损失函数看成是“自变量”是树的函数

我们把损失函数对树求偏导的话，沿的负梯度发生下降，不就可以让损失函数越变越小吗？

其实就是仿照着梯度下降法来做的。

损失函数是根本实际问题给的

跟提升本身没有关系

我们可能给定MSE，或者交叉熵

当我们有这个损失函数的时候，就来根据这个分类器本身求梯度，然后求负梯度

下降时的“步长”就当做分类器本身的权值

我们曾经在代码中，在第4次课，给出了一个代码，

当时叫做Bagging 。

nformal introduction[edit]

(This section follows the exposition of gradient boosting by Li.[6])

Like other boosting methods, gradient boosting combines weak "learners" into a single strong learner in an iterative fashion. It is easiest to explain in the least-squares regression setting, where the goal is to "teach" a model {\displaystyle F} $545fd099af8541605f7ee55f08225526be88ce57$ 转存失败重新上传取消 $F$ to predict values of the form {\displaystyle {\hat {y}}=F(x)} $ad1ceaa8141c3c194c685cac4d222e286d88e1e6$ 转存失败重新上传取消 $\hat{y} = F(x)$ by minimizing the mean squared error {\displaystyle 1/n\sum _{i}({\hat {y}}_{i}-y_{i})^{2}} $f037c17718147efed43aaf7bb68778a9e8e807bd$ 转存失败重新上传取消 $1/n\sum _{i}({\hat {y}}_{i}-y_{i})^{2}$ , where {\displaystyle i} $add78d8608ad86e54951b8c8bd6c8d8416533d20$ 转存失败重新上传取消 $i$ indexes over some training set of size {\displaystyle n} $a601995d55609f2d9f5e233e36fbe9ea26011b3b$ 转存失败重新上传取消 $n$ of actual values of the output variable {\displaystyle y} $b8a6208ec717213d4317e666f1ae872e00620a0d$ 转存失败重新上传取消 $y$ .

At each stage {\displaystyle m} $0a07d98bb302f3856cbabc47b2b9016692e3f7bc$ 转存失败重新上传取消 $m$ , {\displaystyle 1\leq m\leq M} $180b134ce5f780b4dc77bedbcc7e37a0e7b39260$ 转存失败重新上传取消 $1 \le m \le M$ , of gradient boosting, it may be assumed that there is some imperfect model {\displaystyle F_{m}} $afc15d41d3176d0fb9b4474762c53d49add76fbf$ 转存失败重新上传取消 $F_m$ (at the outset, a very weak model that just predicts the mean y in the training set could be used). The gradient boosting algorithm improves on {\displaystyle F_{m}} $afc15d41d3176d0fb9b4474762c53d49add76fbf$ 转存失败重新上传取消 $F_m$ by constructing a new model that adds an estimator h to provide a better model: {\displaystyle F_{m+1}(x)=F_{m}(x)+h(x)} $71763b3f4f7524fed7836433b33c7a3bdc1ccab4$ 转存失败重新上传取消 $F_{m+1}(x) = F_m(x) + h(x)$ . To find {\displaystyle h} $b26be3e694314bc90c3215047e4a2010c6ee184a$ 转存失败重新上传取消 $h$ , the gradient boosting solution starts with the observation that a perfect h would imply

{\displaystyle F_{m+1}(x)=F_{m}(x)+h(x)=y} $42940917117b12ff21c2c936bfef526a6cb62779$ 转存失败重新上传取消 $F_{m+1}(x)=F_{m}(x)+h(x)=y$

or, equivalently,

{\displaystyle h(x)=y-F_{m}(x)} $0190d9401eba2ab598f747b6c6b19bf023a30545$ 转存失败重新上传取消 $h(x) = y - F_m(x)$ .

Therefore, gradient boosting will fit h to the residual {\displaystyle y-F_{m}(x)} $e8cb01b82cfae01caa755f9a25ead42d05225e81$ 转存失败重新上传取消 $y - F_m(x)$ . As in other boosting variants, each {\displaystyle F_{m+1}} $f79d37287d73b20d8cab41477cda062c8b9e8912$ 转存失败重新上传取消 $F_{m+1}$ attempts to correct the errors of its predecessor {\displaystyle F_{m}} $afc15d41d3176d0fb9b4474762c53d49add76fbf$ 转存失败重新上传取消 $F_m$ .