【机器学习算法】GBDT原理及实现

最新推荐文章于 2025-03-22 16:01:24 发布

九年义务漏网鲨鱼

最新推荐文章于 2025-03-22 16:01:24 发布

阅读量1.3k

点赞数 20

分类专栏：算法文章标签：机器学习算法人工智能提升树 GBDT python

本文链接：https://blog.csdn.net/weixin_51908696/article/details/144142275

版权

算法专栏收录该内容

15 篇文章

订阅专栏

一、基本内容

提升树的分类

-	二分类问题	回归问题
模型	Adaboost的特例，每个弱分类器的高度为2，并且权重为1
损失函数	指数损失函数	平方误差损失函数
优化方式	通过经验风险最小化拟合新的弱分类器	通过残差拟合新的弱分类器

针对不同的问题，不同的损失函数有不同的优化方式，GBDT提出了一般决策优化问题。

一般损失函数定义

$L (y, f (x))$

优化目标：加入新弱分类器的模型损失要低于旧模型：

$L(y,f_m(x))< L(y,f_{m-1}(x))$

在点 $f_{m-1}(x)$ 上对 $L(y,f_m(x))$ 泰勒展开可得：
$L(y,f_m(x))=L(y,f_{m-1}(x)) + \frac{\partial L(y,f_{m-1}(x))}{\partial f_{m-1}(x)}|_{f_{m-1}(x)}·T(X_i,\theta_m)+\alpha$
所以有：
$L(y,f_{m-1}(x))-L(y,f_m(x)) \thickapprox -\frac{\partial L(y,f_{m-1}(x))}{\partial f_{m-1}(x)}|_{f_{m-1}(x)}·T(x_i,\theta_m)\geq 0$
当 $T(x_i,\theta_m) \thickapprox-\frac{\partial L(y,f_{m-1}(x))}{\partial f_{m-1}(x)}|_{f_{m-1}(x)} = y-y_{pred}$ 时，有 $L(y,f_{m-1}(x))-L(y,f_m(x)) \geq 0$ ，因此只需将新加入的弱学习器拟合负梯度即可实现梯度优化

[!tip]

与Adaboost模型不同， GBDT是基于梯度优化的，而Adaboost是基于权重优化的，重点训练了错分类的样本，对异常值较为敏感，GBDT通过优化损失函数的负梯度作为近似残差，指导每棵树的生长。梯度提供了方向和幅度信息，能更精确地找到优化路径。

二、基于分类的GBDT

基本实现

对于最后的回归输出进行 $s i g m o i d$ 函数映射变化：
$\frac{1}{1+e^{-f_m(x)}}$

损失函数：交叉熵损失函数：

$\log y-(1-y) \log(1-y)$

将变换后的特征代入交叉熵损失函数中，可得：

$y=\log (1+e^{-f_m(x)})+(1-y)f_m(x)$

损失函数的负梯度为：

$r_m(x,y)=-[\frac{1}{1+e^{-f_{m-1}(x)}} - y] = y - y_{m-1}$

三、代码实现

参数初始化：模型是通过残差拟合弱学习器的，第一个弱学习器没有残差值，需要初始化

if self.task == "regression":
    self.init_value = np.mean(y)  # 对于回归，初始预测值为均值
elif self.task == "classification":
    self.init_value = np.log(np.mean(y) / (1 - np.mean(y)))  # 对于分类，初始预测值为对数几率

y_pred = np.full(y.shape, self.init_value)

弱学习器梯度更新：新弱分类器拟合损失函数的负梯度

for i in range(self.n_estimators):
    if self.task == "regression":
        # 计算残差（负梯度）
        residual = y - y_pred
    elif self.task == "classification":
        # 计算负梯度（即目标值的梯度）
        prob = 1 / (1 + np.exp(-y_pred))  # Sigmoid
        residual = y - prob

    # 拟合残差（负梯度）
    tree = DecisionTreeRegressor(max_depth=self.max_depth)
    tree.fit(X, residual)
    self.trees.append(tree)

    # 更新预测值，因为是拟合残差，只需在旧集成模型中加入残差值
    y_pred += self.learning_rate * tree.predict(X)

输出预测：通过多个弱分类器的预测结果相加

def predict(self, X):
    y_pred = np.full((X.shape[0],), self.init_value) # 需要用初始值去计算结果

    for tree in self.trees:
        y_pred += self.learning_rate * tree.predict(X)

    if self.task == "classification":
        # 对分类任务，返回概率值
        return 1 / (1 + np.exp(-y_pred))
    return y_pred

完整代码

class GBDT:
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3, task="regression"):
        assert task in ["regression", "classification"], "Task must be 'regression' or 'classification'."
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.task = task
        self.trees = []  # 用于存储每一棵弱学习器
        self.init_value = None  # 初始预测值

    def fit(self, X, y):
        # 初始化预测值
        if self.task == "regression":
            self.init_value = np.mean(y)  # 对于回归，初始预测值为均值
        elif self.task == "classification":
            self.init_value = np.log(np.mean(y) / (1 - np.mean(y)))  # 对于分类，初始预测值为对数几率

        y_pred = np.full(y.shape, self.init_value)

        for i in range(self.n_estimators):
            if self.task == "regression":
                # 计算残差（负梯度）
                residual = y - y_pred
            elif self.task == "classification":
                # 计算负梯度（即目标值的梯度）
                prob = 1 / (1 + np.exp(-y_pred))  # Sigmoid
                residual = y - prob

            # 拟合残差（负梯度）
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, residual)
            self.trees.append(tree)

            # 更新预测值
            y_pred += self.learning_rate * tree.predict(X)

    def predict(self, X):
        y_pred = np.full((X.shape[0],), self.init_value)

        for tree in self.trees:
            y_pred += self.learning_rate * tree.predict(X)

        if self.task == "classification":
            # 对分类任务，返回概率值
            return 1 / (1 + np.exp(-y_pred))
        return y_pred

    def predict_class(self, X, threshold=0.5):
        if self.task != "classification":
            raise ValueError("This method is only available for classification tasks.")
        prob = self.predict(X)
        return (prob >= threshold).astype(int)