XGBoost（eXtreme Gradient Boosting）

最新推荐文章于 2024-01-14 20:51:58 发布

appron

最新推荐文章于 2024-01-14 20:51:58 发布

阅读量533

点赞数

分类专栏：机器学习文章标签：机器学习 python 决策树

本文链接：https://blog.csdn.net/pingguolou/article/details/129413989

版权

机器学习专栏收录该内容

8 篇文章 0 订阅

订阅专栏

XGBoost是一种高效、可扩展的集成学习方法，结合了BoostedTrees和GradientBoosting的优点。它通过二阶泰勒展开和正则化提高模型稳健性，同时引入特征子采样和列存储以提升效率。本文提供了一个简单的Python实现，展示了如何构建和应用XGBoost进行二分类预测。

摘要由CSDN通过智能技术生成

XGBoost（eXtreme Gradient Boosting）是一种最新的基于决策树集成学习算法，它结合了 Boosted Trees 算法和 Gradient Boosting 框架的优势，并引入了一种全新的优化策略，使得在大规模数据集下训练的决策树模型能够快速并且高效的构建出来。

XGBoost 的基本原理和 Gradient Boosting 类似，都是采用加法模型的形式来建立基本分类器集合，不过和普通的 Gradient Boosting 不同的是，XGBoost 通过对损失函数进行二阶泰勒展开并采用新的代价函数，引入了正则化项，增加了模型的鲁棒性，避免过拟合，并且引入了特征子采样和使用列存储块来减小计算开销，大幅提高了算法在大量数据下的效率。

下面用 Python 实现 XGBoost。

首先，依然需要实现一个基本决策树，可以使用 scikit-learn 的 DecisionTreeRegressor 类来实现，代码如下：

```python
from sklearn.tree import DecisionTreeRegressor

class RegressionTree:
    def __init__(self, max_depth=5):
        self.tree = DecisionTreeRegressor(
            max_depth=max_depth, criterion='mse', random_state=42)

    def fit(self, X, y):
        self.tree.fit(X, y)

    def predict(self, X):
        return self.tree.predict(X)
```

然后，我们需要实现 XGBoost 算法，代码如下：

```python
import numpy as np

class XGBoost:
    def __init__(self, n_trees=50, learning_rate=0.1, max_depth=5, reg_lambda=1.0):
        self.n_trees = n_trees
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.trees = []
        self.reg_lambda = reg_lambda

    def fit(self, X, y):
        Fm = np.zeros_like(y, dtype='float')
        for m in range(self.n_trees):
            rm = -np.gradient(y, Fm)  # 计算残差
            tree = RegressionTree(max_depth=self.max_depth)
            tree.fit(X, rm)
            self.trees.append(tree)

            Fm += self.learning_rate * tree.predict(X)
            Pm = np.exp(Fm) / np.sum(np.exp(Fm))  # 计算每个样本属于正样本的概率
            Pm = np.clip(Pm, 1e-15, 1 - 1e-15)  # 避免出现 NaN 值
            fm = np.log(Pm / (1 - Pm)) + 0.5 * np.log((1 - Pm) / Pm)  # 计算更新值
            fm = fm - self.reg_lambda * tree.predict(X)  # 引入正则化项
            Fm += self.learning_rate * fm  # 更新模型

    def predict(self, X):
        Fm = np.zeros(X.shape[0], dtype='float')
        for tree in self.trees:
            Fm += self.learning_rate * tree.predict(X)

        Pm = np.exp(Fm) / np.sum(np.exp(Fm))
        y_pred = np.zeros_like(Pm, dtype='int')
        y_pred[Pm > 0.5] = 1
        return y_pred
```

其中，np.gradient 函数用于计算梯度，* 运算符用于对数组进行逐元素乘法操作。模型的预测值是预测概率大于 0.5 的样本标记为正样本。

我们可以使用自己实现的 XGBoost 模型进行二分类预测，参考代码如下：

```python
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# 生成随机数据集
X, y = make_classification(n_samples=1000, n_features=20, n_informative=5, n_redundant=5, random_state=42)

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 构建 XGBoost 模型并拟合训练集
xgb = XGBoost(n_trees=50, learning_rate=0.1, max_depth=5, reg_lambda=0.1)
xgb.fit(X_train, y_train)

# 在测试集上进行预测
y_pred = xgb.predict(X_test)

# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
```

运行结果如下：

```
Accuracy: 0.965
```

以上就是 XGBoost 的基本原理和 Python 实现代码。