《统计学习方法》-李航、《机器学习-西瓜书》-周志华总结+Python代码连载（六）--集成学习_FM/GBDT/Xgboost

最新推荐文章于 2022-06-12 12:00:27 发布

xiao韩

最新推荐文章于 2022-06-12 12:00:27 发布

阅读量1.3k

点赞数

分类专栏： Python与AI 机器学习学习笔记文章标签：机器学习学习笔记 gbdt xgboost python

本文链接：https://blog.csdn.net/qq_28821995/article/details/101996594

版权

本文总结了《统计学习方法》和《机器学习-西瓜书》中关于随机森林、梯度提升决策树（GBDT）和Xgboost的内容，并提供了Python代码实现。随机森林通过随机属性选择增强决策树的多样性；GBDT利用损失函数负梯度作为残差近似值进行迭代优化；Xgboost则在目标函数中引入正则化，通过贪心策略进行节点分裂。完整代码可在GitHub找到。

摘要由CSDN通过智能技术生成

一、随机森林/FM（Random Forst）

随机森林是集成学习Bagging流派中一个变体，RF在以决策树为基学习构建Bagging集成的基础上，进一步在决策树的训练过程中引入随机属性选择。传统的决策树在选择划分属性时是在当前节点的属性集合中选择最优的一个；而在RF中，对基决策树的每个结点，先从该结点的属性集合中随机选择一个包含k个属性的子集，然后再从这个子集中选择一个最优属性用于划分。

二、梯度提升决策树/GBDT（Gradient boosting decision tree）

在连载（五）中，知道提升树在每一次优化过程，都是拟合上一次的残差，在GBDT中提出用损失函数的负梯度作为回归问题中的残差近似值。

为什么GBDT中可以用损失函数的负梯度来代替上一步的残差？

设在t次的loss函数为 $L(y,H_{t}(x))$ ，该损失函数在 $H_{t-1}(x)$ 泰勒展开：

$L(y,H_{t}(x))=L(y,H_{t-1}(x))+\frac{\partial L(y,H_{t-1}(x))}{\partial H_{t}(x)}h_{t}(x)$ ，其中 $h_{t}(x)$ 为第t拟合的学习器。

由上式可知，要使得 $L(y,H_{t}(x))\leq L(y,H_{t-1}(x))$ ,可以令 $h_{t}(x)=-\frac{\partial L(y,H_{t-1}(x))}{\partial H_{t}(x)}$ ，于是可得到GBDT拟合上一步损失函数的负梯度。

算法步骤：

将连载（五）中的计算残差替换成计算上一步的损失函数的负梯度即可。

代码实现：

import numpy as np
import math
# 计算信息熵
def calculate_entropy(y):
    log2 = math.log2
    unique_labels = np.unique(y)
    entropy = 0
    for label in unique_labels:
        count = len(y[y == label])
        p = count / len(y)
        entropy += -p * log2(p)
    return entropy
# 定义树的节点
class DecisionNode():
    def __init__(self, feature_i=None, threshold=None,
                 value=None, true_branch=None, false_branch=None):
        self.feature_i = feature_i          
        self.threshold = threshold         
        self.value = value                 
        self.true_branch = true_branch     
        self.false_branch = false_branch
def divide_on_feature(X, feature_i, threshold):
    split_func = None
    if isinstance(threshold, int) or isinstance(threshold, float):
        split_func = lambda sample: sample[feature_i] >= threshold
    else:
        split_func = lambda sample: sample[feature_i] == threshold

    X_1 = np.array([sample for sample in X if split_func(sample)])
    X_2 = np.array([sample for sample in X if not split_func(sample)])

    return np.array([X_1, X_2])
# 超类
class DecisionTree(object):
    def __init__(self, min_samples_split=2, min_impurity=1e-7,
                 max_depth=float("inf"), loss=None):
        self.root = None  #根节点
        self.min_samples_split = min_samples_split
        self.min_impurity = min_impurity
        self.max_depth = max_depth
        # 计算值 如果是分类问题就是信息增益，回归问题就基尼指数
        self._impurity_calculation = None
        self._leaf_value_calculation = None #计算叶子
        self.one_dim = None
        self.loss = loss

    def fit(self, X, y, loss=None):
        self.one_dim = len(np.shape(y)) == 1
        self.root = self._build_tree(X, y)
        self.loss=None

    def _build_tree(self, X, y, current_depth=0):
        """
        递归求解树
        """

        largest_impurity = 0
        best_criteria = None
        best_sets = None
        
        if len(np.shape(y)) == 1:
            y = np.expand_dims(y, axis=1)

        Xy = np.concatenate((X, y), axis=1)

        n_samples, n_features = np.shape(X)

        if n_samples >= self.min_samples_split and current_depth <= self.max_depth:
            # 计算每一个特征的增益值
            for feature_i in range(n_features):
                feature_values = np.expand_dims(X[:, feature_i], axis=1)
                unique_values = np.unique(feature_values)

                for threshold in unique_values:
                    Xy1, Xy2 = divide_on_feature(Xy, feature_i, threshold)
                    
                    if len(Xy1) > 0 and len(Xy2) > 0:
                        y1 = Xy1[:, n_features:]
                        y2 = Xy2[:, n_features:]

                        # 计算增益值
                        impurity = self._impurity_calculation(y, y1, y2)

                        if impurity > largest_impurity:
                            largest_impurity = impurity
                            best_criteria = {"feature_i": feature_i, "threshold": threshold}
                            best_sets = {
                                "leftX": Xy1[:, :n_features],  
                                "lefty": Xy1[:, n_features:],   
                                "rightX": Xy2[:, :n_features],  
                                "righty": Xy2[:, n_features:]   
                                }

        if largest_impurity > self.min_impurity:
            true_branch = self._build_tree(best_sets["leftX"], best_sets["lefty"], current_depth + 1)
            false_branch = self._build_tree(best_sets["rightX"], best_sets["righty"], current_depth + 1)
            return DecisionNode(feature_i=best_criteria["feature_i"], threshold=best_criteria[
                                "threshold"], true_branch=true_branch, false_branch=fals

最低0.47元/天解锁文章

xiao韩

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
《统计学习方法》-李航、《机器学习-西瓜书》-周志华总结+Python代码连载（六）--集成学习_FM/GBDT/Xgboost

一、随机森林/FM（Random Forst）随机森林是集成学习Bagging流派中一个变体，RF在以决策树为基学习构建Bagging集成的基础上，进一步在决策树的训练过程中引入随机属性选择。传统的决策树在选择划分属性时是在当前节点的属性集合中选择最优的一个；而在RF中，对基决策树的每个结点，先从该结点的属性集合中随机选择一个包含k个属性的子集，然后再从这个子集中选择一个最优属性用于划分。...
复制链接

扫一扫

专栏目录