决策树总结（个人学习体会）

最新推荐文章于 2022-12-29 16:03:40 发布

盒饭立flag

最新推荐文章于 2022-12-29 16:03:40 发布

阅读量6.2k

点赞数

分类专栏：算法积累文章标签：算法

本文链接：https://blog.csdn.net/weixin_43973436/article/details/106751198

版权

算法积累专栏收录该内容

5 篇文章 0 订阅

订阅专栏

决策树总结（个人学习体会）

算法定义

决策树：是一种监督学习(Supervised Learning)方法，通过不断对某个属性进行判断得到一个分支最终形成决策树，这个决策树能够对新的数据进行正确的分类。
代表算法：ID3，C4.5
损失函数：正则化的极大似然函数

算法流程

将所有特征看成一个一个的节点。
分割每个特征，找到最好的分割点；将数据划分为不同的子节点，N₁、N₂、…、N_m，计算子节点的纯度信息；
由2步选择出最优的特征和划分方式；得出子节点：N₁、N₂、…、N_m，；
对于N₁、N₂、…、N_m，继续执行2~3步，直到每个最终子节点都足够。

流程如下：

If so return 类标签：
Else
     寻找划分数据集的最好特征
     划分数据集
     创建分支节点
         for 每个划分的子集
             调用函数createBranch()并增加返回结果到分支节点中
         return 分支节点

特征选择

目的：选取能够对训练集分类的特征。
标准：信息增益，信息增益比，Gini 指数；
特征选择方法：通过数据集进行划分，计算划分前后信息发生的变化，变化最高的就是最好的选择。

树的生成

特征选择：选取信息增益最大、信息增益比最大、Gini 指数最小的特征。
生成过程：从根节点开始，不断选取局部最优特征；
树的生成方法(ID3算法)：在决策树各个结点上对应信息增益准则选择特征，递归地构建决策树。
1. 从根结点开始，对结点计算所有可能的特征的信息增益，选择信息增益最大的特征作为结点的特征。
2. 由该特征的不同取值建立子节点，再对子结点递归地调用以上方法，构建决策树；
3. 直到所有特征的信息增益均很小或没有特征可以选择为止；
4. 最后得到一个决策树。

树的剪枝

目的：决策树的剪枝是为了防止树的过拟合，增强其泛化能力。
步骤：预剪枝和后剪枝。

算法优缺点

优点
1. 使用可视化模型。
2. 很容易理解和解释。
3. 可以与其他决策技术结合使用。
4. 即使只有很少的数据也有价值。
5. 可以基于分析得出的情况生成确定不同方案的最差，最佳和预期值。
缺点：
1. 它们是不稳定的，这意味着数据的微小变化可能导致最优决策树结构的巨大变化。
2. 它们通常相对不准确。许多其他预测因子使用类似数据表现更好。
3. 对于包括具有不同级别数的分类变量的数据，决策树中的信息增益偏向于具有更多级别的那些属性。
4. 计算可能变得非常复杂，特别是如果许多值不确定和/或许多结果是相关的。

算法实现

算法实现一（python，dt.py）

import numpy as np

class Node:
    def __init__(self, left, right, rule):
        self.left = left
        self.right = right
        self.feature = rule[0]
        self.threshold = rule[1]

class Leaf:
    def __init__(self, value):
        self.value = value

class DecisionTree:
    def __init__(self,classifier=True,max_depth=None,
        n_feats=None,criterion="entropy",seed=None):
        if seed:
            np.random.seed(seed)

        self.depth = 0
        self.root = None

        self.n_feats = n_feats
        self.criterion = criterion
        self.classifier = classifier
        self.max_depth = max_depth if max_depth else np.inf

        if not classifier and criterion in ["gini", "entropy"]:
            raise ValueError(
                "{} is a valid criterion only when classifier = True.".format(criterion)
            )
        if classifier and criterion == "mse":
            raise ValueError("`mse` is a valid criterion only when classifier = False.")

    def fit(self, X, Y):
        self.n_classes = max(Y) + 1 if self.classifier else None
        self.n_feats = X.shape[1] if not self.n_feats else min(self.n_feats, X.shape[1])
        self.root = self._grow(X, Y)

    def predict(self, X):
        return np.array([self._traverse(x, self.root) for x in X])

    def predict_class_probs(self, X):
        assert self.classifier, "`predict_class_probs` undefined for classifier = False"
        return np.array([self._traverse(x, self.root, prob=True) for x in X])

    def _grow(self, X, Y, cur_depth=0):
        # if all labels are the same, return a leaf
        if len(set(Y)) == 1:
            if self.classifier:
                prob = np.zeros(self.n_classes)
                prob[Y[0]] = 1.0
            return Leaf(prob) if self.classifier else Leaf(Y[0])

        # if we have reached max_depth, return a leaf
        if cur_depth >= self.max_depth:
            v = np.mean(Y, axis=0)
            if self.classifier:
                v = np.bincount(Y, minlength=self.n_classes) / len(Y)
            return Leaf(v)

        cur_depth += 1
        self.depth = max(self.depth, cur_depth)

        N, M = X.shape
        feat_idxs = np.random.choice(M, self.n_feats, replace=False)

        # greedily select the best split according to `criterion`
        feat, thresh = self._segment(X, Y, feat_idxs)
        l = np.argwhere(X[:, feat] <= thresh).flatten()
        r = np.argwhere(X[:, feat] > thresh).flatten()

        # grow the children that result from the split
        left = self._grow(X[l, :], Y[l], cur_depth)
        right = self._grow(X[r, :], Y[r], cur_depth)
        return Node(left, right, (feat, thresh))

    def _segment(self, X, Y, feat_idxs):
         best_gain = -np.inf
        split_idx, split_thresh = None, None
        for i in feat_idxs:
            vals = X[:, i]
            levels = np.unique(vals)
            thresholds = (levels[:-1] + levels[1:]) / 2 if len(levels) > 1 else levels
            gains = np.array([self._impurity_gain(Y, t, vals) for t in thresholds])

            if gains.max() > best_gain:
                split_idx = i
                best_gain = gains.max()
                split_thresh = thresholds[gains.argmax()]

        return split_idx, split_thresh

    def _impurity_gain(self, Y, split_thresh, feat_values):
        """
        Compute the impurity gain associated with a given split.
        IG(split) = loss(parent) - weighted_avg[loss(left_child), loss(right_child)]
        """
        if self.criterion == "entropy":
            loss = entropy
        elif self.criterion == "gini":
            loss = gini
        elif self.criterion == "mse":
            loss = mse

        parent_loss = loss(Y)

        # generate split
        left = np.argwhere(feat_values <= split_thresh).flatten()
        right = np.argwhere(feat_values > split_thresh).flatten()

        if len(left) == 0 or len(right) == 0:
            return 0

        # compute the weighted avg. of the loss for the children
        n = len(Y)
        n_l, n_r = len(left), len(right)
        e_l, e_r = loss(Y[left]), loss(Y[right])
        child_loss = (n_l / n) * e_l + (n_r / n) * e_r

        # impurity gain is difference in loss before vs. after split
        ig = parent_loss - child_loss
        return ig

    def _traverse(self, X, node, prob=False):
        if isinstance(node, Leaf):
            if self.classifier:
                return node.value if prob else node.value.argmax()
            return node.value
        if X[node.feature] <= node.threshold:
            return self._traverse(X, node.left, prob)
        return self._traverse(X, node.right, prob)

def mse(y):
    return np.mean((y - np.mean(y)) ** 2)

def entropy(y):
    hist = np.bincount(y)
    ps = hist / np.sum(hist)
    return -np.sum([p * np.log2(p) for p in ps if p > 0])

def gini(y):
    hist = np.bincount(y)
    N = np.sum(hist)
    return 1 - sum([(i / N) ** 2 for i in hist])