CART分类与回归树

最新推荐文章于 2024-07-02 14:27:14 发布

ZOEMMM

最新推荐文章于 2024-07-02 14:27:14 发布

阅读量1.4k

点赞数

分类专栏：决策树机器学习 Python实现

本文链接：https://blog.csdn.net/zeo_m/article/details/80076721

版权

本文介绍了CART（Classification And Regression Tree）决策树的构建过程，包括如何通过最小二乘误差之和选择最佳划分，并讲解了基尼系数在选择特征和切分值中的作用。此外，还详细阐述了决策树的剪枝技术，包括预剪枝和后剪枝，以及如何利用损失函数决定剪枝策略。通过对树的损失函数和内部结点的g(t)计算，找到最优子树。

摘要由CSDN通过智能技术生成

近期在看关于决策树和Gradient Boosting Machine，参考《统计学习方法》

决策树

决策树，可以被看作一系列分割数据空间的规则也称为“if-then”规则。例如，海难幸存与否的分类问题，输入变量｛性别，性别，仓位｝输出：｛幸存，死亡｝，决策树就是这样将数据划分到各个子空间。

通过这些“if-then”规则将数据空间划分为互不重叠的对输出有最一致响应的若干子空间，然后为每个子空间训练一个常数作为每个子空间的输出，分类树以子空间内样本输出最多的类别作为这个常数，回归树以子空间内样本输出变量的均值作为这个常数。划分变量和划分值确定，则决策树确定。

用字典储存树结构：

切分特征
切分值
左子树
右子树

“createTree”伪代码：

if we can’t split the data, this node becomes a leaf node
　　make a binary split of the data
call createTree() on the right split of the data
call createTree() on the left split of the data
returned tree

为此需要实现以下功能的函数：
函数binarySplitDataSet()，按划分变量和划分值分割数据为左右量个子集；
函数regLeaf()，计算叶子输出（即样本输出的均值）；

from numpy import nonzero,mean,var,shape,inf
import pandas as pd

def binarySplitDataSet(dataSet, feature, value):
    mat0 = dataSet[nonzero(dataSet[:,feature]>value)[0],:] 
    # nonzero() return index of non-zero element of array
    mat1 = dataSet[nonzero(dataSet[:,feature]<=value)[0],:]
    return mat0, mat1

def regLeaf(dataSet):
    # return the constant used for each leaf
    return mean(dataSet[:,0])

def createTree(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):
    feat, val = chooseBestSplit(dataSet, leafType, errType, ops)
    if feat == None:
        return val
    regTree = {}
    regTree['spInd'] = feat
    regTree['spVal'] = val
    rset, lset = binarySplitDataSet(dataSet, feat, val)
    regTree['right'] = createTree(rset,leafType, errType, ops)
    regTree['left'] = createTree(lset,leafType, errType, ops)
    return regTree

CART使用基尼系数选择最佳划分变量和划分值(Breiman et al., 1984)，这里没有采用计算和比较基尼系数的方法，先看一个简单版本：划分后，左右子树最小二乘误差之和最小的方法。基尼系数的计算，在下一part介绍。

“chooseBestSplit” 伪代码：
if all the target are the same value
　　return None, mean of the target
for every feature:
　　for every unique value:
　　　　call “binarySplitDataSet” function split the dataset into two region
　　　　if the sample in each two region is less than the threshold
　　　　　　continue
　　　　measure the error of the two region
　　　　if the error is less than bestError
　　　　update bestSplit point and bestError

代码如下

def regErr(dataSet):
    # varience　* number of sample 
    return var(dataSet[:,0]) * shape(dataSet)[0]

def chooseBestSplit(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):
    '''
    leafType:leaf node value function, could be constant or linear func
    errType: calculate error function
    ops:minmum value of decrease error, and minmum sample of leaf
    '''
    tolS = ops[0]
    tolN = ops[1]
    # if all the target are the same value return None
    if len(set(dataSet[:,0].T.tolist())) == 1:  # set() increase a nonoverlap set
        return None, leafType(dataSet)

    m,n = shape(dataSet)
    S = errType(dataSet)
    bestS = inf
    bestIndex = 0
    bestVal = 0
    for featIndex in range(1,n):
        for splitVal in set(dataSet[:,featIndex]):
            r, l = binarySplitDataSet(dataSet,featIndex,splitVal)
            if (shape(r)[0] < tolN) or (shape(l)[0] < tolN):
                continue
            newS = errType(r) + errType(l) 
            if newS < bestS:
                bestIndex = featIndex
                bestS = newS
                bestVal = splitVal
    # if the decrease (S-bestS) id less than a threshold don't do this split
    # and return leaf
    if (S - bestS) < tolS:
        return None, leafType(dataSet)
    r,l = binarySplitDataSet(dataSet, bestIndex, bestVal)
    # if the sample of R or L subspace is less than a threshold don't do this 
    # split and return leaf
    if (shape(r)[0] < tolN) or (shape(l)[0] < tolN):
        return None, leafType(dataSet)
    return bestIndex, bestVal