CART分类与回归树

本文介绍了CART(Classification And Regression Tree)决策树的构建过程,包括如何通过最小二乘误差之和选择最佳划分,并讲解了基尼系数在选择特征和切分值中的作用。此外,还详细阐述了决策树的剪枝技术,包括预剪枝和后剪枝,以及如何利用损失函数决定剪枝策略。通过对树的损失函数和内部结点的g(t)计算,找到最优子树。
摘要由CSDN通过智能技术生成

近期在看关于决策树和Gradient Boosting Machine,参考《统计学习方法》

决策树

决策树,可以被看作一系列分割数据空间的规则也称为“if-then”规则。例如,海难幸存与否的分类问题,输入变量{性别,性别,仓位}输出:{幸存,死亡},决策树就是这样将数据划分到各个子空间。

树的一个例子,展示分割规则和树形结构

通过这些“if-then”规则将数据空间划分为互不重叠的对输出有最一致响应的若干子空间,然后为每个子空间训练一个常数作为每个子空间的输出,分类树以子空间内样本输出最多的类别作为这个常数,回归树以子空间内样本输出变量的均值作为这个常数。划分变量和划分值确定,则决策树确定。

字典储存树结构:

  • 切分特征
  • 切分值
  • 左子树
  • 右子树

“createTree”伪代码:

if we can’t split the data, this node becomes a leaf node
  make a binary split of the data
call createTree() on the right split of the data
call createTree() on the left split of the data
returned tree

为此需要实现以下功能的函数:
函数binarySplitDataSet(),按划分变量和划分值分割数据为左右量个子集;
函数regLeaf(),计算叶子输出(即样本输出的均值);

from numpy import nonzero,mean,var,shape,inf
import pandas as pd

def binarySplitDataSet(dataSet, feature, value):
    mat0 = dataSet[nonzero(dataSet[:,feature]>value)[0],:] 
    # nonzero() return index of non-zero element of array
    mat1 = dataSet[nonzero(dataSet[:,feature]<=value)[0],:]
    return mat0, mat1

def regLeaf(dataSet):
    # return the constant used for each leaf
    return mean(dataSet[:,0])

def createTree(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):
    feat, val = chooseBestSplit(dataSet, leafType, errType, ops)
    if feat == None:
        return val
    regTree = {}
    regTree['spInd'] = feat
    regTree['spVal'] = val
    rset, lset = binarySplitDataSet(dataSet, feat, val)
    regTree['right'] = createTree(rset,leafType, errType, ops)
    regTree['left'] = createTree(lset,leafType, errType, ops)
    return regTree

CART使用基尼系数选择最佳划分变量和划分值(Breiman et al., 1984),这里没有采用计算和比较基尼系数的方法,先看一个简单版本:划分后,左右子树最小二乘误差之和最小的方法。基尼系数的计算,在下一part介绍。

“chooseBestSplit” 伪代码:
if all the target are the same value
  return None, mean of the target
for every feature:
  for every unique value:
    call “binarySplitDataSet” function split the dataset into two region
    if the sample in each two region is less than the threshold
      continue
    measure the error of the two region
    if the error is less than bestError
    update bestSplit point and bestError

代码如下

def regErr(dataSet):
    # varience * number of sample 
    return var(dataSet[:,0]) * shape(dataSet)[0]

def chooseBestSplit(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):
    '''
    leafType:leaf node value function, could be constant or linear func
    errType: calculate error function
    ops:minmum value of decrease error, and minmum sample of leaf
    '''
    tolS = ops[0]
    tolN = ops[1]
    # if all the target are the same value return None
    if len(set(dataSet[:,0].T.tolist())) == 1:  # set() increase a nonoverlap set
        return None, leafType(dataSet)

    m,n = shape(dataSet)
    S = errType(dataSet)
    bestS = inf
    bestIndex = 0
    bestVal = 0
    for featIndex in range(1,n):
        for splitVal in set(dataSet[:,featIndex]):
            r, l = binarySplitDataSet(dataSet,featIndex,splitVal)
            if (shape(r)[0] < tolN) or (shape(l)[0] < tolN):
                continue
            newS = errType(r) + errType(l) 
            if newS < bestS:
                bestIndex = featIndex
                bestS = newS
                bestVal = splitVal
    # if the decrease (S-bestS) id less than a threshold don't do this split
    # and return leaf
    if (S - bestS) < tolS:
        return None, leafType(dataSet)
    r,l = binarySplitDataSet(dataSet, bestIndex, bestVal)
    # if the sample of R or L subspace is less than a threshold don't do this 
    # split and return leaf
    if (shape(r)[0] < tolN) or (shape(l)[0] < tolN):
        return None, leafType(dataSet)
    return bestIndex, bestVal

基尼系数(Gini index)

假设数据集D有 κ κ 类,属于第k类的概率为 pk p k 基尼值的如下,反应了D的“纯度”:

Gini(D)=1k=1κp2k G i n i ( D ) = 1 − ∑ k = 1 κ p k 2

基尼值越小数据集D纯度越高。用特征feature按变量splVal将D划分成 V V 个子集(通常划分为两个子集)则 基尼指数
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值