近期在看关于决策树和Gradient Boosting Machine,参考《统计学习方法》
决策树
决策树,可以被看作一系列分割数据空间的规则也称为“if-then”规则。例如,海难幸存与否的分类问题,输入变量{性别,性别,仓位}输出:{幸存,死亡},决策树就是这样将数据划分到各个子空间。
通过这些“if-then”规则将数据空间划分为互不重叠的对输出有最一致响应的若干子空间,然后为每个子空间训练一个常数作为每个子空间的输出,分类树以子空间内样本输出最多的类别作为这个常数,回归树以子空间内样本输出变量的均值作为这个常数。划分变量和划分值确定,则决策树确定。
用字典储存树结构:
- 切分特征
- 切分值
- 左子树
- 右子树
“createTree”伪代码:
if we can’t split the data, this node becomes a leaf node
make a binary split of the data
call createTree() on the right split of the data
call createTree() on the left split of the data
returned tree
为此需要实现以下功能的函数:
函数binarySplitDataSet(),按划分变量和划分值分割数据为左右量个子集;
函数regLeaf(),计算叶子输出(即样本输出的均值);
from numpy import nonzero,mean,var,shape,inf
import pandas as pd
def binarySplitDataSet(dataSet, feature, value):
mat0 = dataSet[nonzero(dataSet[:,feature]>value)[0],:]
# nonzero() return index of non-zero element of array
mat1 = dataSet[nonzero(dataSet[:,feature]<=value)[0],:]
return mat0, mat1
def regLeaf(dataSet):
# return the constant used for each leaf
return mean(dataSet[:,0])
def createTree(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):
feat, val = chooseBestSplit(dataSet, leafType, errType, ops)
if feat == None:
return val
regTree = {}
regTree['spInd'] = feat
regTree['spVal'] = val
rset, lset = binarySplitDataSet(dataSet, feat, val)
regTree['right'] = createTree(rset,leafType, errType, ops)
regTree['left'] = createTree(lset,leafType, errType, ops)
return regTree
CART使用基尼系数选择最佳划分变量和划分值(Breiman et al., 1984),这里没有采用计算和比较基尼系数的方法,先看一个简单版本:划分后,左右子树最小二乘误差之和最小的方法。基尼系数的计算,在下一part介绍。
“chooseBestSplit” 伪代码:
if all the target are the same value
return None, mean of the target
for every feature:
for every unique value:
call “binarySplitDataSet” function split the dataset into two region
if the sample in each two region is less than the threshold
continue
measure the error of the two region
if the error is less than bestError
update bestSplit point and bestError
代码如下
def regErr(dataSet):
# varience * number of sample
return var(dataSet[:,0]) * shape(dataSet)[0]
def chooseBestSplit(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):
'''
leafType:leaf node value function, could be constant or linear func
errType: calculate error function
ops:minmum value of decrease error, and minmum sample of leaf
'''
tolS = ops[0]
tolN = ops[1]
# if all the target are the same value return None
if len(set(dataSet[:,0].T.tolist())) == 1: # set() increase a nonoverlap set
return None, leafType(dataSet)
m,n = shape(dataSet)
S = errType(dataSet)
bestS = inf
bestIndex = 0
bestVal = 0
for featIndex in range(1,n):
for splitVal in set(dataSet[:,featIndex]):
r, l = binarySplitDataSet(dataSet,featIndex,splitVal)
if (shape(r)[0] < tolN) or (shape(l)[0] < tolN):
continue
newS = errType(r) + errType(l)
if newS < bestS:
bestIndex = featIndex
bestS = newS
bestVal = splitVal
# if the decrease (S-bestS) id less than a threshold don't do this split
# and return leaf
if (S - bestS) < tolS:
return None, leafType(dataSet)
r,l = binarySplitDataSet(dataSet, bestIndex, bestVal)
# if the sample of R or L subspace is less than a threshold don't do this
# split and return leaf
if (shape(r)[0] < tolN) or (shape(l)[0] < tolN):
return None, leafType(dataSet)
return bestIndex, bestVal
基尼系数(Gini index)
假设数据集D有 κ κ 类,属于第k类的概率为 pk p k 。基尼值的如下,反应了D的“纯度”:
基尼值越小数据集D纯度越高。用特征feature按变量splVal将D划分成 V V 个子集(通常划分为两个子集)则 基尼指数: