上一节中介绍的回归方法,主要用于线性问题中,但当数据量变大,特征值变多时,这些方法就变得不那么实用了。这一节介绍一下CART(分类回归树)用于回归。主要讲解两种树:回归树和模型数
在学习CART时,可以回顾一下我们前面所讲的决策树:
http://blog.csdn.net/sunnyxiaohu/article/details/50826016
一、回归树
每个叶节点包含单个值
算法原理:
主要算法实现:
1、根据特征维度和特征值分割
def binSplitDataSet(dataSet, feature, value):
mat0 = dataSet[nonzero(dataSet[:,feature] > value)[0],:][0]
mat1 = dataSet[nonzero(dataSet[:,feature] <= value)[0],:][0]
return mat0,mat1
2、构建叶节点的方法和对应的总方差计算
def regLeaf(dataSet):#returns the value used for each leaf
return mean(dataSet[:,-1])
def regErr(dataSet):
return var(dataSet[:,-1]) * shape(dataSet)[0]
3、选择最适合的特征维度和值进行分割
算法原理:
def chooseBestSplit(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):
tolS = ops[0]; tolN = ops[1]
#if all the target variables are the same value: quit and return value
if len(set(dataSet[:,-1].T.tolist()[0])) == 1: #exit cond 1
return None, leafType(dataSet)
m,n = shape(dataSet)
#the choice of the best feature is driven by Reduction in RSS error from mean
S = errType(dataSet)
bestS = inf; bestIndex = 0; bestValue = 0
for featIndex in range(n-1):
for splitVal in set(dataSet[:,featIndex]):
mat0, mat1 = binSplitDataSet(dataSet, featIndex, splitVal)
if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN): continue
newS = errType(mat0) + errType(mat1)
if newS < bestS:
bestIndex = featIndex