将CART算法用于回归

1、所谓的CART算法是指分类回归树,以下代码是一个回归树的框架:

def createTree(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):#assume dataSet is NumPy Mat so we can array filtering
    feat, val = chooseBestSplit(dataSet, leafType, errType, ops)#choose the best split
    if feat == None: return val #if the splitting hit a stop condition return val
    retTree = {}
    retTree['spInd'] = feat
    retTree['spVal'] = val
    lSet, rSet = binSplitDataSet(dataSet, feat, val)
    retTree['left'] = createTree(lSet, leafType, errType, ops)
    retTree['right'] = createTree(rSet, leafType, errType, ops)
    return retTree
上述代码返回一个树,这个树用字典表示,feat是指样本的众多特征中的一个最好的特征(下面讲为什么是最好的特征),val是对应于feat的某个特征值;函数binSplitDataSet是将数据集按照某个特征对应的某个样本对应的该特征的特征值进行二分法,分为左集合和右集合两部分,再将这两部分分别递归得到左子树和右子树,最后返回整个树。因此最关键的技术就是怎样寻找最好的特征feat和特征值val.这里最好是指误差(每个数据与均值的差值的平方)最小。

def binSplitDataSet(dataSet, feature, value):
    mat0 = dataSet[nonzero(dataSet[:,feature] > value)[0],:][0]
    mat1 = dataSet[nonzero(dataSet[:,feature] <= value)[0],:][0]
    return mat0,mat1

def regLeaf(dataSet):#returns the value used for each leaf
    return mean(dataSet[:,-1])

def regErr(dataSet):
    return var(dataSet[:,-1]) * shape(dataSet)[0]
def chooseBestSplit(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):
    tolS = ops[0]; tolN = ops[1]
    #if all the target variables are the same value: quit and return value
    if len(set(dataSet[:,-1].T.tolist()[0])) == 1: #exit cond 1
        return None, leafType(dataSet)
    m,n = shape(dataSet)
    #the choice of the best feature is driven by Reduction in RSS error from mean
    S = errType(dataSet)
    bestS = inf; bestIndex = 0; bestValue = 0
    for featIndex in range(n-1):
        for splitVal in set(dataSet[:,featIndex]):
            mat0, mat1 = binSplitDataSet(dataSet, featIndex, splitVal)
            if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN): continue
            newS = errType(mat0) + errType(mat1)
            if newS < bestS: 
                bestIndex = featIndex
                bestValue = splitVal
                bestS = newS
    #if the decrease (S-bestS) is less than a threshold don't do the split
    if (S - bestS) < tolS: 
        return None, leafType(dataSet) #exit cond 2
    mat0, mat1 = binSplitDataSet(dataSet, bestIndex, bestValue)
    if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN):  #exit cond 3
        return None, leafType(dataSet)
    return bestIndex,bestValue

regLef和regErr分别是产生叶子节点和误差的函数,chooseBestSpilt是选择最好的特征与特征值,该函数包含两个循环,外层循环是逐个选取特征,内层循环是逐样本选取对应特征值,tols是误差范围,如果计算的最终的树在这个范围内,说明变化不大,返回原始数据集,tolsN是指划分的最小样本数,大于该值继续划分,反之退出
2、最终输出结果

>>>reload(regTrees) 
>>> import regTrees
>>> from numpy import *
>>> mydata=loadDataSet('C:\\Users\\WM\\Desktop\\python\\ex0.txt')
>>> myMat=mat(mydata)
>>> createTree(myMat)
{'spInd': 1, 'spVal': matrix([[ 0.441815]]), 'right': {'spInd': 1, 'spVal': matrix([[ 0.212575]]), 'right': 3.1889351956521743, 'left': 3.5637090000000002}, 'left': {'spInd': 1, 'spVal': matrix([[ 0.808177]]), 'right': {'spInd': 1, 'spVal': matrix([[ 0.621523]]), 'right': 3.9120475757575757, 'left': 4.2337471562500006}, 'left': 4.5816485}}

返回的是一个字典!






  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值