树回归
上篇主要讲了线性回归的一些方法,为全局模型。当数据拥有总舵特征并且特征关系复杂时,全局模型会出现较大的偏差。实际情况中很多问题都是非线性的,全局线性模型手段不利于分析。树回归主要有数值型回归树和模型树两种
CART回归树
决策树主要用于数据的分类,一般用于处理离散型的数据,利用香浓熵来度量集合的无组织程度,选用其他方法来替代香浓熵,就可以使用树构建算法来完成回归。CART算法是应用较广的回归树方法。
创建树伪代码
找到最佳的待切分特征:
如果该节点不能再分,该节点设置为叶节点
执行二元切分
在左子树调用创建树方法(递归)
在右子树调用创建树方法(递归)
CART树中最佳切分的选择为:
每个特征:
对应每个特征值:
将数据集切分为两份
计算切分的误差
如果当前误差小于最小误差,将当前切分设定为最佳切分更新最小误差
返回最佳切分的体征和阈值
示例代码
def binSplitDataSet(dataSet, feature, value):
mat0 = dataSet[nonzero(dataSet[:, feature] > value)[0], :]
mat1 = dataSet[nonzero(dataSet[:, feature] <= value)[0], :]
return mat0, mat1
# 回归树模型
def regLeaf(dataSet):#returns the value used for each leaf
return mean(dataSet[:,-1])
def regErr(dataSet):
return var(dataSet[:,-1]) * shape(dataSet)[0]
# 创建树以及剪枝操作
#################################################################################
# 选择最优节点
def chooseBestSplit(dataSet, leafType = regLeaf, errType = regErr, ops=(1,4)):
tolS = ops[0]; tolN = ops[1]
#if all the target variables are the same value: quit and return value
if len(set((dataSet[:,-1].T.A.tolist())[0])) == 1: #exit cond 1
return None, leafType(dataSet)
m, n = shape(dataSet)
#the choice of the best feature is driven by Reduction in RSS error from mean
S = errType(dataSet)
bestS = inf; bestIndex = 0; bestValue = 0
for featIndex in range(n-1):
for splitVal in set((dataSet[:,featIndex].T.A.tolist())[0]):
mat0, mat1 = binSplitDataSet(dataSet, featIndex, splitVal)
if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN): continue
newS = errType(mat0) + errType(mat1)
if newS < bestS:
bestIndex = featIndex
bestValue = splitVal
bestS = newS
#if the decrease (S-bestS) is less than a threshold don't do the split
if (S - bestS) < tolS:
return None, leafType(dataSet) #exit cond 2
mat0, mat1 = binSplitDataSet(dataSet, bestIndex, bestValue)
if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN): #exit cond 3
return None, leafType(dataSet)
return bestIndex,bestValue#returns the best feature to split on
#and the value used for that split
def createTree(dataSet, leafType = regLeaf, errType = regErr, ops = (1, 4)):
feat, val = chooseBestSplit(dataSet, leafType, errType, ops)
if feat == None: return val
retTree = {}
retTree['spInd'] = feat
retTree['spVal'] = val
lSet, rSet = binSplitDataSet(dataSet, feat, val)
retTree['left'] = createTree(lSet, leafType, errType, ops)
retTree['right'] = createTree(rSet, leafType, errType, ops)
return retTree
树剪枝
一棵树如果节点过多,说明该模型可能对数据进行了“过拟合”。通过降低树的复杂度来避免过拟合的过程成为剪枝。剪枝分为预剪枝和后剪枝,预剪枝体现于上面示例代码中的tolS和tolN。预剪枝参数需要人为设定,不同场景差别很大。后剪枝是利用测试集来对树进行剪枝,由于不需要用户制定参数,后剪枝是更理想的方法。
后剪枝伪代码:
基于已有的树切分测试数据:
入股偶存在任一子集树一棵树,则在该子集递归剪枝过程
计算将当前两个叶子节点合并后的误差
计算不合并的误差
如果不合并会降低误差,将叶子节点合并
示例代码
def isTree(obj):
return (type(obj).__name__ == 'dict')
def getMean(tree):
if isTree(tree['right']):
tree['right'] = getMean(tree['right'])
if isTree(tree['left']):
tree['left'] = getMean(tree['left'])
return (tree['left'] + tree['right'])/2.0
def prune(tree, testData):
if shape(testData)[0] == 0: return getMean(tree)
if isTree(tree['right']) or isTree(tree['left']):
lSet, rSet = binSplitDataSet(testData, tree['spInd'], tree['spVal'])
if isTree(tree['left']):
tree['left'] = prune(tree['left'], lSet)
if isTree(tree['right']):
tree['right'] = prune(tree['right'], rSet)
if not isTree(tree['left']) and not isTree(tree['right']):
lSet, rSet = binSplitDataSet(testData, tree['spInd'], tree['spVal'])
errorNoMerge = sum(power(lSet[:, -1] - tree['left'], 2)) + sum(power(rSet[:, -1] - tree['right'], 2))
treeMean = (tree['left'] + tree['right'])/2.0
errorMerge = sum(power(testData[:, -1] - treeMean, 2))
if errorMerge < errorNoMerge:
print('merging')
return treeMean
else:
return tree
else:
return tree
模型树
将叶节点设定为分段线性函数,模型是有多个线性片段组成的。对前面的稍加修改,就可以在叶节点生成线性模型而不是常数值。
示例代码如下
# 模型树
def linearSolve(dataSet):
m, n = shape(dataSet)
X = mat(ones((m, n))); Y = mat(ones((m, 1)))
X[:, 1:n] = dataSet[0, 0:n-1]; Y = dataSet[:, -1]
xTx = X.T*X
if linalg.det(xTx) == 0.0:
raise NameError('This matrix is singular, cannot do inverse,\n \
try increasing the second value of ops')
ws = xTx.I*(X.T*Y)
return ws, X, Y
def modelLeaf(dataSet):
ws, X, Y = linearSolve(dataSet)
return ws
def modelErr(dataSet):
ws,X,Y = linearSolve(dataSet)
yHat = X*ws
return sum(power((Y-yHat), 2))
数据预测(都适用)
示例代码:
def regTreeEval(model, inDat):
return float(model)
def modelTreeEval(model, inDat):
n = shape(inDat)[1]
X = mat(ones((1, n+1)))
X[:,1:n+1] = inDat
return float(X*model)
def treeForceCast(tree, inData, modelEval = regTreeEval):
if not isTree(tree): return modelEval(tree, inData)
if inData[tree['spInd']] > tree['spVal']:
if isTree(tree['left']):
return treeForceCast(tree['left'], inData, modelEval)
else:
return modelEval(tree['left'], inData)
else:
if isTree(tree['right']):
return treeForceCast(tree['right'], inData, modelEval)
else:
return modelEval(tree['right'], inData)
def createForeCast(tree, testData, modelEval = regTreeEval):
m = len(testData)
yHat = mat(zeros((m, 1)))
for i in range(m):
yHat[i,0] = treeForceCast(tree, mat(testData[i]), modelEval)
return yHat
算法特点
优点: 可以对复杂和非线性数据建模
缺点: 结果不易理解
适用数据类型: 数值型和标称型数据