剪枝
由于悲观错误剪枝PEP (Pessimistic Error Pruning)、代价-复杂度剪枝CCP (Cost-Complexity Pruning)、基于错误剪枝EBP (Error-Based Pruning)、最小错误剪枝MEP (Minimum Error Pruning)都是用于分类模型,故我们用降低错误剪枝REP ( Reduced Error Pruning)方法进行剪枝。它的基本思路是:对于决策树T的每棵非叶子树s,用叶子替代这棵子树.如果s被叶子替代后形成的新树关于D的误差等于或小于s关于D所产生的误差,则用叶子替代子树s。降低错误剪枝REP优点是计算复杂性低、对未知示例预测偏差较小、自底向上处理。
剪枝的具体代码如下:
myDat2=loadDataSet(‘ex2.txt‘)
myMat2=mat(myDat2)
myTree=createTree(myMat2, ops=(0,1))
myDatTest=loadDataSet(‘ex2test.txt‘)
myMat2Test=mat(myDatTest)
pruneTree=prune(myTree, myMat2Test)
#print "prune tree",pruneTree
yModelHat = createForeCast(pruneTree, myMat2[:,0])
print "model tree",corrcoef(yModelHat, myMat2[:,1],rowvar=0)[0,1]
其中prune函数如下:
def prune(tree, testData):
if shape(testData)[0] == 0: return getMean(tree) #if we have no test data collapse the tree
if (isTree(tree[‘right‘]) or isTree(tree[‘left‘])):#if the branches are not trees try to prune them
lSet, rSet = binSplitDataSet(testData, tree[‘spInd‘], tree[‘spVal‘])
if isTree(tree[‘left‘]): tree[‘left‘] = prune(tree[‘left‘], lSet)
if isTree(tree[‘right‘]): tree[‘right‘] =prune(tree[‘right‘], rSet)
#if they are now both leafs, see if we can merge them
if not isTree(tree[‘left‘]) and not isTree(tree[‘right‘]):
lSet, rSet = binSplitDataSet(testData, tree[‘spInd‘], tree[‘spVal‘]