本博文 包括以下内容:
- CART算法
- 回归与模型树
- 树剪枝算法
- Python 中GUI 的使用
问题: 实际生活中很多问题都是非线性的,不可能使用全局线性模型来拟合任何数据。
解决方案:一种可行的方法是将数据集切分成很多份易建模的数据。然后利用第8章的线性回归技术来建模。如果首次切分后仍然难以拟合线性模型就继续切分。这种切分方式下,树结构和回归法相当有用。
9.1 复杂数据的局部性建模
9.2 连续和离散型特征的树的构建
在树的构建过程中,需要解决多种类型数据的存储问题。
we're going to use dictionary for our tree data structure. The dictionary will have the following four items:
- Feature ——A symbol representing the feature split on for this tree.
- Value——The value of the feature used to split.
- Right —— The right subtree; This could also be a single value if the algorithm decides we don't need another split.
- Left - The left subtree similar to the right subtree.
Two types of trees.
- regression tree: contain a single value for each leaf of the tree.
- model tree: it has a linear equation at each leaf node.
9.3 Using CART for regression
In order to model the complex interactions of our data. we've decided to use trees to partition the data. The regression tree break up data using a tree with constant values on the leaf nodes.This strategy assumes that the complex interactions of the data can be summarized by the tree.
9.3.1 Building the tree
The leafType argument is a reference to a function that we use to create the leaf node.
The errType argument is a reference to a function that will be used to calculate the squared deviation from the mean described earlier.
ops is a tuple of user-defined parameters to help with tree building.
# Regression tree split function
# regLeaf(), which generate the model for a leaf node.
# it will call regLeaf() to get a model for the leaf.
# the model in a regression tree is the mean value of the target varaibles.
def regLeaf(dataSet):
return np.mean(dataSet[:,-1])
# regErr(),this function returns the squared error of the target variables in a given dataset.
def regErr(dataSet):
return np.var(dataSet[:,-1]) * np.shape(dataSet)[0]
# chooseBestSplit(), which is the real workhorse of the classification tree.
# the job of this function is to find the best way to do a binary split on the data.
def chooseBestSplit(dataSet, leafType=regLeaf,errType=regErr, ops=(1,4)):
tolS = ops[0] # tolS is a tolerance on the error reduction.
tolN = ops[1] # tolN is the minimum data instances to include in a split.
if len(set(dataSet[:,-1].T.tolist()[0])) == 1:
return None,leafType(dataSet)
m,n = np.shape(dataSet)
S = errType(dataSet)
bestS = np.inf
bestIndex = 0
bestValue = 0
for featIndex in range(n-1):
for splitVal in set(dataSet[:,featIndex]):
mat0,mat1 = binSplitDataSet(dataSet,featIndex,splitVal)
if (np.shape(mat0)[0] < tolN) or (np.shape(mat1)[0] < tolN):
continue
newS = errType(mat0) + errType(mat1)
if newS < bestS:
bestIndex = featIndex
bestValue = splitVal
bestS = newS
if(S - bestS) < tolS:
return None,leafType(dataSet)
mat0,mat1 = binSplitDataSet(dataSet,bestIndex,bestValue)
if (np.shape(mat0)[0] < tolN) or (np.shape(mat1)[0] < tolN):
return None,leafType(dataSet)
return bestIndex,bestValue
9.3.2 Executing the code
9.4 Tree pruning(树剪枝)
Trees with too many nodes are an example of a model overfit to the data.
The procedure of reducing the complexity of a decision tree to avoid overfitting is known as pruning.(剪枝)。在函数chooseBestSplit()中的提前终止条件,实际上是在进行一种所谓的预剪枝(prepruning)操作。Another form of prouning involves a test set and a training set.This is known as postpruning.
9.4.1 Prepruning
9.4.2 Postpruning
The method we'll use will first split our data into a test set and a training set.
A large number of nodes were pruned off the tree, but it wasn’t reduced to two nodes as we had hoped. It turns out that postpruning isn’t as effective as prepruning. You can employ both to give the best possible model.
9.5 Model trees
An alternative to modeling the data as a simple constant value at each leaf node is to model it as a piecewise linear model at each leaf node. Piecewise linear means that you have a model that consists of multiple linear segments.
One of the advantages of decision trees over other machine learning algorithms is that humans can understand the results. Two straight lines are easier to interpret than a big tree of constant values. The interpretability of model trees is one reason why you’d choose them over regression trees. The second reason is higher accuracy.
'''
Author: Maxwell Pan
Date: 2022-04-28 12:59:00
LastEditTime: 2022-04-28 13:02:58
FilePath: \cp09\regTrees.py
Description:
Software:VSCode,env:
'''
import numpy as np
def loadDataSet(fileName): #general function to parse tab -delimited floats
dataMat = [] #assume last column is target value
fr = open(fileName)
for line in fr.readlines():
curLine = line.strip().split('\t')
fltLine = list(map(float,curLine)) #map all elements to float()
dataMat.append(fltLine)
return dataMat
def binSplitDataSet(dataSet, feature, value):
mat0 = dataSet[np.nonzero(dataSet[:,feature] > value)[0],:]
mat1 = dataSet[np.nonzero(dataSet[:,feature] <= value)[0],:]
return mat0,mat1
def regLeaf(dataSet):#returns the value used for each leaf
return np.mean(dataSet[:,-1])
def regErr(dataSet):
return np.var(dataSet[:,-1]) * np.shape(dataSet)[0]
def linearSolve(dataSet): #helper function used in two places
m,n = np.shape(dataSet)
X = np.mat(np.ones((m,n))); Y = np.mat(np.ones((m,1)))#create a copy of data with 1 in 0th postion
X[:,1:n] = dataSet[:,0:n-1]; Y = dataSet[:,-1]#and strip out Y
xTx = X.T*X
if np.linalg.det(xTx) == 0.0:
raise NameError('This matrix is singular, cannot do inverse,\n\
try increasing the second value of ops')
ws = xTx.I * (X.T * Y)
return ws,X,Y
def modelLeaf(dataSet):#create linear model and return coeficients
ws,X,Y = linearSolve(dataSet)
return ws
def modelErr(dataSet):
ws,X,Y = linearSolve(dataSet)
yHat = X * ws
return sum(np.power(Y - yHat,2))
def chooseBestSplit(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):
tolS = ops[0]; tolN = ops[1]
#if all the target variables are the same value: quit and return value
if len(set(dataSet[:,-1].T.tolist()[0])) == 1: #exit cond 1
return None, leafType(dataSet)
m,n = np.shape(dataSet)
#the choice of the best feature is driven by Reduction in RSS error from mean
S = errType(dataSet)
bestS = np.inf; bestIndex = 0; bestValue = 0
for featIndex in range(n-1):
for splitVal in set(dataSet[:,featIndex].T.tolist()[0]):
mat0, mat1 = binSplitDataSet(dataSet, featIndex, splitVal)
if (np.shape(mat0)[0] < tolN) or (np.shape(mat1)[0] < tolN): continue
newS = errType(mat0) + errType(mat1)
if newS < bestS:
bestIndex = featIndex
bestValue = splitVal
bestS = newS
#if the decrease (S-bestS) is less than a threshold don't do the split
if (S - bestS) < tolS:
return None, leafType(dataSet) #exit cond 2
mat0, mat1 = binSplitDataSet(dataSet, bestIndex, bestValue)
if (np.shape(mat0)[0] < tolN) or (np.shape(mat1)[0] < tolN): #exit cond 3
return None, leafType(dataSet)
return bestIndex,bestValue#returns the best feature to split on
#and the value used for that split
def createTree(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):#assume dataSet is NumPy Mat so we can array filtering
feat, val = chooseBestSplit(dataSet, leafType, errType, ops)#choose the best split
if feat == None: return val #if the splitting hit a stop condition return val
retTree = {}
retTree['spInd'] = feat
retTree['spVal'] = val
lSet, rSet = binSplitDataSet(dataSet, feat, val)
retTree['left'] = createTree(lSet, leafType, errType, ops)
retTree['right'] = createTree(rSet, leafType, errType, ops)
return retTree
def isTree(obj):
return (type(obj).__name__=='dict')
def getMean(tree):
if isTree(tree['right']): tree['right'] = getMean(tree['right'])
if isTree(tree['left']): tree['left'] = getMean(tree['left'])
return (tree['left']+tree['right'])/2.0
def prune(tree, testData):
if np.shape(testData)[0] == 0: return getMean(tree) #if we have no test data collapse the tree
if (isTree(tree['right']) or isTree(tree['left'])):#if the branches are not trees try to prune them
lSet, rSet = binSplitDataSet(testData, tree['spInd'], tree['spVal'])
if isTree(tree['left']): tree['left'] = prune(tree['left'], lSet)
if isTree(tree['right']): tree['right'] = prune(tree['right'], rSet)
#if they are now both leafs, see if we can merge them
if not isTree(tree['left']) and not isTree(tree['right']):
lSet, rSet = binSplitDataSet(testData, tree['spInd'], tree['spVal'])
errorNoMerge = sum(np.power(lSet[:,-1] - tree['left'],2)) +\
sum(np.power(rSet[:,-1] - tree['right'],2))
treeMean = (tree['left']+tree['right'])/2.0
errorMerge = sum(np.power(testData[:,-1] - treeMean,2))
if errorMerge < errorNoMerge:
print("merging")
return treeMean
else: return tree
else: return tree
def regTreeEval(model, inDat):
return float(model)
def modelTreeEval(model, inDat):
n = np.shape(inDat)[1]
X = np.mat(np.ones((1,n+1)))
X[:,1:n+1]=inDat
return float(X*model)
def treeForeCast(tree, inData, modelEval=regTreeEval):
if not isTree(tree): return modelEval(tree, inData)
if inData[tree['spInd']] > tree['spVal']:
if isTree(tree['left']): return treeForeCast(tree['left'], inData, modelEval)
else: return modelEval(tree['left'], inData)
else:
if isTree(tree['right']): return treeForeCast(tree['right'], inData, modelEval)
else: return modelEval(tree['right'], inData)
def createForeCast(tree, testData, modelEval=regTreeEval):
m=len(testData)
yHat = np.mat(np.zeros((m,1)))
for i in range(m):
yHat[i,0] = treeForeCast(tree, np.mat(testData[i]), modelEval)
return yHat
9.6 Example : comparing tree methods to standard regression
'''
Author: Maxwell Pan
Date: 2022-04-28 12:59:00
LastEditTime: 2022-04-28 14:40:49
FilePath: \cp09\regTrees.py
Description:
Software:VSCode,env:
'''
import numpy as np
def loadDataSet(fileName): #general function to parse tab -delimited floats
dataMat = [] #assume last column is target value
fr = open(fileName)
for line in fr.readlines():
curLine = line.strip().split('\t')
fltLine = list(map(float,curLine)) #map all elements to float()
dataMat.append(fltLine)
return dataMat
def binSplitDataSet(dataSet, feature, value):
mat0 = dataSet[np.nonzero(dataSet[:,feature] > value)[0],:]
mat1 = dataSet[np.nonzero(dataSet[:,feature] <= value)[0],:]
return mat0,mat1
def regLeaf(dataSet):#returns the value used for each leaf
return np.mean(dataSet[:,-1])
def regErr(dataSet):
return np.var(dataSet[:,-1]) * np.shape(dataSet)[0]
def linearSolve(dataSet): #helper function used in two places
m,n = np.shape(dataSet)
X = np.mat(np.ones((m,n))); Y = np.mat(np.ones((m,1)))#create a copy of data with 1 in 0th postion
X[:,1:n] = dataSet[:,0:n-1]; Y = dataSet[:,-1]#and strip out Y
xTx = X.T*X
if np.linalg.det(xTx) == 0.0:
raise NameError('This matrix is singular, cannot do inverse,\n\
try increasing the second value of ops')
ws = xTx.I * (X.T * Y)
return ws,X,Y
def modelLeaf(dataSet):#create linear model and return coeficients
ws,X,Y = linearSolve(dataSet)
return ws
def modelErr(dataSet):
ws,X,Y = linearSolve(dataSet)
yHat = X * ws
return sum(np.power(Y - yHat,2))
def chooseBestSplit(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):
tolS = ops[0]; tolN = ops[1]
#if all the target variables are the same value: quit and return value
if len(set(dataSet[:,-1].T.tolist()[0])) == 1: #exit cond 1
return None, leafType(dataSet)
m,n = np.shape(dataSet)
#the choice of the best feature is driven by Reduction in RSS error from mean
S = errType(dataSet)
bestS = np.inf; bestIndex = 0; bestValue = 0
for featIndex in range(n-1):
for splitVal in set(dataSet[:,featIndex].T.tolist()[0]):
mat0, mat1 = binSplitDataSet(dataSet, featIndex, splitVal)
if (np.shape(mat0)[0] < tolN) or (np.shape(mat1)[0] < tolN): continue
newS = errType(mat0) + errType(mat1)
if newS < bestS:
bestIndex = featIndex
bestValue = splitVal
bestS = newS
#if the decrease (S-bestS) is less than a threshold don't do the split
if (S - bestS) < tolS:
return None, leafType(dataSet) #exit cond 2
mat0, mat1 = binSplitDataSet(dataSet, bestIndex, bestValue)
if (np.shape(mat0)[0] < tolN) or (np.shape(mat1)[0] < tolN): #exit cond 3
return None, leafType(dataSet)
return bestIndex,bestValue#returns the best feature to split on
#and the value used for that split
def createTree(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):#assume dataSet is NumPy Mat so we can array filtering
feat, val = chooseBestSplit(dataSet, leafType, errType, ops)#choose the best split
if feat == None: return val #if the splitting hit a stop condition return val
retTree = {}
retTree['spInd'] = feat
retTree['spVal'] = val
lSet, rSet = binSplitDataSet(dataSet, feat, val)
retTree['left'] = createTree(lSet, leafType, errType, ops)
retTree['right'] = createTree(rSet, leafType, errType, ops)
return retTree
def isTree(obj):
return (type(obj).__name__=='dict')
def getMean(tree):
if isTree(tree['right']): tree['right'] = getMean(tree['right'])
if isTree(tree['left']): tree['left'] = getMean(tree['left'])
return (tree['left']+tree['right'])/2.0
def prune(tree, testData):
if np.shape(testData)[0] == 0: return getMean(tree) #if we have no test data collapse the tree
if (isTree(tree['right']) or isTree(tree['left'])):#if the branches are not trees try to prune them
lSet, rSet = binSplitDataSet(testData, tree['spInd'], tree['spVal'])
if isTree(tree['left']): tree['left'] = prune(tree['left'], lSet)
if isTree(tree['right']): tree['right'] = prune(tree['right'], rSet)
#if they are now both leafs, see if we can merge them
if not isTree(tree['left']) and not isTree(tree['right']):
lSet, rSet = binSplitDataSet(testData, tree['spInd'], tree['spVal'])
errorNoMerge = sum(np.power(lSet[:,-1] - tree['left'],2)) +\
sum(np.power(rSet[:,-1] - tree['right'],2))
treeMean = (tree['left']+tree['right'])/2.0
errorMerge = sum(np.power(testData[:,-1] - treeMean,2))
if errorMerge < errorNoMerge:
print("merging")
return treeMean
else: return tree
else: return tree
def regTreeEval(model, inDat):
return float(model)
def modelTreeEval(model, inDat):
n = np.shape(inDat)[1]
X = np.mat(np.ones((1,n+1)))
X[:,1:n+1]=inDat
return float(X*model)
def treeForeCast(tree, inData, modelEval=regTreeEval):
if not isTree(tree): return modelEval(tree, inData)
if inData[tree['spInd']] > tree['spVal']:
if isTree(tree['left']): return treeForeCast(tree['left'], inData, modelEval)
else: return modelEval(tree['left'], inData)
else:
if isTree(tree['right']): return treeForeCast(tree['right'], inData, modelEval)
else: return modelEval(tree['right'], inData)
def createForeCast(tree, testData, modelEval=regTreeEval):
m=len(testData)
yHat = np.mat(np.zeros((m,1)))
for i in range(m):
yHat[i,0] = treeForeCast(tree, np.mat(testData[i]), modelEval)
return yHat