决策树通常用来处理分类问题,回归问题也可以处理如CART。最基本的思想是:对给定的数据进行一个特征的熵值化,从而进行判断;建立树形结构,自顶向下做出分类判断。
下面是展示最基础的决策树代码(利用信息增益作为判断依据ID3):
from math import log
import operator
# 计算给定数据集的香农熵
def calcShannonEnt(dataSet):
numEntries = len(dataSet)
labelCounts = {}
for featVex in dataSet:
# 取键值对最后一列的数值
currentLabel = featVex[-1]
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob * log(prob, 2)
return shannonEnt
def createDataSet():
dataSet = [[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']]
labels = ['no surfacing', 'flippers']
return dataSet, labels
# 按照给定特征划分数据集
def splitDataSet(dataSet, axis, value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
# 从第0个位置截取到axis位置
reducedFeatVec = featVec[:axis]
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
# 选择最后的数据集划分方式
def chooseBestFeatureToSqlit(dataSet):
numFeatures = len(dataSet[0]) - 1
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0
bestFeature = -1
for i in range(numFeatures):
featList = [example[i] for example in dataSet]
uniqueVals = set(featList)
newEntropy = 0.0
for value in uniqueVals:
# 进行数据集的划分
subDataSet = splitDataSet(dataSet, i , value)
prob = len(subDataSet)/float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
# 计算信息增益
infoGain = baseEntropy - newEntropy
if(infoGain > bestInfoGain):
bestInfoGain = infoGain
bestFeature = i
# 返回第几个特征是最好的判断特征作为数据集划分方式
return bestFeature
# 标签出现频率
def majotrityCnt(classList):
classCount = {}
for vote in classList:
if vote not in classCount.keys():
classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.items(),\
key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
# 创建树的函数代码
def createTree(dataSet, labels):
classList = [example[-1] for example in dataSet]
if classList.count(classList[0]) == len(classList):
return classList[0]
if len(dataSet[0]) == 1:
return majotrityCnt(classList)
bestFeat = chooseBestFeatureToSqlit(dataSet)
bestFeatlabel = labels[bestFeat]
myTree = {bestFeatlabel:{}}
# 从labelss数组删除用来划分的类标签
del(labels[bestFeat])
featValues = [example[bestFeat] for example in dataSet]
# 去掉数组里面重复的值(集合的概念就是没有重复值)
uniqueVals = set(featValues)
for value in uniqueVals:
# 拷贝数组labels,使其不会丢失其他属性
subLabels = labels[:]
myTree[bestFeatlabel][value] = createTree(splitDataSet\
(dataSet, bestFeat, value),subLabels)
return myTree
if __name__ == '__main__':
myDat, labels = createDataSet()
Ent = calcShannonEnt(myDat)
choice = chooseBestFeatureToSqlit(myDat)
# print(choice)
# print(myDat)
myTree = createTree(myDat, labels)
print(myTree)
ID3算法会偏重特征属性分类的的特征,建立的模型较为复杂容易产生过拟合。C4.5算法使用信息增益比来作为选择特征。CART分类树是利用基尼指数来选择最优特征(CART回归树适用均方误差来作为loss函数)。
决策树的几个优点:
1、决策树算法易理解,机理解释起来简单。而且带有可解释性。
2、对缺失值不敏感。
3、可以剪枝,有效降低树形复杂度,达到性能与模型复杂度的平衡。