计算给定数据集的香农熵
在本章中,给出决策树的训练方法,以及训练中的信息增益。首先介绍了信息增益,信息增益有两种,一种是香农熵,另一种是基尼不纯度。第一段代码就是计算香农熵,我在读书的时候研究过结构化随机森林,曾经评估过香农熵,有一定的了解,代码看起来不太费劲(其实本来就比较简单),不怂,直接上书中提供的代码:
from math import log
def calcShannonEnt(dataSet):
numEntries = len(dataSet)
labelCounts = {}
for featVec in dataSet: #the the number of unique elements and their occurance
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob * log(prob,2) #log base 2
return shannonEnt
计算香农熵的代码写完了,先测一下能不能行,随意创建一个测试用例吧:
def createDataSet():
dataSet = [[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']]
labels = ['no surfacing', 'flippers']
return dataSet, labels
测试代码也贴上:
myData, labels = createDataSet()
shanon_ent = calcShannonEnt(myData)
print(shanon_ent)
划分数据集
上面的截图部分就是本段测试代码运行的结果。到目前为止,我们能够简单的计算shannon熵了,但是这还远远不够。我们可以通过迭代的方式让information gain一步步减小到我们能够容忍的范围以内。为了达到这个目的,我们需要把一个要分类的大的集合逐渐分解成小的集合,直至各个集合的基尼不纯度(也就是information gain的具体表达形式)符合一定的条件,我们才能够说,ok,我们可以得到一个分类的方法(也就是一棵树),可以搞定分类问题!本着拆解大问题的方针,小问题先被拆减成分开一个大集合,选择最佳的数据划分方式,以及迭代计算基尼不纯度(别一直想着比基尼哦)!先强上划分数据集的代码:
def splitDataSet(dataSet, axis, value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
splitSet = featVec[:axis]
splitSet.extend(featVec[axis + 1:])
retDataSet.append(splitSet)
return retDataSet
这中间有个小点,list的append方法和extern方法不同,不同之处百度一下就好,lol。做好测试!做一个好的开发一定要做好测试,这里引导大家做测试,做不做好看个人追求:)
if __name__ == "__main__":
myData, labels = createDataSet()
reduced_data = splitDataSet(myData, 0, 0)
print(reduced_data)
应打印出的结果是:
这里,最终存储结果直接把划分数据集的特征扔掉了,这是决策树和CART的差别。把featvec[:axis]改成featvec[:axis+1]即可。接下来我们将遍历整个数据集,寻找最佳的数据划分方式。
def chooseBestFeatureToSplit(dataset):
featNum = len(dataset[0]) - 1
baseEntropy = calcShannonEnt(dataset)
bestInfoGain = 0.0
bestFeature = -1
for i in range(featNum):
featList = [example[i] for example in dataset]
uniqueValue = set(featList)
newEntropy = 0.0
for value in uniqueValue:
subDataSet = splitDataSet(dataset, i, value)
prob = len(subDataSet)/float(len(featList))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy
if (infoGain > bestInfoGain):
bestInfoGain = infoGain
bestFeature = i
return bestFeature
测试代码:
if __name__ == "__main__":
myData, labels = createDataSet()
bf = chooseBestFeatureToSplit(myData)
print(bf)
预期结果:
完成树的构建
最后一个问题,完成树的创建:
def majorityCnt(classList):
classCount={}
for vote in classList:
if vote not in classCount.keys(): classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
def createTree(dataset, labels):
classList = [example[-1] for example in dataset]
if classList.count(classList[0]) == len(classList):
return classList[0]
if len(dataset[0]) == 1:
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataset)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel:{}}
del(labels[bestFeat])
featValues = [example[bestFeat] for example in dataset]
uniqueValue = set(featValues)
for value in uniqueValue:
subLabels = labels[:]
myTree[bestFeatLabel][value] = createTree(splitDataSet\
(dataset, bestFeat, value), labels)
return myTree
测试代码:
if __name__ == "__main__":
myData, labels = createDataSet()
myTree = createTree(myData, labels)
print(myTree)
测试结果:
到这里,一颗决策树就创建完成了,其实不需要在意creatTree里面的细节,这里仅仅练手用,使用开源库会更加安全。
测试模型
依靠训练创建了决策树之后,就可以用它来对实际的数据进行测试了。测试中已然需要递归调用分类函数。由于无法获取某个feature的编号,所以需要将featLabels传进去,以帮助程序处理。刚代码吧:
def classify(inputTree, featLabels, testVec):
firstStr = list(inputTree.keys())[0]
secondDict = inputTree[firstStr]
featIndex = featLabels.index(firstStr)
for key in secondDict.keys():
if testVec[featIndex] == key:
if type(secondDict[key]).__name__ == 'dict':
classLabel = classify(secondDict[key], featLabels, testVec)
else:
classLabel = secondDict[key]
return classLabel
下面对我们的分类代码进行一个简单的测试,此时我们先用createtree函数创建一个决策树,然后构造一个数据测试。注意:在createtree函数中,labels被一个一个del掉了,所以要先复制一个labels,然后再调用classify,否则报错。测试代码:
if __name__ == "__main__":
myData, labels = createDataSet()
myLabels = []
for label in labels:
myLabels.append(label)
myTree = createTree(myData, labels)
#print(myLabels)
lable0 = classify(myTree, myLabels, [1,0])
print(lable0)
对于【1,0】这个数据的测试结果如下:
保存模型
每次对新的数据集做预测时都训练一遍模型是不现实的,训练好的模型需要保存好,预测的时候来调用训练好的模型就可以了。Python中有一个pickle魔法模块,他可以把模型还按照字典的形式读存,此处使用pickle。
def storeTree(inputTree, filename):
import pickle
with open(filename, 'wb') as fw:
pickle.dump(inputTree, fw)
def grabTree(filename):
import pickle
with open(filename, 'rb') as fr:
outputTree = pickle.load(fr)
return outputTree
测试代码:
if __name__ == "__main__":
myData, labels = createDataSet()
myLabels = []
for label in labels:
myLabels.append(label)
myTree = createTree(myData, labels)
lable0 = classify(myTree, myLabels, [1,0])
storeTree(myTree, "classifier.txt")
myTree0 = grabTree("classifier.txt")
print(myTree0)
测试结果: