掰开揉碎机器学习系列-决策树(1)-ID3决策树

最新推荐文章于 2024-01-29 10:39:22 发布

FSak47

最新推荐文章于 2024-01-29 10:39:22 发布

阅读量859

点赞数

文章标签：机器学习决策树熵 ID3 C4.5

本文链接：https://blog.csdn.net/u010246947/article/details/53258931

版权

一、决策树的理论依据：

1、熵的概念：

熵代表了数据分布的"稳定程度"(书上写的所谓纯度)，或者说是"分布的离散程度"。用掰开揉碎的方式解释如下：
如以下数据：
技术能力积极度年龄前途

6 8 old normal

8 9 old yes

3 3 old no

7 5 old normal

7 7 young normal

7 6 old normal

8 5 old normal

2 2 old no

7 5 old normal

6 6 young normal

7 4 old normal

8 4 old normal

4 3 old no

5 4 old no

6 4 old no

6 3 old normal

7 8 young yes

6 8 young yes

6 5 old no

上面是包括我在内的20个同部门员工的一个真实训练样本，分为3个维度考量，分别是技术能力(0-10整型)、工作积极度(0-10整型)、年龄(bool型)，样本的结果是前途(分为好中坏)。首先看看如何计算熵：

熵的计算公式是sum(0 - p(i) * log(p(i) * 2))，这个公式来源于香农，具体为什么我暂时无法解释(待续)。计算过程即: 0 - (p(yes)* log(p(yes), 2) - (p(no) * log(p(no), 2) - (p(normal), log(p(normal) * 2)= 0 -

(0.15 * log(0.15, 2)) - (0.35 * log(0.35, 2)) - (0.5 *log(0.5, 2)) = 1.44064544962

15%的人有前途，35%的人一般，50%的人没有前途，前景无望的人居多数，但依然不乏有一点前途及一小撮比较有前途的人。熵为1.44064544962。

这里要明确的发现，熵和特征分布无关，只和结果取值分布有关。

现在，部门裁员了，如果样本简化为：

6 8 old normal

8 9 old yes

3 3 old no

再次计算熵，0- (1/3 * log(1/3, 2)) - (1/3 * log(1/3, 2)) - (1/3 * log(1/3, 2)) = 0- log(1/3, 2) = 1.58496250072。裁员后，分布变的更加复杂了，前景好坏的人三分天下。熵变大了。

后来，有点前景的受不了都走了，部门换来两个毫无前途的庸人，样本变为了：

3 3 old no

5 4 old no

再次计算熵，0- (1 * log(1, 2)) - (1 * log(1, 2)) - (1 * log(1, 2)) = 0 - log(1,2) = 0

可见，部门现在情况很稳定，完全都是前景无望的庸人了。熵降到冰点0了。

现在可以总结：

熵是什么？熵反应了当前样本的概率分布的稳定性，如果概率分布非常"分散"或"平均"，什么样的情况都有而且分布平均，那么熵会比较大，相反会更小。

2、信息增益

谈到信息增益，必须首先看2.1。

2.1、决策树大概是什么样子的：

蓝色代表了特征，红色代表了结果。

那么，很可能下面这样的一个样本，会训练出上面这样的决策树：

天气老婆是否在家老婆是否例假采取的行动(结果)

好不在有跟小三出去

好不在没有跟小三出去

好在有跟老婆出去

好在没有跟老婆出去

不好不在有玩游戏

不好不在没有玩游戏

不好在有玩游戏

不好在没有啪啪啪

2.2、决策树希望是什么样子的：

决策树，作为由训练样本生成的模型，要尽力体现共性，避免过拟合。对于决策树的树形结构来说就要避免过多的分支，即避免过拟合。关于过拟合，在接下来的回归算法文章中还会不停的强调。

决策树如何避免过拟合？

1、从决策树的正常创建过程来说：

树的每一层的根节点的特征不是随便取的，要根据当前这一层，样本数据以哪个特征作为这一层的根节点，样本数据的概率分布更难体现共性，即概率分布更为平稳，来决定由哪个特征作为该层的根节点。

2、从剪枝的角度来说(后面CART/C4.5具体描述)：

前剪枝：在创建时就设置以某些条件来避免过拟合的生长

后剪枝：在决策树生成后修剪

关于2后面的决策树的改进版C4.5、CART讨论。

关于1，是ID3决策树的创建原则，这就要引入”信息增益”的概念。

2.3、信息增益

信息增益的定义：一个特征能够为分类系统带来多少信息，带来的信息越多，该特征越重要。它的计算方式是通过熵。

进一步就是：样本数据中的特征A，特征A的信息增益 = 样本数据的熵 - 它的各个取值里的条件熵之和，它的各个取值里的”取值概率 *条件熵”之和越小，则信息增益越大，则特征A越应该成为当前样本的决策树的根节点。或者说，作为根特征的特征A，其各个取值必须和各自的结果，有更强的相关性。

条件熵：在特征A的某个取值不变时，得到的子样本数据的熵。

即如何确定决策树的各层特征。举例样本数据如下：

老婆是否例假天气如何决定

是好玩游戏

是不好玩游戏

不是好啪啪啪

不是不好啪啪啪

样本熵 = 0 – 1/2 * log(1/2, 2) – 1/2 * log(1/2, 2) = 1

1、如果以”老婆是否例假”作为决策树的根特征：

取值”是”：取值概率为1/2，子样本是：

好玩游戏

不好玩游戏

子样本的熵：0 – 1 *log(1, 2) – 1 * log(1, 2) = 0

取值”不是”：取值概率为1/2，子样本是：

好啪啪啪

不好啪啪啪

子样本的熵：0 – 1 *log(1, 2) - 1 * log(1, 2) = 0

信息增益 = 1(样本熵) – 1/2(取值”是”概率) * 0(取值”是”的子样本熵) – 1/2(取值”不是”概率) * 0(取值”不是”的子样本熵) = 1

2、如果以”天气如何”作为决策树的根特征：

取值”好”：取值概率为1/2，子样本是：

是玩游戏

不是啪啪啪

子样本的熵：0 - 1 /2 *log(1/2, 2) - 1 /2 * log(1/2, 2) = 1，取值概率为1/2

取值”不好”：取值概率为1/2，子样本是：

是玩游戏

不是啪啪啪

子样本的熵：0 - 1 /2 *log(1/2, 2) - 1 /2 * log(1/2, 2) = 1，取值概率为1/2

信息增益 = 1(样本熵) – 1/2(取值”好”概率) * 1(取值”好”的子样本熵) – 1/2(取值”好”概率) * 1(取值”不好”的子样本熵) = 0

所以，以”老婆是否例假”作为决策树的根特征，比”天气如何”作为决策树的根特征，信息增益更大，应该以”老婆是否例假”作为根特征，用matplotlib画图如下：

该图的含义是：只要老婆没有例假就啪啪啪，否则就玩游戏。反之，如果以”天气如何”作为根特征，不论天气是”好”还是”不好”，都要再根据”老婆是否在家”的情况，做出不同的决定。

2.4、总结

当样本S有N个特征(N > 1)，作为根特征的特征A，必须符合特征A的各个取值，满足公式：min(sum(p(i)* Ent(S|A = Ai)))，含义是：作为根特征的特征A，它的每个取值，都要尽可能分散度更小(熵更小)的结果。

3、递归决策树

回到最开始的样本，这不是自黑，是一个真实的样本：

技术能力积极度年龄前途

6 8 old normal

8 9 old yes

3 3 old no

7 5 old normal

7 7 young normal

7 6 old normal

8 5 old normal

2 2 old no

7 5 old normal

6 6 young normal

7 4 old normal

8 4 old normal

4 3 old no

5 4 old no

6 4 old no

6 3 old normal

7 8 young yes

6 8 young yes

6 5 old no

共有3个特征，技术能力'tech', 积极程度'ispositive', 年龄'age'，作为决策树，按上面描述的方法，可以计算出根特征。根特征是”ispositive”。然后就需要计算在根特征是”ispositive”的各种取值下，哪个特征作为接下来的根特征。举例如下：比如说，决定职业球员能否取得成功，根特征是”身体素质”，那么在身体素质打9分的情况下，还有其他的特征进一步决定能否取得多大的成功，比如”职业态度”，在身体素质9分职业态度9分的情况下，还会有很多因素进一步影响能取得多大成功，事实上现实生活中，每一个结果也确实都是由多种多样的因素最终决定的。

ID3递归决策树，就是通过概率和熵，由min(sum(p(i)* Ent(S|A = Ai)))这个结论，一层一层的计算出当前样本中最具广泛意义的特征，根据其不同的取值，进一步收缩样本，再计算收缩后样本的最具广泛意义的特征，直到找到最终结果。

下面直接给出程序：

训练数据就是上面的数据，制表符分隔。

#coding:utf8
fromnumpy import *
frommath import log
importsys
importoperator
fromtreeplot import *

#status.txt就是上面的训练数据
def createdataset ():
         #dataset = [[1, 1, 'yes'], [1, 1,'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
         #labels = ['no surfacing', 'flippers']
         f = open('status.txt')
         items = {}
         dataset = []
         while 1:
                   l = f.readline().strip('\n')
                   if l == "":
                            break
                  
                   ary = l.split('\t')
                   name = ary[0]
                   items[name] = ary[1:]
                   dataset.append(ary[1:])
         f.close()
         labels = ['tech', 'is positive', 'age']
         #labels = ['no problem', 'weather']
         return dataset, labels

#熵越大，说明情况越复杂，反之，什么情况越清晰
def calcEnt (dataset):
	map = {}
	for data in dataset:
		label = data[-1]
		if map.has_key(label):
			map[label] += 1
		else:
			map[label] = 1
	
	ent = 0.0
	for label in map:
		p = float(map[label])/float(len(dataset))
		print label, p, p * log(p, 2)
		ent -= p * log(p, 2)
	return ent
def split_by_feature (dataset, idx, val):
	res = []
	for data in dataset:
		if data[idx] == val:
			vec = data[:idx]
			vec.extend(data[idx + 1:])
			res.append(vec)
	return res
#对每个特征进行分析，计算每个特征的信息增益，每个取值的"概率 * 该特征值下子数据的熵"的和，找出变化最小即最稳定的是哪个特征
def find_bestfeature_tosplit_dataset (dataset, labels):
	feature_num = len(dataset[0]) - 1
	bestentgain = 0.0
	bestfeature = -1
	ent = calcEnt(dataset)
	
	#对每个特征进行分析
	#print "ent: %f" % ent
	for i in range(feature_num):
		values = set([data[i] for data in dataset])
		cur_ent = 0.0
		#print "\nfeature %s" % labels[i]
		#计算每个特征的信息增益，每个取值的"概率 * 该特征值下子数据的熵"的和，找出变化最小即最稳定的是哪个特征
		for value in values:
			res = split_by_feature(dataset, i, value)
			p = float(len(res))/float(len(dataset))
			cur_ent += p * calcEnt(res)
			#print "value %s, p(%f), ent(%f)" % (value, p, cur_ent)
			#print res
		entgain = ent - cur_ent
		#print "%s,  entgain(%f)" % (labels[i], entgain)
		
		#entgain越大，即cur_ent越小，即(熵*概率)越小，即该情况越清晰
		if entgain > bestentgain:
			bestentgain = entgain
			bestfeature = i
	return bestfeature
def vote(classes):
	classcount = {}
	for vote in classes:
		if classcount.has_key(vote):
			classcount[vote] += 1
		else:
			classcount[vote] = 1
	sortedclasscount  = sorted(classcount.iteritems(), key = operator.itemgetter(1), reverse = True)
	return sortedclasscount[0][0]
#递归决策树，依次找每个变化最稳定的特征，构成决策树
def createdtree (dataset, labels):
	#classes是当前所有的结果
	classes = [data[-1] for data in dataset]
	#如果当前都没有特征了，只剩下结果了，那就简单的看下哪个结果多就算是哪个
	if len(dataset[0]) == 1:
		#print "no feature"
		return vote(classes)
	#就一种结果了，不用计算什么根特征了，肯定就这个结果
	if len(classes) == classes.count(classes[0]):
		#print "direct result %s" % (classes[0]), dataset
		return classes[0]
	
	#计算根特征，进而构建当前的决策树
	bestfeatureidx = find_bestfeature_tosplit_dataset(dataset, labels)
	bestfeature = labels[bestfeatureidx]
	tree = {labels[bestfeatureidx]:{}}


	#print "best feature: %s" % bestfeature, dataset
	values = set(data[bestfeatureidx] for data in dataset)
	#当前特征已为决策树的根特征，干掉
	del(labels[bestfeatureidx])
	#当前根特征下，各个特征值的子数据的再决策
	for value in values:
		#这里千万不可以newlabels = labels，这样是引用，会破坏递归前labels。要newlabels = labels[:]，这样是拷贝
		newlabels = labels[:]
		print value, newlabels
		#按根特征的当前的取值，获取收缩后的样本
		newdataset = split_by_feature(dataset, bestfeatureidx, value)
		tree[bestfeature][value] = createdtree(newdataset, newlabels)
	
	return tree
if __name__ == "__main__":
	#加载训练样本数据
	dataset, labels = createdataset()
	#构建递归决策树
	tree = createdtree(dataset, labels)
	#画图
	createPlot(tree)

关于matplotlib画决策层的图，直接贴出程序，暂先不讨论细节，matplotlib可能需要作为一个大专题来讨论。这是一个比较通用的程序，接收决策树参数即可使用。

#coding: utf8
import matplotlib.pyplot as plt

#定义文本框和箭头格式  
decisionNode = dict(boxstyle="sawtooth", fc="0.8") #定义判断节点形态  
leafNode = dict(boxstyle="round4", fc="0.8") #定义叶节点形态  
arrow_args = dict(arrowstyle="<-") #定义箭头  
  
#绘制带箭头的注解  
#nodeTxt：节点的文字标注, centerPt：节点中心位置,  
#parentPt：箭头起点位置（上一节点位置）, nodeType：节点属性  
def plotNode(nodeTxt, centerPt, parentPt, nodeType):  
    createPlot.ax1.annotate(nodeTxt, xy=parentPt,  xycoords='axes fraction',  
             xytext=centerPt, textcoords='axes fraction',  
             va="center", ha="center", bbox=nodeType, arrowprops=arrow_args )

#计算叶节点数  
def getNumLeafs(myTree):  
    numLeafs = 0  
    firstStr = myTree.keys()[0]   
    secondDict = myTree[firstStr]   
    for key in secondDict.keys():  
        if type(secondDict[key]).__name__=='dict':#是否是字典  
            numLeafs += getNumLeafs(secondDict[key]) #递归调用getNumLeafs  
        else:   numLeafs +=1 #如果是叶节点，则叶节点+1  
    return numLeafs  
  
#计算数的层数  
def getTreeDepth(myTree):  
    maxDepth = 0  
    firstStr = myTree.keys()[0]  
    secondDict = myTree[firstStr]  
    for key in secondDict.keys():  
        if type(secondDict[key]).__name__=='dict':#是否是字典  
            thisDepth = 1 + getTreeDepth(secondDict[key]) #如果是字典，则层数加1，再递归调用getTreeDepth  
        else:   thisDepth = 1  
        #得到最大层数  
        if thisDepth > maxDepth:  
            maxDepth = thisDepth  
    return maxDepth
	
#在父子节点间填充文本信息  
#cntrPt:子节点位置, parentPt：父节点位置, txtString：标注内容  
def plotMidText(cntrPt, parentPt, txtString):  
    xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]  
    yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]  
    createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)
	
#绘制树形图  
#myTree：树的字典, parentPt:父节点, nodeTxt：节点的文字标注  
def plotTree(myTree, parentPt, nodeTxt):  
    numLeafs = getNumLeafs(myTree)  #树叶节点数  
    depth = getTreeDepth(myTree)    #树的层数  
    firstStr = myTree.keys()[0]     #节点标签  
    #计算当前节点的位置  
    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)  
    plotMidText(cntrPt, parentPt, nodeTxt) #在父子节点间填充文本信息  
    plotNode(firstStr, cntrPt, parentPt, decisionNode) #绘制带箭头的注解  
    secondDict = myTree[firstStr]  
    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD  
    for key in secondDict.keys():  
        if type(secondDict[key]).__name__=='dict':#判断是不是字典，  
            plotTree(secondDict[key],cntrPt,str(key))        #递归绘制树形图  
        else:   #如果是叶节点  
            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW  
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)  
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))  
    plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD  

def createPlot(inTree):  
    fig = plt.figure(1, facecolor='white')  
    fig.clf()  
    axprops = dict(xticks=[], yticks=[])  
    createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)      
    plotTree.totalW = float(getNumLeafs(inTree)) #树的宽度  
    plotTree.totalD = float(getTreeDepth(inTree)) #树的深度  
    plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;  
    plotTree(inTree, (0.5,1.0), '')  
    plt.show()

结果如下图：