配套书目 周志华老师的机器学习(西瓜书)
ide pycharm
操作系统windows10
python 3.6.4
从总体上来说,决策树算法最终需要生成的是一颗决策树,其包含的是一些叶节点,根节点,内部节点,那么化作python语言来说,需要生成的就是一个多重嵌套的字典。其目的本省就是一个生成嵌套字典的过程。
在周志华老师的西瓜书中p74提供了 决策树基本算法
输入:训练集D={(x1,y1),(x2,y2), ,(xm,ym)}
属性集A = {a1,a2,a3,a4,,,,,,,ad}
过程,函数create tree(训练集(dataset),属性集(label))
1,生成结点node
2,if D样本中全是属于同一类别C then
将node标记为C类叶节点,return
end if
3, if A为空集,或者D中样本在A上取值相同 then
将node 标记为叶节点,其类别标记为D中样本数最多的类
end if
4,从A中选择最优划分属性a*
for a*^ in a*:
为node 生成一个分支,令D^表示D在a*上取值为a*^的样本子集
if D^为空
将分支节点标记为叶节点,其类别标记为D中样本最多的类 return
else
createtree(D^,A\{a*})
end if
end for
输出,多重嵌套字典
那么表现在puthon中
其输入显然是 m*n维列表 和 k 维列表,k个标签集,m个样本集,以及n维样本
过程
def creattree(self,dataset,labels):
classlist = [example[-1] for example in dataset]#将dataset每个样本集的最后一维属性值抽出
if classlist.count(classlist[0]) == len(classlist):#如果这个属性值都是同一个,也就是说整个样本集,都是同一种东西时,无需区分
return classlist[0]
if len(classlist) == 1:#当样本集只有一个时,调用majorqitycnt函数,返回出现次数最多的分类名称。
return self.majoritycnt(classlist)
bestfeat = self.choosebestshannonent(dataset)#选出使原本混乱度减少最大的属性,调用函数choosebenstshannonent(),返回的是特征所在的列数。
bestfeatlabel = labels[bestfeat]#这行与以下的这行应该只是纯粹为了构造一个字典
mytree = {bestfeatlabel:{}}
del (labels[bestfeat])#将选出的属性删除去
featvalues = [example[bestfeat] for example in dataset]
uniqualset = set(featvalues)
for value in uniqualset:#这边调用了splitdataset函数,将上面选出的最好属性删除后,在利用剩下的属性集进行在次的筛选,迭代。选出属性位置后,该位置上的属性值对应的结果,这里,首先选出的属性的0位置上的,其有属性值,0,1,两个值,那么其0,对应的是’no‘,1,对应的其他。
sublabels = labels[:]
mytree[bestfeatlabel][value] = self.creattree(self.splitdataset(dataset,bestfeat,value),sublabels)
return mytree
若将for循环注释后,其结果为{“no surfacing”:{}},加上for,才出现0,1,选项。
下面再解释,其几个重要的函数,choosegestshannonent()majoritycnt()和splitedataset函数
1,其对应的是,当使用贪心法时,选出当前最优的属性时,需要对样本集数据进行处理,即将该属性的对应属性值删除
def splitdataset(self,dataset,axis,value):
retdateset=[]
for featvet in dataset:
if featvet[axis]==value:
reducedat = featvet[:axis]
reducedat.extend(featvet[axis+1:])
retdateset.append(reducedat)
return retdateset
2,在上述算法中当len(dataset)==1时,即属性拆分到最后,只剩下最后一种属性时,不需要拆封,此时,如何选取数值代表此属性,那么,将该数据集中数量最多的值认为为该属性值。
def majoritycnt(self,classlist):
classcount = {}
for vote in classlist:
if vote not in classcount.keys():
classcount[vote] = 0
classcount[vote] +=1
for i in range(len(classcount)-1):
sortedclasscount =sorted(classcount[i],key=operator.itemgetter(1),reverse=true)
return sortedclasscount[0][0]
在python2.x版本中倒数第二行的语句,可将
for i in range(len(classcount)-1):
sortedclasscount =sorted(classcount[i],key=operator.itemgetter(1),reverse=true)
转换为, sortedclasscount =sorted(classcount.iteritems(),key=operator.itemgetter(1),reverse=true)
一个语句即可,但是在python3.x版本中,已经废除了iteritem()函数,只能用for循环替代
item()返回的是一个列表
如 a={’0‘:0,’1‘:1,’2‘:2}
a.item()
可得[('0',0),('1',1),('2',2)]
3,也是最重要的函数choosebestshannonent()
def choosebestshannonent(self,dataset):
numfeature = len(dataset[0])-1 得到单个样本的维度,便于计算熵
baseentroy = self.calshannonent(dataset) 运用的calshannonent()函数,计算整体熵
bestinfshannonent = 0.0
bestfeature = -1
for i in range(numfeature):
featlist = [example[i] for example in dataset] 将每种属性值列为列表
uniqualist = set(featlist) set函数,将列表变为集合,剔除其中重合部分,得到集合
newentroy = 0.0
for value in uniqualist: 针对每一个值,找出每个值的频率(体现为个数处以总数)
subdataset = self.splitdataset(dataset,i,value) 并且计算剔除该值后各自的熵,乘积相加后得到,剔除后的熵
probs = len(subdataset)/float(len(dataset))
newentroy += probs*self.calshannonent(subdataset)
infshannonent = baseentroy-newentroy
if(infshannonent > bestinfshannonent): 找出修改后,变化最大的,混乱度减少最大的属性值,即最优
bestinfshannonent = infshannonent
bestfeature=i
return bestfeature 输出的为该属性值在属性值的位置编号,为一整数值
从这个函数可以知道,他将dataset[0]的长度,当为当个样本所有的维度,可以看到,该决策树,必须所有的样本长度相等
下面是整体的代码
#_*_coding:utf-8_*_ #_*_python3.6_*_ """ @filename:decision_tree @data :2018/8/30 7:26 """ import math import operator class Decicion_Tree(): def __init__(self,filename="decision_tree.txt"): self.filename = filename def creatdataset(self): dataset = [[1,1,'yes'],[1,1,'yes'],[1,0,'no'],[0,1,'no'],[0,1,'no']] label = ['no surfacing','flippers'] return dataset,label def calshannonent(self,dataset): nument = len(dataset) labelcount = {} for featvet in dataset: currentlabel = featvet[-1] if currentlabel not in labelcount.keys(): labelcount[currentlabel]=0 labelcount[currentlabel] +=1 shannonent = 0 for key in labelcount: prob = float(labelcount[key])/nument shannonent -= prob*math.log(prob,2) return shannonent def splitdataset(self,dataset,axis,value): retdateset=[] for featvet in dataset: if featvet[axis]==value: reducedat = featvet[:axis] reducedat.extend(featvet[axis+1:]) retdateset.append(reducedat) return retdateset def choosebestshannonent(self,dataset): numfeature = len(dataset[0])-1 baseentroy = self.calshannonent(dataset) bestinfshannonent = 0.0 bestfeature = -1 for i in range(numfeature): featlist = [example[i] for example in dataset] uniqualist = set(featlist) newentroy = 0.0 for value in uniqualist: subdataset = self.splitdataset(dataset,i,value) probs = len(subdataset)/float(len(dataset)) newentroy += probs*self.calshannonent(subdataset) infshannonent = baseentroy-newentroy if(infshannonent > bestinfshannonent): bestinfshannonent = infshannonent bestfeature=i return bestfeature def majoritycnt(self,classlist): classcount = {} for vote in classlist: if vote not in classcount.keys(): classcount[vote] = 0 classcount[vote] +=1 for i in range(len(classcount)-1): sortedclasscount =sorted(classcount[i],key=operator.itemgetter(1),reverse=true) return sortedclasscount[0][0] def creattree(self,dataset,labels): classlist = [example[-1] for example in dataset] if classlist.count(classlist[0]) == len(classlist): return classlist[0] if len(classlist) == 1: return self.majoritycnt(classlist) bestfeat = self.choosebestshannonent(dataset) bestfeatlabel = labels[bestfeat] mytree = {bestfeatlabel:{}} del (labels[bestfeat]) featvalues = [example[bestfeat] for example in dataset] uniqualset = set(featvalues) for value in uniqualset: sublabels = labels[:] mytree[bestfeatlabel][value] = self.creattree(self.splitdataset(dataset,bestfeat,value),sublabels) return mytree
到这里决策树结束,至于决策树的可视化部分,待续……