信息增益作为决策树的节点决策基础,因为训练集不涉及到连续值,本文采取的是ID3算法
在信息增益中,衡量标准是看特征能够为分类系统带来多少信息,带来的信息越多,该特征越重要。对一个特征而言,系统有它和没它时信息量将发生变化,而前后信息量的差值就是这个特征给系统带来的信息量。所谓信息量,就是熵。假如有变量X,其可能的取值有n种,每一种取到的概率为Pi,那么X的熵就定义为:
(1)根节点的信息熵代码实现
from math import log
def CalRootEntropy(dataset):#计算根节点的熵
Num=len(dataset)
labels={}
for i in dataset:
currentlabel=i[-1]
if currentlabel not in labels.keys():
labels[currentlabel]=0
labels[currentlabel]+=1
RootEntropy=0.0
for i in labels:
Cali=float(labels[i])/Num
RootEntropy-=Cali*log(Cali,2)
return RootEntropy
(2)按照某个特征值分割数据集,列表存储
def splitDataset(dataset,axis,value):
ReturnList=[]
for i in (dataset):
if i[axis]==value:
ReturnSingle=i[:axis]
ReturnSingle.extend(i[axis+1:])
ReturnList.append(ReturnList)
return ReturnList
(3)选择决策树的节点分支
def choose(dataset):
NumOfFeature=len(dataset[0])-1
RootEntropy=CalRootEntropy(dataset)
BestP=0.0
bestfeature=-1
for i in range(NumOfFeature):
ListLabel=[L[i] for L in dataset]
UniqLabel=set(ListLabel)
newEntropy=0.0
for j in UniqLabel:
SD=splitDataset(dataset, i, j)
prob=len(SD)/float(len(dataset))
newEntropy+=prob*CalRootEntropy(SD)
BestHA=RootEntropy-newEntropy
if(BestHA>BestP):
BestP=BestHA
bestfeature=i
return bestfeature
(4)叶节点的多数决策方法
def major(classlist):
classcount={}
for i in classlist:
if i not in classcount.keys():
classcount[i]=0
classcount[i]+=1
sortedClass=sorted(classcount.iteritems(),key=operator.itemgetter[1],reverse=True)
return sortedClass[0][0]
(5)决策树的最终构建
这段代码比较高深,涉及到字典嵌套和迭代函数
def createTree(dataset,labels):
classList=[i[-1] for i in dataset]
if classList.count(classList[0]==len(classList)):
return classList[0]
if len(dataset[0])==1:
return major(classList)
bestFeat=choose(dataset)
bestFeatLabel=labels[bestFeat]
myTree={bestFeatLabel:{}}
del(labels[bestFeat])
featValues=[E[bestFeat] for E in dataset]
Uniq=set(featValues)
for value in Uniq:
subLabels=labels[:]
myTree[bestFeatLabel][value]=createTree(splitDataset(dataset, bestFeat, value), subLabels)
return myTree
上述完成了决策树的构建过程,ID3算法存在着很多问题,最常见的就是过拟合,C4.5以及CART算法对其进行了改进。