决策树及实现

本文基于python逐步实现Decision Tree(决策树),分为以下几个步骤:

  • 加载数据集
  • 熵的计算
  • 根据最佳分割feature进行数据分割
  • 根据最大信息增益选择最佳分割feature
  • 递归构建决策树
  • 样本分类

关于决策树的理论方面本文几乎不讲,详情请google keywords:“决策树 信息增益  熵”

将分别体现于代码。

本文只建一个.py文件,所有代码都在这个py里



1.加载数据集

我们选用UCI经典Iris为例

Brief of IRIS:

Data Set Characteristics:  

Multivariate

Number of Instances:

150

Area:

Life

Attribute Characteristics:

Real

Number of Attributes:

4

Date Donated

1988-07-01

Associated Tasks:

Classification

Missing Values?

No

Number of Web Hits:

533125


Code:

  1. from numpy import *  
  2. #load "iris.data" to workspace   
  3. traindata = loadtxt("D:\ZJU_Projects\machine learning\ML_Action\Dataset\Iris.data",delimiter = ',',usecols = (0,1,2,3),dtype = float)  
  4. trainlabel = loadtxt("D:\ZJU_Projects\machine learning\ML_Action\Dataset\Iris.data",delimiter = ',',usecols = (range(4,5)),dtype = str)  
  5. feaname = ["#0","#1","#2","#3"] # feature names of the 4 attributes (features)  
from numpy import *
#load "iris.data" to workspace
traindata = loadtxt("D:\ZJU_Projects\machine learning\ML_Action\Dataset\Iris.data",delimiter = ',',usecols = (0,1,2,3),dtype = float)
trainlabel = loadtxt("D:\ZJU_Projects\machine learning\ML_Action\Dataset\Iris.data",delimiter = ',',usecols = (range(4,5)),dtype = str)
feaname = ["#0","#1","#2","#3"] # feature names of the 4 attributes (features)

Result:

           

左图为实际数据集,四个离散型feature,一个label表示类别(有Iris-setosa, Iris-versicolor,Iris-virginica 三个类)




2. 熵的计算

entropy是香农提出来的(信息论大牛),定义见wiki

注意这里的entropy是H(C|X=xi)而非H(C|X), H(C|X)的计算见第下一个点,还要乘以概率加和

Code:

  1. from math import log  
  2. def calentropy(label):  
  3.     n = label.size # the number of samples   
  4.     #print n   
  5.     count = {} #create dictionary "count"   
  6.     for curlabel in label:  
  7.         if curlabel not in count.keys():  
  8.             count[curlabel] = 0  
  9.         count[curlabel] += 1  
  10.     entropy = 0  
  11.     #print count   
  12.     for key in count:  
  13.         pxi = float(count[key])/n #notice transfering to float first   
  14.         entropy -= pxi*log(pxi,2)  
  15.     return entropy  
  16.   
  17. #testcode:   
  18. #x = calentropy(trainlabel)  
from math import log
def calentropy(label):
    n = label.size # the number of samples
    #print n
    count = {} #create dictionary "count"
    for curlabel in label:
        if curlabel not in count.keys():
            count[curlabel] = 0
        count[curlabel] += 1
    entropy = 0
    #print count
    for key in count:
        pxi = float(count[key])/n #notice transfering to float first
        entropy -= pxi*log(pxi,2)
    return entropy

#testcode:
#x = calentropy(trainlabel)


Result:







3. 根据最佳分割feature进行数据分割

假定我们已经得到了最佳分割feature,在这里进行分割(最佳feature为splitfea_idx)

第二个函数idx2data是根据splitdata得到的分割数据的两个index集合返回datal (samples less than pivot), datag(samples greater than pivot), labell, labelg。 这里我们根据所选特征的平均值作为pivot

  1. #split the dataset according to label "splitfea_idx"   
  2. def splitdata(oridata,splitfea_idx):  
  3.     arg = args[splitfea_idx] #get the average over all dimensions   
  4.     idx_less = [] #create new list including data with feature less than pivot   
  5.     idx_greater = [] #includes entries with feature greater than pivot   
  6.     n = len(oridata)  
  7.     for idx in range(n):  
  8.         d = oridata[idx]  
  9.         if d[splitfea_idx] < arg:  
  10.             #add the newentry into newdata_less set   
  11.             idx_less.append(idx)  
  12.         else:  
  13.             idx_greater.append(idx)  
  14.     return idx_less,idx_greater  
  15.   
  16. #testcode:2   
  17. #idx_less,idx_greater = splitdata(traindata,2)   
  18.   
  19.   
  20. #give the data and labels according to index   
  21. def idx2data(oridata,label,splitidx,fea_idx):  
  22.     idxl = splitidx[0#split_less_indices   
  23.     idxg = splitidx[1#split_greater_indices   
  24.     datal = []  
  25.     datag = []  
  26.     labell = []  
  27.     labelg = []  
  28.     for i in idxl:  
  29.         datal.append(append(oridata[i][:fea_idx],oridata[i][fea_idx+1:]))  
  30.     for i in idxg:  
  31.         datag.append(append(oridata[i][:fea_idx],oridata[i][fea_idx+1:]))  
  32.     labell = label[idxl]  
  33.     labelg = label[idxg]  
  34.     return datal,datag,labell,labelg  
#split the dataset according to label "splitfea_idx"
def splitdata(oridata,splitfea_idx):
    arg = args[splitfea_idx] #get the average over all dimensions
    idx_less = [] #create new list including data with feature less than pivot
    idx_greater = [] #includes entries with feature greater than pivot
    n = len(oridata)
    for idx in range(n):
        d = oridata[idx]
        if d[splitfea_idx] < arg:
            #add the newentry into newdata_less set
            idx_less.append(idx)
        else:
            idx_greater.append(idx)
    return idx_less,idx_greater

#testcode:2
#idx_less,idx_greater = splitdata(traindata,2)


#give the data and labels according to index
def idx2data(oridata,label,splitidx,fea_idx):
    idxl = splitidx[0] #split_less_indices
    idxg = splitidx[1] #split_greater_indices
    datal = []
    datag = []
    labell = []
    labelg = []
    for i in idxl:
        datal.append(append(oridata[i][:fea_idx],oridata[i][fea_idx+1:]))
    for i in idxg:
        datag.append(append(oridata[i][:fea_idx],oridata[i][fea_idx+1:]))
    labell = label[idxl]
    labelg = label[idxg]
    return datal,datag,labell,labelg


这里args是参数,决定分裂节点的阈值(每个参数对应一个feature,大于该值分到>branch,小于该值分到<branch),我们可以定义如下:

  1. args = mean(traindata,axis = 0)  
args = mean(traindata,axis = 0)



测试:按特征2进行分类,得到的less和greater set of indices分别为:


也就是按args[2]进行样本集分割,<和>args[2]的branch分别有57和93个样本。




4. 根据最大信息增益选择最佳分割feature

信息增益为代码中的info_gain, 注释中是熵的计算

  1. #select the best branch to split   
  2. def choosebest_splitnode(oridata,label):  
  3.     n_fea = len(oridata[0])  
  4.     n = len(label)  
  5.     base_entropy = calentropy(label)  
  6.     best_gain = -1  
  7.     for fea_i in range(n_fea): #calculate entropy under each splitting feature   
  8.         cur_entropy = 0  
  9.         idxset_less,idxset_greater = splitdata(oridata,fea_i)  
  10.         prob_less = float(len(idxset_less))/n  
  11.         prob_greater = float(len(idxset_greater))/n  
  12.           
  13.         #entropy(value|X) = \sum{p(xi)*entropy(value|X=xi)}   
  14.         cur_entropy += prob_less*calentropy(label[idxset_less])  
  15.         cur_entropy += prob_greater * calentropy(label[idxset_greater])  
  16.           
  17.         info_gain = base_entropy - cur_entropy #notice gain is before minus after   
  18.         if(info_gain>best_gain):  
  19.             best_gain = info_gain  
  20.             best_idx = fea_i  
  21.     return best_idx    
  22.   
  23. #testcode:   
  24. #x = choosebest_splitnode(traindata,trainlabel)  
#select the best branch to split
def choosebest_splitnode(oridata,label):
    n_fea = len(oridata[0])
    n = len(label)
    base_entropy = calentropy(label)
    best_gain = -1
    for fea_i in range(n_fea): #calculate entropy under each splitting feature
        cur_entropy = 0
        idxset_less,idxset_greater = splitdata(oridata,fea_i)
        prob_less = float(len(idxset_less))/n
        prob_greater = float(len(idxset_greater))/n
        
        #entropy(value|X) = \sum{p(xi)*entropy(value|X=xi)}
        cur_entropy += prob_less*calentropy(label[idxset_less])
        cur_entropy += prob_greater * calentropy(label[idxset_greater])
        
        info_gain = base_entropy - cur_entropy #notice gain is before minus after
        if(info_gain>best_gain):
            best_gain = info_gain
            best_idx = fea_i
    return best_idx  

#testcode:
#x = choosebest_splitnode(traindata,trainlabel)



这里的测试针对所有数据,分裂一次选择哪个特征呢?






5. 递归构建决策树

详见code注释,buildtree递归地构建树。

递归终止条件:

①该branch内没有样本(subset为空) or

②分割出的所有样本属于同一类 or 

③由于每次分割消耗一个feature,当没有feature的时候停止递归,返回当前样本集中大多数sample的label


  1. #create the decision tree based on information gain   
  2. def buildtree(oridata, label):  
  3.     if label.size==0#if no samples belong to this branch   
  4.         return "NULL"  
  5.     listlabel = label.tolist()  
  6.     #stop when all samples in this subset belongs to one class   
  7.     if listlabel.count(label[0])==label.size:  
  8.         return label[0]  
  9.           
  10.     #return the majority of samples' label in this subset if no extra features avaliable   
  11.     if len(feanamecopy)==0:  
  12.         cnt = {}  
  13.         for cur_l in label:  
  14.             if cur_l not in cnt.keys():  
  15.                 cnt[cur_l] = 0  
  16.             cnt[cur_l] += 1  
  17.         maxx = -1   
  18.         for keys in cnt:  
  19.             if maxx < cnt[keys]:  
  20.                 maxx = cnt[keys]  
  21.                 maxkey = keys  
  22.         return maxkey  
  23.       
  24.     bestsplit_fea = choosebest_splitnode(oridata,label) #get the best splitting feature   
  25.     print bestsplit_fea,len(oridata[0])  
  26.     cur_feaname = feanamecopy[bestsplit_fea] # add the feature name to dictionary   
  27.     print cur_feaname  
  28.     nodedict = {cur_feaname:{}}   
  29.     del(feanamecopy[bestsplit_fea]) #delete current feature from feaname   
  30.     split_idx = splitdata(oridata,bestsplit_fea) #split_idx: the split index for both less and greater   
  31.     data_less,data_greater,label_less,label_greater = idx2data(oridata,label,split_idx,bestsplit_fea)  
  32.       
  33.     #build the tree recursively, the left and right tree are the "<" and ">" branch, respectively   
  34.     nodedict[cur_feaname]["<"] = buildtree(data_less,label_less)  
  35.     nodedict[cur_feaname][">"] = buildtree(data_greater,label_greater)  
  36.     return nodedict  
  37.       
  38. #testcode:   
  39. #mytree = buildtree(traindata,trainlabel)   
  40. #print mytree  
#create the decision tree based on information gain
def buildtree(oridata, label):
    if label.size==0: #if no samples belong to this branch
        return "NULL"
    listlabel = label.tolist()
    #stop when all samples in this subset belongs to one class
    if listlabel.count(label[0])==label.size:
        return label[0]
        
    #return the majority of samples' label in this subset if no extra features avaliable
    if len(feanamecopy)==0:
        cnt = {}
        for cur_l in label:
            if cur_l not in cnt.keys():
                cnt[cur_l] = 0
            cnt[cur_l] += 1
        maxx = -1 
        for keys in cnt:
            if maxx < cnt[keys]:
                maxx = cnt[keys]
                maxkey = keys
        return maxkey
    
    bestsplit_fea = choosebest_splitnode(oridata,label) #get the best splitting feature
    print bestsplit_fea,len(oridata[0])
    cur_feaname = feanamecopy[bestsplit_fea] # add the feature name to dictionary
    print cur_feaname
    nodedict = {cur_feaname:{}} 
    del(feanamecopy[bestsplit_fea]) #delete current feature from feaname
    split_idx = splitdata(oridata,bestsplit_fea) #split_idx: the split index for both less and greater
    data_less,data_greater,label_less,label_greater = idx2data(oridata,label,split_idx,bestsplit_fea)
    
    #build the tree recursively, the left and right tree are the "<" and ">" branch, respectively
    nodedict[cur_feaname]["<"] = buildtree(data_less,label_less)
    nodedict[cur_feaname][">"] = buildtree(data_greater,label_greater)
    return nodedict
    
#testcode:
#mytree = buildtree(traindata,trainlabel)
#print mytree


Result:


mytree就是我们的结果,#1表示当前使用第一个feature做分割,'<'和'>'分别对应less 和 greater的数据。





6. 样本分类

根据构建出的mytree进行分类,递归走分支

  1. #classify a new sample   
  2. def classify(mytree,testdata):  
  3.     if type(mytree).__name__ != 'dict':  
  4.         return mytree  
  5.     fea_name = mytree.keys()[0#get the name of first feature   
  6.     fea_idx = feaname.index(fea_name) #the index of feature 'fea_name'   
  7.     val = testdata[fea_idx]  
  8.     nextbranch = mytree[fea_name]  
  9.       
  10.     #judge the current value > or < the pivot (average)   
  11.     if val>args[fea_idx]:  
  12.         nextbranch = nextbranch[">"]  
  13.     else:  
  14.         nextbranch = nextbranch["<"]  
  15.     return classify(nextbranch,testdata)  
  16.   
  17. #testcode   
  18. tt = traindata[0]  
  19. x = classify(mytree,tt)  
  20. print x  
#classify a new sample
def classify(mytree,testdata):
    if type(mytree).__name__ != 'dict':
        return mytree
    fea_name = mytree.keys()[0] #get the name of first feature
    fea_idx = feaname.index(fea_name) #the index of feature 'fea_name'
    val = testdata[fea_idx]
    nextbranch = mytree[fea_name]
    
    #judge the current value > or < the pivot (average)
    if val>args[fea_idx]:
        nextbranch = nextbranch[">"]
    else:
        nextbranch = nextbranch["<"]
    return classify(nextbranch,testdata)

#testcode
tt = traindata[0]
x = classify(mytree,tt)
print x

Result:



为了验证代码准确性,我们换一下args参数,把它们都设成0(很小)

args = [0,0,0,0]

建树和分类的结果如下:


可见没有小于pivot(0)的项,于是dict中每个<的key对应的value都为空。




本文中全部代码下载:决策树python实现

Reference: Machine Learning in Action


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值