python实现决策树分类（ID3）

最新推荐文章于 2023-07-17 16:41:01 发布

sayasora

最新推荐文章于 2023-07-17 16:41:01 发布

阅读量968

点赞数

分类专栏： Python数据分析整理文章标签： python 机器学习决策树数据分析

本文链接：https://blog.csdn.net/weixin_44255182/article/details/108748933

版权

Python数据分析整理专栏收录该内容

10 篇文章 5 订阅

订阅专栏

总目录：Python数据分析整理

本文数据以及大部分代码来自《机器学习实战》

机器学习实战

决策树分类

导入包
数据集
信息熵
计算信息熵
分类数据
找出使信息熵最少的分类方法
完全分类
所有代码

导入包

import pandas as pd
import numpy as np
# trees为自己编写的py文件，放在同一目录，之后有写
import trees
from math import log
import operator

数据集

file.txt

No.	no surfacing	flippers	fish
1	L1	R1	yes
2	L1	R1	yes
3	L1	R0	no
4	L0	R1	no
5	L0	R1	no
6	L0	R1	yes
7	L0	R1	why
8	L2	R2	yes
9	L0	R0	why

数据这个地方无所谓，建议新建一个txt文档，为了方便之后print观察，每列都有特殊的标识符，命名为file.txt，可以随时更改数据，以便观察代码每一步的含义。

信息熵

没办法，虽然只想实现功能，但这部分不把这个了解下基本没办法了解代码的含义，总的来说信息熵就是用来度量混乱程度的。一件必定发生的事情最存粹、干净，信息熵为0。相反，某件事的可能性越多，越不确定，信息熵越大（信息熵可以大于1）。信息熵在所有可能性均等时达到最大值。更多信息可以百度百科一下”香农熵“很有意思

计算信息熵

trees.py下的代码

def calcShannonEnt(dataSet):
    dataSet = pd.DataFrame(dataSet)
    n = dataSet.shape[0]
    iset = dataSet.iloc[:, -1].value_counts()
    p = iset / n
    ent = (-p * np.log2(p)).sum()
    return ent

代码很短，输入一个列表，array，pd。返回数据的信息熵（多列数据只返回最后一列数据的信息熵）。

测试代码：

print(trees.calcShannonEnt([1,1,1,2,2,2]))
print(trees.calcShannonEnt(np.array([1,1,2,2,2,2])))
data_file = pd.read_csv('file.txt', sep='\t')
print(trees.calcShannonEnt(data_file))
print(trees.calcShannonEnt(data_file['flippers']))

分类数据

trees.py下的代码

def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

参数1：数据集（<class ‘list’>）
参数2：分类条件的位置，第几列（0为第一列）
参数3：分类条件的值，以值为多少进行分类
返回值：分类后的结果（<class ‘list’>）

测试代码：

data_file = pd.read_csv('file.txt', sep='\t')
data_file = data_file.iloc[:,1:]
a = data_file.values
b = a.tolist()
print(trees.splitDataSet(b, 0, 'L1'))
print(trees.splitDataSet(b, 1, 'R1'))
print(trees.splitDataSet(b, 2, "no"))

找出使信息熵最少的分类方法

补充一下：后面搜到资料是这么说的”ID3算法用的是信息增益，当我加了这个特征以后，我的信息熵减少了多少。不确定信息减少得越多，那得到的信息就越大，所以我们选择一个信息增益最大的，作为结点。“
trees.py下的代码

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        #iterate over all the features
        featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
        uniqueVals = set(featList)       #get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            # 实际操作可以注释掉这里的print 这里的三个print仅供观察函数到底干了啥。
            print(subDataSet)
            print(calcShannonEnt(subDataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        print()
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature                      #returns an integer

这个代码太多了，我直接把书上的复制过来了，代码的实现利用了上面的两个函数calcShannonEnt，splitDataSet。功能为找到一个分类标准，使得分类后的目标结果，信息熵最小。也就是说经过第一次分类后，要使得情况看起来最干净。上代码演示下。

测试代码：
上面的chooseBestFeatureToSplit已经加上了print，直接运行

data_file = pd.read_csv('file.txt', sep='\t')
data_file = data_file.iloc[:,1:]
print(data_file)
a = data_file.values
b = a.tolist()

best = trees.chooseBestFeatureToSplit(b)
print(best)

返回结果：

  no surfacing flippers fish
0           L1       R1  yes
1           L1       R1  yes
2           L1       R0   no
3           L0       R1   no
4           L0       R1   no
5           L0       R1  yes
6           L0       R1  why
7           L2       R2  yes
8           L0       R0  why
[['R2', 'yes']]
0.0
[['R1', 'no'], ['R1', 'no'], ['R1', 'yes'], ['R1', 'why'], ['R0', 'why']]
1.5219280948873621
[['R1', 'yes'], ['R1', 'yes'], ['R0', 'no']]
0.9182958340544896

[['L1', 'yes'], ['L1', 'yes'], ['L0', 'no'], ['L0', 'no'], ['L0', 'yes'], ['L0', 'why']]
1.4591479170272448
[['L1', 'no'], ['L0', 'why']]
1.0
[['L2', 'yes']]
0.0

0

通过测试的代码说一说这个函数的作用，第一次以第一列的‘no surfacing’为分类标准，将数据分成了

[['R2', 'yes']]
0.0
[['R1', 'no'], ['R1', 'no'], ['R1', 'yes'], ['R1', 'why'], ['R0', 'why']]
1.5219280948873621
[['R1', 'yes'], ['R1', 'yes'], ['R0', 'no']]
0.9182958340544896

一共三类，信息熵为最后一列的信息熵，也就是‘fish’列，三类的信息熵分别是0.0，1.5219280948873621，0.9182958340544896
信息熵的和为2.440223928941852

然后以第二列‘flippers ’为分类标准也将数据分成了三类

[['L1', 'yes'], ['L1', 'yes'], ['L0', 'no'], ['L0', 'no'], ['L0', 'yes'], ['L0', 'why']]
1.4591479170272448
[['L1', 'no'], ['L0', 'why']]
1.0
[['L2', 'yes']]
0.0

三类的信息熵和为2.4591479170272448，比第种分类的信息熵大一点点，所以相比之下以第一列（no surfacing）为分类标准，分出来的数据更加有序。所以最后输出了一个0，也就是告诉我们第一列最优（0代表第一列）。
我们也可以随意更改file.txt的数据进行测试，看看这个函数到底做了什么。

完全分类

tree.py下的代码

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList): 
        return classList[0]#stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree

时间精力有限（能力也有限），就不去掉头发了。直接上测试代码。

data_file = pd.read_csv('file.txt', sep='\t')
data_file = data_file.iloc[:,1:]
a = data_file.values
b = a.tolist()

label = list(data_file.keys()[:-1])
print(b)
print(label)
mytree = trees.createTree(b, label)
print(mytree)

最后的代码很简单输出的结果为

  no_surfacing flippers fish
0           L1       R1  yes
1           L1       R1  yes
2           L1       R0   no
3           L0       R1   no
4           L0       R1   no
5           L0       R1  yes
6           L0       R1  why
7           L2       R2  yes
8           L0       R0  why
[['L1', 'R1', 'yes'], ['L1', 'R1', 'yes'], ['L1', 'R0', 'no'], ['L0', 'R1', 'no'], ['L0', 'R1', 'no'], ['L0', 'R1', 'yes'], ['L0', 'R1', 'why'], ['L2', 'R2', 'yes'], ['L0', 'R0', 'why']]
['no_surfacing', 'flippers']
{'no_surfacing': {'L1': {'flippers': {'R0': 'no', 'R1': 'yes'}}, 'L0': {'flippers': {'R0': 'why', 'R1': 'no'}}, 'L2': 'yes'}}

createTree()函数很简单，第二个参数是[‘no_surfacing’, ‘flippers’]，也就是分类的标签（list格式），也就是第一行的列名（不包括最后一列的结果列），第一个参数是整个分类数据（list）。
输出的结果不太明显，我们把他格式化一下。

{'no_surfacing': {'L1': {'flippers': {'R0': 'no', 'R1': 'yes'}}, 'L0': {'flippers': {'R0': 'why', 'R1': 'no'}}, 'L2': 'yes'}}

直接复制到这个网站SoJson点一下格式化。

{
	'no_surfacing': {
		'L1': {
			'flippers': {
				'R0': 'no',
				'R1': 'yes'
			}
		},
		'L0': {
			'flippers': {
				'R0': 'why',
				'R1': 'no'
			}
		},
		'L2': 'yes'
	}
}

这就很容易看懂了吧，第一次以”no_surfacing“分成三类，其中L2直接出了结果。L0、L1两类再分一次，也出了分类结果。

所有代码

trees.py

'''
Created on Oct 12, 2010
Decision Tree Source Code for Machine Learning in Action Ch. 3
@author: Peter Harrington
'''
from math import log
import operator
import numpy as np
import pandas as pd

def calcShannonEnt(dataSet):
    dataSet = pd.DataFrame(dataSet)
    n = dataSet.shape[0]
    iset = dataSet.iloc[:, -1].value_counts()
    p = iset / n
    ent = (-p * np.log2(p)).sum()
    return ent
    
def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet
    
def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        #iterate over all the features
        featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
        uniqueVals = set(featList)       #get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature                      #returns an integer

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList): 
        return classList[0]#stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree

test.py

import pandas as pd
import numpy as np
import trees
from math import log

data_file = pd.read_csv('file.txt', sep='\t')
data_file = data_file.iloc[:,1:]
a = data_file.values
b = a.tolist()

label = list(data_file.keys()[:-1])
print(b)
print(label)
mytree = trees.createTree(b, label)
print(mytree)

sayasora

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
python实现决策树分类（ID3）

总目录：Python数据分析整理本文数据以及大部分代码来自《机器学习实战》机器学习实战决策树分类导入包数据集信息熵计算信息熵分类数据找出使信息熵最少的分类方法导入包import pandas as pdimport numpy as np# trees为自己编写的py文件，放在同一目录，之后有写import treesfrom math import logimport operator数据集No. no surfacing flippers fish1 L1 R1 yes
复制链接

扫一扫