# 机器学习之决策树

## 决策树原理

### 如何构造一个决策树呢？

检测数据中所有数据的分类标签是否相同：
If so return 标签
Else:
寻找划分数据集最好的特征（划分后，信息熵最小，或者是信息增益最大的特征）
划分数据集
创建分支节点
for 每个划分的子集
调用createBranch(创建分支的函数)，并增加返回结果到分支节点中
return 分支节点

## Python代码实现

### 计算香农熵（或者直接称为熵entronpy）

def calcShannonEnt(dataSet):
numEntries = len(dataSet)  #求list的长度，表示参与训练的训练量
labelCounts = {}
for featVec in dataSet: #the the number of unique elements and their occurance
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob * log(prob,2) #log base 2
return shannonEnt

### 根据给定的特征划分数据集

def splitDataSet(dataSet, axis, value):   # 括号里是数据集、划分数据集的特征、需要返回的特征的值
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet

### 选择最好的数据划分方式

def chooseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels最后一个元素是标签
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0; bestFeature = -1
for i in range(numFeatures):        #iterate over all the features 分别计算不同特征值的信息熵
featList = [example[i] for example in dataSet] #create a list of all the examples of this feature
#print(featList)
uniqueVals = set(featList)       #get a set of unique values
#print(uniqueVals)
newEntropy = 0.0
for value in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet)/float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
if (infoGain > bestInfoGain):       #compare this to the best gain so far
bestInfoGain = infoGain         #if better than current best, set to best
bestFeature = i
return bestFeature

### 创建决策树

def createTree(dataSet,labels): #数据集和标签列表
classList = [example[-1] for example in dataSet] #将dataSet中的最后一列的类别标签存入classList当中
if classList.count(classList[0]) == len(classList):
return classList[0]#stop splitting when all of the classes are equal
if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
return majorityCnt(classList)
#以上是递归停止条件

bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]  #最优分类特征
myTree = {bestFeatLabel:{}} #初始化myTree
copylabels = labels.copy()
del(copylabels[bestFeat]) #删除已经分类使用的特征值
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)  #set函数基本功能包括关系测试和消除重复元素，返回是一个无序不重复元素集
for value in uniqueVals:
subLabels = copylabels[:]       #copy all of labels, so trees don't mess up existing labels
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
return myTree 

## PS :

《机器学习实战》这本书中使用的python2，因此在python3中使用是，有些小小的不同：

firstStr = myTree.keys()[0] #找到输入的第一个元素

firstStr = list(myTree.keys())[0]   

• 点赞
• 评论
• 分享
x

海报分享

扫一扫，分享海报

• 收藏
• 手机看

分享到微信朋友圈

x

扫一扫，手机阅读

• 打赏

打赏

huixinbuding

你的鼓励将是我创作的最大动力

C币 余额
2C币 4C币 6C币 10C币 20C币 50C币
• 一键三连

点赞Mark关注该博主, 随时了解TA的最新博文
10-25

01-02
09-04
03-10
11-14
09-17 20万+
07-14 11万+