信息增益,信息增益是决策树的划分标准
1熵(香农熵,克劳德·香农所描述的一个值)
首先给出计算熵的公式
对于一个可能有n种取值的随机变量:
其熵为:
<pre name="code" class="python"># -*- coding: utf-8 -*-
from math import log
import operator
import matplotlib.pyplot as plt
def calcShannonEnt(dataSet):
# 返回值是熵
# 传入的参数是一个dataSet:一个二维的数据集
numEntries = len(dataSet)
labelCounts = {}
for featVec in dataSet:
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key]) / numEntries
shannonEnt -= prob * log(prob, 2)
return shannonEnt
给出我的数据集
<pre name="code" class="python">def createDataSet():
# 每一列的名称是labels 第一列是no serfacing,同理第二列是flippers 第三列或一个常规数据集合的-1列是结果
dataSet = [[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']]
labels = ['no surfacing', 'flippers']
return dataSet, labels
2信息增益
不知道怎么打公式,不过贴出代码应该就可以理解了,在给出之前先看信息的划分<pre name="code" class="python">def splitDataSet(dataSet, axis, value):
# 参数1:数据集, 参数2:按照特征划分,特征值为value
# 返回值: 按照(例如)axis = 1, value = 1, 划分出来的数据集合
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis]
reducedFeatVec.extend(featVec[axis + 1:])
retDataSet.append(reducedFeatVec)
return retDataSet
信息增益决定了怎么划分集合最好,
<pre name="code" class="python">def chooseBestFeatureToSplit(dataSet):
# 选择最好的划分条件
numFeatures = len(dataSet[0]) - 1
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0;
bestFeature = -1
for i in range(numFeatures):
featList = [example[i] for example in dataSet]
uniqueVals = set(featList)
newEntropy = 0.0
for value in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet) / float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy
if (infoGain > bestInfoGain):
bestInfoGain = infoGain
bestFeature = i
return bestFeature
3.划分决策树
def createTree(dataSet, labels):
classList = [example[-1] for example in dataSet] # 最后一列形成的list
if classList.count(classList[0]) == len(classList): # 计算classList[0]的个数,如果相等证明划分到最后了
return classList[0]
if len(dataSet[0]) == 1: # 决策树划分一次要删除一个属性,if True就说明划分完了
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel: {}}
del (labels[bestFeat])
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:]
myTree[bestFeatLabel][value] = createTree(splitDataSet \
(dataSet, bestFeat, value), subLabels)
return myTree