第五章——决策树笔记(《统计学习方法》-李航)
本文主要记录自己学习李航的小蓝书的过程,之前的4张有时间再补上。本文只做知识点整理,不做详细的见解,因为我认为,依照本人的知识水平,不可能比书上的讲得要好,所以如果想看详细的知识介绍,别怕困难,别怕长,慢慢看原书第五章吧。学习没有捷径的。
目录
- 1. 手写版知识点回顾
- 2. 简单补充,文字总结
- 3. python代码实现决策树
- 4. 源码下载
1. 手写版知识点回顾
使用手写的方式汇总知识点,可读性不是很高,主要用作自己的回顾用吧。现在公式太难弄了,排版也费劲。
2. 简单补充,文字总结
- 知识点总结
ID3算法
使用信息增益来选择特征进行划分数据集。C4.5算法
使用信息增益比来选择特征进行数据集的划分。CART(Classification and Regression Tree) 为二叉树
为二叉树,可以做回归和分类问题。使用基尼指数来确定划分属性。
3. python代码实现决策树
本代码主要参考书《机器学习实战》里面的代码,然后经过自己的总结,增加详细的注释,可读性更高些。
# _*_ coding: utf-8 _*_
import operator
from math import log
def calcShannonEnt(dataSet):
numEntries = len(dataSet)
labelCounts = {}
for featVec in dataSet:
currentLabel = featVec[-1]
if currentLabel not in labelCounts:
labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
# 计算香农熵
for key in labelCounts:
prob = float(labelCounts[key]) / numEntries
shannonEnt -= prob * log(prob, 2)
return shannonEnt
def createDataSet():
dataSet = [
[1, 1, 'yes'],
[1, 0, 'no'],
[1, 1, 'yes'],
[0, 0, 'no'],
[1, 0, 'no'],
[0, 1, 'no']
]
labels = ['是否青年', '是否买房']
return dataSet, labels
def splitDataSet(dataSet, axis, value):
'''
按照给定特征划分数据集
:param dataSet:
:param axis:
:param value:
:return:
'''
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis]
reducedFeatVec.extend(featVec[axis + 1:])
retDataSet.append(reducedFeatVec)
return retDataSet
def chooseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[0]) - 1
baseEntropy = calcShannonEnt(dataSet) # 求的是H(D)
bestInfoGain = 0.0
bestFeature = -1
for i in range(numFeatures):
# 对每一个特征求条件商
featList = [example[i] for example in dataSet]
uniqueVals = set(featList) # 求一个set,相当于去掉重复的了
newEntropy = 0.0
# 整个循环是在求条件商,及H(D/A)
for value in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet) / float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
# 公式:g(D,A)=H(D)-H(D/A)
infoGain = baseEntropy - newEntropy
print("第%d属性的信息增益为:" % i)
print(infoGain)
if (infoGain > bestInfoGain):
bestInfoGain = infoGain
bestFeature = i
return bestFeature
def majorityCnt(classList):
'''
多数表决法,返回数目最多的类别名称
:param classList:
:return:
'''
classCount = {}
for vote in classList:
if vote not in classCount:
classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount
def createTree(dataSet, labels):
classList = [example[-1] for example in dataSet]
if classList.count(classList[0]) == len(classList):
return classList[0] # 类别数完全相同,则停止继续划分
if len(dataSet[0]) == 1: # 已经遍历完所有特征,返回出现次数最多的类别
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel: {}}
del (labels[bestFeat])
# 得到列表包含的所有属性值
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues) # 去重
for value in uniqueVals:
subLabels = labels[:]
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
return myTree
def classify(inputTree, featLabels, testVec):
firstStr = inputTree.keys()[0]
secondDict = inputTree[firstStr]
featIndex = featLabels.index(firstStr)
for key in secondDict.keys():
if testVec[featIndex] == key():
if type(secondDict[key]).__name__ == 'dict':
classLabel = classify(secondDict[key], featLabels, testVec)
else:
classLabel = secondDict[key]
return classLabel
def main():
dataSet, labels = createDataSet()
# print(dataSet)
# shannonEnt = calcShannonEnt(dataSet=dataSet)
# print(shannonEnt)
#
# print(splitDataSet(dataSet, 0, 1))
# print(splitDataSet(dataSet, 1, 1))
print(chooseBestFeatureToSplit(dataSet))
myTree = createTree(dataSet, labels)
print(myTree)
print(classify(myTree, labels, [1, 1]))
main()
4. 源码下载
最后附上源码下载链接:
https://download.csdn.net/download/u012324136/10515528