前言
本节学习从一堆原始数据中构造决策树。
决策树是一种树形结构,其中每个内部节点表示一个属性上的测试,每个分支代表一个测试输出,每个叶节点代表一种类别。
使用决策树预测隐形眼镜类型
我们有一堆原始数据(lenses.txt)
young myope no reduced no lenses
young myope no normal soft
young myope yes reduced no lenses
young myope yes normal hard
young hyper no reduced no lenses
young hyper no normal soft
young hyper yes reduced no lenses
young hyper yes normal hard
pre myope no reduced no lenses
pre myope no normal soft
pre myope yes reduced no lenses
pre myope yes normal hard
pre hyper no reduced no lenses
pre hyper no normal soft
pre hyper yes reduced no lenses
pre hyper yes normal no lenses
presbyopic myope no reduced no lenses
presbyopic myope no normal no lenses
presbyopic myope yes reduced no lenses
presbyopic myope yes normal hard
presbyopic hyper no reduced no lenses
presbyopic hyper no normal soft
presbyopic hyper yes reduced no lenses
presbyopic hyper yes normal no lenses
* 特征有四个(前4列):age(年龄)、prescript(症状)、astigmatic(是否散光)、tearRate(眼泪数量)
* 隐形眼镜类别有三类(最后一列):硬材质(hard)、软材质(soft)、不适合佩戴隐形眼镜(no lenses)
构造决策树代码(trees.py)
# coding:utf-8
from math import log
import operator
import treePlotter
"""
计算香农熵
"""
def calcShannonEnt(dataSet):
numEntries = len(dataSet)
labelCounts = {}
#为所有可能分类创建字典
for featVec in dataSet: #the the number of unique elements and their occurance
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
#以2为底求对数, prob为选择该分类的概率
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob * log(prob,2) #log base 2
return shannonEnt
"""
按照给定特征划分数据集
输入:dataSet-待划分数据集
axis-划分数据集特征
value-特征返回值
"""
def splitDataSet(dataSet, axis, value):