决策树
ID3 算法的核心问题是选取在树的每个结点要测试的属性。我们希望选择的是最有 助于分类实例的属性。那么衡量属性价值的一个好的定量标准是什么呢?这里将定义一 个统计属性,称为“信息增益(information gain)”,用来衡量给定的属性区分训练样例 的能力。ID3 算法在增长树的每一步使用这个信息增益标准从候选属性中选择属性。
熵(Entropy)
所有可能结果的信息量的总和组成熵。信息量= − l o g p ( x ) -logp(x) −logp(x)。 H ( X ) = E ( l o g 2 p ( x ) ) = − ∑ p ( x ) l o g 2 p ( x ) H(X)=E(log_2p(x))=−∑p(x)log_2p(x) H(X)=E(log2p(x))=−∑p(x)log2p(x)
信息增益(Information Gain)
G
a
i
n
(
S
,
A
)
≡
E
n
t
r
o
p
y
(
S
)
−
∑
v
∈
V
a
l
u
e
s
(
A
)
∣
S
v
∣
∣
S
∣
E
n
t
r
o
p
y
(
S
v
)
Gain(S,A)\equiv Entropy(S)-\sum_{v\in Values(A)}\frac{|S_v|}{|S|}Entropy(S_v)
Gain(S,A)≡Entropy(S)−v∈Values(A)∑∣S∣∣Sv∣Entropy(Sv)
ID3算法,通过计算信息增益来构建决策树,IG越大,则选用的决策属性越好,本质是空间分割区域,每个区域尽可能样本同样种类。
基尼系数(Gini Index)
G I N I ( t ) = 1 − ∑ j [ p ( j ∣ t ) ] 2 GINI(t)=1-\sum_j[p(j|t)]^2 GINI(t)=1−j∑[p(j∣t)]2
基尼分割(Gini Split)
G I N I s p l i t = ∑ i = 1 k n i n G I N I ( i ) GINI_{split}=\sum_{i=1}^k\frac{n_i}{n}GINI(i) GINIsplit=i=1∑knniGINI(i)
错误分类误差(Misclassification Error)
E
r
r
o
r
(
t
)
=
1
−
M
a
x
i
P
(
i
∣
t
)
Error(t)=1-Max_iP(i|t)
Error(t)=1−MaxiP(i∣t)
训练和测试误差
图像表明,随着训练节点的增加,训练数据的误差是在一直减小,而测试数据的误差,是先减小后增加的。
实战项目(预测患者隐形眼镜类型)
“使用的算法称为ID3,它是一个好的算法但并不完美。ID3算法无法直接处理数值型数据,尽管我们可以通过量化的方法将数值型数据转化为标称型数值,但是如果存在太多的特征划分,ID3算法仍然会面临其他问题。”
核心代码
计算数据的香农熵
def calcShannonEnt(dataSet):
numEntries = len(dataSet)
labelCounts = {}
for featVec in dataSet: #the the number of unique elements and their occurance
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob * log(prob,2) #log base 2
return shannonEnt
根据给定数据,选择最好的特征来划分数据
def splitDataSet(dataSet, axis, value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis] #chop out axis used for splitting
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
def chooseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[0]) - 1 #the last column is used for the labels
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0; bestFeature = -1
for i in range(numFeatures): #iterate over all the features
featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
uniqueVals = set(featList) #get a set of unique values
newEntropy = 0.0
for value in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet)/float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy #calculate the info gain; ie reduction in entropy
if (infoGain > bestInfoGain): #compare this to the best gain so far
bestInfoGain = infoGain #if better than current best, set to best
bestFeature = i
return bestFeature #returns an integer
递归开始创建决策树
def createTree(dataSet,labels):
classList = [example[-1] for example in dataSet]
if classList.count(classList[0]) == len(classList):
return classList[0]#stop splitting when all of the classes are equal
if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel:{}}
del(labels[bestFeat])
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:] #copy all of labels, so trees don't mess up existing labels
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
return myTree
利用决策树进行分类
def classify(inputTree,featLabels,testVec):
firstStr = list(inputTree.keys())[0]
secondDict = inputTree[firstStr]
featIndex = featLabels.index(firstStr)
key = testVec[featIndex]
valueOfFeat = secondDict[key]
if isinstance(valueOfFeat, dict):
classLabel = classify(valueOfFeat, featLabels, testVec)
else: classLabel = valueOfFeat
return classLabel