1. 信息论基础(熵 联合熵 条件熵 信息增益 基尼不纯度)
2. 决策树的不同分类算法(ID3算法、C4.5、CART分类树)的原理及应用场景
3. 回归树原理
4. 决策树防止过拟合手段
5. 模型评估
6. sklearn使用试例
附:代码
(如有错误,感谢指出!)
1.信息论基础(熵 联合熵 条件熵 信息增益 基尼不纯度)
- 信息熵:解决了信息的度量问题,量化了信息。对数据集来说信息熵度量了样本集合纯度。
- 联合熵和条件熵:在信息熵基础上变换而来,前者是对随机变量 ( X , Y ) (X,Y) (X,Y)联合求熵,后者是在已知 X X X下求 Y Y Y的不确定性。
- 信息增益:样本集 D D D的信息熵减去对属性 a a a划分后各分支比例乘信息熵之和。
- 增益率:信息增益除以 I V ( a ) IV(a) IV(a),其称为属性 a a a的“固有值”。
- 基尼指数:从数据集中随机抽取两个样本,其类别标记不一致的概率。
2. 决策树的不同分类算法(ID3算法、C4.5、CART分类树)的原理及应用场景
-
ID3算法:基于信息增益判断最优化分属性的算法。
假定集合 D D D中第 k k k类样本所占的比例为 p k ( k = 1 , 2 , . . . , ∣ y ∣ ) p_k(k=1,2,...,|y|) pk(k=1,2,...,∣y∣),则 D D D的信息熵定义为 E n t ( D ) = − ∑ k = 1 ∣ y ∣ p k log 2 p k Ent(D)=-\sum_{k=1}^{|y|}p_k\log_2p_k Ent(D)=−k=1∑∣y∣pklog2pk E n t ( D ) Ent(D) Ent(D)的值越小,则 D D D的纯度越高。
对属性 a a a划分共有 V V V个分支,其中第 v v v个分支在 D D D上属性为 a a a上取值为 a v a^v av的样本,记 D v D^v Dv,则有信息增益为 G a i n ( D , a ) = E n t ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ E n t ( D v ) Gain(D,a)=Ent(D)-\sum_{v=1}^V\frac{|D^v|}{|D|}Ent(D^v) Gain(D,a)=Ent(D)−v=1∑V∣D∣∣Dv∣Ent(Dv),其中 ∣ D v ∣ ∣ D ∣ \frac{|D^v|}{|D|} ∣D∣∣Dv∣是分支结点的权重。选取信息增益最大的属性作为当前最优属性进行划分。但其对可取值较多的属性有所偏好,如把唯一ID属性拿进来算,就不是我们期望的了。 -
C4.5:C4.5对ID3进行了多种改进,最大的改进就是用信息增益率代替信息增益来选择属性 G a i n _ r a t i o ( D , a ) = G a i n ( D , a ) I V ( a ) Gain\_ratio(D,a)=\frac{Gain(D,a)}{IV(a)} Gain_ratio(D,a)=IV(a)Gain(D,a)其中 I V ( a ) = − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ log 2 ∣ D v ∣ ∣ D ∣ IV(a)=-\sum_{v=1}^V\frac{|D^v|}{|D|}\log_2\frac{|D^v|}{|D|} IV(a)=−v=1∑V∣D∣∣Dv∣log2∣D∣∣Dv∣称为属性a的“固有值”。属性a的可能取值越多, I V ( a ) IV(a) IV(a)通常就越大。
需注意的是,信息增益对可取值较少的属性有所偏好,所以C4.5采用启发式:先从候选划分属性中找出信息增益高于平均水平的属性,再从中选择信息增益率最高的。
C4.5还提供了对连续值得离散化处理(二分法),对缺失值的处理(赋予一个概率),以及有剪枝操作(防止过拟合)。
-
CART(Classification and Regression Tree):使用“基尼指数”来选择划分属性。数据集 D D D的纯度度量方法 G i n i ( D ) = ∑ k = 1 ∣ y ∣ ∑ k ′ ≠ k p k p k ′ = 1 − ∑ k = 1 ∣ y ∣ p k 2 Gini(D)=\sum_{k=1}^{|y|}\sum_{k'\neq k}p_kp_{k'}=1-\sum_{k=1}^{|y|}p_k^2 Gini(D)=k=1∑∣y∣k′̸=k∑pkpk′=1−k=1∑∣y∣pk2则属性 a a a的基尼指数为 G i n i _ i n d e x ( D , a ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ G i n i ( D v ) Gini\_index(D,a)=\sum_{v=1}^V\frac{|D^v|}{|D|}Gini(D^v) Gini_index(D,a)=v=1∑V∣D∣∣Dv∣Gini(Dv)选择使划分后基尼指数最小的属性最为最优化分属性。
3. 回归树原理
建立回归树大致步骤: 将预测变量空间
(
X
1
,
X
2
,
.
.
.
,
X
p
)
(X_1,X_2,...,X_p)
(X1,X2,...,Xp)的可能取值分为
n
n
n个互不重叠的区域
(
R
1
,
R
2
,
.
.
.
,
R
n
)
(R_1,R_2,...,R_n)
(R1,R2,...,Rn),再对每个区域
R
n
R_n
Rn取算数平均作为该区域的预测值。然后求各分法的损失函数,取其最小的分法作为最优分法。
损失函数定义为
L
=
∑
n
=
1
n
∑
(
y
i
−
c
n
)
2
L=\sum_{n=1}^n\sum(y_i-c_n)^2
L=n=1∑n∑(yi−cn)2其中
c
n
c_n
cn是各区域的均值。
重复分割步骤(递归二分),使 L L L尽可能减小,直到达到某个阈值即停止分割。
4. 决策树防止过拟合手段
-
剪枝(pruning):是决策树学习算法防止过拟合的主要手段,其基本策略有“预剪枝”(prepruning)和“后剪枝”(postpruning)。
-
预剪枝:指在决策树生成过程中,对每个结点在划分前进行估计,若当前划分不能带来决策树性能提升,则停止划分,并将其标记为叶结点。
预剪枝使得许多分支没有“展开”,这不仅降低了过拟合的风险,还减少了训练、测试时间开销。其基于“贪心”进行分支展开的,没有前瞻性(在未展开的节点,其后序划分可能带来性能显著提高),也带来了欠拟合的风险。
-
后剪枝:从一颗完整的决策树自底向上对非叶结点进行考察,若将该结点对应的子树替换成叶结点能带来决策树性能的提升,则替换为叶结点。
后剪枝决策树通常比预剪枝决策树保留了更多的分支。一般情形,后剪枝决策树的欠拟合风险很小,泛化性能往往优于预剪枝。但后剪枝是在生成完全决策树之后进行的,并且要自底向上逐一考察,因此其训练时间要比未剪枝和预剪枝大得多。
5. 模型评估
自助法(bootstrap):以自助采样法(有放回抽样)对数据集分为训练 D ′ D' D′、测试 D D D\ D ′ D' D′集,则约有36.8%的样本未出现在训练集中。抽样b次,产生有b个自助法样本,则总准确率为( a c s ac_s acs为包含所有样本计算的准确率) a c b o o t = 1 b ∑ i = 1 b ( 0.632 × ϵ i + 0.368 × a c s ) ac_{boot}=\frac{1}{b}\sum_{i=1}^b(0.632\times\epsilon _i+0.368\times ac_s) acboot=b1i=1∑b(0.632×ϵi+0.368×acs)
准确度的区间估计:将分类问题看做二项分布,则有:
令
X
X
X为样本正确分类,
p
p
p为准确率,
X
X
X为均值
N
p
N_p
Np、方差
N
p
(
1
−
p
)
N_p(1-p)
Np(1−p)的二项分布。
a
c
=
X
/
N
ac=X/N
ac=X/N为均值
p
p
p,方差
p
(
1
−
p
)
/
N
p(1-p)/N
p(1−p)/N的二项分布。可求
a
c
ac
ac的置信区间:
P
(
−
Z
α
2
≤
a
c
−
p
p
(
1
−
p
)
/
N
≤
Z
1
−
α
2
)
=
1
−
α
P\left(-Z_{\frac{\alpha }{2}}\leq \frac{ac-p}{\sqrt{p(1-p)/N}}\leq Z_{1-\frac{\alpha}{2}}\right)=1-\alpha
P(−Z2α≤p(1−p)/Nac−p≤Z1−2α)=1−α
P
∈
2
×
N
×
a
c
+
Z
α
2
2
±
Z
α
2
Z
α
2
2
+
4
×
N
×
a
c
c
−
4
×
N
×
a
c
2
2
(
N
+
Z
α
2
2
)
)
P\in\frac{2\times N \times ac +Z_{\frac{\alpha}{2}}^{2}\pm Z_{\frac{\alpha}{2}}\sqrt{Z_{\frac{\alpha}{2}}^{2}+4\times N \times acc-4\times N \times ac{2}}}{2(N+Z_{\frac{\alpha}{2}}^2)})
P∈2(N+Z2α2)2×N×ac+Z2α2±Z2αZ2α2+4×N×acc−4×N×ac2)
6. sklearn使用试例
方法:
分类树 DecisionTreeClassifier() 回归树 DecisionTreeRegressor()
使用试例:
from sklearn import tree
X = [[0, 0], [1, 1]]
Y = [0, 1]
dtc = tree.DecisionTreeClassifier()
dtc = dtc.fit(X, Y)
dtc.predict([[0, 1]])
附:代码
ID3代码:
import numpy as np
import operator
import treePlotter as treePlotter
def createDataSet():
"""
outlook-> 0: sunny | 1: overcast | 2: rain
temperature-> 0: hot | 1: mild | 2: cool
humidity-> 0: high | 1: normal
windy-> 0: false | 1: true
"""
dataSet = [[0, 0, 0, 0, 'N'],
[0, 0, 0, 1, 'N'],
[1, 0, 0, 0, 'Y'],
[2, 1, 0, 0, 'Y'],
[2, 2, 1, 0, 'Y'],
[2, 2, 1, 1, 'N'],
[1, 2, 1, 1, 'Y']]
labels = ['outlook', 'temperature', 'humidity', 'windy']
return dataSet, labels
def createTestSet():
"""
outlook-> 0: sunny | 1: overcast | 2: rain
temperature-> 0: hot | 1: mild | 2: cool
humidity-> 0: high | 1: normal
windy-> 0: false | 1: true
"""
testSet = [[0, 1, 0, 0],
[0, 2, 1, 0],
[2, 1, 1, 0],
[0, 1, 1, 1],
[1, 1, 0, 1],
[1, 0, 1, 0],
[2, 1, 0, 1]]
return testSet
def calcShannonEnt(dataset):
"""
计算数据集的信息熵
:param dataset:
:return:
"""
classLabel = np.array(dataset)[:,-1]
classCount={}
for key in classLabel:
classCount[key] = classCount.get(key,0)+1
size = len(classLabel)
h = 0
for key,count in classCount.items():
h -= count/size*np.log2(count/size)
return h
#找出叶子节点中 最多的样本的类别
def majorityCnt(classList):
classCount={}
for key in classList:
classCount[key]=classCount.get(key,0)+1
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
#使用pandas分组 count 取最大值
def splitDataSet(dataSet,index,value):
"""
根据给定的数据集 以及切分列 以及列所对应的组值
取其子集
:param dataSet:
:param index:
:param value:
:return:
"""
subDataSet = []
for example in dataSet:
newExample = []
if example[index] == value:
newExample = example[0:index]
newExample.extend(example[index+1:])
subDataSet.append(newExample)
return subDataSet
def chooseBestFeatureToSplit(dataSet):
"""
选择最佳属性:找出每个属性的信息熵
找属性的个数 size
for i in range(size):
找i所对应的列的值set value
value=0 根据value划分子数据集
计算
"""
#属性的个数
featureSize = len(dataSet[0])-1
#设置最小信息熵及其下标,最小信息熵的默认值只能是大于等于1,下标只能是小于0的
minEnt,index = 1,None
#求出每个属性下的熵值
for i in range(featureSize):
featureList = [example[i] for example in dataSet]
#featureList = np.array(dataSet)[:,i]
featureGroup = set(featureList)
#指定属性下的熵值
featureEnt = 0
for value in featureGroup:
subDataSet = splitDataSet(dataSet,i,value)
featureEnt+=len(subDataSet)/len(dataSet)*calcShannonEnt(subDataSet)
if featureEnt < minEnt:
minEnt = featureEnt
index = i
return index
def createTree(dataSet, labels):
"""
输入:数据集,特征标签
输出:决策树
描述:递归构建决策树,利用上述的函数
"""
classList = [example[-1] for example in dataSet]
"""
['N', 'N', 'Y', 'Y', 'Y', 'N', 'Y']
"""
if classList.count(classList[0]) == len(classList):
# 类别完全相同,停止划分
return classList[0]
if len(dataSet[0]) == 1:
# 遍历完所有特征时返回出现次数最多的
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel:{}}
del(labels[bestFeat])
# 得到列表包括节点所有的属性值
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:]
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
return myTree
def classification(tree,labels,testSample):
#节点的名称
nodeName = set(tree.keys()).pop()
#节点名称所对应的特征向量中的下标
labelIndex = labels.index(nodeName)
value = testSample[labelIndex]
for k,v in tree[nodeName].items():
if value == k:
if type(v).__name__=="dict":
return classification(v,labels,testSample)
else:
return v
def classificationAll(tree,labels,testSet):
result = []
for testSample in testSet:
result.append(classification(tree,labels,testSample))
return result
if __name__=="__main__":
dataSet,label = createDataSet()
desicionTree = createTree(dataSet,label.copy())
print('desicionTree:\n', desicionTree)
testSet = createTestSet()
result = classificationAll(desicionTree,label,testSet)
print(result)
treePlotter.createPlot(desicionTree)
C4.5代码:
import numpy as np
import operator
import treePlotter as treePlotter
def createDataSet():
"""
outlook-> 0: sunny | 1: overcast | 2: rain
temperature-> 0: hot | 1: mild | 2: cool
humidity-> 0: high | 1: normal
windy-> 0: false | 1: true
"""
dataSet = [[0, 0, 0, 0, 'N'],
[0, 0, 0, 1, 'N'],
[1, 0, 0, 0, 'Y'],
[2, 1, 0, 0, 'Y'],
[2, 2, 1, 0, 'Y'],
[2, 2, 1, 1, 'N'],
[1, 2, 1, 1, 'Y']]
labels = ['outlook', 'temperature', 'humidity', 'windy']
return dataSet, labels
def createTestSet():
"""
outlook-> 0: sunny | 1: overcast | 2: rain
temperature-> 0: hot | 1: mild | 2: cool
humidity-> 0: high | 1: normal
windy-> 0: false | 1: true
"""
testSet = [[0, 1, 0, 0],
[0, 2, 1, 0],
[2, 1, 1, 0],
[0, 1, 1, 1],
[1, 1, 0, 1],
[1, 0, 1, 0],
[2, 1, 0, 1]]
return testSet
def calcShannonEnt(dataset):
"""
计算数据集的信息熵
:param dataset:
:return:
"""
classLabel = np.array(dataset)[:,-1]
classCount={}
for key in classLabel:
classCount[key] = classCount.get(key,0)+1
size = len(classLabel)
h = 0
for key,count in classCount.items():
h -= count/size*np.log2(count/size)
return h
#找出叶子节点中 最多的样本的类别
def majorityCnt(classList):
classCount={}
for key in classList:
classCount[key]=classCount.get(key,0)+1
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
#使用pandas分组 count 取最大值
def splitDataSet(dataSet,index,value):
"""
根据给定的数据集 以及切分列 以及列所对应的组值
取其子集
:param dataSet:
:param index:
:param value:
:return:
"""
subDataSet = []
for example in dataSet:
newExample = []
if example[index] == value:
newExample = example[0:index]
newExample.extend(example[index+1:])
subDataSet.append(newExample)
return subDataSet
def chooseBestFeatureToSplit(dataSet):
"""
选择最佳属性:找出每个属性的信息熵
找属性的个数 size
for i in range(size):
找i所对应的列的值set value
value=0 根据value划分子数据集
计算
"""
#属性的个数
featureSize = len(dataSet[0])-1
gainRatio,index = 0,None
#数据集的信息熵
baseEnt= calcShannonEnt(dataSet)
#求出每个属性下的熵值
for i in range(featureSize):
featureList = [example[i] for example in dataSet]
#featureList = np.array(dataSet)[:,i]
featureGroup = set(featureList)
#指定属性下的熵值
featureEnt = 0
splitInfo = 0
for value in featureGroup:
subDataSet = splitDataSet(dataSet,i,value)
featureEnt+=len(subDataSet)/len(dataSet)*calcShannonEnt(subDataSet)
splitInfo-=len(subDataSet)/len(dataSet)*np.log2(len(subDataSet)/len(dataSet))
newGainRatio = (baseEnt-featureEnt)/splitInfo
if gainRatio<newGainRatio:
gainRatio = newGainRatio
index = i
return index
def createTree(dataSet, labels):
"""
输入:数据集,特征标签
输出:决策树
描述:递归构建决策树,利用上述的函数
"""
classList = [example[-1] for example in dataSet]
"""
['N', 'N', 'Y', 'Y', 'Y', 'N', 'Y']
"""
if classList.count(classList[0]) == len(classList):
# 类别完全相同,停止划分
return classList[0]
if len(dataSet[0]) == 1:
# 遍历完所有特征时返回出现次数最多的
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel:{}}
del(labels[bestFeat])
# 得到列表包括节点所有的属性值
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:]
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
return myTree
def classification(tree,labels,testSample):
#节点的名称
nodeName = set(tree.keys()).pop()
#节点名称所对应的特征向量中的下标
labelIndex = labels.index(nodeName)
value = testSample[labelIndex]
for k,v in tree[nodeName].items():
if value == k:
if type(v).__name__=="dict":
return classification(v,labels,testSample)
else:
return v
def classificationAll(tree,labels,testSet):
result = []
for testSample in testSet:
result.append(classification(tree,labels,testSample))
return result
if __name__=="__main__":
dataSet,label = createDataSet()
desicionTree = createTree(dataSet,label.copy())
print('desicionTree:\n', desicionTree)
testSet = createTestSet()
result = classificationAll(desicionTree,label,testSet)
print(result)
treePlotter.createPlot(desicionTree)
画图代码(文件名为:treePlotter ,上面import用到的):
import matplotlib.pyplot as plt
decisionNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle="<-")
def plotNode(nodeTxt, centerPt, parentPt, nodeType):
createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction', \
xytext=centerPt, textcoords='axes fraction', \
va="center", ha="center", bbox=nodeType, arrowprops=arrow_args)
def getNumLeafs(myTree):
numLeafs = 0
firstStr = list(myTree.keys())[0]
secondDict = myTree[firstStr]
for key in secondDict.keys():
if type(secondDict[key]).__name__ == 'dict':
numLeafs += getNumLeafs(secondDict[key])
else:
numLeafs += 1
return numLeafs
def getTreeDepth(myTree):
maxDepth = 0
firstStr = list(myTree.keys())[0]
secondDict = myTree[firstStr]
for key in secondDict.keys():
if type(secondDict[key]).__name__ == 'dict':
thisDepth = getTreeDepth(secondDict[key]) + 1
else:
thisDepth = 1
if thisDepth > maxDepth:
maxDepth = thisDepth
return maxDepth
def plotMidText(cntrPt, parentPt, txtString):
xMid = (parentPt[0] - cntrPt[0]) / 2.0 + cntrPt[0]
yMid = (parentPt[1] - cntrPt[1]) / 2.0 + cntrPt[1]
createPlot.ax1.text(xMid, yMid, txtString)
def plotTree(myTree, parentPt, nodeTxt):
numLeafs = getNumLeafs(myTree)
depth = getTreeDepth(myTree)
firstStr = list(myTree.keys())[0]
cntrPt = (plotTree.xOff + (1.0 + float(numLeafs)) / 2.0 / plotTree.totalw, plotTree.yOff)
plotMidText(cntrPt, parentPt, nodeTxt)
plotNode(firstStr, cntrPt, parentPt, decisionNode)
secondDict = myTree[firstStr]
plotTree.yOff = plotTree.yOff - 1.0 / plotTree.totalD
for key in secondDict.keys():
if type(secondDict[key]).__name__ == 'dict':
plotTree(secondDict[key], cntrPt, str(key))
else:
plotTree.xOff = plotTree.xOff + 1.0 / plotTree.totalw
plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
plotTree.yOff = plotTree.yOff + 1.0 / plotTree.totalD
def createPlot(inTree):
fig = plt.figure(1, facecolor='white')
fig.clf()
axprops = dict(xticks=[], yticks=[])
createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)
plotTree.totalw = float(getNumLeafs(inTree))
plotTree.totalD = float(getTreeDepth(inTree))
plotTree.xOff = -0.5 / plotTree.totalw
plotTree.yOff = 1.0
plotTree(inTree, (0.5, 1.0), '')
plt.show()