机器学习(四)——决策树
4.1 决策树概述
决策树是一种非参数的监督学习方法,通过对训练集数据学习,挖掘一定规则用于对新的数据集进行预测,通俗来说,是if-then决策集合。目的是使样本尽可能属于同一类别,分类更准确,通过递归选择最优特征对数据集进行分割,使每个子集都有一个最优分类过程。通过特征选择,选择最佳特征,将数据集分割成正确分类的子集。
常用的特征选择及对应算法:
信息增益——ID3算法
信息增益率——C4.5算法
基尼系数——CART算法
三个算法比较一览:
模型 | 连续值 | 缺失值 | |
---|---|---|---|
ID3 | 分类 | 不支持 | 不支持 |
C4.5 | 分类 | 支持 | 支持 |
CART | 分类回归 | 支持 | 支持 |
4.1.1 ID3算法
基于信息增益为度量指标的分类算法,用到了熵理论,熵越小信息越纯,效果越好,选取熵值小(信息增益大)作为分类节点
一般步骤如下:
假设数据集D,|D|表示样本总个数,数据集有K个分类,记为Ck,特征A有j个不同取值{a1,……,aj}
由A可以把D分为j个子集
Dik为子集D再按k划分而得的子集
已知
p
k
=
∣
C
k
∣
∣
D
∣
p_{k}=\frac{\left |C_{k} \right | }{\left |D \right | }
pk=∣D∣∣Ck∣
p i = ∣ D i ∣ ∣ D ∣ p_{i}=\frac{\left |D_{i} \right | }{\left |D \right | } pi=∣D∣∣Di∣
①因此总信息熵为
E
n
t
r
o
p
y
(
D
)
=
−
∑
k
=
1
K
p
k
l
o
g
2
(
p
k
)
=
−
∑
k
=
1
K
∣
C
k
∣
∣
D
∣
l
o
g
2
(
∣
C
k
∣
∣
D
∣
)
Entropy(D) = - \sum_{k=1}^{K}p_{k}log_{2} (p_{k})=- \sum_{k=1}^{K}\frac{\left |C_{k} \right | }{\left |D \right | } log_{2} (\frac{\left |C_{k} \right | }{\left |D \right | })
Entropy(D)=−k=1∑Kpklog2(pk)=−k=1∑K∣D∣∣Ck∣log2(∣D∣∣Ck∣)
②特征条件下经验条件熵
E
n
t
r
o
p
y
(
D
∣
A
)
=
−
∑
i
=
1
j
p
i
E
(
D
i
)
=
−
∑
i
=
1
j
∣
D
i
∣
∣
D
∣
∑
k
=
1
K
∣
D
i
k
∣
∣
D
i
∣
l
o
g
2
(
∣
D
i
k
∣
∣
D
i
∣
)
Entropy(D|A) = - \sum_{i=1}^{j}p_{i}E(D_{i})=- \sum_{i=1}^{j}\frac{\left |D_{i} \right | }{\left |D \right | } \sum_{k=1}^{K}\frac{\left |D_{ik} \right | }{\left |D_{i} \right | }log_{2} (\frac{\left |D_{ik} \right | }{\left |Di \right | })
Entropy(D∣A)=−i=1∑jpiE(Di)=−i=1∑j∣D∣∣Di∣k=1∑K∣Di∣∣Dik∣log2(∣Di∣∣Dik∣)
③特征的信息增益
G
a
i
n
(
D
,
A
)
=
E
n
t
r
o
p
y
(
D
)
−
E
n
t
r
o
p
y
(
D
∣
A
)
Gain(D,A)=Entropy(D)-Entropy(D|A)
Gain(D,A)=Entropy(D)−Entropy(D∣A)
④进行①-③计算每个点的信息增益,选择值最大的进行扩展
⑤重复①-④直到叶子节点唯一即建立决策树
4.1.2 C4.5算法
基于信息增益率作为指标,在ID3基础上能处理连续型数据,也能处理有缺失情况的数据集。
在ID3的①-③步中额外新增两步:
S
p
l
i
t
(
D
)
=
−
∑
k
=
1
K
p
k
l
o
g
2
(
p
k
)
=
−
∑
k
=
1
K
∣
D
i
∣
∣
D
∣
l
o
g
2
(
∣
D
i
∣
∣
D
∣
)
Split(D)=-\sum_{k=1}^{K} p_{k} log_{2}(p_{k}) =-\sum_{k=1}^{K} \frac{\left |D_{i} \right | }{\left |D \right | } log_{2}(\frac{\left |D_{i} \right | }{\left |D \right | } )
Split(D)=−k=1∑Kpklog2(pk)=−k=1∑K∣D∣∣Di∣log2(∣D∣∣Di∣)
G a i n R a t e ( A ) = G a i n ( D , A ) S p l i t ( D ) GainRate(A)=\frac{Gain(D,A)}{Split(D)} GainRate(A)=Split(D)Gain(D,A)
选择最大信息增益率作为分裂节点
4.1.3 CART算法
基于Gini系数的分类回归算法,选择Gini系数小的作为分裂节点
①计算总Gini系数
G
i
n
i
(
D
)
=
1
−
∑
k
=
1
K
(
∣
C
k
∣
∣
D
∣
)
2
Gini(D)=1-\sum_{k=1}^{K}(\frac{\left |C_{k} \right | }{\left |D \right | } )^{2}
Gini(D)=1−k=1∑K(∣D∣∣Ck∣)2
②计算每个特征变量的Gini系数(可能因此分为D1、D2两部分)
G
i
n
i
(
D
,
A
)
=
∣
D
1
∣
∣
D
∣
G
i
n
i
(
D
1
)
+
∣
D
2
∣
∣
D
∣
G
i
n
i
(
D
2
)
Gini(D,A)=\frac{\left |D_{1} \right | }{\left |D \right | } Gini(D_{1})+\frac{\left |D_{2} \right | }{\left |D \right | } Gini(D_{2})
Gini(D,A)=∣D∣∣D1∣Gini(D1)+∣D∣∣D2∣Gini(D2)
③选择Gini系数最小的作为分裂节点
连续型:
将连续值离散化,将值按序划分,m个值就有m-1种划分方式,分为D1.D2.计算每种划分下的Gini系数,选最小的作为最终结果
离散型(文本):
一个值为D1,另外一个值为D2,计算Gini系数,选最小的结果
4.2 实现
4.2.1 数据集介绍
数据集采用周志华《机器学习》课后习题4.3的西瓜数据集
4.2.2 代码
4.2.2.1 决策树构建
计算特征熵增益,以选择最佳特征进行决策树构建。
Entropy :计算香农熵。
Entropy_Gain :计算特征熵增益。这个函数首先根据特征将个案分为不同的类别,然后计算每个类别的熵,最后计算特征熵增益。
select_best_feature :选择最佳特征。这个函数遍历所有特征,计算每个特征的熵增益,并找到最大熵增益所对应的特征。
import math
data = [[0, 0, 0, 0, 0, 0, 0.697, 1],
[1, 0, 1, 0, 0, 0, 0.774, 1],
[1, 0, 0, 0, 0, 0, 0.634, 1],
[0, 0, 1, 0, 0, 0, 0.608, 1],
[2, 0, 0, 0, 0, 0, 0.556, 1],
[0, 1, 0, 0, 1, 1, 0.403, 1],
[1, 1, 0, 1, 1, 1, 0.481, 1],
[1, 1, 0, 0, 1, 0, 0.437, 1],
[1, 1, 1, 1, 1, 0, 0.666, 0],
[0, 2, 2, 0, 2, 1, 0.243, 0],
[2, 2, 2, 2, 2, 0, 0.245, 0],
[2, 0, 0, 2, 2, 1, 0.343, 0],
[0, 1, 0, 1, 0, 0, 0.639, 0],
[2, 1, 1, 1, 0, 0, 0.657, 0],
[1, 1, 0, 0, 1, 1, 0.360, 0],
[2, 0, 0, 2, 2, 0, 0.593, 0],
[0, 0, 1, 1, 1, 0, 0.719, 0]]
divide_point = [0.244, 0.294, 0.351, 0.381, 0.420, 0.459, 0.518, 0.574, 0.600, 0.621, 0.636, 0.648, 0.661, 0.681, 0.708,
0.746]
def Entropy(melons):
melons_num = len(melons)
pos_num = 0
nag_num = 0
for i in range(melons_num):
if melons[i][7] == 1:
pos_num = pos_num + 1
nag_num = melons_num - pos_num
p_pos = pos_num / melons_num
p_nag = nag_num / melons_num
entropy = -(p_pos * math.log(p_pos, 2) + p_nag * math.log(p_nag, 2))
return entropy
def Entropy_Gain(melons, charac):
charac_entropy = 0
entropy_gain = 0
melons_num = len(melons)
if charac >= 6:
density_entropy = list()
density0 = list()
density1 = list()
class0_small_num = 0
class0_big_num = 0
class1_small_num = 0
class1_big_num = 0
for i in range(melons_num):
if melons[i][7] == 1:
if melons[i][6] > divide_point[charac - 6]:
class1_big_num = class1_big_num + 1
else:
class1_small_num = class1_small_num + 1
else:
if melons[i][6] > divide_point[charac - 6]:
class0_big_num = class0_big_num + 1
else:
class0_small_num = class0_small_num + 1
if class0_small_num == 0 and class1_small_num == 0:
p0_small = 0
p1_small = 0
else:
p0_small = class0_small_num / (class0_small_num + class1_small_num)
p1_small = class1_small_num / (class0_small_num + class1_small_num)
if class0_big_num == 0 and class1_big_num == 0:
p0_big = 0
p1_big = 0
else:
p0_big = class0_big_num / (class0_big_num + class1_big_num)
p1_big = class1_big_num / (class0_big_num + class1_big_num)
if p0_small != 0 and p1_small != 0:
entropy_small = -(class0_small_num + class1_small_num) / melons_num * (
-(p0_small * math.log(p0_small, 2)
+ p1_small * math.log(p1_small, 2)))
elif p0_small == 0 and p1_small != 0:
entropy_small = -(class0_small_num + class1_small_num) / melons_num * (
-p1_small * math.log(p1_small, 2))
elif p0_small != 0 and p1_small == 0:
entropy_small = -(class0_small_num + class1_small_num) / melons_num * (
-p0_small * math.log(p0_small, 2))
else:
entropy_small = 0
if p0_big != 0 and p1_big != 0:
entropy_big = -(class0_big_num + class1_big_num) / melons_num * (
-(p0_big * math.log(p0_big, 2) + p1_big *
math.log(p1_big, 2)))
elif p0_big == 0 and p1_big != 0:
entropy_big = -(class0_big_num + class1_big_num) / melons_num * (
-p1_big * math.log(p1_big, 2))
elif p0_big != 0 and p1_big == 0:
entropy_big = -(class0_big_num + class1_big_num) / melons_num * (
-p0_big * math.log(p0_big, 2))
else:
entropy_big = 0
entropy_gain = Entropy(melons) + entropy_small + entropy_big
elif charac == 5:
class0_melons = []
class1_melons = []
class_melons = [[], []]
for i in range(melons_num):
if melons[i][5] == 0:
class0_melons.append(melons[i][7])
else:
class1_melons.append(melons[i][7])
class_melons[0] = class0_melons
class_melons[1] = class1_melons
for i in range(2):
class0_num = 0
class1_num = 0
total_num = len(class_melons[i])
for j in range(total_num):
if class_melons[i][j] == 0:
class0_num = class0_num + 1
else:
class1_num = class1_num + 1
p_class0 = class0_num / total_num
p_class1 = class1_num / total_num
if p_class0 != 0 and p_class1 != 0: # 防止log0的报错
entropy_class = -p_class0 * math.log(p_class0, 2) - p_class1 * math.log(p_class1, 2)
elif p_class0 == 0 and p_class1 != 0:
entropy_class = - p_class1 * math.log(p_class1, 2)
else:
entropy_class = -p_class0 * math.log(p_class0, 2)
charac_entropy = charac_entropy - total_num / melons_num * entropy_class
entropy_gain = Entropy(melons) + charac_entropy
else:
class0_melons = []
class1_melons = []
class2_melons = []
class_melons = [[], [], []]
for i in range(melons_num):
if melons[i][charac] == 0:
class0_melons.append(melons[i][7])
elif melons[i][charac] == 1:
class1_melons.append(melons[i][7])
else:
class2_melons.append(melons[i][7])
class_melons[0] = class0_melons
class_melons[1] = class1_melons
class_melons[2] = class2_melons
for i in range(3):
class0_num = 0
class1_num = 0
total_num = len(class_melons[i])
if total_num != 0:
for j in range(total_num):
if class_melons[i][j] == 0:
class0_num = class0_num + 1
else:
class1_num = class1_num + 1
p_class0 = class0_num / total_num
p_class1 = class1_num / total_num
if p_class0 != 0 and p_class1 != 0: # 防止log0的报错
entropy_class = -p_class0 * math.log(p_class0, 2) - p_class1 * math.log(p_class1, 2)
elif p_class0 == 0 and p_class1 != 0:
entropy_class = - p_class1 * math.log(p_class1, 2)
else:
entropy_class = -p_class0 * math.log(p_class0, 2)
charac_entropy = charac_entropy - total_num / melons_num * entropy_class
entropy_gain = Entropy(melons) + charac_entropy
else:
entropy_gain = 0
return [entropy_gain, charac]
def select_best_feature(melons, features):
best_feature = 0
max_entropy = Entropy_Gain(melons, features[0])
for i in range(len(features)):
entropy = Entropy_Gain(melons, features[i])
if entropy[0] > max_entropy[0]:
max_entropy = entropy
return max_entropy
4.2.2.2 预测
通过调用tree_generate并传入训练数据data和特征集A,可以生成一个基于划分点的决策树。find_most和select_best_feature分别用于找到出现次数最多的类别和选择最优特征,从而优化决策树的分割。
from Divide_Select import *
import numpy as np
# 训练集data,属性集A
# 0色泽,1根蒂,2敲声,3纹理,4脐部,5触感,
# 对于密度,每个划分点算作一个特征,共16个划分点,即6~21
A = list(range(22))
def find_most(x):
return sorted([(np.sum(x == i), i) for i in np.unique(x)])[-1][-1]
def tree_generate(melons, features):
melons_y = [i[7] for i in melons]
if len(np.unique(melons_y)) == 1:
return melons_y[0]
same_flag = 1
for i in range(6):
if len(np.unique([j[i] for j in melons])) > 1:
same_flag = 0
if not features or same_flag == 1:
return find_most(melons_y)
[max_entropy, best_feature] = select_best_feature(melons, features)
node = {best_feature: {}}
division = list()
to_divide = list()
if best_feature < 6:
division = [i[best_feature] for i in data]
to_divide = [i[best_feature] for i in melons]
else:
for j in [i[6] for i in melons]:
if j > divide_point[best_feature - 6]:
to_divide.append(1)
else:
to_divide.append(0)
division = [0, 1]
data_y = [i[7] for i in data]
for i in np.unique(division):
loc = list(np.where(to_divide == i))
if len(loc[0]) == 0:
test = find_most(melons_y)
node[best_feature][i] = find_most(melons_y)
else:
new_melons = []
for k in range(len(loc[0])):
new_melons.append(melons[loc[0][k]])
if best_feature in features:
features.remove(best_feature)
node[best_feature][i] = tree_generate(new_melons, features)
return node
print(tree_generate(data, A))
4.2.2.3 绘图
调用createPlot传入决策树myTree进行绘图。这个决策树可以根据给定的特征和划分点对西瓜好坏进行分类。
import matplotlib.pyplot as plt
from pylab import *
decisionNode = dict(boxstyle="square", pad=0.5,fc="0.8")
leafNode = dict(boxstyle="circle", fc="0.8")
arrow_args = dict(arrowstyle="<-")
mpl.rcParams['font.sans-serif'] = ['SimHei']
def getNumLeafs(myTree):
numLeafs = 0
firstStr = list(myTree.keys())[0]
secondDict = myTree[firstStr]
for key in secondDict.keys():
if type(secondDict[key]) is dict:
numLeafs += getNumLeafs(secondDict[key])
else:
numLeafs += 1
return numLeafs
def getTreeDepth(myTree):
maxDepth = 0
firstStr = list(myTree.keys())[0]
secondDict = myTree[firstStr]
for key in secondDict.keys():
if type(secondDict[key]) is dict:
thisDepth = 1 + getTreeDepth(secondDict[key])
else:
thisDepth = 1
maxDepth = max(maxDepth, thisDepth)
return maxDepth
def plotNode(nodeTxt, centerPt, parentPt, nodeType):
createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction', xytext=centerPt, textcoords='axes fraction', va="center", ha="center", bbox=nodeType, arrowprops=arrow_args)
def plotMidText(cntrPt, parentPt, txtString):
xMid = (parentPt[0] - cntrPt[0]) / 2 + cntrPt[0]
yMid = (parentPt[1] - cntrPt[1]) / 2 + cntrPt[1]
createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)
def plotTree(myTree, parentPt, nodeTxt):
numLeafs = getNumLeafs(myTree)
cntrPt = (plotTree.xOff + (1 + numLeafs) / 2 / plotTree.totalW, plotTree.yOff)
plotMidText(cntrPt, parentPt, nodeTxt)
firstStr = list(myTree.keys())[0]
plotNode(firstStr, cntrPt, parentPt, decisionNode)
secondDict = myTree[firstStr]
plotTree.yOff = plotTree.yOff - 1 / plotTree.totalD
for key in secondDict.keys():
if type(secondDict[key]) is dict:
plotTree(secondDict[key], cntrPt, str(key))
else:
plotTree.xOff = plotTree.xOff + 1 / plotTree.totalW
plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
plotTree.yOff = plotTree.yOff + 1 / plotTree.totalD
4.2.3 结果
4.3 总结
4.3.1 优点
速度快
准确度高
可处理连续字段和种类字段
无需领域知识和参数假设
适合高维数据
4.3.2 缺点
容易过拟合
忽略相关性
各类别样本数量不一致
特征选择偏向于取值较多的特征
4.3.3 思考
4.3.3.1 集成学习算法:随机森林
集成学习起源于强学习和弱学习的等价原理,将多个分类器组合得到更好泛化能力的强学习模型、融合的方法有平均法和投票法
常见算法:
Bagging:训练多个分类器再使用投票法
Boosting:不断构建新模型对旧的错误分类进行修正
Stacking:分层框架,形成最终预测前,总是由一组向另一组学习器提供信息
而随机森林是基于决策树、随机子空间和Bagging思想的算法,每棵决策树生成过程中样本和特征都是随机选取的
基本步骤如下:
随机选择样本(有放回)
随机选择特征
构建多棵决策树
随机森林投票
优点:
解决分类和回归的问题,泛化性能优秀
对高维的数据集处理好
可应对缺失的数据,不用归一化
存在分类不平衡时,提供平衡误差的方法
训练快,高并行,易于分布式实现
缺点:
解决回归问题时不能给连续型的输出,可能会导致过拟合
无法控制内部运行,只能在不同参数和随机种子之间尝试
忽略特征相关性