目录
1.决策树简介
决策树是一个预测模型;他代表的是对象属性与对象值之间的一种映射关系。树中每个节点表示某个对象,而每个分叉路径则代表某个可能的属性值,而每个叶节点则对应从根节点到该叶节点所经历的路径所表示的对象的值。决策树仅有单一输出,若欲有复数输出,可以建立独立的决策树以处理不同输出。 数据挖掘中决策树是一种经常要用到的技术,可以用于分析数据,同样也可以用来作预测。
2.如何划分属性
1-信息增益
要明白信息增益,首先要明白信息熵
一.简介
①信息熵简介
在信息论里面,熵是对不确定性的测量。但是在信息世界,熵越高,则能传输越多的信息,熵越低,则意味着传输的信息越少。
公式
H
=
−
∑
i
=
1
k
p
i
l
o
g
(
p
i
)
H = -\sum_{i = 1}^{k}p_{i}log(p_{i})
H=−i=1∑kpilog(pi)
②信息增益简介
信息增益是一个统计量,用来描述一个属性区分数据样本的能力。信息增益越大,那么决策树就会越简洁。这里信息增益的程度用信息熵的变化程度来衡量。
公式
特征A对训练数据集D的信息增益g(D,A),定义为集合D的经验熵H(D)与特征A给定条件下的经验熵H(D|A)之差.
g
(
D
,
A
)
=
H
(
D
)
−
H
(
D
∣
A
)
g(D,A) = H(D) - H(D|A)
g(D,A)=H(D)−H(D∣A)
二.如何计算信息增益
以此数据集为例
#第一列是车(1|2|3),第二行是婚姻状态(1已婚|2单身|3离婚),第三列是是否有贷款(0无|1有),第四列是是否已有车(0无|1有),第五列为是否买车(0n|1y)
dataset = [
[“0”, “0”, “0”, “0”, “0”],
[“0”, “0”, “0”, “1”, “0”],
[“1”, “1”, “0”, “0”, “1”],
[“2”, “2”, “1”, “0”, “1”],
[“2”, “0”, “0”, “1”, “1”],
[“1”, “1”, “1”, “0”, “1”],
[“0”, “1”, “0”, “0”, “0”],
[“0”, “2”, “1”, “0”, “1”],
[“1”, “2”, “1”, “1”, “0”],
[“0”, “1”, “1”, “1”, “1”],
[“2”, “1”, “0”, “1”, “1”],
[“2”, “2”, “1”, “1”, “1”],
[“1”, “0”, “0”, “1”, “0”]
]
①计算经验熵
H
(
D
)
=
−
(
8
13
l
o
g
2
8
13
+
5
13
l
o
g
2
5
13
)
=
0.961
H(D) = -(\frac{8}{13}log_{2}\frac{8}{13} + \frac{5}{13}log_{2}\frac{5}{13}) = 0.961
H(D)=−(138log2138+135log2135)=0.961
以下为代码
#计算经验熵
def calc_empirical_entropy(labels):
labels_count = collections.Counter(labels) # 统计每个类别的数量
labels_len = len(labels)
empirical_entropy = 0 # 初始化经验熵为0
for label in labels_count:
proportion_label = labels_count[label] / labels_len # 计算每个类别的概率
empirical_entropy -= proportion_label * math.log2(proportion_label) # 计算每个类别的概率乘以对数概率
return empirical_entropy
②计算各特征的条件熵
将属性记为A1,A2,A3,A4
以A1为例
H
(
D
∣
A
1
)
=
−
[
5
13
(
2
5
l
o
g
2
2
5
+
3
5
l
o
g
2
3
5
)
+
4
13
(
2
4
l
o
g
2
2
4
+
2
4
l
o
g
2
2
4
)
+
4
13
(
4
4
l
o
g
2
4
4
)
]
=
0.68
H(D|A_1) = -[\frac{5}{13}(\frac{2}{5}log_{2}\frac{2}{5} + \frac{3}{5}log_{2}\frac{3}{5}) + \frac{4}{13}(\frac{2}{4}log_{2}\frac{2}{4} + \frac{2}{4}log_{2}\frac{2}{4}) + \frac{4}{13}(\frac{4}{4}log_{2}\frac{4}{4} )] = 0.68
H(D∣A1)=−[135(52log252+53log253)+134(42log242+42log242)+134(44log244)]=0.68
以下为代码
# 计算条件熵
def calc_conditional_entropy(labels, subset_labels):
labels_count = collections.Counter(subset_labels)
labels_len = len(subset_labels)
conditional_entropy = 0 # 初始化条件熵为0
for label in labels_count:
proportion_label = labels_count[label] / labels_len
conditional_entropy -= proportion_label * math.log2(proportion_label)
return conditional_entropy
③计算信息增益
信息增益可以用于决策树划分属性的选择,就是选择信息增益最大的属性,假设A1最大
由上面两步可以得出信息增益为
g
(
D
,
A
1
)
=
H
(
D
)
−
H
(
D
∣
A
1
)
=
0.961
−
0.68
=
0.281
g(D,A_1) = H(D) - H(D|A_1) = 0.961-0.68 = 0.281
g(D,A1)=H(D)−H(D∣A1)=0.961−0.68=0.281
以下为代码
def calc_information_gain(dataset, labels, feature_index):
empirical_entropy = calc_empirical_entropy(labels) # 计算经验熵
new_entropy = 0 # 初始化新的熵
for value in set(row[feature_index] for row in dataset): # 遍历特征的所有可能值
subset_labels = [labels[idx] for idx, row in enumerate(dataset) if row[feature_index] == value]
weight = len(subset_labels) / len(labels) # 计算子集权重
new_entropy += weight * calc_empirical_entropy(subset_labels) # 计算加权熵
return empirical_entropy - new_entropy # 信息增益等于经验熵减去新熵
2-信息增益率
一.信息增益率本质
在信息增益的基础上乘上一个惩罚参数.特征个数较多,惩罚参数较小,特征个数较少,惩罚参数较大.以数据集D以A作为随机变量的熵的倒数
二.公式
G
a
i
n
_
R
a
t
i
o
(
D
,
a
)
=
G
a
i
n
(
D
,
a
)
I
V
(
a
)
Gain\_Ratio(D,a) = \frac{Gain(D,a)}{IV(a)}
Gain_Ratio(D,a)=IV(a)Gain(D,a)
I
V
(
a
)
=
−
∑
v
=
1
v
D
v
D
l
o
g
2
D
v
D
IV(a) = -\sum_{v =1}^{v}\frac{D^v}{D}log_2\frac{D^v}{D}
IV(a)=−v=1∑vDDvlog2DDv
Gain_Ratio表示信息增益率,IV表示分裂信息.
特征值种类越多,除以的参数就越大,反之,特征值种类越小,除以的参数越小
以下为代码
def calc_information_gain_ratio(dataset, labels, feature_index):
# 计算当前特征的经验熵
empirical_entropy = calc_empirical_entropy(labels)
# 初始化信息增益
information_gain = 0
# 初始化分裂信息熵(各分支的熵的总和)
split_entropy = 0
# 获取特征的所有唯一值
feature_values = set(row[feature_index] for row in dataset)
for value in feature_values:
# 为当前特征值创建子集
subset_labels = [labels[idx] for idx, row in enumerate(dataset) if row[feature_index] == value]
# 计算子集的权重
weight = len(subset_labels) / len(labels)
# 计算子集的条件熵
subset_entropy = calc_empirical_entropy(subset_labels)
# 计算信息增益
information_gain += weight * subset_entropy
# 更新分裂信息熵
split_entropy += weight * subset_entropy
# 计算信息增益率
info_gain_ratio = (empirical_entropy - split_entropy) / empirical_entropy
return info_gain_ratio
3.如何建树
信息增益建树(ID3)
选择最佳特征
def choose_best_split_ID3(dataset, labels):
num_features = len(dataset[0]) - 1 # 数据集中特征的数量(不包括标签)
best_feature = None
best_information_gain = -float('inf') # 初始化为负无穷,以便找到第一个正值
for feature_index in range(num_features): # 遍历所有特征
information_gain = calc_information_gain(dataset, labels, feature_index) # 计算信息增益
if information_gain > best_information_gain: # 比较信息增益
best_feature = feature_index
best_information_gain = information_gain
return best_feature
构建决策树
# 根据特征和值分割数据集
def split_data(dataset, feature_index, feature_value):
return [row for row in dataset if row[feature_index] == feature_value]
# 构建决策树
def build_tree_ID3(dataset, labels):
# 如果所有数据点都属于同一类别,返回一个叶子节点
if labels.count(labels[0]) == len(labels):
return DecisionTreeNode(value=labels[0])
# 如果没有更多的特征,返回一个叶子节点,其类别是数据点的多数类别
if len(dataset[0]) == 1:
majority_class = collections.Counter(labels).most_common(1)[0][0]
return DecisionTreeNode(value=majority_class)
# 选择最佳特征
best_feature_index = choose_best_split_ID3(dataset, labels)
# 创建当前节点
node = DecisionTreeNode(feature_index=best_feature_index)
# 遍历特征的所有可能值,并为每个值创建一个分支
for feature_value in set(row[best_feature_index] for row in dataset):
subset_dataset = split_data(dataset, best_feature_index, feature_value)
subset_labels = [labels[idx] for idx, row in enumerate(dataset) if row[best_feature_index] == feature_value]
# 递归地为每个子集构建子树
node.add_branch(feature_value, build_tree_ID3(subset_dataset, subset_labels))
return node
# 决策树节点类
class DecisionTreeNode:
def __init__(self, feature_index=None, value=None):
self.feature_index = feature_index
self.value = value
self.branches = {}
def add_branch(self, feature_value, branch):
self.branches[feature_value] = branch
#预测
def predict(tree, sample):
if tree.value is not None:
return tree.value
feature_val = sample[tree.feature_index]
if feature_val in tree.branches:
return predict(tree.branches[feature_val], sample)
else:
all_labels = [predict(branch, sample) for branch in tree.branches.values()]
return collections.Counter(all_labels).most_common(1)[0][0]
测试结果预测
使用以下代码来预测
if __name__ == "__main__":
dataset, labels = data_set()
decision_tree = build_tree_ID3(dataset, labels)
sample = ["0", "1", "1", "1"] # 新样本的特征
predicted_label = predict(decision_tree, sample)
print("\nPredicted Label for sample", sample, "is:", predicted_label)
以下为建立信息增益率的决策树
#选择最佳特征
def choose_best_split_4point5(dataset, labels):
num_features = len(dataset[0]) - 1 # 数据集中特征的数量(不包括标签)
best_feature = None
best_gain_ratio = -float('inf') # 初始化为负无穷,以便找到第一个正值
for feature_index in range(num_features): # 遍历所有特征
# 计算信息增益率
gain_ratio = calc_information_gain_ratio(dataset, labels, feature_index)
# 比较信息增益率
if gain_ratio > best_gain_ratio:
best_feature = feature_index
best_gain_ratio = gain_ratio
return best_feature
# 构建C4.5决策树
def build_tree_c4point5(dataset, labels):
# 如果所有数据点都属于同一类别,返回一个叶子节点
if len(set(labels)) == 1:
return DecisionTreeNode(value=labels[0])
# 如果没有更多的特征,返回一个叶子节点,其类别是数据点的多数类别
if len(dataset[0]) == 1:
majority_class = collections.Counter(labels).most_common(1)[0][0]
return DecisionTreeNode(value=majority_class)
# 选择最佳特征
best_feature_index = choose_best_split_4point5(dataset, labels)
node = DecisionTreeNode(feature_index=best_feature_index)
feature_values = set(row[best_feature_index] for row in dataset)
for value in feature_values:
subset_dataset = [row for row in dataset if row[best_feature_index] == value]
subset_labels = [labels[idx] for idx, row in enumerate(dataset) if row[best_feature_index] == value]
node.add_branch(value, build_tree_c4point5(subset_dataset, subset_labels))
return node
4.模型评估(留出法,留一法)
留出法
def hold_out_set_split(group, labels, test_size=0.3):
index = np.arange(len(group))
np.random.shuffle(index)
group = [group[i] for i in index] # 转换为列表
labels = [labels[i] for i in index] # 转换为列表
test_size_index = int(len(group) * test_size)
train_group = group[test_size_index:]
test_group = group[:test_size_index]
train_labels = labels[test_size_index:]
test_labels = labels[:test_size_index]
return train_group, test_group, train_labels, test_labels
# 使用留出法评估决策树C4.5
def hold_out_evaluate_tree_4point5(train_group, train_labels, test_group, test_labels):
tree = build_tree_c4point5(train_group, train_labels) # 使用C4.5算法构建决策树
correct = 0
# 遍历测试集并进行预测
for i, sample in enumerate(test_group):
predicted_label = predict(tree, sample) # 使用决策树进行预测
if predicted_label == test_labels[i]: # 直接使用索引,而不是 test_group.index(sample)
correct += 1
# 计算准确率
accuracy = correct / len(test_group)
return accuracy
# 使用留出法评估决策树ID3
def hold_out_evaluate_tree_ID3(train_group, train_labels, test_group, test_labels):
tree = build_tree_ID3(train_group, train_labels) # 使用C4.5算法构建决策树
correct = 0
# 遍历测试集并进行预测
for i, sample in enumerate(test_group):
predicted_label = predict(tree, sample) # 使用决策树进行预测
if predicted_label == test_labels[i]: # 直接使用索引,而不是 test_group.index(sample)
correct += 1
# 计算准确率
accuracy = correct / len(test_group)
return accuracy
使用以下代码来测试
if __name__ == "__main__":
dataset, labels = data_set()
test_size = 0.4
train_group, test_group, train_labels, test_labels = hold_out_set_split(dataset, labels, test_size)
accuracy1 = hold_out_evaluate_tree_ID3(train_group, train_labels, test_group, test_labels)
accuracy2 = hold_out_evaluate_tree_4point5(train_group, train_labels, test_group, test_labels)
print(f"ID3_Model accuracy 留出法: {accuracy1:.2f}")
print(f"C4.5_Model accuracy 留出法: {accuracy2:.2f}")
交叉验证法(留一法)
# 交叉验证法数据分割
def cross_validate_set_split(group, labels):
# 数据集打乱
index = np.arange(len(group))
np.random.shuffle(index)
# 循环给出每个样本索引
for i in range(len(group)):
# 将对应位置删除创建训练集
train_set = [row for j, row in enumerate(group) if j != i]
train_labels = [label for j, label in enumerate(labels) if j != i]
# 将删除的位置元素取出作为测试集
test_set = group[i]
test_labels = labels[i]
yield train_set, train_labels, test_set, test_labels
# 交叉验证评估(留一交叉验证)
def cross_validate_evaluate_tree_c4point5(group, labels):
correct = 0
for train_set, train_labels, test_set, test_labels in cross_validate_set_split(group, labels):
# 使用训练集构建决策树
tree = build_tree_c4point5(train_set, train_labels)
# 使用决策树进行预测
predicted_label = predict(tree, test_set)
if predicted_label == test_labels:
correct += 1
# 计算准确率
accuracy = correct / len(group)
return accuracy
def cross_validate_evaluate_tree_ID3(group, labels):
correct = 0
for train_set, train_labels, test_set, test_labels in cross_validate_set_split(group, labels):
# 使用训练集构建决策树
tree = build_tree_ID3(train_set, train_labels)
# 使用决策树进行预测
predicted_label = predict(tree, test_set)
if predicted_label == test_labels:
correct += 1
# 计算准确率
accuracy = correct / len(group)
return accuracy
用以下代码进行测试
if __name__ == "__main__":
dataset, labels = data_set()
accuracy1 = cross_validate_evaluate_tree_ID3(dataset, labels)
accuracy2 = cross_validate_evaluate_tree_c4point5(dataset, labels)
print(f"ID3_Model accuracy 交叉验证(留一法): {accuracy1:.2f}")
print(f"C4.5_Model accuracy 交叉验证(留一法): {accuracy2:.2f}")
比较ID3和C4.5
特征选择:ID3使用信息增益,而C4.5使用信息增益率。
处理连续属性:ID3不直接支持连续属性,C4.5可以处理。
缺失值处理:C4.5算法可以处理数据中的缺失值,而ID3则需要预处理缺失值。
剪枝:C4.5使用剪枝技术来避免过拟合,而ID3通常不包括剪枝。
总结优缺点
ID3:
优点:
简单,易于理解和实现。
对于分类问题,决策树提供了直观的规则集合。
缺点:
对于具有不同类别数量的特征,ID3倾向于选择那些具有更多类别的特征。
对噪声和异常值敏感,可能导致过拟合。
C4.5
优点:
解决了ID3对特征偏好的问题。
可以处理连续属性和缺失值。
通过剪枝减少了过拟合的风险。
缺点:
实现比ID3复杂。
对于具有大量分支的决策树,计算信息增益率的计算成本较高。