决策树算法介绍
决策树:Decision Tree
什么是决策树
顾名思义,决策树是将样本处理成树形结构,再对新数据按照树形规则进行分类。
上图的流程图就是一个决策树,正方形代表判断模块(decision block),椭圆代表终止模块(terminating block),即数据最终分类结果。
如何实现决策树
如何选择最好特征
如何选择数据集中的最好特征并对数据集进行划分是决策树的核心。
要选择最好特征,首先要知道信息增益(information gain)的概念:划分数据集前后信息发生的变化。找到将数据集划分后获得信息增益最高的特征就是最好特征。
那么怎么计算信息增益呢?我们需要用到香农熵(简称熵,信息论之父克劳德-香农提出),其是集合信息的度量方式,或可定义为信息的期望值。
信息的定义:如果待分类的数据可能划分在多个分类中,则符号
x
i
x_i
xi的信息定义为:
l
(
x
i
)
=
−
l
o
g
2
p
(
x
i
)
l(x_i)=-log_2 p(x_i)
l(xi)=−log2p(xi)
其中
p
(
x
i
)
p(x_i)
p(xi)是选择该分类的概率。
为了计算熵,需要计算所有类别所有可能值包含的信息期望值:
E
(
x
)
=
−
∑
i
=
1
n
p
(
x
i
)
l
o
g
2
p
(
x
i
)
E(x)=-\sum^n_{i=1}p(x_i)log_2 p(x_i)
E(x)=−∑i=1np(xi)log2p(xi)
其中
n
n
n是分类的数目。
决策树的优缺点
优点:计算复杂度不高,结果易于理解,对中间缺失值不敏感,可以处理不相关的特征数据。
缺点:会产生过度匹配问题。
代码实现
from math import log
import numpy as np
def shannon_entropy(train_data, train_labels):
"""
计算熵
:param train_data: 需要计算想农熵的数据集
:param train_labels: 数据集的标签
:return: 香农熵
"""
num_samples = len(train_data) # 统计样本数量
label_counts = {} # 定义标签字典
assert train_data.shape[0] == train_labels.shape[0], '维度不匹配'
for sample, current_label in zip(train_data, train_labels): # 统计每个标签的数量
if current_label not in label_counts.keys():
label_counts[current_label] = 0
label_counts[current_label] += 1
shannon_ent = 0 # 初始化想农熵
for k in label_counts: # 计算总的信息熵
prob = float(label_counts[k]) / num_samples
shannon_ent -= prob * log(prob, 2)
return shannon_ent
def split_data(train_data, train_labels, feature, value):
"""
划分数据集
:param train_data: 待划分的数据集
:param feature: 划分数据集的特征
:param value: 特征的返回值
:return: 分割后的数据集(即删除最好特征列,剩余的数据)
"""
rest_data = []
rest_labels = []
for index, sampleVec in enumerate(train_data):
if sampleVec[feature] == value:
rest_labels.append(train_labels[index])
temp = np.concatenate([sampleVec[:feature], sampleVec[feature + 1:]])
rest_data.append(temp)
return np.array(rest_data), np.array(rest_labels)
def choose_best_feature(train_data, train_labels):
"""
选择最好特征
:param train_data: 训练数据
:param train_labels: 训练数据标签
:return: 样本最好特征的序号
"""
num_features = len(train_data[0])
base_entropy = shannon_entropy(train_data, train_labels)
best_info_gain = 0
best_feature = -1
for n in range(num_features):
feat_values = [sample[n] for sample in train_data]
unique_values = set(feat_values)
new_entropy = 0
for value in unique_values:
sub_data, sub_labels = split_data(train_data, train_labels, n, value)
prob = len(sub_data) / float(len(train_data))
new_entropy += prob * shannon_entropy(sub_data, sub_labels)
info_gain = base_entropy - new_entropy
if info_gain > best_info_gain:
best_info_gain = info_gain
best_feature = n
return best_feature
def majority_vote(clist):
"""
:param clist: 输入列表
:return: 列表中最多的元素
"""
import collections
return collections.Counter(clist).most_common(1)[0][0]
def DecisionTree(train_data, train_labels, categories):
"""
构建决策树
:param train_data: 训练数据
:param train_labels: 训练数据标签
:param categories: 特征名称列表
:return: 构建完成的决策树
"""
class_list = train_labels
if len(set(class_list)) == 1:
return class_list[0]
if len(train_data[0]) == 0:
return majority_vote(class_list)
best_feat = choose_best_feature(train_data, train_labels)
best_feat_label = categories[best_feat]
my_tree = {best_feat_label: {}}
del (categories[best_feat])
feat_values = [sample[best_feat] for sample in train_data]
unique_feat_values = set(feat_values)
for value in unique_feat_values:
subcategories = categories[:]
sub_data, sublabels = split_data(train_data, train_labels, best_feat, value)
my_tree[best_feat_label][value] = DecisionTree(sub_data, sublabels, subcategories)
return my_tree
def DecisionTreePredict(tree, test_data, feature_names):
"""
:param tree: 构建完成的决策树
:param test_data: 测试数据
:param feature_names: 特征名称列表
:return: 预测值
"""
first_feat = list(tree.keys())[0]
second_dict = tree[first_feat]
feat_index = feature_names.index(first_feat)
for k in second_dict.keys():
if test_data[feat_index] == k:
if type(second_dict[k]) is dict:
class_label = DecisionTreePredict(second_dict[k], test_data, feature_names)
else:
class_label = second_dict[k]
return class_label