决策树
介绍关于决策树的内容,代码来源于Ng Andrew课程配套代码
例子是用一些特征来判断蘑菇是否有毒,使用决策树模型
1.导入包
import numpy as np
import matplotlib.pyplot as plt
from public_tests import *
%matplotlib inline
2.数据集
You will start by loading the dataset for this task. The dataset you have collected is as follows:
Cap Color | Stalk Shape | Solitary | Edible |
---|---|---|---|
Brown | Tapering | Yes | 1 |
Brown | Enlarging | Yes | 1 |
Brown | Enlarging | No | 0 |
Brown | Enlarging | No | 0 |
Brown | Tapering | Yes | 1 |
Red | Tapering | Yes | 0 |
Red | Enlarging | No | 0 |
Brown | Enlarging | Yes | 1 |
Red | Tapering | No | 1 |
Brown | Enlarging | No | 0 |
- You have 10 examples of mushrooms. For each example, you have
- Three features
- Cap Color (
Brown
orRed
), - Stalk Shape (
Tapering
orEnlarging
), and - Solitary (
Yes
orNo
)
- Cap Color (
- Label
- Edible (
1
indicating yes or0
indicating poisonous)
- Edible (
- Three features
2.1 独热编码的数据集
For ease of implementation, we have one-hot encoded the features (turned them into 0 or 1 valued features)
Brown Cap | Tapering Stalk Shape | Solitary | Edible |
---|---|---|---|
1 | 1 | 1 | 1 |
1 | 0 | 1 | 1 |
1 | 0 | 0 | 0 |
1 | 0 | 0 | 0 |
1 | 1 | 1 | 1 |
0 | 1 | 1 | 0 |
0 | 0 | 0 | 0 |
1 | 0 | 1 | 1 |
0 | 1 | 0 | 1 |
1 | 0 | 0 | 0 |
Therefore,
-
X_train
contains three features for each example- Brown Color (A value of
1
indicates “Brown” cap color and0
indicates “Red” cap color) - Tapering Shape (A value of
1
indicates “Tapering Stalk Shape” and0
indicates “Enlarging” stalk shape) - Solitary (A value of
1
indicates “Yes” and0
indicates “No”)
- Brown Color (A value of
-
y_train
is whether the mushroom is edibley = 1
indicates edibley = 0
indicates poisonous
可能有一些特征不仅有2种性状,而是n种,此时要用独热编码表示就必须要n列
2.2 查看数据
刚开始都最好打印一下数据和数据类型
print("First few elements of X_train:\n", X_train[:5])
print("Type of X_train:",type(X_train))
First few elements of X_train:
[[1 1 1]
[1 0 1]
[1 0 0]
[1 0 0]
[1 1 1]]
Type of X_train: <class 'numpy.ndarray'>
print("First few elements of y_train:", y_train[:5])
print("Type of y_train:",type(y_train))
First few elements of y_train: [1 1 0 0 1]
Type of y_train: <class 'numpy.ndarray'>
维数也要打印
print ('The shape of X_train is:', X_train.shape)
print ('The shape of y_train is: ', y_train.shape)
print ('Number of training examples (m):', len(X_train))
The shape of X_train is: (10, 3)
The shape of y_train is: (10,)
Number of training examples (m): 10
3.决策树刷新器
在本实践中,将根据提供的数据集构建决策树。
-
构建决策树的步骤如下:
- 从根节点的所有示例开始
- 计算所有可能特征的信息增益,并选择信息增益最高的特征
- 根据所选特征拆分数据集,并创建树的左右分支
- 继续重复拆分过程,直到满足停止条件
-
在本实验中,您将实现以下功能,这些功能将允许您使用信息增益最高的特性将节点拆分为左分支和右分支
- 计算节点处的熵
- 根据给定特征将节点处的数据集拆分为左右分支
- 计算在给定特征上拆分的信息增益
- 选择最大化信息增益的特征
- 然后,我们将使用您实现的助手函数helper function,通过重复拆分过程来构建决策树,直到满足停止条件
- 对于这个实验室,我们选择的停止标准是将最大深度设置为2
3.1 计算熵 Calculate entropy
首先,您将编写一个名为“compute_entropy”的助手函数,用于计算节点处的熵 (measure of impurity杂质的度量) .
- 该函数接受一个numpy数组(“y”),表示该节点中的示例蘑菇是可食用的(“1”)还是有毒的(“0”)
- 完成下面的
compute_entropy()
函数
- Compute
p
1
p_1
p1, which is the fraction of examples that are edible (i.e. have value =
1
iny
) 计算 p 1 p_1 p1,这是可食用的示例的分数(即在“y”中的值=“1”) - The entropy is then calculated as
H ( p 1 ) = − p 1 log 2 ( p 1 ) − ( 1 − p 1 ) log 2 ( 1 − p 1 ) H(p_1) = -p_1 \text{log}_2(p_1) - (1- p_1) \text{log}_2(1- p_1) H(p1)=−p1log2(p1)−(1−p1)log2(1−p1)
- Note
- The log is calculated with base 2 2 2
- For implementation purposes出于实现目的, 0 log 2 ( 0 ) = 0 0\text{log}_2(0) = 0 0log2(0)=0。也就是说,如果 p 1 = 0 或 p 1 = 1 p_1=0或p_1=1 p1=0或p1=1,则将熵设置为“0”`(代码中需要特判)
- Make sure to check that the data at a node is not empty (i.e.
len(y) != 0
). Return0
if it is 检查节点上的数据是否为空(特判是否为空串)(即len(y)!=0
). 如果是,则返回“0”
代码如下:
# UNQ_C1
# GRADED FUNCTION: compute_entropy
def compute_entropy(y):
"""
Computes the entropy for
Args:
y (ndarray): Numpy array indicating whether each example at a node is
edible (`1`) or poisonous (`0`)
Returns:
entropy (float): Entropy at that node
"""
# You need to return the following variables correctly
entropy = 0.
### START CODE HERE ###
if len(y)!=0:
p1 = len(y[y == 1]) / len(y) #节点y为1的概率,y==1可以选出其中只为1的子数组
if p1!=1 and p1!=0:
entropy = -p1*np.log2(p1) - (1-p1)*np.log2(1 - p1)
else:
entropy = 0.
### END CODE HERE ###
return entropy
3.2 Split dataset
接下来,您将编写一个名为“split_dataset”的助手函数,它接收节点处的数据和要拆分的特性,并将其拆分为左右分支。稍后在实验室中,您将实现代码来计算分割的效果。
- 该函数接收训练数据、该节点的数据点索引列表以及要拆分的特征。
- 它拆分数据并返回左分支和右分支的索引子集。
- 例如,假设我们从根节点开始(因此
node_index=[0,1,2,3,4,5,6,7,8,9]
),我们选择在特征0
上拆分,这就是示例是否有棕色帽。- 函数的输出是,
left_indices=[0,1,2,3,4,7,9]
和right_indices=[5,6,8]
- 函数的输出是,
Index | Brown Cap | Tapering Stalk Shape | Solitary | Edible |
---|---|---|---|---|
0 | 1 | 1 | 1 | 1 |
1 | 1 | 0 | 1 | 1 |
2 | 1 | 0 | 0 | 0 |
3 | 1 | 0 | 0 | 0 |
4 | 1 | 1 | 1 | 1 |
5 | 0 | 1 | 1 | 0 |
6 | 0 | 0 | 0 | 0 |
7 | 1 | 0 | 1 | 1 |
8 | 0 | 1 | 0 | 1 |
9 | 1 | 0 | 0 | 0 |
Exercise 2
Please complete the split_dataset()
function shown below
- For each index in
node_indices
- If the value of
X
at that index for that feature is1
, add the index toleft_indices
- If the value of
X
at that index for that feature is0
, add the index toright_indices
- If the value of
代码实现:
# UNQ_C2
# GRADED FUNCTION: split_dataset
def split_dataset(X, node_indices, feature):
"""
Splits the data at the given node into
left and right branches
Args:
X (ndarray): Data matrix of shape(n_samples, n_features)
node_indices (ndarray): List containing the active indices. I.e, the samples being considered at this step.
feature (int): Index of feature to split on
Returns:
left_indices (ndarray): Indices with feature value == 1
right_indices (ndarray): Indices with feature value == 0
"""
# You need to return the following variables correctly
left_indices = []
right_indices = []
### START CODE HERE ###
for i in node_indices:
if X[i][feature] == 1:
left_indices.append(i)
else:
right_indices.append(i)
### END CODE HERE ###
return left_indices, right_indices
函数调用:
root_indices = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# Feel free to play around with these variables
# The dataset only has three features, so this value can be 0 (Brown Cap), 1 (Tapering Stalk Shape) or 2 (Solitary)
feature = 0
left_indices, right_indices = split_dataset(X_train, root_indices, feature)
print("Left indices: ", left_indices)
print("Right indices: ", right_indices)
3.3 计算信息增益 information gain
接下来,您将编写一个名为“information_gain”的函数,它接收训练数据、节点处的索引和要拆分的特征,并返回拆分后的信息增益。
练习3
请完成下面显示的compute_information_gain()
函数来计算
Information Gain = H ( p 1 node ) − ( w left H ( p 1 left ) + w right H ( p 1 right ) ) \text{Information Gain} = H(p_1^\text{node})- (w^{\text{left}}H(p_1^\text{left}) + w^{\text{right}}H(p_1^\text{right})) Information Gain=H(p1node)−(wleftH(p1left)+wrightH(p1right))
where
- H ( p 1 node ) H(p_1^\text{node}) H(p1node) is entropy at the node
- H ( p 1 left ) H(p_1^\text{left}) H(p1left) and H ( p 1 right ) H(p_1^\text{right}) H(p1right) are the entropies at the left and the right branches resulting from the split
- w left w^{\text{left}} wleft and w right w^{\text{right}} wright are the proportion of examples at the left and right branch respectively 两个分支的比例(权重)
代码实现:
注意len(X_node)=len(X_left) + len(X_right)
# UNQ_C3
# GRADED FUNCTION: compute_information_gain
def compute_information_gain(X, y, node_indices, feature):
"""
Compute the information of splitting the node on a given feature
Args:
X (ndarray): Data matrix of shape(n_samples, n_features)
y (array like): list or ndarray with n_samples containing the target variable
node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
Returns:
cost (float): Cost computed
"""
# Split dataset
left_indices, right_indices = split_dataset(X, node_indices, feature)
# Some useful variables
X_node, y_node = X[node_indices], y[node_indices]
X_left, y_left = X[left_indices], y[left_indices]
X_right, y_right = X[right_indices], y[right_indices]
# You need to return the following variables correctly
information_gain = 0
### START CODE HERE ###
# Weights
w_left = len(X_left) / len(X_node)
w_right = len(X_right) / len(X_node)
#Weighted entropy
# 记得这里算的是y值
H_p1_node = compute_entropy(y_node)
H_p1_left = compute_entropy(y_left)
H_p1_right = compute_entropy(y_right)
#Information gain
information_gain = H_p1_node - (w_left*H_p1_left + w_right*H_p1_right)
### END CODE HERE ###
return information_gain
3.4 Get best split
通过如上所述计算每个特征的信息增益,并返回给出最大信息增益的特征,来获得要分割的最佳特征
Exercise 4
请完成下面显示的get_best_split()
函数。
- 该函数接收训练数据以及该节点的数据点索引
- 函数的输出是提供最大信息增益的特征
- 您可以使用
compute_information_gain()
函数遍历功能并计算每个功能的信息
代码如下:
# UNQ_C4
# GRADED FUNCTION: get_best_split
def get_best_split(X, y, node_indices):
"""
Returns the optimal feature and threshold value
to split the node data
Args:
X (ndarray): Data matrix of shape(n_samples, n_features)
y (array like): list or ndarray with n_samples containing the target variable
node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
Returns:
best_feature (int): The index of the best feature to split
"""
# Some useful variables
num_features = X.shape[1]
# You need to return the following variables correctly
best_feature = -1
### START CODE HERE ###
max_gain = 0
for i in range(num_features):
info_gain = compute_information_gain(X,y,node_indices,i)
if info_gain > max_gain:
max_gain = info_gain
best_feature = i
### END CODE HERE ##
return best_feature
4.构建决策树
递归地构建决策树
# Not graded
tree = []
def build_tree_recursive(X, y, node_indices, branch_name, max_depth, current_depth):
"""
Build a tree using the recursive algorithm that split the dataset into 2 subgroups at each node.
This function just prints the tree.
Args:
X (ndarray): Data matrix of shape(n_samples, n_features)
y (array like): list or ndarray with n_samples containing the target variable
node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
branch_name (string): Name of the branch. ['Root', 'Left', 'Right']
max_depth (int): Max depth of the resulting tree.
current_depth (int): Current depth. Parameter used during recursive call.
"""
# Maximum depth reached - stop splitting
if current_depth == max_depth:
formatting = " "*current_depth + "-"*current_depth
print(formatting, "%s leaf node with indices" % branch_name, node_indices)
return
# Otherwise, get best split and split the data
# Get the best feature and threshold at this node
best_feature = get_best_split(X, y, node_indices)
tree.append((current_depth, branch_name, best_feature, node_indices))
formatting = "-"*current_depth
print("%s Depth %d, %s: Split on feature: %d" % (formatting, current_depth, branch_name, best_feature))
# Split the dataset at the best feature
left_indices, right_indices = split_dataset(X, node_indices, best_feature)
# continue splitting the left and the right child. Increment current depth
build_tree_recursive(X, y, left_indices, "Left", max_depth, current_depth+1)
build_tree_recursive(X, y, right_indices, "Right", max_depth, current_depth+1)
build_tree_recursive(X_train, y_train, root_indices, "Root", max_depth=2, current_depth=0)
Depth 0, Root: Split on feature: 2
- Depth 1, Left: Split on feature: 0
-- Left leaf node with indices [0, 1, 4, 7]
-- Right leaf node with indices [5]
- Depth 1, Right: Split on feature: 1
-- Left leaf node with indices [8]
-- Right leaf node with indices [2, 3, 6, 9]
5.课后题
熵的计算公式的套用
H
(
p
1
)
=
−
p
1
log
2
(
p
1
)
−
(
1
−
p
1
)
log
2
(
1
−
p
1
)
H(p_1) = -p_1 \text{log}_2(p_1) - (1- p_1) \text{log}_2(1- p_1)
H(p1)=−p1log2(p1)−(1−p1)log2(1−p1)
信息增益的计算公式的套用
Information Gain
=
H
(
p
1
node
)
−
(
w
left
H
(
p
1
left
)
+
w
right
H
(
p
1
right
)
)
\text{Information Gain} = H(p_1^\text{node})- (w^{\text{left}}H(p_1^\text{left}) + w^{\text{right}}H(p_1^\text{right}))
Information Gain=H(p1node)−(wleftH(p1left)+wrightH(p1right))
在连续值特征中找决策树分割点
尝试每两个相邻的分割点中间,其中信息增益最大的就是分割点
节点停止分裂的时间点
- 节点的examples数量小于阈值
- 树达到了最大深度
随机森林
对于随机森林,如何构建每棵树,使它们彼此不完全相同?
替换训练数据样本。可以通过对训练进行采样来生成一个对每个树都唯一的训练集数据替换。
什么是sampling with replacement
绘制一系列示例,其中在拾取下一个示例时,首先替换所有先前绘制的示例
神经网络vs决策树
在结构化数据当中决策树更好,非结构化数据神经网络更好