【Machine Learning】21.决策树


介绍关于决策树的内容,代码来源于Ng Andrew课程配套代码

例子是用一些特征来判断蘑菇是否有毒,使用决策树模型

1.导入包

import numpy as np
import matplotlib.pyplot as plt
from public_tests import *

%matplotlib inline

2.数据集

You will start by loading the dataset for this task. The dataset you have collected is as follows:

Cap ColorStalk ShapeSolitaryEdible
BrownTaperingYes1
BrownEnlargingYes1
BrownEnlargingNo0
BrownEnlargingNo0
BrownTaperingYes1
RedTaperingYes0
RedEnlargingNo0
BrownEnlargingYes1
RedTaperingNo1
BrownEnlargingNo0
  • You have 10 examples of mushrooms. For each example, you have
    • Three features
      • Cap Color (Brown or Red),
      • Stalk Shape (Tapering or Enlarging), and
      • Solitary (Yes or No)
    • Label
      • Edible (1 indicating yes or 0 indicating poisonous)

2.1 独热编码的数据集

For ease of implementation, we have one-hot encoded the features (turned them into 0 or 1 valued features)

Brown CapTapering Stalk ShapeSolitaryEdible
1111
1011
1000
1000
1111
0110
0000
1011
0101
1000

Therefore,

  • X_train contains three features for each example

    • Brown Color (A value of 1 indicates “Brown” cap color and 0 indicates “Red” cap color)
    • Tapering Shape (A value of 1 indicates “Tapering Stalk Shape” and 0 indicates “Enlarging” stalk shape)
    • Solitary (A value of 1 indicates “Yes” and 0 indicates “No”)
  • y_train is whether the mushroom is edible

    • y = 1 indicates edible
    • y = 0 indicates poisonous

可能有一些特征不仅有2种性状,而是n种,此时要用独热编码表示就必须要n列

2.2 查看数据

刚开始都最好打印一下数据和数据类型

print("First few elements of X_train:\n", X_train[:5])
print("Type of X_train:",type(X_train))

First few elements of X_train:
 [[1 1 1]
 [1 0 1]
 [1 0 0]
 [1 0 0]
 [1 1 1]]
Type of X_train: <class 'numpy.ndarray'>


print("First few elements of y_train:", y_train[:5])
print("Type of y_train:",type(y_train))

First few elements of y_train: [1 1 0 0 1]
Type of y_train: <class 'numpy.ndarray'>

维数也要打印

print ('The shape of X_train is:', X_train.shape)
print ('The shape of y_train is: ', y_train.shape)
print ('Number of training examples (m):', len(X_train))


The shape of X_train is: (10, 3)
The shape of y_train is:  (10,)
Number of training examples (m): 10

3.决策树刷新器

在本实践中,将根据提供的数据集构建决策树。

  • 构建决策树的步骤如下:

    • 从根节点的所有示例开始
    • 计算所有可能特征的信息增益,并选择信息增益最高的特征
    • 根据所选特征拆分数据集,并创建树的左右分支
    • 继续重复拆分过程,直到满足停止条件
  • 在本实验中,您将实现以下功能,这些功能将允许您使用信息增益最高的特性将节点拆分为左分支和右分支

    • 计算节点处的熵
    • 根据给定特征将节点处的数据集拆分为左右分支
    • 计算在给定特征上拆分的信息增益
    • 选择最大化信息增益的特征
    • 然后,我们将使用您实现的助手函数helper function,通过重复拆分过程来构建决策树,直到满足停止条件
    • 对于这个实验室,我们选择的停止标准是将最大深度设置为2

3.1 计算熵 Calculate entropy

首先,您将编写一个名为“compute_entropy”的助手函数,用于计算节点处的熵 (measure of impurity杂质的度量) .

  • 该函数接受一个numpy数组(“y”),表示该节点中的示例蘑菇是可食用的(“1”)还是有毒的(“0”)
  • 完成下面的compute_entropy()函数
  • Compute p 1 p_1 p1, which is the fraction of examples that are edible (i.e. have value = 1 in y) 计算 p 1 p_1 p1,这是可食用的示例的分数(即在“y”中的值=“1”)
  • The entropy is then calculated as

H ( p 1 ) = − p 1 log 2 ( p 1 ) − ( 1 − p 1 ) log 2 ( 1 − p 1 ) H(p_1) = -p_1 \text{log}_2(p_1) - (1- p_1) \text{log}_2(1- p_1) H(p1)=p1log2(p1)(1p1)log2(1p1)

  • Note
    • The log is calculated with base 2 2 2
    • For implementation purposes出于实现目的, 0 log 2 ( 0 ) = 0 0\text{log}_2(0) = 0 0log2(0)=0。也就是说,如果 p 1 = 0 或 p 1 = 1 p_1=0或p_1=1 p1=0p1=1,则将熵设置为“0”`(代码中需要特判)
    • Make sure to check that the data at a node is not empty (i.e. len(y) != 0). Return 0 if it is 检查节点上的数据是否为空(特判是否为空串)(即len(y)!=0). 如果是,则返回“0”

代码如下:

# UNQ_C1
# GRADED FUNCTION: compute_entropy


def compute_entropy(y):
    """
    Computes the entropy for 
    
    Args:
       y (ndarray): Numpy array indicating whether each example at a node is
           edible (`1`) or poisonous (`0`)
       
    Returns:
        entropy (float): Entropy at that node
        
    """
    # You need to return the following variables correctly
    entropy = 0.
    
    ### START CODE HERE ###
    if len(y)!=0:
        p1 = len(y[y == 1]) / len(y) #节点y为1的概率,y==1可以选出其中只为1的子数组

        if p1!=1 and p1!=0:
            entropy = -p1*np.log2(p1) - (1-p1)*np.log2(1 - p1)
        else:
            entropy = 0.
    ### END CODE HERE ###        
    
    return entropy

3.2 Split dataset

接下来,您将编写一个名为“split_dataset”的助手函数,它接收节点处的数据和要拆分的特性,并将其拆分为左右分支。稍后在实验室中,您将实现代码来计算分割的效果。

  • 该函数接收训练数据、该节点的数据点索引列表以及要拆分的特征。
  • 它拆分数据并返回左分支和右分支的索引子集。
  • 例如,假设我们从根节点开始(因此node_index=[0,1,2,3,4,5,6,7,8,9]),我们选择在特征0上拆分,这就是示例是否有棕色帽。
    • 函数的输出是,left_indices=[0,1,2,3,4,7,9]right_indices=[5,6,8]
IndexBrown CapTapering Stalk ShapeSolitaryEdible
01111
11011
21000
31000
41111
50110
60000
71011
80101
91000

Exercise 2

Please complete the split_dataset() function shown below

  • For each index in node_indices
    • If the value of X at that index for that feature is 1, add the index to left_indices
    • If the value of X at that index for that feature is 0, add the index to right_indices

代码实现:

# UNQ_C2
# GRADED FUNCTION: split_dataset

def split_dataset(X, node_indices, feature):
    """
    Splits the data at the given node into
    left and right branches
    
    Args:
        X (ndarray):             Data matrix of shape(n_samples, n_features)
        node_indices (ndarray):  List containing the active indices. I.e, the samples being considered at this step.
        feature (int):           Index of feature to split on
    
    Returns:
        left_indices (ndarray): Indices with feature value == 1
        right_indices (ndarray): Indices with feature value == 0
    """
    
    # You need to return the following variables correctly
    left_indices = []
    right_indices = []


    
    ### START CODE HERE ###
    for i in node_indices:
        if X[i][feature] == 1:
            left_indices.append(i)
        else:
            right_indices.append(i)           
    ### END CODE HERE ###
        
    return left_indices, right_indices

函数调用:

root_indices = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

# Feel free to play around with these variables
# The dataset only has three features, so this value can be 0 (Brown Cap), 1 (Tapering Stalk Shape) or 2 (Solitary)
feature = 0

left_indices, right_indices = split_dataset(X_train, root_indices, feature)

print("Left indices: ", left_indices)
print("Right indices: ", right_indices)

3.3 计算信息增益 information gain

接下来,您将编写一个名为“information_gain”的函数,它接收训练数据、节点处的索引和要拆分的特征,并返回拆分后的信息增益。

练习3

请完成下面显示的compute_information_gain()函数来计算

Information Gain = H ( p 1 node ) − ( w left H ( p 1 left ) + w right H ( p 1 right ) ) \text{Information Gain} = H(p_1^\text{node})- (w^{\text{left}}H(p_1^\text{left}) + w^{\text{right}}H(p_1^\text{right})) Information Gain=H(p1node)(wleftH(p1left)+wrightH(p1right))

where

  • H ( p 1 node ) H(p_1^\text{node}) H(p1node) is entropy at the node
  • H ( p 1 left ) H(p_1^\text{left}) H(p1left) and H ( p 1 right ) H(p_1^\text{right}) H(p1right) are the entropies at the left and the right branches resulting from the split
  • w left w^{\text{left}} wleft and w right w^{\text{right}} wright are the proportion of examples at the left and right branch respectively 两个分支的比例(权重)

代码实现:
注意len(X_node)=len(X_left) + len(X_right)

# UNQ_C3
# GRADED FUNCTION: compute_information_gain

def compute_information_gain(X, y, node_indices, feature):
    
    """
    Compute the information of splitting the node on a given feature
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
   
    Returns:
        cost (float):        Cost computed
    
    """    
    # Split dataset
    left_indices, right_indices = split_dataset(X, node_indices, feature)
    
    # Some useful variables
    X_node, y_node = X[node_indices], y[node_indices]
    X_left, y_left = X[left_indices], y[left_indices]
    X_right, y_right = X[right_indices], y[right_indices]
    
    # You need to return the following variables correctly
    information_gain = 0
    
    ### START CODE HERE ###
    
    # Weights 
    w_left = len(X_left) / len(X_node)
    w_right = len(X_right) / len(X_node)
    #Weighted entropy
    # 记得这里算的是y值
    H_p1_node = compute_entropy(y_node)
    H_p1_left = compute_entropy(y_left)
    H_p1_right = compute_entropy(y_right)
    #Information gain                                                   
    information_gain = H_p1_node - (w_left*H_p1_left + w_right*H_p1_right)
    ### END CODE HERE ###  
    
    return information_gain

3.4 Get best split

通过如上所述计算每个特征的信息增益,并返回给出最大信息增益的特征,来获得要分割的最佳特征

Exercise 4

请完成下面显示的get_best_split()函数。

  • 该函数接收训练数据以及该节点的数据点索引
  • 函数的输出是提供最大信息增益的特征
  • 您可以使用compute_information_gain()函数遍历功能并计算每个功能的信息

代码如下:

# UNQ_C4
# GRADED FUNCTION: get_best_split

def get_best_split(X, y, node_indices):   
    """
    Returns the optimal feature and threshold value
    to split the node data 
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.

    Returns:
        best_feature (int):     The index of the best feature to split
    """    
    
    # Some useful variables
    num_features = X.shape[1]
    
    # You need to return the following variables correctly
    best_feature = -1
    
    ### START CODE HERE ###
    max_gain = 0
    for i in range(num_features):
        info_gain = compute_information_gain(X,y,node_indices,i)
        if  info_gain > max_gain:
            max_gain = info_gain
            best_feature = i
    ### END CODE HERE ##    
   
    return best_feature

4.构建决策树

递归地构建决策树

# Not graded
tree = []

def build_tree_recursive(X, y, node_indices, branch_name, max_depth, current_depth):
    """
    Build a tree using the recursive algorithm that split the dataset into 2 subgroups at each node.
    This function just prints the tree.
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
        branch_name (string):   Name of the branch. ['Root', 'Left', 'Right']
        max_depth (int):        Max depth of the resulting tree. 
        current_depth (int):    Current depth. Parameter used during recursive call.
   
    """ 

    # Maximum depth reached - stop splitting
    if current_depth == max_depth:
        formatting = " "*current_depth + "-"*current_depth
        print(formatting, "%s leaf node with indices" % branch_name, node_indices)
        return
   
    # Otherwise, get best split and split the data
    # Get the best feature and threshold at this node
    best_feature = get_best_split(X, y, node_indices) 
    tree.append((current_depth, branch_name, best_feature, node_indices))
    
    formatting = "-"*current_depth
    print("%s Depth %d, %s: Split on feature: %d" % (formatting, current_depth, branch_name, best_feature))
    
    # Split the dataset at the best feature
    left_indices, right_indices = split_dataset(X, node_indices, best_feature)
    
    # continue splitting the left and the right child. Increment current depth
    build_tree_recursive(X, y, left_indices, "Left", max_depth, current_depth+1)
    build_tree_recursive(X, y, right_indices, "Right", max_depth, current_depth+1)

build_tree_recursive(X_train, y_train, root_indices, "Root", max_depth=2, current_depth=0)

Depth 0, Root: Split on feature: 2
- Depth 1, Left: Split on feature: 0
  -- Left leaf node with indices [0, 1, 4, 7]
  -- Right leaf node with indices [5]
- Depth 1, Right: Split on feature: 1
  -- Left leaf node with indices [8]
  -- Right leaf node with indices [2, 3, 6, 9]

5.课后题

熵的计算公式的套用

H ( p 1 ) = − p 1 log 2 ( p 1 ) − ( 1 − p 1 ) log 2 ( 1 − p 1 ) H(p_1) = -p_1 \text{log}_2(p_1) - (1- p_1) \text{log}_2(1- p_1) H(p1)=p1log2(p1)(1p1)log2(1p1)
在这里插入图片描述

信息增益的计算公式的套用

Information Gain = H ( p 1 node ) − ( w left H ( p 1 left ) + w right H ( p 1 right ) ) \text{Information Gain} = H(p_1^\text{node})- (w^{\text{left}}H(p_1^\text{left}) + w^{\text{right}}H(p_1^\text{right})) Information Gain=H(p1node)(wleftH(p1left)+wrightH(p1right))
在这里插入图片描述

在连续值特征中找决策树分割点

尝试每两个相邻的分割点中间,其中信息增益最大的就是分割点
在这里插入图片描述

节点停止分裂的时间点

  • 节点的examples数量小于阈值
  • 树达到了最大深度
    在这里插入图片描述

随机森林

对于随机森林,如何构建每棵树,使它们彼此不完全相同?
替换训练数据样本。可以通过对训练进行采样来生成一个对每个树都唯一的训练集数据替换。
在这里插入图片描述

什么是sampling with replacement

绘制一系列示例,其中在拾取下一个示例时,首先替换所有先前绘制的示例
在这里插入图片描述

神经网络vs决策树

在结构化数据当中决策树更好,非结构化数据神经网络更好
在这里插入图片描述

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
import pandas as pd data = {'形状': ['圆形', '圆形', '皱形', '皱形', '圆形', '皱形', '圆形', '皱形', '圆形'], '颜色': ['灰色', '白色', '白色', '灰色', '白色', '灰色', '白色', '灰色', '灰色'], '大小': ['饱满', '皱缩', '饱满', '饱满', '皱缩', '皱缩', '饱满', '皱缩', '皱缩'], '土壤': ['酸性', '碱性', '碱性', '酸性', '碱性', '酸性', '酸性', '碱性', '碱性'], '水分': ['多', '少', '多', '多', '少', '少', '少', '多', '少'], '日照': ['多', '多', '多', '少', '少', '多', '少', '少', '多'], '发芽': ['否', '是', '否', '是', '是', '是', '是', '否', '否']} df = pd.DataFrame(data) import math import numpy as np # 经验熵 def entropy(labels): n_labels = len(labels) if n_labels <= 1: return 0 counts = np.bincount(labels.astype(int)) probs = counts / n_labels n_classes = np.count_nonzero(probs) if n_classes <= 1: return 0 ent = 0. for i in probs: ent -= i * math.log(i, 2) return ent # 经验条件熵 def conditional_entropy(x, y): entropy_cond = 0. for i in set(x): p = float(len(x[x == i])) / len(x) entropy_cond += p * entropy(y[x == i]) return entropy_cond # 信息增益 def information_gain(x, y): return entropy(y) - conditional_entropy(x, y) # 对类别特征进行标签编码 le = LabelEncoder() categorical_cols = ['形状', '颜色', '大小', '土壤', '水分', '日照'] for col in categorical_cols: df[col] = le.fit_transform(df[col]) print('训练样本经验熵:', entropy(labels)) cond_ent = conditional_entropy(df["形状"].values.astype(int), labels) info_gain = information_gain(df["形状"].values.astype(int), labels) print('形状属性的经验条件熵:', cond_ent) print('形状属性的信息增益:', info_gain)哪错了
06-06
这段代码有一些问题: 1. 没有导入 LabelEncoder 类,需要先导入:from sklearn.preprocessing import LabelEncoder 2. 在计算训练样本的经验熵时,没有指定 labels 变量,应该改为:entropy(df["发芽"].values.astype(int)) 3. 在计算形状属性的经验条件熵和信息增益时,也没有指定 labels 变量,应该改为:conditional_entropy(df["形状"].values.astype(int), df["发芽"].values.astype(int)) 和 information_gain(df["形状"].values.astype(int), df["发芽"].values.astype(int)) 4. 最后一行代码没有缩进,应该和其他代码平齐。 下面是修改后的代码: ``` import pandas as pd import math import numpy as np from sklearn.preprocessing import LabelEncoder data = {'形状': ['圆形', '圆形', '皱形', '皱形', '圆形', '皱形', '圆形', '皱形', '圆形'], '颜色': ['灰色', '白色', '白色', '灰色', '白色', '灰色', '白色', '灰色', '灰色'], '大小': ['饱满', '皱缩', '饱满', '饱满', '皱缩', '皱缩', '饱满', '皱缩', '皱缩'], '土壤': ['酸性', '碱性', '碱性', '酸性', '碱性', '酸性', '酸性', '碱性', '碱性'], '水分': ['多', '少', '多', '多', '少', '少', '少', '多', '少'], '日照': ['多', '多', '多', '少', '少', '多', '少', '少', '多'], '发芽': ['否', '是', '否', '是', '是', '是', '是', '否', '否']} df = pd.DataFrame(data) # 对类别特征进行标签编码 le = LabelEncoder() categorical_cols = ['形状', '颜色', '大小', '土壤', '水分', '日照'] for col in categorical_cols: df[col] = le.fit_transform(df[col]) # 计算训练样本的经验熵 labels = df["发芽"].values.astype(int) print('训练样本经验熵:', entropy(labels)) # 计算形状属性的经验条件熵和信息增益 cond_ent = conditional_entropy(df["形状"].values.astype(int), labels) info_gain = information_gain(df["形状"].values.astype(int), labels) print('形状属性的经验条件熵:', cond_ent) print('形状属性的信息增益:', info_gain) ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值