吴恩达2022机器学习之决策树

目录

一、问题描述

二、数据集

三、复习决策树

四、构建决策树

注意:

  • 需要安装的包有:numpy, matplotlib,还有练习自带的代码utils.py。

  • utils.py没有也没关系,其用途是验证函数编写是否正确,我在代码下方将给出应该运行的结果,与答案一致。代码模块是我自己写的,部分和答案hints不同,但是结果和答案一样,如果有bug请在评论区指出。

  • 源代码将保存在文章底部的链接处。

一、问题描述

假设你正在创办一家种植和销售野生蘑菇的公司。

  • 由于并不是所有的蘑菇都是可食用的,所以您希望能够根据特定蘑菇的物理属性来判断它是可食用的还是有毒的。

  • 您有一些可用于此任务的现有数据。

你能用这些数据来帮助你确定哪些蘑菇可以安全销售吗?

注:所用数据集仅用于说明目的。它并不是用来鉴别食用蘑菇的指南。

二、数据集

数据集格式如下(表格):

注意:

  • 你有10个样本。每个样本包含三个特征:

  • 蘑菇头的颜色Cap Color(棕色Brown、红色Red)

  • 茎的形状Stalk shape(变大Enlarging、变小Tapering)

  • 独株生长Solitary(是Yes,否No)

  • 一个标签。可食用(1代表可以,0代表不可以)

独热编码

为了方便编写,可以使用独热编码对特征进行改写。

这样,X_train和Y_train变成了只包含0和1的编码。

使用独热编码后,X_train和Y_train的表示如下:

X_train = np.array([[1,1,1],[1,0,1],[1,0,0],[1,0,0],[1,1,1],[0,1,1],[0,0,0],[1,0,1],[0,1,0],[1,0,0]])
y_train = np.array([1,1,0,0,1,0,0,1,1,0])

Tips:可以使用np.shape、type(object)详细查看训练集类型和大小。

print ('The shape of X_train is:', X_train.shape)
print ('The shape of y_train is: ', y_train.shape)
print ('Number of training examples (m):', len(X_train))

输出结果:

The shape of X_train is: (10, 3)

The shape of y_train is: (10,)

Number of training examples (m): 10

三、复习决策树

  • 回忆一下决策树的构建过程:

  • 从根节点的所有示例开始

  • 计算所有可能特征的信息增益,并选择信息增益最高的特征

  • 根据所选特征拆分数据集,并创建树的左分支和右分支

  • 继续重复拆分过程,直到满足停止条件

  • 在这一节中,您将实现以下功能,这将允许您使用信息增益最高的功能将节点拆分为左分支和右分支。

  • 计算节点处的熵;

  • 基于给定特征将节点处的数据集拆分为左分支和右分支;

  • 计算在给定特征上拆分的信息增益(Information Gain, IG);

  • 选择使信息增益(Information Gain, IG)最大的特征。

  • 本实验的停止条件为树的深度不超过2

  1. 熵的计算

编写compute_entropy函数计算某一结点的熵。

注意:

  • 该函数的输入为标签y,输出为熵。

  • 熵的计算公式如下:

  • 其中,P1表示可食用部分的占比。

  • 为方便计算,设0log2(0) = 0

  • 注意需要检验节点非空。(len(y)!=0)

该部分的代码如下:

# GRADED FUNCTION: compute_entropy
import math
def compute_entropy(y):
    """
    Computes the entropy for 
    
    Args:
       y (ndarray): Numpy array indicating whether each example at a node is
           edible (`1`) or poisonous (`0`)
       
    Returns:
        entropy (float): Entropy at that node
        
    """
    # You need to return the following variables correctly
    entropy = 0.
    
    ### START CODE HERE ###
    if len(y) != 0:
        p_1 = len(y[y == 1]) / len(y)
        if p_1 == 1 or p_1 == 0:
            entropy = 0
        else:
            entropy = -p_1*math.log(p_1,2)-(1-p_1)*math.log(1-p_1,2)
    ### END CODE HERE ###        
    
    return entropy

验证是否正确:

# Since we have 5 edible and 5 non-edible mushrooms, the entropy should be 1"

print("Entropy at root node: ", compute_entropy(y_train)) 

# UNIT TESTS
compute_entropy_test(compute_entropy

输出结果:

Entropy at root node: 1.0

  1. 分割数据集

编写split_dataset函数根据数据集中数据的特征,将其划分入左分支/右分支。

注意:

  • 该函数的输入为:训练数据、该节点的数据点索引列表以及要拆分的特征;

  • 该函数的输出为:左分支/右分支的索引子集。

  • 输入数据格式如下:

  • 约定:可食用(y==1)划入左分支,不可食用(y==0)划入右分支。

代码如下:

# GRADED FUNCTION: split_dataset

def split_dataset(X, node_indices, feature):
    """
    Splits the data at the given node into
    left and right branches
    
    Args:
        X (ndarray):             Data matrix of shape(n_samples, n_features)
        node_indices (ndarray):  List containing the active indices. I.e, the samples being considered at this step.
        feature (int):           Index of feature to split on
    
    Returns:
        left_indices (ndarray): Indices with feature value == 1
        right_indices (ndarray): Indices with feature value == 0
    """
    
    # You need to return the following variables correctly
    left_indices = []
    right_indices = []
    
    ### START CODE HERE ###
    for i in node_indices:
        if X[i,feature] == 1:
            left_indices.append(i)
        elif X[i,feature] == 0:
            right_indices.append(i)
    ### END CODE HERE ###
        
    return left_indices, right_indices

验证是否正确:

root_indices = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

# Feel free to play around with these variables
# The dataset only has three features, so this value can be 0 (Brown Cap), 1 (Tapering Stalk Shape) or 2 (Solitary)
feature = 0

left_indices, right_indices = split_dataset(X_train, root_indices, feature)

print("Left indices: ", left_indices)
print("Right indices: ", right_indices)

# UNIT TESTS    
split_dataset_test(split_dataset)

输出结果:

Left indices: [0, 1, 2, 3, 4, 7, 9]

Right indices: [5, 6, 8]

  1. 计算信息增益IG

编写information_gain函数计算信息增益

注意:

  • 该函数的输入为节点处的索引、需要分类的特征、训练样本(X,y),输出为信息增益

  • 信息增益的计算公式如下:

  • 其中,H(P1_node)为根节点的熵;

  • H(P1_left)为左节点的熵,H(P1_right)为右节点的熵;

  • w_left为左节点所有样本占全部样本的比例,w_right为右节点所有样本占全部样本的比例。

  • 可以使用compute_entropy()函数计算熵;

  • 可以使用split_dataset()函数分割左右分支。

代码如下:

# GRADED FUNCTION: compute_information_gain

def compute_information_gain(X, y, node_indices, feature):
    
    """
    Compute the information of splitting the node on a given feature
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
   
    Returns:
        cost (float):        Cost computed
    
    """    
    # Split dataset
    left_indices, right_indices = split_dataset(X, node_indices, feature)
    
    # Some useful variables
    X_node, y_node = X[node_indices], y[node_indices]
    X_left, y_left = X[left_indices], y[left_indices]
    X_right, y_right = X[right_indices], y[right_indices]
    
    # You need to return the following variables correctly
    information_gain = 0
    
    ### START CODE HERE ###
    
    # Weights 
    w_left = len(X_left) / len(X_node)
    w_right = len(X_right) / len(X_node)
    #Weighted entropy
    H_p1_node = compute_entropy(y)
    H_p1_left = compute_entropy(y_left)
    H_p1_right = compute_entropy(y_right)
    #Information gain                                                   
    information_gain = H_p1_node - (w_left*H_p1_left+w_right*H_p1_right)
    ### END CODE HERE ###  
    
    return information_gain

验证结果:

info_gain0 = compute_information_gain(X_train, y_train, root_indices, feature=0)
print("Information Gain from splitting the root on brown cap: ", info_gain0)
    
info_gain1 = compute_information_gain(X_train, y_train, root_indices, feature=1)
print("Information Gain from splitting the root on tapering stalk shape: ", info_gain1)

info_gain2 = compute_information_gain(X_train, y_train, root_indices, feature=2)
print("Information Gain from splitting the root on solitary: ", info_gain2)

# UNIT TESTS
compute_information_gain_test(compute_information_gain)

输出结果:

Information Gain from splitting the root on brown cap: 0.034851554559677034

Information Gain from splitting the root on tapering stalk shape: 0.12451124978365313

Information Gain from splitting the root on solitary: 0.2780719051126377

(结果表明独株生长Solitary,features=2的信息增益最大,最适合作为根节点向下分裂的特征)

  1. 得到最佳分支

编写get_best_split()函数,根据上面得到的信息增益,获取对应的特征。

注意:

  • 该函数接收训练样本及其索引,返回最佳特征。

  • 可以使用compute_information_gain()函数帮助你计算信息增益。

代码如下:

# GRADED FUNCTION: get_best_split

def get_best_split(X, y, node_indices):   
    """
    Returns the optimal feature and threshold value
    to split the node data 
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.

    Returns:
        best_feature (int):     The index of the best feature to split
    """    
    
    # Some useful variables
    num_features = X.shape[1]
    
    # You need to return the following variables correctly
    best_feature = -1
    
    ### START CODE HERE ###
    IG = []
    for i in range(num_features):
        IG.append(compute_information_gain(X, y, node_indices, i))
    best_feature = np.argmax(IG)
    ### END CODE HERE ##    
   
    return best_feature

验证结果:

best_feature = get_best_split(X_train, y_train, root_indices)
print("Best feature to split on: %d" % best_feature)

# UNIT TESTS
get_best_split_test(get_best_split)

输出结果:

Best feature to split on: 2

四、构建决策树

代码如下:

def build_tree_recursive(X, y, node_indices, branch_name, max_depth, current_depth):
    """
    Build a tree using the recursive algorithm that split the dataset into 2 subgroups at each node.
    This function just prints the tree.
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
        branch_name (string):   Name of the branch. ['Root', 'Left', 'Right']
        max_depth (int):        Max depth of the resulting tree. 
        current_depth (int):    Current depth. Parameter used during recursive call.
   
    """ 

    # Maximum depth reached - stop splitting
    if current_depth == max_depth:
        formatting = " "*current_depth + "-"*current_depth
        print(formatting, "%s leaf node with indices" % branch_name, node_indices)
        return
   
    # Otherwise, get best split and split the data
    # Get the best feature and threshold at this node
    best_feature = get_best_split(X, y, node_indices) 
    tree.append((current_depth, branch_name, best_feature, node_indices))
    
    formatting = "-"*current_depth
    print("%s Depth %d, %s: Split on feature: %d" % (formatting, current_depth, branch_name, best_feature))
    
    # Split the dataset at the best feature
    left_indices, right_indices = split_dataset(X, node_indices, best_feature)
    
    # continue splitting the left and the right child. Increment current depth
    build_tree_recursive(X, y, left_indices, "Left", max_depth, current_depth+1)
    build_tree_recursive(X, y, right_indices, "Right", max_depth, current_depth+1)

验证结果:

build_tree_recursive(X_train, y_train, root_indices, "Root", max_depth=2, current_depth=0)

输出结果:

Depth 0, Root: Split on feature: 2

- Depth 1, Left: Split on feature: 0

-- Left leaf node with indices [0, 1, 4, 7]

-- Right leaf node with indices [5]

- Depth 1, Right: Split on feature: 1

-- Left leaf node with indices [8]

-- Right leaf node with indices [2, 3, 6, 9]

小结

决策树主要的函数包括:熵的计算compute_entropy(),分割数据集(左右分支)split_dataset(),计算信息增益information_gain(),得到最佳分支节点get_best_split(),构建决策树build_tree_recursive()。

源代码链接:

夸克网盘

链接:https://pan.quark.cn/s/1ea562d2a249

  • 2
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值