Machine learning week 7(Andrew Ng)

本文深入介绍了决策树的学习过程,包括衡量纯度的熵函数,选择最佳分裂特征的信息增益,以及如何构建决策树。接着讨论了使用多个决策树的树集成技术,如随机森林和XGBoost,强调了这些方法在处理数据敏感性和提高预测性能上的优势。
摘要由CSDN通过智能技术生成

Decision tree

1、Decision tree

1.1、Decision tree model

在这里插入图片描述

1.2、Learning process

在这里插入图片描述

在这里插入图片描述

2、Decision tree learning

2.1、Measuring purity

The entropy function
在这里插入图片描述
H ( p 1 ) = − p 1 l o g 2 ( p 1 ) − p 0 l o g 2 ( p 0 ) H(p_1)=-p_1log_2(p_1)-p_0log_2(p_0) H(p1)=p1log2(p1)p0log2(p0) H ( p 1 ) = − p 1 log 2 ( p 1 ) − ( 1 − p 1 ) log 2 ( 1 − p 1 ) H(p_1) = -p_1 \text{log}_2(p_1) - (1- p_1) \text{log}_2(1- p_1) H(p1)=p1log2(p1)(1p1)log2(1p1)

    def compute_entropy(y):
    """
    Computes the entropy for 
    
    Args:
       y (ndarray): Numpy array indicating whether each example at a node is
           edible (`1`) or poisonous (`0`)
       
    Returns:
        entropy (float): Entropy at that node
        
    """
    # You need to return the following variables correctly
    entropy = 0.
    
    ### START CODE HERE ###
    m = len(y)
    if m != 0:
        p1 = len(y[y == 1]) / len(y)
        if p1 != 0 and p1 != 1:
            entropy = -p1*np.log2(p1)-(1-p1)*np.log2(1-p1)
        else:
            entropy = 0
    ### END CODE HERE ###        
    
    return entropy
2.2、Choosing a split: Information Gain

Information Gain = H ( p 1 node ) − ( w left H ( p 1 left ) + w right H ( p 1 right ) ) \text{Information Gain} = H(p_1^\text{node})- (w^{\text{left}}H(p_1^\text{left}) + w^{\text{right}}H(p_1^\text{right})) Information Gain=H(p1node)(wleftH(p1left)+wrightH(p1right))
Pick the one with the highest information gain
在这里插入图片描述

def split_dataset(X, node_indices, feature):
    """
    Splits the data at the given node into
    left and right branches
    
    Args:
        X (ndarray):             Data matrix of shape(n_samples, n_features)
        node_indices (list):  List containing the active indices. I.e, the samples being considered at this step.
        feature (int):           Index of feature to split on
    
    Returns:
        left_indices (list): Indices with feature value == 1
        right_indices (list): Indices with feature value == 0
    """
    
    # You need to return the following variables correctly
    left_indices = []
    right_indices = []
    
    ### START CODE HERE ###
    for i in node_indices:
        if X[i][feature] == 1:
            left_indices.append(i)
        else:
            right_indices.append(i)
    ### END CODE HERE ###
        
    return left_indices, right_indices


def compute_information_gain(X, y, node_indices, feature):
    
    """
    Compute the information of splitting the node on a given feature
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
   
    Returns:
        cost (float):        Cost computed
    
    """    
    # Split dataset
    left_indices, right_indices = split_dataset(X, node_indices, feature)
    
    # Some useful variables
    X_node, y_node = X[node_indices], y[node_indices]
    X_left, y_left = X[left_indices], y[left_indices]
    X_right, y_right = X[right_indices], y[right_indices]
    
    # You need to return the following variables correctly
    information_gain = 0
    
    ### START CODE HERE ###
    w_left = len(X_left) / len(X_node)
    w_right = len(X_right) / len(X_node)
    # Weights 
    node_entropy = compute_entropy(y_node)
    left_entropy = compute_entropy(y_left)
    right_entropy = compute_entropy(y_right)
    #Weighted entropy
    information_gain = node_entropy - (w_left * left_entropy + w_right * right_entropy)
    #Information gain                                                   
    ### END CODE HERE ###  
    
    return information_gain


def get_best_split(X, y, node_indices):   
    """
    Returns the optimal feature and threshold value
    to split the node data 
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.

    Returns:
        best_feature (int):     The index of the best feature to split
    """    
    
    # Some useful variables
    num_features = X.shape[1]
    
    # You need to return the following variables correctly
    best_feature = -1
    
    ### START CODE HERE ###
    max_info_gain = 0
    for i in range(num_features):
         now_info_gain = compute_information_gain(X, y, node_indices,i)
         if now_info_gain > max_info_gain:
                max_info_gain = now_info_gain
                best_feature = i
    ### END CODE HERE ##    
   
    return best_feature
2.3、Putting it together

The steps:
在这里插入图片描述

# Not graded
tree = []

def build_tree_recursive(X, y, node_indices, branch_name, max_depth, current_depth):
    """
    Build a tree using the recursive algorithm that split the dataset into 2 subgroups at each node.
    This function just prints the tree.
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
        branch_name (string):   Name of the branch. ['Root', 'Left', 'Right']
        max_depth (int):        Max depth of the resulting tree. 
        current_depth (int):    Current depth. Parameter used during recursive call.
   
    """ 

    # Maximum depth reached - stop splitting
    if current_depth == max_depth:
        formatting = " "*current_depth + "-"*current_depth
        print(formatting, "%s leaf node with indices" % branch_name, node_indices)
        return
   
    # Otherwise, get best split and split the data
    # Get the best feature and threshold at this node
    best_feature = get_best_split(X, y, node_indices) 
    tree.append((current_depth, branch_name, best_feature, node_indices))
    
    formatting = "-"*current_depth
    print("%s Depth %d, %s: Split on feature: %d" % (formatting, current_depth, branch_name, best_feature))
    
    # Split the dataset at the best feature
    left_indices, right_indices = split_dataset(X, node_indices, best_feature)
    
    # continue splitting the left and the right child. Increment current depth
    build_tree_recursive(X, y, left_indices, "Left", max_depth, current_depth+1)
    build_tree_recursive(X, y, right_indices, "Right", max_depth, current_depth+1)

2.4、Using one-hot encoding of categorical features

在这里插入图片描述

2.5、Continuous valued features

在这里插入图片描述
Choose the 9 mid-points between the 10 examples as possible splits, and find the split that gives the highest information gain.

2.6、Regression with Decision Trees

Use some features to predict the weight bias Information Gain \text{Information Gain} Information Gain.

在这里插入图片描述

3、Tree ensembles

3.1、Using multiple decision trees

Trees are highly sensitive to small changes of the data.
在这里插入图片描述

3.2、Sampling with replacement(有放回抽样)

Create a new training set.

3.3、Random forest algorithm

在这里插入图片描述

3.4、XGBoost

(Deliberate practice)When building the next decision tree, we’re going to focus more attention on the subset of examples that are not yet doing well on and get the new decision tree.
在这里插入图片描述

3.5、When to use decision trees

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值