Machine learning week 7(Andrew Ng)

最新推荐文章于 2024-09-28 16:30:19 发布

小白有颗大白梦

最新推荐文章于 2024-09-28 16:30:19 发布

阅读量534

点赞数 2

分类专栏： Machine learning 文章标签：机器学习 python numpy

本文链接：https://blog.csdn.net/weixin_62012485/article/details/126458961

版权

Machine learning 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

本文深入介绍了决策树的学习过程，包括衡量纯度的熵函数，选择最佳分裂特征的信息增益，以及如何构建决策树。接着讨论了使用多个决策树的树集成技术，如随机森林和XGBoost，强调了这些方法在处理数据敏感性和提高预测性能上的优势。

摘要由CSDN通过智能技术生成

文章目录

- Decision tree

Decision tree

1、Decision tree

1.1、Decision tree model

在这里插入图片描述

1.2、Learning process

在这里插入图片描述

2、Decision tree learning

2.1、Measuring purity

The entropy function
在这里插入图片描述
$H(p_1)=-p_1log_2(p_1)-p_0log_2(p_0)$ $H(p_1) = -p_1 \text{log}_2(p_1) - (1- p_1) \text{log}_2(1- p_1)$

    def compute_entropy(y):
    """
    Computes the entropy for 
    
    Args:
       y (ndarray): Numpy array indicating whether each example at a node is
           edible (`1`) or poisonous (`0`)
       
    Returns:
        entropy (float): Entropy at that node
        
    """
    # You need to return the following variables correctly
    entropy = 0.
    
    ### START CODE HERE ###
    m = len(y)
    if m != 0:
        p1 = len(y[y == 1]) / len(y)
        if p1 != 0 and p1 != 1:
            entropy = -p1*np.log2(p1)-(1-p1)*np.log2(1-p1)
        else:
            entropy = 0
    ### END CODE HERE ###        
    
    return entropy

2.2、Choosing a split: Information Gain

$\text{Information Gain} = H(p_1^\text{node})- (w^{\text{left}}H(p_1^\text{left}) + w^{\text{right}}H(p_1^\text{right}))$
Pick the one with the highest information gain
在这里插入图片描述

def split_dataset(X, node_indices, feature):
    """
    Splits the data at the given node into
    left and right branches
    
    Args:
        X (ndarray):             Data matrix of shape(n_samples, n_features)
        node_indices (list):  List containing the active indices. I.e, the samples being considered at this step.
        feature (int):           Index of feature to split on
    
    Returns:
        left_indices (list): Indices with feature value == 1
        right_indices (list): Indices with feature value == 0
    """
    
    # You need to return the following variables correctly
    left_indices = []
    right_indices = []
    
    ### START CODE HERE ###
    for i in node_indices:
        if X[i][feature] == 1:
            left_indices.append(i)
        else:
            right_indices.append(i)
    ### END CODE HERE ###
        
    return left_indices, right_indices


def compute_information_gain(X, y, node_indices, feature):
    
    """
    Compute the information of splitting the node on a given feature
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
   
    Returns:
        cost (float):        Cost computed
    
    """    
    # Split dataset
    left_indices, right_indices = split_dataset(X, node_indices, feature)
    
    # Some useful variables
    X_node, y_node = X[node_indices], y[node_indices]
    X_left, y_left = X[left_indices], y[left_indices]
    X_right, y_right = X[right_indices], y[right_indices]
    
    # You need to return the following variables correctly
    information_gain = 0
    
    ### START CODE HERE ###
    w_left = len(X_left) / len(X_node)
    w_right = len(X_right) / len(X_node)
    # Weights 
    node_entropy = compute_entropy(y_node)
    left_entropy = compute_entropy(y_left)
    right_entropy = compute_entropy(y_right)
    #Weighted entropy
    information_gain = node_entropy - (w_left * left_entropy + w_right * right_entropy)
    #Information gain                                                   
    ### END CODE HERE ###  
    
    return information_gain


def get_best_split(X, y, node_indices):   
    """
    Returns the optimal feature and threshold value
    to split the node data 
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.

    Returns:
        best_feature (int):     The index of the best feature to split
    """    
    
    # Some useful variables
    num_features = X.shape[1]
    
    # You need to return the following variables correctly
    best_feature = -1
    
    ### START CODE HERE ###
    max_info_gain = 0
    for i in range(num_features):
         now_info_gain = compute_information_gain(X, y, node_indices,i)
         if now_info_gain > max_info_gain:
                max_info_gain = now_info_gain
                best_feature = i
    ### END CODE HERE ##    
   
    return best_feature

2.3、Putting it together

The steps:
在这里插入图片描述

# Not graded
tree = []

def build_tree_recursive(X, y, node_indices, branch_name, max_depth, current_depth):
    """
    Build a tree using the recursive algorithm that split the dataset into 2 subgroups at each node.
    This function just prints the tree.
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
        branch_name (string):   Name of the branch. ['Root', 'Left', 'Right']
        max_depth (int):        Max depth of the resulting tree. 
        current_depth (int):    Current depth. Parameter used during recursive call.
   
    """ 

    # Maximum depth reached - stop splitting
    if current_depth == max_depth:
        formatting = " "*current_depth + "-"*current_depth
        print(formatting, "%s leaf node with indices" % branch_name, node_indices)
        return
   
    # Otherwise, get best split and split the data
    # Get the best feature and threshold at this node
    best_feature = get_best_split(X, y, node_indices) 
    tree.append((current_depth, branch_name, best_feature, node_indices))
    
    formatting = "-"*current_depth
    print("%s Depth %d, %s: Split on feature: %d" % (formatting, current_depth, branch_name, best_feature))
    
    # Split the dataset at the best feature
    left_indices, right_indices = split_dataset(X, node_indices, best_feature)
    
    # continue splitting the left and the right child. Increment current depth
    build_tree_recursive(X, y, left_indices, "Left", max_depth, current_depth+1)
    build_tree_recursive(X, y, right_indices, "Right", max_depth, current_depth+1)

2.4、Using one-hot encoding of categorical features

在这里插入图片描述

2.5、Continuous valued features

在这里插入图片描述
Choose the 9 mid-points between the 10 examples as possible splits, and find the split that gives the highest information gain.

2.6、Regression with Decision Trees

Use some features to predict the weight bias $\text{Information Gain}$ .

在这里插入图片描述

3、Tree ensembles

3.1、Using multiple decision trees

Trees are highly sensitive to small changes of the data.
在这里插入图片描述

3.2、Sampling with replacement(有放回抽样）

Create a new training set.

3.3、Random forest algorithm

在这里插入图片描述

3.4、XGBoost

（Deliberate practice）When building the next decision tree, we’re going to focus more attention on the subset of examples that are not yet doing well on and get the new decision tree.
在这里插入图片描述