【统计学习方法】决策树

最新推荐文章于 2021-05-15 19:14:08 发布

Scau_Jack

最新推荐文章于 2021-05-15 19:14:08 发布

阅读量216

点赞数

分类专栏：概率图模型统计学习方法文章标签：决策树 python 算法机器学习

本文链接：https://blog.csdn.net/Scau_Jack/article/details/107999593

版权

概率图模型同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

统计学习方法

3 篇文章 0 订阅

订阅专栏

文章目录

特征选择

信息增益

信息熵

信息熵就是用来表示随机变量不确定性的度量。随机变脸的熵可以定义为：
$-\sum_{i=1}^{n}p_ilogp_i \tag{1}$
我们以贷款申请样本数据表为例子。

ID	年龄	有工作	有自己的房子	信贷情况	类别
1	青年	否	否	一般	否
2	青年	否	否	好	否
3	青年	是	否	好	是
4	青年	是	是	一般	是
5	青年	否	否	一般	否
6	中年	否	否	一般	否
7	中年	否	否	好	否
8	中年	是	是	好	是
9	中年	否	是	非常好	是
10	中年	否	是	非常好	是
11	老年	否	是	非常好	是
12	老年	否	是	好	是
13	老年	是	否	好	是
14	老年	是	否	非常好	是
15	老年	否	否	一般	否

条件信息熵

$\tag{2}$

基尼指数

在分类任务中，假设有K个类别，样本点属于第k类的概率是 $p_k$ ,那么概率分布的基尼指数定义为：
$\begin{aligned} Gini(p) =& \sum_{k=1}^{K}p_k(1-p_k) \\ =& \sum_{k=1}^{K}p_k -\sum_{k=1}^{K}p_k^2 \\ =& 1 - \sum_{k=1}^{K}p_k^2 \tag{1} \end{aligned}$
在给定特征A的条件下，集合D的基尼指数可以定义为:
$\frac{|D_1|}{|D|}Gini(D_1) + \frac{|D_2|}{|D|}Gini(D_2)$

信息增益比

以信息增益作为划分训练数据集的特征，存在偏向于选择取值较多的特征问题，使用信息增益比可以对这个问题进行校正，这是特征选择的另一准则。
特征A对训练数据集D的信息增益比 $g_R(D,A) = \frac{g(D,A)}{H_A(D)}$ ，其中 $H_A(D) = \sum_{i=1}^{n}\frac{|D_i|}{|D|}log_2\frac{|D_i|}{|D|}$

决策树生成

ID3

'''
Author: your name
Date: 2020-08-08 11:52:19
LastEditTime: 2020-08-22 16:43:41
LastEditors: Please set LastEditors
Description: In User Settings Edit
FilePath: /统计学习方法/决策树/main.py
'''
# %%
import random
import numpy as np
import pprint
from math import log
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
from collections import defaultdict

import math
import pandas as pd
from graphviz import Digraph
import networkx as nx
import matplotlib.pyplot as plt
# 书上题目5.1
g = Digraph('决策树')


def H(D: pd.DataFrame, condition=None, class_label='class') -> float:
    class_list = D.groupby(class_label).size().to_list()
    total = sum(class_list)
    entropy = -1 * sum([c_num / total * math.log2(c_num / total)
                        for c_num in class_list])
    if condition is not None:
        condition_entropy = sum(D.groupby(condition).apply(
            lambda x: len(x)/len(D) * H(x)))
        return entropy - condition_entropy
    else:
        return entropy


def create_data():
    datasets = [['青年', '否', '否', '一般', '否'],
                ['青年', '否', '否', '好', '否'],
                ['青年', '是', '否', '好', '是'],
                ['青年', '是', '是', '一般', '是'],
                ['青年', '否', '否', '一般', '否'],
                ['中年', '否', '否', '一般', '否'],
                ['中年', '否', '否', '好', '否'],
                ['中年', '是', '是', '好', '是'],
                ['中年', '否', '是', '非常好', '是'],
                ['中年', '否', '是', '非常好', '是'],
                ['老年', '否', '是', '非常好', '是'],
                ['老年', '否', '是', '好', '是'],
                ['老年', '是', '否', '好', '是'],
                ['老年', '是', '否', '非常好', '是'],
                ['老年', '否', '否', '一般', '否'],
                ]
    labels = [u'age', u'work', u'house', u'credit', 'class']
    # 返回数据集和每个维度的名称
    return datasets, labels


class Node:
    def __init__(self, is_leaf, class_type):
        self.son_nodes = []
        self.is_leaf = is_leaf
        self.class_type = str(class_type)
        if is_leaf:
            self.class_type += str(random.randint(1,1000))
        self.edges = []

    def __str__(self):
        return "{} {}".format(self.is_leaf, self.class_type)


def create_disicion_tree_by_id3(D, A, delta, class_label='class'):
    class_num = len(D.groupby(class_label).size())
    most_frequence_class = D[class_label].value_counts().keys()[0]
    
    if class_num == 1:
        return Node(True, most_frequence_class)
    if len(A) == 0:
        return Node(True, most_frequence_class)

    entropy_list = []
    for feature in A:
        entropy_list.append(H(D, condition=feature, class_label=class_label))

    idx = np.argmax(entropy_list)
    selected_feature = A[idx]
    max_entropy = entropy_list[idx]
    if max_entropy < delta:
        return Node(True, most_frequence_class)

    base = Node(False, selected_feature)
    for group in D.groupby(selected_feature):
        base.son_nodes.append(create_disicion_tree_by_id3(group[1], list(
            set(A) - set([selected_feature])), delta))
        base.edges.append(group[0])
    return base


datasets, labels = create_data()
train_data = pd.DataFrame(datasets, columns=labels)
train_data['class'] = train_data['class'] == '是'
root = create_disicion_tree_by_id3(train_data, [
    u'age', u'work', u'house', u'credit'], 0.0)
G = nx.Graph()
queue = [root]
G.add_node(root.class_type)
while len(queue) > 0:
    head = queue.pop(0)
    for son in head.son_nodes:
        G.add_node(son.class_type)
        G.add_edge(head.class_type, son.class_type)
        queue.append(son)
    print(head.class_type)

pos = nx.spring_layout(G)
nx.draw(G, pos,with_labels=True)
plt.show()

# %%

在这里插入图片描述

C4.5

和ID3的区别在于使用信息增益比作为特征选择标准

CART

分类回归树(classification and regression tree,CART)是给定随机变量X条件下，输出随机变量Y的条件概率分布的学习方法。CART假设决策树是二叉树。也即是在计算现有特征A，对其所有可能取到的每一个值a，根据样本点对A=a的测试为”是“或者”否“，将训练数据集D切分为D1和D2，样本分割位置叫样本切分点。再利用公示X计算A=a的时候的基尼指数。

Scau_Jack

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【统计学习方法】决策树

文章目录特征选择信息增益信息熵条件信息熵基尼指数信息增益比决策树生成ID3C4.5CART决策树剪枝CART剪枝特征选择信息增益信息熵信息熵就是用来表示随机变量不确定性的度量。我们以贷款申请样本数据表为例子。ID年龄有工作有房子贷款情况类别条件信息熵基尼指数信息增益比决策树生成ID3C4.5CART决策树剪枝CART剪枝...
复制链接

扫一扫

专栏目录