决策树理解与python代码

GTXY90

已于 2024-09-02 17:14:01 修改

阅读量1.1k

点赞数 28

文章标签：决策树 python 算法

于 2024-08-31 10:20:30 首次发布

本文链接：https://blog.csdn.net/Llcm3030zzstj81/article/details/141745048

版权

文章目录

决策树

决策树

什么是信息熵？

假设你面前有两个盒子，盒子 A 和盒子 B。

盒子 A 里有 10 个完全相同的红色球，每次从盒子 A 中拿球，你肯定能拿到红球，结果非常确定。这种情况下，盒子 A 的不确定性就很低，它的信息熵就很小，可以认为几乎为零。
不同颜色的球

盒子 B 里有 5 个红球和 5 个蓝球，每次从盒子 B 中拿球，你不确定会拿到红球还是蓝球，结果有一定的不确定性。这种情况下，盒子 B 的不确定性比盒子 A 高，它的信息熵就比盒子 A 大。

信息熵就是用来衡量一个系统的不确定性或者混乱程度的指标。系统越不确定、越混乱，信息熵就越大；系统越确定、越有序，信息熵就越小。

数学推导

信息熵

$\operatorname{Info}(D)=-\sum_{k=1}^{\mathcal{n}}p_k\log_2p_k$

信息增益量

$Gain\left(A\right)=Info\left(D\right)-Info_A\left(D\right)$
公式中， $G ain (A)$ 代表以A特征来进行划分时的信息增益量； $I n f o (D)$ 是集合 $D$ 的信息熵； $Info_A(D)$ 是按照A特征划分之后，各部分信息熵的加权和。

信息增益率

$GainRatio\left(A\right)=\frac{Gain\left(A\right)}{SplitInfo\left(A\right)}$
$SplitInfo\left(A\right)=-\sum_{j=1}^{m}\frac{D_{j}}{D}\times\log_{2}\frac{D_{j}}{D}$
其中 $Spl i t I n f o (A)$ 是分类信息值， $m$ 是属性 $A$ 的类别的个数。

GINI指数

采用相同的符号，数据集 $D$ 的纯度可用基尼值来度量：
$\begin{aligned}\mathrm{Gini}(D)=&\sum_{k=1}^{|\mathcal{Y}|}\sum_{k^{\prime}\neq k}p_{k}p_{k^{\prime}}\\=&1-\sum_{k=1}^{|\mathcal{Y}|}p_{k}^{2}\:.\end{aligned}$
$\mathrm{Gini\_index}(D,a)=\sum_{v=1}^V\frac{|D^v|}{|D|_.}\mathrm{Gini}(D^v)\:.$
于是，我们在候选属性集合 $A$ 中，选择那个使得划分后基尼指数最小的属性作为最优划分属性，即 $a_*=\arg\min Gin\_index( D, a) .$

ID3决策树python代码实现

构造一个是否去打网球的数据集：数据集中有四个特征属性和一个决策属性：

第一个特征属性是“天气状况”，有三种可能的值：Sunny（晴天）、Overcast（阴天）、Rain（雨天）。
第二个特征属性是“温度”，有三种可能的值：Hot（炎热）、Mild（温和）、Cool（凉爽）。
第三个特征属性是“湿度”，有两种可能的值：High（高）、Normal（正常）。
第四个特征属性是“风力”，有两种可能的值：Weak（弱）、Strong（强）。
决策属性是“是否进行活动”，用“Yes”或“No”表示。

下面是实现代码：

import math
import pandas as pd

def entropy(data):
    counts = {}
    total = len(data)
    for row in data:
        label = row[-1]
        if label not in counts:
            counts[label] = 0
        counts[label] += 1
    ent = 0
    for label in counts:
        prob = counts[label] / total
        ent += -prob * math.log2(prob)
    return ent

def information_gain(data, attribute_index):
    total_entropy = entropy(data)
    values = set([row[attribute_index] for row in data])
    weighted_entropy = 0
    for value in values:
        subset = [row for row in data if row[attribute_index] == value]
        prob = len(subset) / len(data)
        weighted_entropy += prob * entropy(subset)
    return total_entropy - weighted_entropy

def id3(data, attributes):
    labels = [row[-1] for row in data]
    if len(set(labels)) == 1:
        return labels[0]
    if len(attributes) == 0:
        return max(set(labels), key=labels.count)
    best_attribute = max(attributes, key=lambda attr: information_gain(data, attr))
    tree = {best_attribute: {}}
    remaining_attributes = [attr for attr in attributes if attr!= best_attribute]
    values = set([row[best_attribute] for row in data])
    for value in values:
        subset = [row for row in data if row[best_attribute] == value]
        subtree = id3(subset, remaining_attributes)
        tree[best_attribute][value] = subtree
    return tree

data = [
    ['Sunny', 'Hot', 'High', 'Weak', 'No'],
    ['Sunny', 'Hot', 'High', 'Strong', 'No'],
    ['Overcast', 'Hot', 'High', 'Weak', 'Yes'],
    ['Rain', 'Mild', 'High', 'Weak', 'Yes'],
    ['Rain', 'Cool', 'Normal', 'Weak', 'Yes'],
    ['Rain', 'Cool', 'Normal', 'Strong', 'No'],
    ['Overcast', 'Cool', 'Normal', 'Strong', 'Yes'],
    ['Sunny', 'Mild', 'High', 'Weak', 'No'],
    ['Sunny', 'Cool', 'Normal', 'Weak', 'Yes'],
    ['Rain', 'Mild', 'Normal', 'Weak', 'Yes'],
    ['Sunny', 'Mild', 'Normal', 'Strong', 'Yes'],
    ['Overcast', 'Mild', 'High', 'Strong', 'Yes'],
    ['Overcast', 'Hot', 'Normal', 'Weak', 'Yes'],
    ['Rain', 'Mild', 'High', 'Strong', 'No']
]

attributes = list(range(len(data[0]) - 1))
tree = id3(data, attributes)
print(tree)

输出如下：

{0: {'Overcast': 'Yes', 'Rain': {3: {'Weak': 'Yes', 'Strong': 'No'}}, 'Sunny': {2: {'High': 'No', 'Normal': 'Yes'}}}}

如果采用 $g r a p h v i z$ 库画出来，效果如下(有些丑陋)：
决策树
这里有科学有视化软件 $G r a p h v i z$ 的介绍和下载安装教程。

GTXY90

关注

28
点赞
踩
13

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫