机器学习学习笔记（15）----ID3（Iterative Dichotomizer 3）算法

最新推荐文章于 2022-10-28 10:20:19 发布

swordmanwk

最新推荐文章于 2022-10-28 10:20:19 发布

阅读量599

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/swordmanwk/article/details/107444379

版权

机器学习专栏收录该内容

20 篇文章 7 订阅

订阅专栏

ID3（Iterative Dichotomizer 3）算法是决策树中最简单的算法，基于信息增益作为选择特征的标准。算法是一个递归调用的方法，通过调用ID3（S, X）返回一颗决策树。以下是ID3递归算法的伪代码：

ID3(S,A)

输入：训练数据S，特征子集 $A\subseteq X$

如果 S中的所有样本都相同：

将该样本类作为该节点的类编号，返回叶子节点；

如果 $A=\o$ （空集）：

将S中样本数最多的类作为该节点的类编号，返回叶子节点；

否则：

构造一个子节点T，

令 $x_{j} =argmax_{x_{i}\in A}Gain(S,x_{i})$

假设 $x_{j}$ 有k个不同的值( $r_{1},r_{2},......r_{k}$ )，则：

$T_{1}$ 为 $ID3(\{ (x,y)\in S: x_{j}=r_{1} \}, A - x_{j} )$ 返回的子节点

......

$T_{k}$ 为 $ID3(\{ (x,y)\in S: x_{j}=r_{k} \}, A - x_{j} )$ 返回的子节点

令 $T_{1},T_{2},......T_{k}$ 作为T的子节点，返回T。

接下来，我们用python实现一个ID3算法(id3tree.py)（参考自《Python机器学习算法：原理，实现与案例》）:

import numpy as np

class ID3DecisionTree:
    class Node:
        def __init__(self):
            self.value = None
            # 内部叶节点属性
            self.feature_index = None
            self.children = {}
            
        def __str__(self):
            if self.children:
                s = '内部节点<%s>:\n' % self.feature_index
                for fv, node in self.children.items():
                    ss = '[%s]-> %s' %(fv, node)
                    s += '\t' + ss.replace('\n', '\n\t')  + '\n'
            else:
                s = '叶节点(%s)' % self.value
            return s
            
    def __init__(self, gain_threshold=0.01):
        #信息增益阈值
        self.gain_threshold = gain_threshold
        
    def _entropy(self, y):
        #熵：-sum(pi*log2(pi))
        c = np.bincount(y)
        p = c[np.nonzero(c)]/y.size
        return -sum(p*np.log2(p))
        
    def _conditional_entropy(self, feature, y):
        #条件熵
        feature_values = np.unique(feature)
        h = 0
        for v in feature_values:
            y_sub = y[feature == v]
            p = y_sub.size /y.size
            h += p * self._entropy(y_sub)
        return h
        
    def _information_gain(self, feature, y):
        #信息增益 = 经验熵 - 经验条件熵
        return self._entropy(y) - self._conditional_entropy(feature, y)

    def _select_feature(self, X, y, feature_list):
        #选择信息增益最大的特征
        #正常情况下，返回特征（最大信息增益）在feature_list中的index值
        if feature_list:
            gains = np.apply_along_axis(self._information_gain, 0,
                X[:, feature_list], y)
            index = np.argmax(gains)
            if gains[index] > self.gain_threshold:
                return index
        #当feature_list已为空，或所有特征信息增益都小于阈值时，返回None
        return None
        
    def _build_tree(self, X, y, feature_list):
        #决策树构造算法（递归）
        #创建节点
        node = ID3DecisionTree.Node()
        #统计数据集中样本类标记的个数
        labels_count = np.bincount(y)
        #任何情况下节点值总等于数据集中样本最多的类标记
        node.value = np.argmax(labels_count)
        
        #判断类标记是否全部一致
        if np.count_nonzero(labels_count) != 1:
            #选择信息增益最大的特征
            index = self._select_feature(X, y, feature_list)        
            #能选择到适合的特征时，创建内部节点，否则创建叶节点
            if index is not None:
                #将已选特征从特征集合中删除
                node.feature_index = feature_list.pop(index)
                
                #根据已选特征的取值范围划分数据集，并使用数据子集创建子树
                feature_values = np.unique(X[:, node.feature_index])
                for v in feature_values:
                    #筛选出数据子集
                    idx = (X[:, node.feature_index] == v)
                    X_sub, y_sub = X[idx], y[idx]
                    #创建子树
                    node.children[v] = self._build_tree(X_sub, y_sub, feature_list.copy())
        return node
        
    def _predict_one(self, x):
        #搜索决策树，对单个实例进行预测
        node = self.tree_
        while node.children:
            child = node.children.get(x[node.feature_index])
            if not child:
                break
            node = child
        return node.value
        
    def train(self, X_train, y_train):
        #训练
        _, n = X_train.shape
        self.tree_ = self._build_tree(X_train, y_train, list(range(n)))
        
    def predict(self, X):
        #对每一个实例使用_predict_one，返回收集到的结果数组
        return np.apply_along_axis(self._predict_one, axis=1, arr=X)
    
    def __str__(self):
        #生成决策树的对应字符串
        if hasattr(self, 'tree_'):
            return str(self.tree_)
        return ''

代码中有几个numpy的函数，以前没有使用过，先说明一下。

np.bincount函数：

>>> y = [1,2,3,2]
>>> np.bincount(y)
array([0, 1, 2, 1], dtype=int32)
表示y数组中：0的个数是0，1的个数是1个，2的个数是2个，3的个数是1个。（https://blog.csdn.net/m0_37885275/article/details/92992078）

np.nonzero函数：

>>> y = [1,2,3,3,1]
>>> c = np.bincount(y)
>>> c
array([0, 2, 1, 2], dtype=int32)
>>> d = np.nonzero(c)
>>> d
(array([1, 2, 3], dtype=int32),)

表示c数组的index是1，2，3的值是非0的。

count_nonzero函数：

>>> d = [0,2,3,4,0]
>>> np.count_nonzero(d)
3

表示d数组中非0项的元素个数是3个

np.argmax函数：

>>> d = [2,4,0,5,1]
>>> np.argmax(d)
3

表示d数组的最大元素（5）的index值是3

np.apply_along_axis函数：

可以参考（https://www.cnblogs.com/zz22--/p/7498868.html）

接下来解释一下代码：

train函数：调用_build_tree来构造决策树，初始使用所有的特征。

_build_tree函数：递归函数，首先创建子节点，把样本值最多的类的样本值保存到node.value中，如果子节点对应的y的样本值不是完全相同，就按照最大信息增益来找到对应的特征index，然后进一步细分样本集，进行递归调用。node.children字典的key是特征的所有的不同样本值。

_select_feature函数：用于选择增益最大的特征。

_information_gain函数：信息增益 = 经验熵 - 经验条件熵，具体计算公式可参考《机器学习学习笔记（14）----决策树》。

接下来，测试一下效果，使用隐形眼睛数据集（http://archive.ics.uci.edu/ml/machine-learning-databases/lenses/）：

数据的第一列是ID，不使用，后面几个特征列分别是：

age of the patient: (1) young, (2) pre-presbyopic, (3) presbyopic
spectacle prescription: (1) myope, (2) hypermetrope
astigmatic: (1) no, (2) yes
tear production rate: (1) reduced, (2) normal

标签有3个分类：

1 : the patient should be fitted with hard contact lenses,
2 : the patient should be fitted with soft contact lenses,
3 : the patient should not be fitted with contact lenses.

>>> import numpy as np
>>> dataset = np.genfromtxt('lenses.data',dtype=np.int)
>>> X = dataset[:, 1:-1]
>>> y = dataset[:,-1]
>>> from id3tree import ID3DecisionTree
>>> id3 = ID3DecisionTree()
>>> id3.train(X,y)
>>> print(id3)
内部节点<3>:
	[1]-> 叶节点(3)
	[2]-> 内部节点<2>:
		[1]-> 内部节点<0>:
			[1]-> 叶节点(2)
			[2]-> 叶节点(2)
			[3]-> 内部节点<1>:
				[1]-> 叶节点(3)
				[2]-> 叶节点(2)
			
		
		[2]-> 内部节点<1>:
			[1]-> 叶节点(1)
			[2]-> 内部节点<0>:
				[1]-> 叶节点(1)
				[2]-> 叶节点(3)
				[3]-> 叶节点(3)
			
>>> y_predict = id3.predict(X)
>>> y_predict
array([3, 2, 3, 1, 3, 2, 3, 1, 3, 2, 3, 1, 3, 2, 3, 3, 3, 3, 3, 1, 3, 2,
       3, 3], dtype=int32)
>>> y
array([3, 2, 3, 1, 3, 2, 3, 1, 3, 2, 3, 1, 3, 2, 3, 3, 3, 3, 3, 1, 3, 2,
       3, 3])

参考资料：

《深入理解机器学习---从原理到算法》

《Python机器学习算法：原理，实现与案例》

swordmanwk

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
机器学习学习笔记（15）----ID3（Iterative Dichotomizer 3）算法

ID3（Iterative Dichotomizer 3）算法是决策树中最简单的算法，基于信息增益作为选择特征的标准。算法是一个递归调用的方法，通过调用ID3（S, X）返回一颗决策树。以下是ID3递归算法的伪代码：ID3(S,A)输入：训练数据S，特征子集如果 S中的所有样本都相同：将该样本类作为该节点的类编号，返回叶子节点；如果（空集）：将S中样本数最多的类作为该节点的类编号，返回叶子节点；否则：构造一个子节点T，令假设有...
复制链接

扫一扫

专栏目录