sk-learn决策树ID3、C4.5和CART的算法代码实现

最新推荐文章于 2022-10-22 00:15:00 发布

伊木子曦

最新推荐文章于 2022-10-22 00:15:00 发布

阅读量2.1k

点赞数

分类专栏： # 人工智能文章标签：算法决策树机器学习

本文链接：https://blog.csdn.net/Mouer__/article/details/121057922

版权

人工智能专栏收录该内容

21 篇文章 1 订阅

订阅专栏

一、ID3算法

1.伪代码

ID3 (Examples, Target_Attribute, Attributes)
    Create a root node for the tree
    If all examples are positive, Return the single-node tree Root, with label = +.
    If all examples are negative, Return the single-node tree Root, with label = -.
    If number of predicting attributes is empty, then Return the single node tree Root,
    with label = most common value of the target attribute in the examples.
    Otherwise Begin
        A ← The Attribute that best classifies examples.
        Decision Tree attribute for Root = A.
        For each possible value, vi, of A,
            Add a new tree branch below Root, corresponding to the test A = vi.
            Let Examples(vi) be the subset of examples that have the value vi for A
            If Examples(vi) is empty
                Then below this new branch add a leaf node with label = most common target value in the examples
            Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A})
    End
    Return Root

我们接下来并不会用这个而是用sklearn库实现

2.缺点

对于具有很多值的属性它是非常敏感的，例如，如果我们数据集中的某个属性值对不同的样本基本上是不相同的，甚至更极端点，对于每个样本都是唯一的，如果我们用这个属性来划分数据集，它会得到很大的信息增益，但是，这样的结果并不是我们想要的。
ID3算法不能处理具有连续值的属性。
ID3算法不能处理属性具有缺失值的样本。
由于按照上面的算法会生成很深的树，所有容易产生过拟合现象。

3.实现代码

1.导入模块部分

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier

2.读取数据

data = pd.read_csv('./西瓜数据集.csv')
data

	色泽	根蒂	敲击	纹理	脐部	触感	好瓜
0	青绿	蜷缩	浊响	清晰	凹陷	硬滑	是
1	乌黑	蜷缩	沉闷	清晰	凹陷	硬滑	是
2	乌黑	蜷缩	浊响	清晰	凹陷	硬滑	是
3	青绿	蜷缩	沉闷	清晰	凹陷	硬滑	是
4	浅白	蜷缩	浊响	清晰	凹陷	硬滑	是
5	青绿	稍蜷	浊响	清晰	稍凹	软粘	是
6	乌黑	稍蜷	浊响	稍糊	稍凹	软粘	是
7	乌黑	稍蜷	浊响	清晰	稍凹	硬滑	是
8	乌黑	稍蜷	沉闷	稍糊	稍凹	硬滑	否
9	青绿	硬挺	清脆	清晰	平坦	软粘	否
10	浅白	硬挺	清脆	模糊	平坦	硬滑	否
11	浅白	蜷缩	浊响	模糊	平坦	软粘	否
12	青绿	稍蜷	浊响	稍糊	凹陷	硬滑	否
13	浅白	稍蜷	沉闷	稍糊	凹陷	硬滑	否
14	乌黑	稍蜷	浊响	清晰	稍凹	软粘	否
15	浅白	蜷缩	浊响	模糊	平坦	硬滑	否
16	青绿	蜷缩	沉闷	稍糊	稍凹	硬滑	否

3.数据编码

#创建LabelEncoder()对象，用于序列化
label = LabelEncoder()    

#为每一列序列化
for col in data[data.columns[:-1]]:
    data[col] = label.fit_transform(data[col])
data

	色泽	根蒂	敲击	纹理	脐部	触感	好瓜
0	2	2	1	1	0	0	是
1	0	2	0	1	0	0	是
2	0	2	1	1	0	0	是
3	2	2	0	1	0	0	是
4	1	2	1	1	0	0	是
5	2	1	1	1	2	1	是
6	0	1	1	2	2	1	是
7	0	1	1	1	2	0	是
8	0	1	0	2	2	0	否
9	2	0	2	1	1	1	否
10	1	0	2	0	1	0	否
11	1	2	1	0	1	1	否
12	2	1	1	2	0	0	否
13	1	1	0	2	0	0	否
14	0	1	1	1	2	1	否
15	1	2	1	0	1	0	否
16	2	2	0	2	2	0	否

sklearn拟合

# 采用ID3拟合
dtc = DecisionTreeClassifier(criterion='entropy')
# 进行拟合
dtc.fit(data.iloc[:,:-1].values.tolist(),data.iloc[:,-1].values) 
# 标签对应编码
result = dtc.predict([[1,1,1,1,0,0]])
#拟合结果
result

array(['是'], dtype=object)

二、C4.5算法

C4.5算法总体思路与ID3类似，都是通过构造决策树进行分类，其区别在于分支的处理，在分支属性的选取上，ID3算法使用信息增益作为度量，而C4.5算法引入了信息增益率作为度量
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-h88vxEjX-1635604412623)(attachment:image.png)]
由信息增益率公式中可见，当𝑣比较大时，信息增益率会明显降低，从而在一定程度上能够解决ID3算法存在的往往选择取值较多的分支属性的问题

三、CART算法

CART算法构造的是二叉决策树，决策树构造出来后同样需要剪枝，才能更好的应用于未知数据的分类。CART算法在构造决策树时通过基尼系数来进行特征选择。

1.基尼指数

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9I1TfPAZ-1635604412625)(attachment:8RW4%293Q@CXD6R%60WA%5BXP4CGC.png)]

2.CART拟合

# 采用CART拟合
dtc = DecisionTreeClassifier()
# 进行拟合
dtc.fit(data.iloc[:,:-1].values.tolist(),data.iloc[:,-1].values) 
# 标签对应编码
result = dtc.predict([[1,1,1,1,0,0]])
#拟合结果
result

array(['是'], dtype=object)

四、参考

https://blog.csdn.net/xlinsist/article/details/51468741

https://blog.csdn.net/qq_47281915/article/details/120928915