【李航-统计学习方法】【原理与代码】第五章决策树 decision tree python

最新推荐文章于 2022-07-04 23:46:21 发布

Jie Ou

最新推荐文章于 2022-07-04 23:46:21 发布

阅读量550

点赞数

分类专栏：机器学习 python 文章标签：统计学习方法决策树 decision tree

本文链接：https://blog.csdn.net/github_36923418/article/details/89222668

版权

机器学习同时被 2 个专栏收录

45 篇文章 1 订阅

订阅专栏

python

44 篇文章 2 订阅

订阅专栏

一、决策树

最基本思想：

决策树学习的算法通常是一个递归地选择最优特征，并根据该特征对训练数据进行分割，使得对各个子数据集有一个最好的分类的过程。这一过程对应着对特征空间的划分，也对应着决策树的构建。开始，构建根结点，将所有训练数据都放在根结点。选择一个最优特征，按照这一特征将训练数据集分割成子集，使得各个子集有一个在当前条件下最好的分类。如果这些子集已经能够被基本正确分类，那么构建叶结点，并将这些子集分到所对应的叶结点中去；如果还有子集不能被基本正确分类，那么就对这些子集选择新的最优特征，继续对其进行分割，构建相应的结点。如此递归地进行下去，直至所有训练数据子集被基本正确分类，或者没有合适的特征为止。最后每个子集都被分到叶结点上，即都有了明确的类。这就生成了一棵决策树。

以上方法生成的决策树可能对训练数据有很好的分类能力，但对未知的测试数据却未必有很好的分类能力，即可能发生过拟合现象。我们需要对已生成的树自下而上进行剪枝，将树变得更简单，从而使它具有更好的泛化能力。具体地，就是去掉过于细分的叶结点，使其回退到父结点，甚至更高的结点，然后将父结点或更高的结点改为新的叶结点。

要点：

如何选择最优特征？利用“经验熵”，“经验条件熵”，“信息增益”等来选择当前节点下最好的特征。

C表示类别，D表示整个集合，Di表示某个特征A下的具体某个特征的样本的集合。

ID3算法：

先利用信息增益选择最好的特征A，再利用A下的最大|Di| 分为两个子集，从而这两个子集分别回到第1步，继续。

代码：

# encoding:utf-8
#import sys
#print sys.getdefaultencoding()
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

from collections import Counter
import math
from math import log

#这里直接借鉴的“机器学习初学者”公众号给出的资料
# 书上题目5.1
def create_data():
    datasets = [[u'青年', u'否', u'否', u'一般', u'否'],
               [u'青年', u'否', u'否', u'好', u'否'],
               [u'青年', u'是', u'否', u'好', u'是'],
               [u'青年', u'是', u'是', u'一般', u'是'],
               [u'青年', u'否', u'否', u'一般', u'否'],
               [u'中年', u'否', u'否', u'一般', u'否'],
               [u'中年', u'否', u'否', u'好', u'否'],
               [u'中年', u'是', u'是', u'好', u'是'],
               [u'中年', u'否', u'是', u'非常好', u'是'],
               [u'中年', u'否', u'是', u'非常好', u'是'],
               [u'老年', u'否', u'是', u'非常好', u'是'],
               [u'老年', u'否', u'是', u'好', u'是'],
               [u'老年', u'是', u'否', u'好', u'是'],
               [u'老年', u'是', u'否', u'非常好', u'是'],
               [u'老年', u'否', u'否', u'一般', u'否'],
               ]
    labels = [u'年龄', u'有工作', u'有自己的房子', u'信贷情况', u'类别']
    # 返回数据集和每个维度的名称
    return datasets, labels

#经验熵
def cal_entropy(datasets):
    dataset_length=len(datasets)
    #print(datasets[0])
    labels={}
    for i in range(dataset_length):
        label=datasets[i][-1]
        label=label#.decode("utf8")#.decode('unicode-escape')#.encode("gb18030").decode("gb18030")
        if label not in labels:
            labels[label]=0
        labels[label]+=1
    #print(labels)
    entropy=0
    for key in labels.keys():
        probability= float(labels[key])/float(dataset_length)
        if probability ==0 :
            self_entropy=0
        else: 
            self_entropy=probability*math.log(probability,2) #遵从课本使用以2为底数的log函数
        entropy+=self_entropy
        #print(-self_entropy)
    return -entropy

#经验条件熵
def conditional_entropy(dataset,select_Feature):
    select_Feature_choice=[]
    length_D=float(dataset.shape[0])
    Di_={}
    for unit in dataset[:,select_Feature]: #统计出，选择的这个特征种有多少种表示形式
        if unit in select_Feature_choice:
            continue
        select_Feature_choice.append(unit)
    conditional_entropy_=0 
    for unit in select_Feature_choice:  #按照所选特征的每种表示形式，选出子集，求经验条件熵
        select_dataset=dataset[np.where(dataset[:,select_Feature]==unit)]
        length_Di= float(select_dataset.shape[0])
        Di_[unit]=length_Di
        entropy_Di=cal_entropy(select_dataset)
        conditional_entropy_+=((length_Di/length_D)*entropy_Di)
    return conditional_entropy_,Di_

def cal_information_gain(H_entropy,Condi_entropy): #计算信息增益
    return H_entropy-Condi_entropy

def cal_information_gain_ratio(H_entropy,Condi_entropy) :#计算信息增益比
    return cal_information_gain(H_entropy,Condi_entropy)[0]/H_entropy

def train_by_cal_information_gain(datasets,labels):
    num_labels=len(labels)
    information_gain_=[]
    Diss=[]
    entropy=cal_entropy(datasets)
    for i in range(num_labels):
        conditional_entropy_,Dis=conditional_entropy(datasets,i)
        information_gain=cal_information_gain(entropy,conditional_entropy_)
        information_gain_.append(information_gain)
        Diss.append(Dis)
    sorted_information_gain_=sorted(information_gain_) #sorted 从小到大排序
    selected_feature = information_gain_.index(sorted_information_gain_[-1])
    print("被选择的特征是：")
    print(labels[selected_feature])
    Dis_selected=Diss[selected_feature]
    Di_selected=sorted(Dis_selected.items(), key=lambda x:x[-1])[-1][0]
    new_datasets=np.delete(datasets,np.where(new_dataset[:selected_feature]==Di_selected), axis=0)
    new_datasets=np.delete(new_datasets,selected_feature, axis=1)
    selected_dataset=datasets[np.where(new_dataset[:selected_feature]==Di_selected),:]
    selected_dataset=np.delete(selected_dataset,selected_feature,axis)
    del labels[selected_feature]
    return Di_selected,labels[selected_feature],selected_dataset,new_datasets,labels

# 定义节点类 二叉树
class Node:
    def __init__(self, root=True, label=None, feature_name=None, feature=None):
        self.root = root
        self.label = label
        self.feature_name = feature_name
        self.feature = feature
        self.tree = {}
        self.result = {'label:': self.label, 'feature': self.feature, 'tree': self.tree}

    def __repr__(self):
        return '{}'.format(self.result)

    def add_node(self, val, node):
        self.tree[val] = node

    def predict(self, features):
        if self.root is True:
            return self.label
        return self.tree[features[self.feature]].predict(features)
    
class DTree:
    def __init__(self, epsilon=0.1):
        self.epsilon = epsilon
        self._tree = {}

    # 熵
    


    def info_gain_train(self, datasets):
        count = len(datasets[0]) - 1
        ent = cal_entropy(datasets)
        best_feature = []
        for c in range(count):
            c_info_gain = cal_information_gain(ent, conditional_entropy(datasets, c)[0])
            best_feature.append((c, c_info_gain))
        # 比较大小
        best_ = max(best_feature, key=lambda x: x[-1])
        return best_

    def train(self, train_data):
        """
        input:数据集D(DataFrame格式)，特征集A，阈值eta
        output:决策树T
        """
        _, y_train, features = train_data.iloc[:, :-1], train_data.iloc[:, -1], train_data.columns[:-1]
        # 1,若D中实例属于同一类Ck，则T为单节点树，并将类Ck作为结点的类标记，返回T
        if len(y_train.value_counts()) == 1:
            return Node(root=True,
                        label=y_train.iloc[0])

        # 2, 若A为空，则T为单节点树，将D中实例树最大的类Ck作为该节点的类标记，返回T
        if len(features) == 0:
            return Node(root=True, label=y_train.value_counts().sort_values(ascending=False).index[0])

        # 3,计算最大信息增益 同5.1,Ag为信息增益最大的特征
        max_feature, max_info_gain = self.info_gain_train(np.array(train_data))
        max_feature_name = features[max_feature]

        # 4,Ag的信息增益小于阈值eta,则置T为单节点树，并将D中是实例数最大的类Ck作为该节点的类标记，返回T
        if max_info_gain < self.epsilon:
            return Node(root=True, label=y_train.value_counts().sort_values(ascending=False).index[0])

        # 5,构建Ag子集
        node_tree = Node(root=False, feature_name=max_feature_name, feature=max_feature)

        feature_list = train_data[max_feature_name].value_counts().index
        for f in feature_list:
            sub_train_df = train_data.loc[train_data[max_feature_name] == f].drop([max_feature_name], axis=1)

            # 6, 递归生成树
            sub_tree = self.train(sub_train_df)
            node_tree.add_node(f, sub_tree)

        # pprint.pprint(node_tree.tree)
        return node_tree

    def fit(self, train_data):
        self._tree = self.train(train_data)
        return self._tree

    def predict(self, X_test):
        return self._tree.predict(X_test)

datasets, labels = create_data()
data_df = pd.DataFrame(datasets, columns=labels)
dt = DTree()
tree = dt.fit(data_df)

print(dt.predict([u'老年', u'否', u'是', u'好']))

Jie Ou

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
【李航-统计学习方法】【原理与代码】第五章决策树 decision tree python

一、决策树最基本思想：决策树学习的算法通常是一个递归地选择最优特征，并根据该特征对训练数据进行分割，使得对各个子数据集有一个最好的分类的过程。这一过程对应着对特征空间的划分，也对应着决策树的构建。开始，构建根结点，将所有训练数据都放在根结点。选择一个最优特征，按照这一特征将训练数据集分割成子集，使得各个子集有一个在当前条件下最好的分类。如果这些子集已经能够被基本正确分类，那么...
复制链接

扫一扫