基于周志华西瓜数据集的决策树算法及准确率测试

基于周志华西瓜数据集的决策树算法及准确率测试

1.决策树介绍
举个通俗的栗子来解释一下什么是决策树,想象一个女孩的母亲要给这个女孩介绍男朋友:

女儿:有没有房子?母亲:有。

女儿:长的帅不帅?母亲:挺帅的。

女儿:收入高不?
母亲:不算很高,中等情况。

女儿:是公务员不?母亲:是,在税务局上班呢。

女儿:那好,我去见见。

这个女孩的决策过程就是典型的分类树决策。相当于通过是否有房、长相、收入和是否公务员对将男人分为两个类别:见和不见。下面我们通过流程图把女儿的决策树判断过程展现出来:
在这里插入图片描述

通过这个例子,大家已经对决策树算法有个基本了解了吧,这也是决策树算法的一大优势——数据形式非常容易理解。

2.用python构造决策树基本流程
下图是西瓜书中的决策树学习基本算法,接下来我们将根据这个算法流程用python代码自己写一棵决策树。
在这里插入图片描述

在构造决策树时,要解决的第一个问题就是,当前数据集哪个特征在划分数据分类时起决定性作用。在前面相亲的例子中,女孩为何第一个问题是“是否有房子”呢,因为是否有房子这个特征能够提供的“信息量”很大,划分选择就是找提供“信息量”最大的特征,学术上叫信息增益。

3.划分选择(按照信息增益)
什么是信息增益呢,官方介绍请参考西瓜书哈,个人认为就是一个信息提纯的过程,比如一堆黄豆和一堆红豆混在一起,这时候信息的纯度是很低的,如果我们把红豆挑出来了分成两堆,那这时候纯度就高了。这就是一个信息增益的过程,衡量信息纯度的标准,就是信息熵。

信息熵是度量样本集合纯度最常用的一种指标,我的个人理解是对一个事件进行编码,所需要的平均码长就是信息熵,纯度越高,需要的平均代码就越短,信息熵越低。

当前样本集合D中第k类样本所占的比例为pk(k=1,2,…,n),则D的信息熵定义为
Ent(D)=−∑k=1npklog2pk
Ent(D)=−∑k=1npklog2pk

Ent(D)的值越小,则D的纯度越高。
西瓜数据集:链接: https://pan.baidu.com/s/1jxZvzUYX6QUk0cVH3d1vfw 提取码: 3ee9
随机1/3数据作为测试集
最初代码:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import collections


#计算给定数据集的香浓熵
from math import log
def splitDataSet(dataSet, index, feature):
    splitedDataSet = []
    mD = len(dataSet)
    for data in dataSet:
        if(data[index] == feature):
            sliceTmp = data[:index]
            sliceTmp.extend(data[index + 1:])
            splitedDataSet.append(sliceTmp)
    return splitedDataSet
def Ent(dataset):
    n = len(dataset)
    label_counts = {}
    for item in dataset:#遍历数据集
        label_current = item[-1]#存入
        if label_current not in label_counts.keys():
            label_counts[label_current] = 0#将特征值存入,并标记为0
        label_counts[label_current] += 1
    ent = 0.0
    for key in label_counts:
        prob = label_counts[key]/n
        ent -= prob * log(prob,2)
    return ent

#测试我们编写的香浓熵计算函数
data = pd.read_csv('xigua1.csv',encoding='gbk')
print(data)

#test=pd.read_csv('textSet.csv')

#print(test)
#a=Ent(data.iloc[:,-1])#取数据集最后一列


#按照权重计算各分支的信息熵
def sum_weight(grouped,total_len):
    weight = len(grouped)/total_len
    return weight * Ent(grouped.iloc[:,-1])

#根据公式计算信息增益
def Gain(column, data):
    lenth = len(data)
    ent_sum = data.groupby(column).apply(lambda x:sum_weight(x,lenth)).sum()#按照column重新排列,然后计算信息熵,再加一块 ☆!!
    #print("11",ent_sum)
    ent_D = Ent(data.iloc[:,-1])
    #print("22",ent_D)
    return ent_D - ent_sum

#计算按照属性'色泽'的信息增益


# 计算获取最大的信息增益的feature,输入data是一个dataframe,返回是一个字符串
def get_max_gain(data):
    max_gain = 0.0
    cols = data.columns[:-1]

    for col in cols:
        gain = Gain(col,data)
        #print(gain)
        if gain > max_gain:
            max_gain = gain
            max_label = col
    return max_label

#获取data中最多的类别作为节点分类,输入一个series,返回一个索引值,为字符串
def get_most_label(label_list):
    return label_list.value_counts().idxmax()  #value_counts:指数据集中值有哪些,每个出现多少次

# 创建决策树,传入的是一个dataframe,最后一列为label

def TreeGenerate(data):
    feature = train.columns[:-1]
    label_list = data.iloc[:, -1]
    #如果样本全属于同一类别C,将此节点标记为C类叶节点
    if len(pd.unique(label_list)) == 1:
        return label_list.values[0]
    #如果待划分的属性集A为空,或者样本在属性A上取值相同,则把该节点作为叶节点,并标记为样本数最多的分类
    elif len(feature)==0 or len(data.loc[:,feature].drop_duplicates())==1:
        return get_most_label(label_list)
    #从A中选择最优划分属性
    best_attr = get_max_gain(data)
    tree = {best_attr: {}}
    #对于最优划分属性的每个属性值,生成一个分支
    for attr,gb_data in data.groupby(by=best_attr):
        if len(gb_data) == 0:
            tree[best_attr][attr] = get_most_label(label_list)
        else:
            #在data中去掉已划分的属性
            new_data = gb_data.drop(best_attr,axis=1)
            #递归构造决策树
            tree[best_attr][attr] = TreeGenerate(new_data)
    return tree


#使用递归函数进行分类
def tree_predict(tree, data):
  #print(data)
  feature = list(tree.keys())[0]#取树第一个结点的键(特征)
  #print(feature)
  label = data[feature]#该特征下所有属性
  next_tree = tree[feature][label]#下一个结点树
  if type(next_tree) == str:#如果是个字符串
    return next_tree
  else:
    return tree_predict(next_tree, data)





#切割训练集和测试集
 # 训练模型


from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

#切割训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(data.iloc[:,:-1], data.iloc[:,-1], test_size = 0.3, random_state=1)
train = pd.concat([X_train,y_train],axis=1)
print("train",X_train)
print("test",y_test)

decition_tree = TreeGenerate(train)


print(decition_tree)

y_predict = X_test.apply(lambda x:tree_predict(decition_tree, x),axis=1)
score = accuracy_score(y_test,y_predict)
print('第实验准确率为:'+repr(score*100)+'%')


其实上面算法是有缺陷的,有可能缺失分支,需要补全分支:

import numpy as np
import pandas as pd
import random
import csv
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

#计算熵
def calcEntropy(dataSet):
    mD = len(dataSet)
    dataLabelList = [x[-1] for x in dataSet]
    dataLabelSet = set(dataLabelList)
    ent = 0
    for label in dataLabelSet:
        mDv = dataLabelList.count(label)
        prop = float(mDv) / mD
        ent = ent - prop * np.math.log(prop, 2)

    return ent

# # 拆分数据集
# # index - 要拆分的特征的下标
# # feature - 要拆分的特征
# # 返回值 - dataSet中index所在特征为feature,且去掉index一列的集合
def splitDataSet(dataSet, index, feature):
    splitedDataSet = []
    mD = len(dataSet)
    for data in dataSet:
        if(data[index] == feature):
            sliceTmp = data[:index]
            sliceTmp.extend(data[index + 1:])
            splitedDataSet.append(sliceTmp)
    return splitedDataSet

#根据信息增益 - 选择最好的特征
# 返回值 - 最好的特征的下标
def chooseBestFeature(dataSet):
    entD = calcEntropy(dataSet)
    mD = len(dataSet)
    featureNumber = len(dataSet[0]) - 1
    maxGain = -100
    maxIndex = -1
    for i in range(featureNumber):
        entDCopy = entD
        featureI = [x[i] for x in dataSet]
        featureSet = set(featureI)
        for feature in featureSet:
            splitedDataSet = splitDataSet(dataSet, i, feature)  # 拆分数据集
            mDv = len(splitedDataSet)
            entDCopy = entDCopy - float(mDv) / mD * calcEntropy(splitedDataSet)
        if(maxIndex == -1):
            maxGain = entDCopy
            maxIndex = i
        elif(maxGain < entDCopy):
            maxGain = entDCopy
            maxIndex = i

    return maxIndex

# 寻找最多的,作为标签
def mainLabel(labelList):
    labelRec = labelList[0]
    maxLabelCount = -1
    labelSet = set(labelList)
    for label in labelSet:
        if(labelList.count(label) > maxLabelCount):
            maxLabelCount = labelList.count(label)
            labelRec = label
    return labelRec

#生成决策树
# featureNamesSet 是featureNames取值的集合
# labelListParent 是父节点的标签列表
def createDecisionTree(dataSet, featureNames):
    labelList = [x[-1] for x in dataSet]
    if(len(dataSet[0]) == 1): #没有可划分的属性了
        return mainLabel(labelList)  #选出最多的label作为该数据集的标签
    elif(labelList.count(labelList[0]) == len(labelList)): # 全部都属于同一个Label
        return labelList[0]

    bestFeatureIndex = chooseBestFeature(dataSet)
    bestFeatureName = featureNames.pop(bestFeatureIndex)
    myTree = {bestFeatureName: {}}
    featureList = [x[bestFeatureIndex] for x in dataSet]
    featureSet = set(featureList)
    for feature in featureSet:
        featureNamesNext = featureNames[:]
        splitedDataSet = splitDataSet(dataSet, bestFeatureIndex, feature)
        myTree[bestFeatureName][feature] = createDecisionTree(splitedDataSet, featureNamesNext)
    return myTree

def createFullDecisionTree(dataSet, featureNames, featureNamesSet, labelListParent):
    labelList = [x[-1] for x in dataSet]
    if(len(dataSet) == 0):
        return mainLabel(labelListParent)
    elif(len(dataSet[0]) == 1): #没有可划分的属性了
        return mainLabel(labelList)  #选出最多的label作为该数据集的标签
    elif(labelList.count(labelList[0]) == len(labelList)): # 全部都属于同一个Label
        return labelList[0]

    bestFeatureIndex = chooseBestFeature(dataSet)
    #print('index',bestFeatureIndex)
    bestFeatureName = featureNames.pop(bestFeatureIndex)
    myTree = {bestFeatureName: {}}
    featureList = featureNamesSet.pop(bestFeatureIndex)
    #print('ss',featureList)
    featureSet = set(featureList)
    #print('featureSet',featureSet)
    for feature in featureSet:
        featureNamesNext = featureNames[:]
        #print('featureNamesNext',featureNamesNext)
        featureNamesSetNext = featureNamesSet[:][:]
        #print('featureNamesSetNext',featureNamesSetNext)
        splitedDataSet = splitDataSet(dataSet, bestFeatureIndex, feature)
        myTree[bestFeatureName][feature] = createFullDecisionTree(splitedDataSet, featureNamesNext, featureNamesSetNext, labelList)
    return myTree




#读取西瓜数据集2.0
def readWatermelonDataSet():

    ifile = open("xigua1.txt")
    #print(ifile)
    featureName = ifile.readline()  #表头
    featureName = featureName.rstrip("\n")
    #print(featureName)
    featureNames = (featureName.split(' ')[0]).split(',')
    #print(featureNames)
    lines = ifile.readlines()
    dataSet = []
    for line in lines:
        tmp = line.split('\n')[0]
        #print('tmp',tmp)
        tmp = tmp.split(',')
        dataSet.append(tmp)
    random.shuffle(dataSet)
    dlen = int(len(dataSet) * 2 / 3)
    testDlen = len(dataSet) - dlen
    D = dataSet[0:dlen]
    #print('d',D)
    testD = dataSet[dlen:len(dataSet)]



    labelList = [x[-1] for x in D]
    #print('labelList',labelList)
    #获取featureNamesSet
    featureNamesSet = []
    for i in range(len(D[0]) - 1):
        col = [x[i] for x in D]
        colSet = set(col)
        featureNamesSet.append(list(colSet))
    #print('saa',featureNamesSet)

    return D, featureNames, featureNamesSet,labelList,testD

def tree_predict(tree, data):
  #print(data)
  feature = list(tree.keys())[0]#取树第一个结点的键(特征)
  #print(feature)
  label = data[feature]#该特征下所有属性
  next_tree = tree[feature][label]#下一个结点树
  if type(next_tree) == str:#如果是个字符串
    return next_tree
  else:
    return tree_predict(next_tree, data)



def main():
    #读取数据
    pingjun=0.0
    for i in range(1,11):
        dataSet, featureNames, featureNamesSet,labelList,testD = readWatermelonDataSet()
        #print('daas',dataSet)
        tree=createFullDecisionTree(dataSet, featureNames,featureNamesSet,labelList)
        tree2=createDecisionTree(dataSet, featureNames)
        #print('tree2',tree2)
        print(tree)
        train= pd.DataFrame(dataSet, columns=['色泽','根蒂','敲声','纹理','脐部','触感','好瓜'])
        #print('train',train)
        test=pd.DataFrame(testD, columns=['色泽','根蒂','敲声','纹理','脐部','触感','好瓜'])
        #print('test', test)
        feature = list(train.columns[:])
        #print('feat',feature)

        y_predict = test.apply(lambda x: tree_predict(tree, x), axis=1)
        label_list = test.iloc[:, -1]
        score = accuracy_score(label_list, y_predict)
        pingjun+=score
        print('第'+repr(i)+'次补全分支准确率为:' + repr(score * 100) + '%')
    print("平均准确率为:"+repr(pingjun*10)+'%')


if __name__ == "__main__":
    main()
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值