数据挖掘之ID3算法

最新推荐文章于 2024-04-04 11:23:19 发布

伽利略的猫

最新推荐文章于 2024-04-04 11:23:19 发布

阅读量830

点赞数 1

分类专栏： Python 文章标签： Python ID3 数据挖掘

本文链接：https://blog.csdn.net/qq_40232802/article/details/90405050

版权

Python 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

数据挖掘ID3算法，在网上的指导下写的，仅供参考。

数据集如下：

Outlook	Temperature	Humidity	Windy	Play
sunny	hot	high	FALSE	no
sunny	hot	high	TRUE	no
overcast	hot	high	FALSE	yes
rain	mild	high	FALSE	yes
rain	cool	normal	FALSE	yes
rain	cool	normal	TRUE	no
overcast	cool	normal	TRUE	yes
sunny	mild	high	FALSE	no
sunny	cool	normal	FALSE	yes
rain	mild	normal	FALSE	yes
sunny	mild	normal	TRUE	yes
overcast	mild	high	TRUE	yes
overcast	hot	normal	FALSE	yes
rain	mild	high	TRUE	no

import pandas as pd
import numpy as np
from math import log
import operator
def getDataSet():
    DataSet = pd.read_excel(r"ID3数据集.xlsx", encoding='UTF-8')
    DataArr = np.array(DataSet)
    columns = np.array(DataSet.columns[:len(DataSet.columns)-1])
    return DataArr.tolist(),columns.tolist() #获取数据

# 计算香农熵
def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for feaVec in dataSet:
        currentLabel = feaVec[-1]
        if currentLabel not in labelCounts:
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key]) / numEntries
        shannonEnt -= prob * log(prob, 2)
    return shannonEnt


def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis + 1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet


def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1  # 因为数据集的最后一项是标签
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0
    bestFeature = -1
    for i in range(numFeatures):
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet) / float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        if infoGain > bestInfoGain:
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature


# 因为我们递归构建决策树是根据属性的消耗进行计算的，所以可能会存在最后属性用完了，但是分类
# 还是没有算完，这时候就会采用多数表决的方式计算节点分类
def majorityCnt(classList):
    classCount = {}
    for vote in classList:
        if vote not in classCount.keys():
            classCount[vote] = 0
        classCount[vote] += 1
    return max(classCount)


def createTree(dataSet, labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList):  # 类别相同则停止划分
        return classList[0]
    if len(dataSet[0]) == 1:  # 所有特征已经用完
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel: {}}
    del (labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]  # 为了不改变原始列表的内容复制了一下
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet,bestFeat, value), subLabels)
    return myTree


def main():
    data, label = getDataSet()
    myTree = createTree(data, label)
    print(myTree)


if __name__ == '__main__':
    main()

伽利略的猫

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
数据挖掘之ID3算法

数据挖掘ID3算法，在网上的指导下写的，仅供参考。数据集如下：Outlook Temperature Humidity Windy Play sunny hot high FALSE no sunny hot high TRUE no overcast hot high FALSE ...
复制链接

扫一扫