决策树
决策数(Decision Tree)在机器学习中也是比较常见的一种算法,属于监督学习中的一种。
它每个节点验证数据一个属性,根据该属性进行分割数据,将数据分布到不同的分支上,直到叶子节点,叶子结点上表示该样本的label. 每一条从根节点到叶子节点的路径表示分类[回归]的规则.
Sklearn接口
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris, load_boston
from sklearn import tree
from sklearn.model_selection import train_test_split
# 分类树
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
print ("Classifier Score:", clf.score(X_test, y_test))
tree.plot_tree(clf.fit(X, y))
plt.show()
# 回归树
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = tree.DecisionTreeRegressor()
clf = clf.fit(X_train, y_train)
print ("Regression Score:", clf.score(X_test, y_test))
tree.plot_tree(clf.fit(X, y))
plt.show()
常见的决策树模型有以下三种(CART决策树既可以做分类也可以做回归):
- ID3: 使用信息增益准则选择特征, 相当于用极大似然法进行概率模型选择.
- C4.5: 和ID3算法相似, 只是用信息增益比选择特征.
- CART: 递归构建二叉决策树, 回归树:使用平方误差; 分类树:使用基尼指数.
决策树的构造过程
一般包含三个部分
1、特征选择:特征选择是指从训练数据中众多的特征中选择一个特征作为当前节点的分裂标准,如何选择特征有着很多不同量化评估标准标准,从而衍生出不同的决策树算法,如CART, ID3, C4.5等。
2、决策树生成: 根据选择的特征评估标准,从上至下递归地生成子节点,直到数据集不可分则停止决策树停止生长。 树结构来说,递归结构是最容易理解的方式。
3、剪枝:决策树容易过拟合,一般来需要剪枝,缩小树结构规模、缓解过拟合。剪枝技术有预剪枝和后剪枝两种。
伪代码:
if 遇到终止条件:
return 类标签
else:
寻找一个最优特征对数据集进行分类
创建分支点
对每个分支节点进行划分,将分支点返回到主分支
return 分支节点
ID3的代码实现:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import operator
def loadDataSet():
"""
导入数据
@ return dataSet: 读取的数据集
"""
# 对数据进行处理
dataSet = pd.read_csv('isFish.csv', delimiter=',')
# dataSet = dataSet.replace('yes', 1).replace('no', 0)
labelSet = list(dataSet.columns.values)
dataSet = dataSet.values
return dataSet, labelSet
def calcShannonEnt(dataSet):
"""
计算给定数据集的信息熵(香农熵)
@ param dataSet: 数据集
@ return shannonEnt: 香农熵
"""
numEntries = len(dataSet)
labelCounts = {}
for featVec in dataSet:
# 当前样本类型
currentLabel = featVec[-1]
# 如果当前类别不在labelCounts里面,则创建
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key]) / numEntries
shannonEnt -= prob*np.log2(prob)
return shannonEnt
def splitDataSet(dataSet, axis, value):
"""
划分数据集, 提取所有满足一个特征的值
@ param dataSet: 数据集
@ param axis: 划分数据集的特征
@ param value: 提取出来满足某特征的list
"""
retDataSet = []
for featVec in dataSet:
# 将相同数据特征的提取出来
if featVec[axis] == value:
reducedFeatVec = list(featVec[:axis])
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
def chooseBestFeature(dataSet):
"""
选择最优的划分属性
@ param dataSet: 数据集
@ return bestFeature: 最佳划分属性
"""
# 属性的个数
numFeature = len(dataSet[0])-1
baseEntroy = calcShannonEnt(dataSet)
bestInfoGain = 0.0
bestFeature = -1
for i in range(numFeature):
# 获取第i个特征所有可能的取值
featureList = [example[i] for example in dataSet]
# 去除重复值
uniqueVals = set(featureList)
newEntropy = 0.0
for value in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value)
# 特征为i的数据集占总数的比例
prob = len(subDataSet) / float(len(dataSet))
newEntropy += prob * np.log2(prob)
inforGain = baseEntroy - newEntropy
if inforGain > bestInfoGain:
bestInfoGain = inforGain
bestFeature = i
return bestFeature
def majorityCnt(classList):
"""
递归构建决策树
@ param classList: 类别列表
@ return sortedClassCount[0][0]: 出现次数最多的类别
"""
classCount = {}
for vote in classList:
if vote not in classCount.keys():
classCount[vote] = 0
classCount += 1
# 排序
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
# 返回出现次数最多的
return sortedClassCount[0][0]
def createTree(dataSet, labels):
"""
构造决策树
@ param dataSet: 数据集
@ param labels: 标签集
@ return myTree: 决策树
"""
classList = [example[-1] for example in dataSet]
# 当类别与属性完全相同时停止
if classList.count(classList[0]) == len(classList):
return classList[0]
# 遍历完所有特征值时,返回数量最多的
if (len(dataSet[0]) == 1):
return majorityCnt(classList)
# 获取最佳划分属性
bestFeat = chooseBestFeature(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel:{}}
# 清空labels[bestFeat]
del(labels[bestFeat])
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:]
# 递归调用创建决策树
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
return myTree
if __name__ == '__main__':
dataSet, labelSet = loadDataSet()
shannonEnt = calcShannonEnt(dataSet)
tree= createTree(dataSet, labelSet)
print (tree)
参考文献:
1、
https://github.com/datawhalechina/team-learning
2、https://blog.csdn.net/csqazwsxedc/article/details/65697652
3、
https://cloud.tencent.com/developer/article/1057143