python算法的缺陷和不足_决策树基本概念及算法优缺点

本文详细介绍了决策树的基本概念、结构、种类、算法(如ID3、C4.5、CART)及其优缺点。决策树是一种快速且易于理解的分类模型,适用于处理离散和连续数据,但易受样本不平衡和过拟合影响。同时,文章通过Python实现了熵和信息增益的计算,并展示了如何使用Scikit-Learn库构建决策树模型。
摘要由CSDN通过智能技术生成

1. 什么是决策树

分类决策树模型是一种描述对实例进行分类的树形结构. 决策树由结点和有向边组成. 结点有两种类型: 内部结点和叶节点. 内部节点表示一个特征或属性, 叶节点表示一个类.

决策树(Decision Tree),又称为判定树, 是一种以树结构(包括二叉树和多叉树)形式表达的预测分析模型.

通过把实例从根节点排列到某个叶子节点来分类实例

叶子节点为实例所属的分类

树上每个节点说明了对实例的某个属性的测试, 节点的每个后继分支对应于该属性的一个可能值

2.决策树结构

决策树结构.png

3.决策树种类

分类树--对离散变量做决策树

回归树--对连续变量做决策树

4.决策树算法(贪心算法)

有监督的学习

非参数学习算法

自顶向下递归方式构造决策树

在每一步选择中都采取在当前状态下最好/优的选择

决策树学习的算法通常是一个递归地选择最优特征, 并根据该特征对训练数据进行分割, 使得各个子数据集有一个最好的分类的过程.

在决策树算法中,ID3基于信息增益作为属性选择的度量, C4.5基于信息增益作为属性选择的度量, CART基于基尼指数作为属性选择的度量

5.决策树学习过程

特征选择

决策树生成: 递归结构, 对应于模型的局部最优

决策树剪枝: 缩小树结构规模, 缓解过拟合, 对应于模型的全局选择

6.决策树优缺点

优点:

(1)速度快: 计算量相对较小, 且容易转化成分类规则. 只要沿着树根向下一直走到叶, 沿途的分裂条件就能够唯一确定一条分类的谓词.

(2)准确性高: 挖掘出来的分类规则准确性高, 便于理解, 决策树可以清晰的显示哪些字段比较重要, 即可以生成可以理解的规则.

(3)可以处理连续和种类字段

(4)不需要任何领域知识和参数假设

(5)适合高维数据

缺点:

(1)对于各类别样本数量不一致的数据, 信息增益偏向于那些更多数值的特征

(2)容易过拟合

(3)忽略属性之间的相关性

5.2 决策树数学知识

1.信息论:

若一事假有k种结果, 对应概率为

, 则此事件发生后所得到的信息量I为:

2.熵:

给定包含关于某个目标概念的正反样例的样例集S, 那么S相对这个布尔型分类的熵为:

其中

代表正样例,

代表反样例

3.条件熵:

假设随机变量(X,Y), 其联合分布概率为P(X=xi,Y=yi)=Pij, i=1,2,...,n;j=1,2,..,m

则条件熵H(Y|X)表示在已知随机变量X的条件下随机变量Y的不确定性, 其定义为X在给定条件下Y的条件概率分布的熵对X的数学期望

5.3 决策树算法Hunt

在Hunt算法中, 通过递归的方式建立决策树.

如果数据集D种所有的数据都属于一个类, 那么将该节点标记为节点.

如果数据集D中包含属于多个类的训练数据, 那么选择一个属性将训练数据划分为较小的子集, 对于测试条件的每个输出, 创建一个子节点, 并根据测试结果将D种的记录分布到子节点中, 然后对每一个子节点重复1,2过程, 对子节点的子节点依然是递归地调用该算法, 直至最后停止.

5.4 决策树算法ID3

1.分类系统信息熵

2.条件熵

3.信息增益Gain(S, A) 定义

4.属性选择度量

使用信息增益, 选择最高信息增益的属性作为当前节点的测试属性

5.算法不足

使用ID3算法构建决策树时, 若出现各属性值取值数分布偏差大的情况, 分类精度会大打折扣

ID3算法本身并未给出处理连续数据的方法

ID3算法不能处理带有缺失值的数据集, 故在算法挖掘之前需要对数据集中的缺失值进行预处理

ID3算法只有树的生成, 所以该算法生成的树容易产生过拟合

6.算法流程

ID3(Examples,Target_attribute,Attributes)

Examples即训练样例集. Target_attribute是这棵树要预测的目标属性. Attributes是除目标属性外供学习到的决策树测试的属性列表. 返回能正确分类给定Examples的决策树.

创建树的Root结点

如果Examples都为正, 那么返回label=+的单节点树Root

如果Examples都为负, 那么返回label=-的单节点树Root

如果Attributes为空, 那么返回单节点树Root, label=Examples中最普通的Target_attribute值

否则

A ← Attributes中分类Examples能力最好*的属性

7.算法Python实现

Python实现熵的计算

from math import log

def calcShanNonEnt(dataSet):

numEntries = len(dataSet)

labelCounts = {}

for featVec in dataSet:

currentLabel = featVec[-1]

if currentLabel not in labelCounts.keys():

labelCounts[currentLabel] = 0

labelCounts[currentLabel] += 1

shannonEnt = 0.0

for key in labelCounts:

prob = float(labelCounts[key])/numEntries

shannonEnt -= prob*log(prob,2)

return shannonEnt

# example

dataset = [[1],[2],[3],[3],]

sne = calcShanNonEnt(dataset)

print(sne)

Sklearn.tree参数介绍及使用建议

class sklearn.tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)

# Examples

from sklearn.datasets import load_iris

from sklearn.model_selection import cross_val_score

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(random_state=0)

iris = load_iris()

cross_val_score(clf, iris.data, iris.target, cv=10)

from sklearn.model_selection import train_test_split

X_train,X_test, y_train, y_test = train_test_split(iris.data,iris.target,test_size=0.3)

res = clf.fit(X_train,y_train)

pre = clf.predict(X_test)

sco = clf.score(X_test, y_test)

print(y_test)

print(pre)

print(sco)

clf.apply(X_train)

clf.apply(X_test)

clf.decision_path(X_train)

type(clf.decision_path(X_train))

X_train.shape

clf.feature_importances_

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()

clf.fit(X_train, y_train)

clf.feature_importances_

clf.get_params()

clf.predict_log_proba(X_test)

clf.predict_proba(X_test)

限制决策树层数为4的DecisionTreeClassifier实例

from itertools import product

import numpy as np

import matplotlib.pyplot as plt

from sklearn import datasets

from sklearn.tree import DecisionTreeClassifier

# 使用iris数据

iris = datasets.load_iris()

X = iris.data[:, [0, 2]]

y = iris.target

# 训练模型, 限制树的最大深度为4

clf = DecisionTreeClassifier(max_depth=4)

clf.fit(X,y)

# Plot

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, .1),

np.arange(y_min, y_max, .1))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=.4)

plt.scatter(X[:, 0], X[:, 1], c=y, alpha=.8)

plt.show()

output_12_0.png

This plot compares the decision surfaces learned by a dcision tree classifier(first column), by a random forest classifier(second column), by an extra-trees classifier(third column) and by an AdaBoost classifier(fouth column).

print(__doc__)

import numpy as np

import matplotlib.pyplot as plt

from matplotlib.colors import ListedColormap

from sklearn import clone

from sklearn.datasets import load_iris

from sklearn.ensemble import (RandomForestClassifier, ExtraTreesClassifier,

AdaBoostClassifier)

from sklearn.tree import DecisionTreeClassifier

# Parameters

n_classes = 3

n_estimators = 30

cmap = plt.cm.RdYlBu

plot_step = 0.02

plot_step_coarser = 0.5

RANDOM_SEED = 13

# Load data

iris = load_iris()

plot_idx = 1

models = [DecisionTreeClassifier(max_depth=None),

RandomForestClassifier(n_estimators=n_estimators),

ExtraTreesClassifier(n_estimators=n_estimators),

AdaBoostClassifier(DecisionTreeClassifier(max_depth=3), n_estimators=n_estimators)]

for pair in ([0,1], [0,2], [2,3]):

for model in models:

# print(pair, model)

# only take the two correspoding features

X = iris.data[:, pair]

y = iris.target

# Shuffle

idx = np.arange(X.shape[0])

np.random.seed(RANDOM_SEED)

np.random.shuffle(idx)

X = X[idx]

y = y[idx]

# Standardize

mean = X.mean(axis=0)

std = X.std(axis=0)

X = (X - mean) / std

# Train

clf = clone(model)

clf = model.fit(X, y)

scores = clf.score(X, y)

# Create a title for each column and the console by using str() and

# slicing away useless parts of the string

model_title = str(type(model)).split(".")[-1][:-2][:-len('Classifier')]

model_details = model_title

if hasattr(model, "estimators_"):

model_details += " with {} estimators".format(len(model.estimators_))

print(model_details + " with features", pair,

"has a score of", scores)

plt.subplot(3, 4, plot_idx)

if plot_idx <= len(models):

# Add a title at the top of eeach column

plt.title(model_title)

# Now plot the decision boundary using a fine mesh as input to a filled contour plot

x_min, x_max = X[:,0].min() - 1, X[:,0].max() + 1

y_min, y_max = X[:,1].min() - 1, X[:,1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),

np.arange(y_min, y_max, plot_step))

# Plot either a single DecisionTreeClassifier or alpha blend the

# decision surfaces of the ensemble of classifiers

if isinstance(model, DecisionTreeClassifier):

Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

cs = plt.contourf(xx, yy, Z, cmap=cmap)

else:

# Choose alpha blend level with respect to the number of estimators

# that are in use (nothing that AdaBoost can use fewer estimtors

# than its maximum if it achieves a good enough fit early on)

estimator_alpha = 1.0 / len(model.estimators_)

print(len(model.estimators_))

for tree in model.estimators_:

Z = tree.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

cs = plt.contourf(xx, yy, Z, alpha=estimator_alpha, cmap=cmap)

# Build a coarser grid to plot a set of ensemble classifications

# to show how these are different to what we see in the decision

# surfaces. These points are regularly space and do not have a

# black outline

xx_coarser, yy_coarser = np.meshgrid(

np.arange(x_min, x_max, plot_step_coarser),

np.arange(y_min, y_max, plot_step_coarser))

Z_points_coarser = model.predict(np.c_[xx_coarser.ravel(),

yy_coarser.ravel()]

).reshape(xx_coarser.shape)

cs_points = plt.scatter(xx_coarser, yy_coarser, s=15,

c=Z_points_coarser, cmap=cmap,

edgecolors="none")

plt.scatter(X[:, 0], X[:, 1], c=y,

cmap=ListedColormap(['r', 'y', 'b']),

edgecolor='k', s=20)

plot_idx += 1

Output:

Automatically created module for IPython interactive environment

DecisionTree with features [0, 1] has a score of 0.9266666666666666

RandomForest with 30 estimators with features [0, 1] has a score of 0.9266666666666666

30

ExtraTrees with 30 estimators with features [0, 1] has a score of 0.9266666666666666

30

AdaBoost with 30 estimators with features [0, 1] has a score of 0.84

30

DecisionTree with features [0, 2] has a score of 0.9933333333333333

RandomForest with 30 estimators with features [0, 2] has a score of 0.9933333333333333

30

ExtraTrees with 30 estimators with features [0, 2] has a score of 0.9933333333333333

30

AdaBoost with 30 estimators with features [0, 2] has a score of 0.9933333333333333

30

DecisionTree with features [2, 3] has a score of 0.9933333333333333

RandomForest with 30 estimators with features [2, 3] has a score of 0.9933333333333333

30

ExtraTrees with 30 estimators with features [2, 3] has a score of 0.9933333333333333

30

AdaBoost with 30 estimators with features [2, 3] has a score of 0.9933333333333333

30

output_14_1.png

A comparison of a several classifiers in scikit-learn on synthetic datasets.

The point of this examples is to illustrate the nature of decision boundaries of different classifiers.

Particularly in high-dimensional spaces, data can more easily be separated linearly and the simplicity of classifiers such as naive Bayes and linear SVMs might lead to better generalization than is achieved by other classifiers.

print(__doc__)

# Code source: Gaël Varoquaux

# Andreas Müller

# Mmodified for documentation by Jaques Grobler

# License: BSD 3 clause

import numpy as np

import matplotlib.pyplot as plt

from matplotlib.colors import ListedColormap

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.datasets import make_moons, make_circles, make_classification

# classifier

from sklearn.neural_network import MLPClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.svm import SVC

from sklearn.gaussian_process import GaussianProcessClassifier

from sklearn.gaussian_process.kernels import RBF

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

from sklearn.naive_bayes import GaussianNB

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

h = .02 # step size in the mesh

names = ['Nearest Neighbors', 'Linear SVM', 'RBF SVM', 'Gaussian Process', 'Decision Tree', 'Random Forest', 'Neural Net', 'AdaBoost',' Naive Bayes','QDA']

classifiers = [

KNeighborsClassifier(3),

SVC(kernel="linear",C=0.025), # C is pantly parameter

SVC(gamma=2, C=1), # kernel: rbf(default), gamma: Kernel coefficient

GaussianProcessClassifier(1.0 * RBF(1.0)),

DecisionTreeClassifier(max_depth=5),

RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),

MLPClassifier(alpha=1), # MultiLayer Perceptron

AdaBoostClassifier(),

GaussianNB(), # Gaussian Naive Bayes

QuadraticDiscriminantAnalysis()

]

X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,

random_state=1, n_clusters_per_class=1)

rng = np.random.RandomState(2)

X += 2 * rng.uniform(size=X.shape)

linearly_separable = (X, y)

datasets = [make_moons(noise=0.3, random_state=0),

make_circles(noise=0.2, factor=0.5, random_state=1),

linearly_separable

]

figure = plt.figure(figsize=(27,9))

i = 1

# iterate over datasets

for ds_cnt, ds in enumerate(datasets):

# preprocess datasset, split into training and test part

X, y = ds

X = StandardScaler().fit_transform(X)

X_train, X_test, y_train, y_test = \

train_test_split(X,y,test_size=.4,random_state=42)

x_min, x_max = X[:,0].min() - .5, X[:,0].max() + .5

y_min, y_max = X[:,1].min() - .5, X[:,0].max() + .5

xx, yy = np.meshgrid(np.arange(x_min, x_max, h),

np.arange(y_min, y_max, h))

# just plot the dataset first

cm = plt.cm.RdBu

cm_bright = ListedColormap(['#FF0000', '#0000FF'])

ax = plt.subplot(len(datasets), len(classifiers) + 1, i)

if ds_cnt == 0:

ax.set_title("Input data")

# Plot the training points

ax.scatter(X_train[:,0], X_train[:,1], c=y_train, cmap=cm_bright,

edgecolors='k')

# and testing points

ax.scatter(X_test[:,0], X_test[:,1], c=y_test, cmap=cm_bright, alpha=0.6,

edgecolors='k')

ax.set_xlim(xx.min(), xx.max())

ax.set_ylim(yy.min(), yy.max())

ax.set_xticks(())

ax.set_yticks(())

i += 1

# iterate over classifiers

for name, clf in zip(names, classifiers):

ax = plt.subplot(len(datasets), len(classifiers) + 1, i)

clf.fit(X_train, y_train)

score = clf.score(X_test, y_test)

# plot the decision boundary, For that, we will assign a color to each

# point in the mesh [x_min, x_max]*[y_min, y_max].

if hasattr(clf, 'decision_function'):

Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])

else:

Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

# Put the result into a color plot

Z = Z.reshape(xx.shape)

ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)

# plot also the training points

ax.scatter(X_train[:,0], X_train[:,1], c=y_train, cmap=cm_bright,

edgecolors='k')

# and testing points

ax.scatter(X_test[:,0], X_test[:,1], c=y_test, cmap=cm_bright,

edgecolors='k', alpha=.6)

ax.set_xlim(xx.min(), xx.max())

ax.set_ylim(yy.min(), yy.max())

ax.set_xticks(())

ax.set_yticks(())

if ds_cnt == 0:

ax.set_title(name)

ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),

size=15, horizontalalignment='right')

i += 1

plt.tight_layout()

plt.show()

Classifier comparison.png

This example fits an AdaBoost decisin stump on a non-linearly separable classification dataset composed of two "Gaussian quantiles" clusters and plots the decision boundary and decision scores.

print(__doc__)

# Author: Noel Dawe

#

# License; BSD 3 clause

import numpy as np

import matplotlib.pyplot as plt

from sklearn.ensemble import AdaBoostClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.datasets import make_gaussian_quantiles

# Construct dataset

X1, y1 = make_gaussian_quantiles(cov=2.,

n_samples=200, n_features=2,

n_classes=2, random_state=1)

X2, y2 = make_gaussian_quantiles(mean=(3,3), cov=1.5,

n_samples=300, n_features=2,

n_classes=2, random_state=1)

X = np.concatenate((X1, X2))

y = np.concatenate((y1, -y2 + 1))

# Create and fit an AdaBoosted decision tree

bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),

algorithm='SAMME',

n_estimators=200)

bdt.fit(X, y)

plot_colors = 'br'

plot_step = .02

class_names = 'AB'

plt.figure(figsize=(10,5))

# plot the decision boundaries

plt.subplot(121)

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),

np.arange(y_min, y_max, plot_step))

Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)

plt.axis("tight")

# Plot the training points

for i, n, c in zip(range(2), class_names, plot_colors):

idx = np.where(y == i)

plt.scatter(X[idx, 0], X[idx, 1],

c=c, cmap=plt.cm.Paired,

s=20, edgecolor='k',

label=("Class %s" % n))

plt.xlim(x_min, x_max)

plt.ylim(y_min, y_max)

plt.legend(loc='upper right')

plt.xlabel('x')

plt.ylabel('y')

plt.title('Decision Boundary')

# Plot the two-class decision scors

twoclass_output = bdt.decision_function(X)

plot_range = (twoclass_output.min(), twoclass_output.max())

plt.subplot(122)

for i, n, c in zip(range(2), class_names, plot_colors):

plt.hist(twoclass_output[y == i],

bins=10,

range=plot_range,

facecolor=c,

label=('Class %s' % n),

alpha=.5,

edgecolor='k')

x1, x2, y1, y2 = plt.axis()

plt.axis((x1, x2, y1, y2 * 1.2))

plt.legend(loc='upper right')

plt.ylabel('Samples')

plt.xlabel('Score')

plt.title('Decision Scores')

plt.tight_layout()

plt.subplots_adjust(wspace=0.35)

plt.show()

Output:

Automatically created module for IPython interactive environment

Two-class AdaBoost.png

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值