sklearn中的决策树

最新推荐文章于 2024-07-11 01:37:09 发布

大龄coder

最新推荐文章于 2024-07-11 01:37:09 发布

阅读量696

点赞数

分类专栏：读书笔记机器学习文章标签： sklearn 决策树机器学习 DecisionTreeClassifier

本文链接：https://blog.csdn.net/weixin_42341153/article/details/89078813

版权

机器学习同时被 2 个专栏收录

19 篇文章 0 订阅

订阅专栏

读书笔记

17 篇文章 0 订阅

订阅专栏

参数

DecisionTreeRegressor(criterion=’mse’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, presort=False)

参数说明：

criterion : string, optional (default=”mse”)，衡量划分质量的准则，其他选项有“friedman_mse”，“mae”平均绝对误差；
splitter : string, optional (default=”best”)，用来选择每个节点的划分策略，其他选择有“random”选择最好的随机划分；
max_depth : int or None, optional (default=None)，树的最大深度，None的话，节点会扩展至叶节点只包含一类的值；
min_samples_split : int, float, optional (default=2)，一个内部节点可以划分需要的最少样本数目；如果是整数，就是样本数目；如果是分数，表明ceil(min_samples_split * n_samples) 个样本；
min_samples_leaf : int, float, optional (default=1)，形成叶节点需要的最少样本数目，整数和分数与上同；
min_weight_fraction_leaf : float, optional (default=0.)，形成叶节点所需要的最小权重分数
max_features : int, float, string or None, optional (default=None)，选择最优划分时需要考虑的特征数量，整数和分数与上同，“auto”为n个特征；“sqrt”，根号n个特征；“log2”，log2（n）个特征；None，n个特征；
random_state : int, RandomState instance or None, optional (default=None)，随机数的生成器，
max_leaf_nodes : int or None, optional (default=None)，生成一棵max_leaf_nodes节点的最优树。
min_impurity_decrease : float, optional (default=0.)，当不纯度的降低大于或等于该值时，节点将被划分；
min_impurity_split : float, (default=1e-7)，deprecated，min_impurity_decrease 使用这个；
presort : bool, optional (default=False)，是否预排序数据来加速寻找最优的划分拟合。

优缺点

决策树是一种无参数的监督学习算法，可用于分类和回归问题。
其优点有：

易于理解和解释，可以被可视化；
需要较少的数据预处理；
训练树是与数据集数量成对数复杂度的消耗；
可以处理连续和离散的值；
可以处理多类输出问题；
白盒模型；
可以使用统计测试来验证模型；
即使假设和真正的模型有冲突，也能表现很好；
缺点有：
决策树会过拟合，造成泛化能力不足；
决策树不稳定，因为输入数据一个很小的变动会导致一个完全不同的结果；
学习一个最优的决策树是np完全问题；
决策树对XOR，parity，multiplexer问题不能很好的表示；
如果数据不平衡，会导致决策树的偏斜。

sklearn中的DecisionTreeClassifier

和其他分类器类似，DecisionTreeClassifier输入是两个数组：X稀疏或稠密，维度[n_samples, n_features]，表示训练的样本；Y，整数个值，size [n_samples]，表明了训练样本的标签。

>>> from sklearn import tree
>>> X = [[0, 0], [1, 1]]
>>> Y = [0, 1]
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(X, Y)

训练好的模型，可以用来预测样本的分类：

>>> clf.predict([[2., 2.]])
array([1])

同样的，也可以用来预测样本属于每一类的概率，

>>> clf.predict_proba([[2., 2.]])
array([[0., 1.]])

DecisionTreeClassifier可以用于二分类，也可以用于多分类。

可视化

训练好的模型可以用 export_graphviz方法，输出Graphviz格式进行可视化。安装包

conda install python-graphviz
下面的代码将训练的模型可视化结果输出到"iris.pdf"文件中。
>>> import graphviz 
>>> dot_data = tree.export_graphviz(clf, out_file=None) 
>>> graph = graphviz.Source(dot_data) 
>>> graph.render("iris")

也支持美化选项，包括通过类别（回归问题的值）来对节点染色，或者使用显式的变量。如下

>>> dot_data = tree.export_graphviz(clf, out_file=None, 
...                      feature_names=iris.feature_names,  
...                      class_names=iris.target_names,  
...                      filled=True, rounded=True,  
...                      special_characters=True)  
>>> graph = graphviz.Source(dot_data)  
>>> graph

实际使用的建议

当有大量的特征时，决策树易过拟合。因为很少的点在高维空间容易过拟合；
可以使用维度规约（PCA，ICA或特征选择）；
可视化树，使用max_depth=3作为最初的树高度来感觉它是如何拟合数据的，然后增加高度；
当每增加一层高度时，就需要一定数量的样本，可以使用max_depth来防止过拟合；
使用min_samples_split 或者min_samples_leaf 来保证多个样本来组成了决策，来控制应该考虑哪种分割。太小的值会导致过拟合，太大又会导致学不到数据中的模型。可以使用min_samples_leaf=5作为初始值；
平衡训练数据，防止决策树偏斜；
如果样本有权重，可以使用基于权重的预剪枝准则min_weight_fraction_leaf来优化树的结构；
决策树内部使用np.float32数组；
如果输入X是很稀疏的矩阵，推荐转化成稀疏的csc_matrix 进行训练，稀疏的csr_matrix 来预测结果。

示例（鸢尾花数据的决策面和决策树的可视化）

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Parameters
n_classes = 3
plot_colors = "ryb"
plot_step = 0.02

# Load data
iris = load_iris()

for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3],
                                [1, 2], [1, 3], [2, 3]]):
    # We only take the two corresponding features
    X = iris.data[:, pair]
    y = iris.target

    # Train
    clf = DecisionTreeClassifier().fit(X, y)

    # Plot the decision boundary
    plt.subplot(2, 3, pairidx + 1)

    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                         np.arange(y_min, y_max, plot_step))
    plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    cs = plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu)

    plt.xlabel(iris.feature_names[pair[0]])
    plt.ylabel(iris.feature_names[pair[1]])

    # Plot the training points
    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y == i)
        plt.scatter(X[idx, 0], X[idx, 1], c=color, label=iris.target_names[i],
                    cmap=plt.cm.RdYlBu, edgecolor='black', s=15)

plt.suptitle("Decision surface of a decision tree using paired features")
plt.legend(loc='lower right', borderpad=0, handletextpad=0)
plt.axis("tight")
plt.show()

参考资料：https://scikit-learn.org/stable/modules/tree.html?tdsourcetag=s_pctim_aiomsg