决策树

最新推荐文章于 2020-12-22 04:18:03 发布

u200710

最新推荐文章于 2020-12-22 04:18:03 发布

阅读量98

点赞数

分类专栏： scikit-learn 文章标签：决策树 python 机器学习

原文链接：https://scikit-learn.org/stable/modules/tree.html

版权

scikit-learn 专栏收录该内容

21 篇文章 1 订阅

订阅专栏

决策树

决策树是一类可用于分类和回归的无参数监督学习算法。其目标是为了构建一个模型，能够从数据特征中学习简单的决策规则，以此预测目标变量的值。

分类

DecisionTreeClassifier是一种能够执行多类别分类的算法。

# coding: utf-8
# Plot the decision surface of a decision tree on the iris dataset

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree

n_classes = 3
plot_colors = "ryb"
plot_step = 0.02

iris = load_iris()

for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3],
                                [1, 2], [1, 3], [2, 3]]):
    X = iris.data[:, pair]
    y = iris.target

    clf = DecisionTreeClassifier().fit(X, y)

    plt.subplot(2, 3, pairidx+1)

    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                         np.arange(y_min, y_max, plot_step))
    plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    cs = plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu)

    plt.xlabel(iris.feature_names[pair[0]])
    plt.ylabel(iris.feature_names[pair[1]])

    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y == i)
        plt.scatter(X[idx, 0], X[idx, 1], c=color, label=iris.target_names[i],
                    cmap=plt.cm.RdYlBu, edgecolor='black', s= 15)

plt.suptitle('Decision surface of a decision tree using paired features')
plt.legend(loc='lower right', borderpad=0, handletextpad=0)
plt.axis('tight')

plt.figure()
clf = DecisionTreeClassifier().fit(iris.data, iris.target)
plot_tree(clf, filled=True)
plt.show()

回归

决策树也能够应用于回归问题，使用DecisionTreeRegressor类

# coding: utf-8
# Decision Tree Regression

import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

rng = np.random.RandomState(1)
X = np.sort(5 * rng.rand(80, 1), axis=0)
y = np.sin(X).ravel()
y[::5] += 3 * (0.5 - rng.rand(16))

regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)
regr_1.fit(X, y)
regr_2.fit(X, y)

X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)

plt.figure()
plt.scatter(X, y, s=20, edgecolor="black",
            c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue",
         label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=5", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

多输出问题

一个多输出问题(Multi-output problem)是一个有多个输出预测的监督学习问题，此时Y的大小为[n_samples, n_outputs]。

当输出之间没有相关性时，最简单的方式是为每个输出构建独立的模型。但是，对于同一输入的输出值自身通常是相关的，一个常用的更好的方式是构建一个可以同时预测所有n个输出的模型。

决策树可以很容易地支持求解这类问题。但需要做如下变化：

将n个输出值保存在叶子节点中
使用划分准则计算所有n个输出上的平均减少值

DecisionTreeClassifier和DecisionTreeRegressor这两类可以完成上述工作。

# coding: utf-8
# Multi-output Decision Tree Regression

import numpy as np
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeRegressor

rng = np.random.RandomState(1)
X = np.sort(200 * rng.rand(100, 1) - 100, axis=0)
y = np.array([np.pi * np.sin(X).ravel(), np.pi * np.cos(X).ravel()]).T
y[::5, :] += (0.5 - rng.rand(20, 2))

# Fit regression model
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)
regr_3 = DecisionTreeRegressor(max_depth=8)
regr_1.fit(X, y)
regr_2.fit(X, y)
regr_3.fit(X, y)

X_test = np.arange(-100.0, 100.0, 0.01)[:, np.newaxis]
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)
y_3 = regr_3.predict(X_test)

plt.figure()
s = 25
plt.scatter(y[:, 0], y[:, 1], c='navy', s=s,
            edgecolor='black', label='data')
plt.scatter(y_1[:, 0], y_1[:, 1], c="cornflowerblue", s=s,
            edgecolor="black", label="max_depth=2")
plt.scatter(y_2[:, 0], y_2[:, 1], c="red", s=s,
            edgecolor="black", label="max_depth=5")
plt.scatter(y_3[:, 0], y_3[:, 1], c="orange", s=s,
            edgecolor="black", label="max_depth=8")
plt.xlim([-6, 6])
plt.ylim([-6, 6])
plt.xlabel('target 1')
plt.ylabel('target 2')
plt.title('Multi-output Decision Tree Regression')
plt.legend(loc='best')
plt.show()