Sklearn交叉验证与二叉树绘制_sklearn二叉树-CSDN博客

本文链接：https://blog.csdn.net/m0_74697699/article/details/134160177

一、相关数据库的导入与二叉树绘制前期准备

调用make_blobs函数主要是为了生成分类or聚类数据集
n_features表示每一个样本有多少特征值
n_sample表示样本的个数
centers是聚类中心点的个数，可以理解为label的种类数
random_state是随机种子，可以固定生成的数据
cluster_std设置每个类别的标准差（默认值为1）
shuffle 随机排列（洗牌）
#导入数据集生成器
from sklearn.datasets import make_blobs
help(make_blobs)
Help on function make_blobs in module sklearn.datasets._samples_generator:

make_blobs(n_samples=100, n_features=2, *, centers=None, cluster_std=1.0, center_box=(-10.0, 10.0), shuffle=True, random_state=None, return_centers=False)
    Generate isotropic Gaussian blobs for clustering.
    
    Read more in the :ref:`User Guide <sample_generators>`.
    
    Parameters
    ----------
    n_samples : int or array-like, default=100
        If int, it is the total number of points equally divided among
        clusters.
        If array-like, each element of the sequence indicates
        the number of samples per cluster.
    
        .. versionchanged:: v0.20
            one can now pass an array-like to the ``n_samples`` parameter
    
    n_features : int, default=2
        The number of features for each sample.
    
    centers : int or ndarray of shape (n_centers, n_features), default=None
        The number of centers to generate, or the fixed center locations.
        If n_samples is an int and centers is None, 3 centers are generated.
        If n_samples is array-like, centers must be
        either None or an array of length equal to the length of n_samples.
    
    cluster_std : float or array-like of float, default=1.0
        The standard deviation of the clusters.
    
    center_box : tuple of float (min, max), default=(-10.0, 10.0)
        The bounding box for each cluster center when centers are
        generated at random.
    
    shuffle : bool, default=True
        Shuffle the samples.
    
    random_state : int, RandomState instance or None, default=None
        Determines random number generation for dataset creation. Pass an int
        for reproducible output across multiple function calls.
        See :term:`Glossary <random_state>`.
    
    return_centers : bool, default=False
        If True, then return the centers of each cluster
    
        .. versionadded:: 0.23
    
    Returns
    -------
    X : ndarray of shape (n_samples, n_features)
        The generated samples.
    
    y : ndarray of shape (n_samples,)
        The integer labels for cluster membership of each sample.
    
    centers : ndarray of shape (n_centers, n_features)
        The centers of each cluster. Only returned if
        ``return_centers=True``.
    
    Examples
    --------
    >>> from sklearn.datasets import make_blobs
    >>> X, y = make_blobs(n_samples=10, centers=3, n_features=2,
    ...                   random_state=0)
    >>> print(X.shape)
    (10, 2)
    >>> y
    array([0, 0, 1, 0, 2, 2, 2, 1, 1, 0])
    >>> X, y = make_blobs(n_samples=[3, 3, 4], centers=None, n_features=2,
    ...                   random_state=0)
    >>> print(X.shape)
    (10, 2)
    >>> y
    array([0, 1, 2, 0, 2, 2, 2, 1, 1, 0])
    
    See Also
    --------
    make_classification : A more intricate variant.

data = make_blobs(n_samples = 200, centers = 2, random_state = 8)
print(data)
(array([[ 6.75445054,  9.74531933],
       [ 6.80526026, -0.2909292 ],
       [ 7.07978644,  7.81427747],
       [ 6.87472003, -0.16069949],
       [ 8.06164078,  8.43736968],
       [ 7.4934131 , 11.00892356],
       [ 4.69777002,  0.59687317],
       [ 9.19642422, 11.57536954],
       [ 8.80996213, 11.9021701 ],
       [ 7.5952749 ,  1.32739544],
       [ 8.20330317,  1.27929111],
       [ 8.59258191, -0.29022607],
       [ 6.89228905,  8.60634293],
       [ 8.00405631, 10.53695374],
       [ 8.14715032,  2.09399376],
       [ 7.06363179, -0.57743891],
       [ 6.34526126,  8.70677779],
       [ 5.28435774, 10.16972385],
       [ 6.62257531,  2.04423066],
       [ 7.40314915, 10.42342437],
       [ 7.27423265,  9.18459991],
       [ 8.77188508,  0.768341  ],
       [ 6.39995999,  0.07580004],
       [ 7.44636985, 11.43674954],
       [ 7.74488453,  0.14409178],
       [ 9.10088858,  9.14807411],
       [ 8.10044749,  0.7596783 ],
       [ 8.73747674,  2.0086222 ],
       [ 6.51876894, -1.36881715],
       [ 7.16251356,  9.74878714],
       [ 6.57119411, -0.74277359],
       [ 7.1354011 , -0.63951267],
       [ 7.31294296,  9.92166331],
       [ 7.52733204,  0.2744698 ],
       [ 6.0160163 ,  0.53637761],
       [ 6.73117031,  1.20886838],
       [ 6.11962018,  0.21527805],
       [ 7.88579276,  0.78743005],
       [ 7.32112244,  0.78510422],
       [ 7.62051584,  9.37144814],
       [ 6.96767867,  8.9622523 ],
       [ 8.51730001, -0.42711053],
       [ 7.92672195,  0.44823051],
       [ 5.52161775,  7.98446372],
       [ 6.93568163,  0.50274121],
       [ 7.89765814,  8.21954764],
       [ 7.40292703,  9.16217702],
       [ 8.28827095, 10.71730803],
       [ 7.33912656, -0.07533921],
       [ 5.27801757,  8.93474119],
       [ 5.57550594,  0.4274511 ],
       [ 8.67425268, -0.37860274],
       [ 7.55303352, 11.85706105],
       [ 6.84661976, -0.85945209],
       [ 6.26977193,  2.11033394],
       [ 7.09962807,  0.5655205 ],
       [ 5.5987887 ,  7.59170022],
       [ 8.0060449 ,  0.80933758],
       [ 6.85769503, 10.30105929],
       [ 6.19399963,  8.19786954],
       [ 8.68173394,  0.54980379],
       [ 5.82259795,  8.88727231],
       [ 5.30528133,  0.29441074],
       [ 6.89703841,  7.98081009],
       [ 5.9389756 ,  1.19214956],
       [ 7.13760133,  9.84345464],
       [ 7.51718983,  1.31532401],
       [ 8.08034605, 10.02847377],
       [ 6.89078889, 10.61298902],
       [ 6.95802459,  9.19924611],
       [ 8.91111219,  9.14933265],
       [ 7.57818277,  9.58629233],
       [ 6.24007751,  0.55847799],
       [ 7.79924692, 10.59576952],
       [ 7.49985237,  9.55274284],
       [ 9.94109903,  9.22395667],
       [ 7.07232613,  1.26533062],
       [ 7.50126258,  0.62517001],
       [ 6.63110319,  2.65308097],
       [ 6.6060513 ,  3.19799895],
       [ 8.81545663,  8.76386046],
       [ 6.5688005 ,  0.09522898],
       [ 9.15668309,  9.59459888],
       [ 7.45637594,  0.24440634],
       [ 7.29548244, -0.22293119],
       [ 8.20316159, 12.01375618],
       [ 6.97321804,  2.576281  ],
       [ 6.42049196,  0.26683712],
       [ 7.40783871,  6.93633083],
       [ 6.54464509,  0.89987351],
       [ 7.58423725, 10.70124388],
       [ 8.80002143,  8.54323521],
       [ 7.1847723 ,  2.22950427],
       [ 7.80361128,  9.74561264],
       [ 7.96481592,  8.03914659],
       [ 6.6571269 ,  7.72756233],
       [ 7.29433984,  9.79486468],
       [ 7.237824  ,  1.70291874],
       [ 8.37153676,  0.98810496],
       [ 6.49932355,  0.24955722],
       [ 9.02255525, 10.06777901],
       [ 7.61227907,  9.4463627 ],
       [ 8.89464606, 10.29806397],
       [ 7.01747287, -1.22016798],
       [ 8.10434971,  1.83659293],
       [ 7.68373899,  1.5632695 ],
       [ 9.43042008,  0.68726533],
       [ 6.26211747,  1.577057  ],
       [ 9.59017028,  0.58441955],
       [ 7.82182216,  0.52633087],
       [ 7.6025272 ,  8.98962387],
       [ 8.48011698,  0.69122126],
       [ 7.63890536, -0.06731493],
       [ 5.84965451,  0.72241791],
       [ 7.46996922,  8.44935323],
       [ 6.8117005 , 10.8840413 ],
       [ 8.67502392,  0.37561206],
       [ 8.12519495,  1.67159478],
       [ 5.07337492, 10.52482973],
       [ 7.48665378,  0.21345453],
       [ 8.11950967,  0.56120493],
       [ 6.15895483,  8.70208685],
       [ 7.94310647,  8.20622208],
       [ 7.95311372,  8.36897664],
       [ 4.96938735,  1.32531048],
       [ 8.8583269 , -0.34648253],
       [10.01367527, 10.52089453],
       [ 8.99334153,  9.7313491 ],
       [ 8.22871505,  1.23014656],
       [ 6.19407512, -0.03183561],
       [ 7.26697254,  9.87045836],
       [ 7.94970781, -0.37340645],
       [ 5.62803952,  9.77585443],
       [ 8.50049461,  9.12147855],
       [ 7.31054144,  0.39102866],
       [ 7.49814373,  9.29677019],
       [ 8.32245091,  9.67819196],
       [ 8.32813617,  9.14002426],
       [ 7.56475962, 11.24762868],
       [ 7.92129785,  0.78018447],
       [ 8.00236864, 10.1691733 ],
       [ 4.33366829, 10.51034676],
       [ 6.02937898, 10.31974057],
       [ 6.88953097,  0.80526874],
       [ 7.51239046,  2.06597042],
       [ 9.17061801, 10.37690696],
       [ 7.63027116,  8.69797933],
       [ 8.35312192,  0.20325714],
       [ 8.72578696, 10.34691678],
       [ 5.44099009,  1.59585563],
       [ 7.56093115, -0.51702689],
       [ 6.02376341, -0.52025947],
       [ 7.15013321,  9.52893935],
       [ 7.56833386,  9.32443309],
       [ 7.09022949,  8.57919798],
       [ 5.94356564,  0.6092466 ],
       [ 6.25817082,  9.79505477],
       [ 5.94205586, 10.50768333],
       [ 7.82510107,  8.41865266],
       [ 5.88994248,  2.1198068 ],
       [ 6.40269472,  0.08495368],
       [ 7.64534862, -1.89105765],
       [ 6.8830708 ,  1.38045511],
       [ 7.24044576,  1.07171623],
       [ 9.4035308 ,  8.09592099],
       [ 6.55819206,  8.84793239],
       [ 6.58341965,  8.42678679],
       [ 7.83939881, -0.10906103],
       [ 7.22095192,  8.06544414],
       [ 7.8440213 , 10.29060403],
       [ 7.39634594,  8.90196559],
       [ 9.10772988, -0.06937041],
       [ 6.93540782,  1.74268311],
       [ 7.9465776 , -0.37622421],
       [ 7.92430026,  0.10451121],
       [ 6.79156708,  0.47231026],
       [ 6.28516091, 11.28717687],
       [ 7.54257819,  7.02403019],
       [ 7.40565933,  8.8292448 ],
       [ 7.51463404, 10.14107588],
       [ 6.40863862,  0.09433704],
       [ 6.5342397 ,  9.45532341],
       [ 5.17209648, 11.78064756],
       [ 5.49953213,  9.04384494],
       [ 9.86936252,  0.76402347],
       [ 7.84725158, -0.25808463],
       [ 8.14330144,  1.05961829],
       [ 7.28724996,  7.620998  ],
       [ 6.0888764 , -0.01613322],
       [ 7.59635095,  8.0197955 ],
       [ 6.71388804,  1.38741885],
       [ 7.3307687 ,  0.97105895],
       [ 8.18240421,  8.16999978],
       [ 8.53178848,  1.68305022],
       [ 6.91511696,  8.64812384],
       [ 7.82944816,  9.62627158],
       [ 6.09382282,  9.38044447],
       [ 7.24211001,  7.48506871],
       [ 8.2634157 , 10.34723435],
       [ 8.39800148,  2.8397151 ]]), array([0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1,
       1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0,
       0, 1]))
X, y = data #分离自变量与因变量
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(X[:, 0], X[:, 1], c = y, cmap = plt.cm.spring, edgecolors = 'k')
<matplotlib.collections.PathCollection at 0xaf15f70>

#导入iris数据集
from sklearn.datasets import load_iris
iris  = load_iris()
#导入boston数据集
from sklearn.datasets import load_boston
boston = load_boston()
class sklearn.preprocessing.MinMaxScaler(feature_range(0,1),copy = True)
将数据缩放至指定的范围内
claas sklearn.preprocessing.MaxAbsScaler(copy = True)
将数据的最大值缩放至1
#将boston数据变换到（10,100）
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler(feature_range=(10,100))#实例化
mms.fit(boston.data)#只是算了均值与标准差（准备工作）
MinMaxScaler(feature_range=(10, 100))
boston_mms = mms.transform(boston.data)
mms2 = MinMaxScaler(feature_range=(10,100), copy = False)
mms2.fit_transform(boston.data)
array([[ 10.        ,  26.2       ,  16.10337243, ...,  35.85106383,
        100.        ,  18.07119205],
       [ 10.02123303,  10.        ,  31.80718475, ...,  59.78723404,
        100.        ,  28.40231788],
       [ 10.0212128 ,  10.        ,  31.80718475, ...,  59.78723404,
         99.07635282,  15.71192053],
       ...,
       [ 10.05507032,  10.        ,  47.84090909, ...,  90.42553191,
        100.        ,  19.7102649 ],
       [ 10.10446569,  10.        ,  47.84090909, ...,  90.42553191,
         99.21705583,  21.79635762],
       [ 10.04156575,  10.        ,  47.84090909, ...,  90.42553191,
        100.        ,  25.27317881]])
from sklearn.preprocessing import MaxAbsScaler
mas = MaxAbsScaler()#默认为0-1
mas.fit_transform(boston.data)#原始数据未改变，copy = True
array([[0.1       , 0.262     , 0.16103372, ..., 0.35851064, 1.        ,
        0.18071192],
       [0.10021233, 0.1       , 0.31807185, ..., 0.59787234, 1.        ,
        0.28402318],
       [0.10021213, 0.1       , 0.31807185, ..., 0.59787234, 0.99076353,
        0.15711921],
       ...,
       [0.1005507 , 0.1       , 0.47840909, ..., 0.90425532, 1.        ,
        0.19710265],
       [0.10104466, 0.1       , 0.47840909, ..., 0.90425532, 0.99217056,
        0.21796358],
       [0.10041566, 0.1       , 0.47840909, ..., 0.90425532, 1.        ,
        0.25273179]])
数据的Normalization -- 向量单位化
sklearn.preprocessing.normalize(
X, axis = 1, copy = True
norm = 'l2' : 'l1', 'l2', or 'max', 用于正则化的具体范数
return_norm = False : 是否返回所使用的范数
)
norm为范数
from sklearn.preprocessing import normalize
help(normalize)
Help on function normalize in module sklearn.preprocessing._data:

normalize(X, norm='l2', *, axis=1, copy=True, return_norm=False)
    Scale input vectors individually to unit norm (vector length).
    
    Read more in the :ref:`User Guide <preprocessing_normalization>`.
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The data to normalize, element by element.
        scipy.sparse matrices should be in CSR format to avoid an
        un-necessary copy.
    
    norm : {'l1', 'l2', 'max'}, default='l2'
        The norm to use to normalize each non zero sample (or each non-zero
        feature if axis is 0).
    
    axis : {0, 1}, default=1
        axis used to normalize the data along. If 1, independently normalize
        each sample, otherwise (if 0) normalize each feature.
    
    copy : bool, default=True
        set to False to perform inplace row normalization and avoid a
        copy (if the input is already a numpy array or a scipy.sparse
        CSR matrix and if axis is 1).
    
    return_norm : bool, default=False
        whether to return the computed norms
    
    Returns
    -------
    X : {ndarray, sparse matrix} of shape (n_samples, n_features)
        Normalized input X.
    
    norms : ndarray of shape (n_samples, ) if axis=1 else (n_features, )
        An array of norms along given axis for X.
        When X is sparse, a NotImplementedError will be raised
        for norm 'l1' or 'l2'.
    
    See Also
    --------
    Normalizer : Performs normalization using the Transformer API
        (e.g. as part of a preprocessing :class:`~sklearn.pipeline.Pipeline`).
    
    Notes
    -----
    For a comparison of the different scalers, transformers, and normalizers,
    see :ref:`examples/preprocessing/plot_all_scaling.py
    <sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.

X1 = [[1,1,2],[2,2,4]]
normalize(X1, 
          norm = 'l2', #选择范数类型
          return_norm=True #返回每个向量的范数
         )
array([[0.40824829, 0.40824829, 0.81649658],
       [0.40824829, 0.40824829, 0.81649658]])
normalize(X1, 
          norm = 'l1', #选择范数类型
          return_norm=True #返回每个向量的范数
         )
(array([[0.25, 0.25, 0.5 ],
        [0.25, 0.25, 0.5 ]]),
 array([4., 8.]))
考虑异常值的标准化方法
稳健标准化
将中位数和百分位数（默认使用四分位距）分别代替均数和标准差用于数据的标准化
更适合于已知有离群值的数据
sklearn.preprocessing.robust_scale(
X, axis = 0, with_centering = True, with_scaling = True
quantile_range = (25.0, 75.0) : 用于计算离散程度的百分位数
copy= True
)
class sklearn.preprocessing.RobustScaler(
with_centering = True, with_scaling = True,
quantile_raange = (25.0, 75.0), copy = True
)
#稳健标准化
from sklearn.preprocessing import robust_scale
from sklearn.preprocessing import RobustScaler
robust_scale(boston.data)
array([[-0.06959315,  1.44      , -0.57164988, ..., -1.33928571,
         0.26190191, -0.63768116],
       [-0.06375455,  0.        , -0.20294345, ..., -0.44642857,
         0.26190191, -0.22188906],
       [-0.06376011,  0.        , -0.20294345, ..., -0.44642857,
         0.06667466, -0.73263368],
       ...,
       [-0.05445006,  0.        ,  0.17350891, ...,  0.69642857,
         0.26190191, -0.57171414],
       [-0.04086745,  0.        ,  0.17350891, ...,  0.69642857,
         0.09641444, -0.48775612],
       [-0.05816351,  0.        ,  0.17350891, ...,  0.69642857,
         0.26190191, -0.34782609]])
rs = RobustScaler()#实例化
rs.fit_transform(boston.data)
array([[-0.06959315,  1.44      , -0.57164988, ..., -1.33928571,
         0.26190191, -0.63768116],
       [-0.06375455,  0.        , -0.20294345, ..., -0.44642857,
         0.26190191, -0.22188906],
       [-0.06376011,  0.        , -0.20294345, ..., -0.44642857,
         0.06667466, -0.73263368],
       ...,
       [-0.05445006,  0.        ,  0.17350891, ...,  0.69642857,
         0.26190191, -0.57171414],
       [-0.04086745,  0.        ,  0.17350891, ...,  0.69642857,
         0.09641444, -0.48775612],
       [-0.05816351,  0.        ,  0.17350891, ...,  0.69642857,
         0.26190191, -0.34782609]])
S折交叉验证 S-fold cross validation 简称cv
S为超参数 将数据分为S折 分成几份模型就会训练几次
比简单交叉验证更加公平
极端情况为留一交叉验证（LOOCV， Leave one out cross validation）
LOOCV是保留一个数据点，同样也可以保留P个数据点作为验证集，这种方法叫做LPOCV
拆分为训练集与测试集
sklearn.model_selection.train_test_split(
*arrays : 等长度的需要拆分的数据对象, 可以同时拆分多个数据，但数据长度要一致
test_size = 0.25 : float, int, None, 用于验证模型的样本比例，范围在0-1
为None时所有样本都将用于训练
train_size = None ：float, int, or None, 用于训练模型的样本比例，0-1
为None时自动基于test_size计算
random_state = None 随机种子
shuffle = True ：是否在拆分前对样本做随机排列
stratify = None ：array-like or None, 是否按指定类别标签对数据做分层拆分
)返回：对输入对象进行拆分后list，length = 2 * len（arrays）
# 拆分为训练集与测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size = 0.3)
len(X_train)
354
len(boston.data)
506
len(y_train)
354
交叉验证 将拆分与评价合并执行
sklearn.model_selection
cross_val_score 将拆分与评估合并执行
estimator : 用于拟合数据的估计器对象名称
X:array-like，用于拟合模型的数据阵
cross_validate 同时使用多个评价指标
cross_val_predict 使用交互验证后的模型进行预测
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
scores = cross_val_score(reg, boston.data, boston.target, cv = 10)
scores
array([ 0.73376082,  0.4730725 , -1.00631454,  0.64113984,  0.54766046,
        0.73640292,  0.37828386, -0.12922703, -0.76843243,  0.4189435 ])
scores.mean(), scores.std()
(0.20252899006055367, 0.5952960169512383)
boston 数据集是顺序排列的，所以导致模型得分不好，差距很大
#将数据集进行随机排列,保证拆分的均匀性
import numpy as np
X, y = boston.data, boston.target
indices = np.arange(y.shape[0])
np.random.shuffle(indices)
X, y = X[indices], y[indices]
reg = LinearRegression()
scores = cross_val_score(reg, X, y, cv = 10)
scores
array([0.77212498, 0.79470905, 0.59899391, 0.80717087, 0.76007414,
       0.75699564, 0.72688181, 0.24256808, 0.6518304 , 0.66100191])
scores.mean(), scores.std()
(0.6772350793373447, 0.1585378148669398)
使用sklearn创建决策树
class sklearn.tree.DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
ct = DecisionTreeClassifier()#实例化
ct.fit(iris.data, iris.target)#模型训练
DecisionTreeClassifier()
ct.max_features_
4
ct.feature_importances_#特征重要性分数
array([0.01333333, 0.        , 0.06405596, 0.92261071])
ct.predict(iris.data)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
#分类模型最好用的衡量指标输出， 比较不同模型的好坏
from sklearn.metrics import classification_report
print(classification_report(iris.target, ct.predict(iris.data)))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       1.00      1.00      1.00        50
           2       1.00      1.00      1.00        50

    accuracy                           1.00       150
   macro avg       1.00      1.00      1.00       150
weighted avg       1.00      1.00      1.00       150

# 分类结果的呈现：混淆矩阵
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(iris.target, ct.predict(iris.data), labels = [2,1,0])#自定义类别顺序输出混淆矩阵
cm
array([[50,  0,  0],
       [ 0, 50,  0],
       [ 0,  0, 50]], dtype=int64)
#用热力图形式展现
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.heatmap(cm, cmap = sns.color_palette("Blues"), annot = True)
<AxesSubplot:>

二、以iris数据库为例子绘制二叉树


#导入iris数据集
from sklearn.datasets import load_iris
iris  = load_iris()
import numpy as np
X, y = iris.data, iris.target
indices = np.arange(y.shape[0])
np.random.shuffle(indices)
X, y = X[indices], y[indices]
#将iris数据拆分为训练集和测试集
from sklearn.model_selection import train_test_split
X_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size = 0.3)
#创建决策树
from sklearn.tree import DecisionTreeClassifier
rt = DecisionTreeClassifier()#实例化
rt.fit(iris.data, iris.target)
DecisionTreeClassifier()
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LinearRegression
import graphviz
reg = LinearRegression()
scores = cross_val_score(reg, X, y, cv = 10)
scores
array([0.86316316, 0.87764635, 0.90032253, 0.89369341, 0.94963924,
       0.96141896, 0.93654241, 0.93546444, 0.88819228, 0.95601217])
scores.mean(), scores.std()
(0.916209493818017, 0.03372720008929205)
rt.max_features_
4
rt.feature_importances_
array([0.        , 0.01333333, 0.56405596, 0.42261071])
rt.predict(iris.data)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
from sklearn.metrics import classification_report
print(classification_report(iris.target, rt.predict(iris.data)))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       1.00      1.00      1.00        50
           2       1.00      1.00      1.00        50

    accuracy                           1.00       150
   macro avg       1.00      1.00      1.00       150
weighted avg       1.00      1.00      1.00       150

from sklearn.tree import export_graphviz 
dot_data = export_graphviz(rt, 
               feature_names = iris.feature_names,
               class_names = iris.target_names)
graph = graphviz.Source(dot_data)
graph

以下为画出的二叉树：