class1 决策树之分类树基础(详细版)

-------------------------------class1:分类树-------------------------------------
from sklearn import tree
from sklearn.datasets import load_wine     # datasets:数据集(波士顿房价,鸢尾花,红酒)
from sklearn.model_selection import train_test_split   #训练集,测试集的类
wine = load_wine()   #数据实例化
wine     # 字典类型 {  }
{'data': array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
         1.065e+03],
        [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
         1.050e+03],
        [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
         1.185e+03],
        ...,
        [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
         8.350e+02],
        [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
         8.400e+02],
        [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
         5.600e+02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2]),
 'frame': None,
 'target_names': array(['class_0', 'class_1', 'class_2'], dtype='<U7'),
 'DESCR': '.. _wine_dataset:\n\nWine recognition dataset\n------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 178 (50 in each of three classes)\n    :Number of Attributes: 13 numeric, predictive attributes and the class\n    :Attribute Information:\n \t\t- Alcohol\n \t\t- Malic acid\n \t\t- Ash\n\t\t- Alcalinity of ash  \n \t\t- Magnesium\n\t\t- Total phenols\n \t\t- Flavanoids\n \t\t- Nonflavanoid phenols\n \t\t- Proanthocyanins\n\t\t- Color intensity\n \t\t- Hue\n \t\t- OD280/OD315 of diluted wines\n \t\t- Proline\n\n    - class:\n            - class_0\n            - class_1\n            - class_2\n\t\t\n    :Summary Statistics:\n    \n    ============================= ==== ===== ======= =====\n                                   Min   Max   Mean     SD\n    ============================= ==== ===== ======= =====\n    Alcohol:                      11.0  14.8    13.0   0.8\n    Malic Acid:                   0.74  5.80    2.34  1.12\n    Ash:                          1.36  3.23    2.36  0.27\n    Alcalinity of Ash:            10.6  30.0    19.5   3.3\n    Magnesium:                    70.0 162.0    99.7  14.3\n    Total Phenols:                0.98  3.88    2.29  0.63\n    Flavanoids:                   0.34  5.08    2.03  1.00\n    Nonflavanoid Phenols:         0.13  0.66    0.36  0.12\n    Proanthocyanins:              0.41  3.58    1.59  0.57\n    Colour Intensity:              1.3  13.0     5.1   2.3\n    Hue:                          0.48  1.71    0.96  0.23\n    OD280/OD315 of diluted wines: 1.27  4.00    2.61  0.71\n    Proline:                       278  1680     746   315\n    ============================= ==== ===== ======= =====\n\n    :Missing Attribute Values: None\n    :Class Distribution: class_0 (59), class_1 (71), class_2 (48)\n    :Creator: R.A. Fisher\n    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n    :Date: July, 1988\n\nThis is a copy of UCI ML Wine recognition datasets.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data\n\nThe data is the results of a chemical analysis of wines grown in the same\nregion in Italy by three different cultivators. There are thirteen different\nmeasurements taken for different constituents found in the three types of\nwine.\n\nOriginal Owners: \n\nForina, M. et al, PARVUS - \nAn Extendible Package for Data Exploration, Classification and Correlation. \nInstitute of Pharmaceutical and Food Analysis and Technologies,\nVia Brigata Salerno, 16147 Genoa, Italy.\n\nCitation:\n\nLichman, M. (2013). UCI Machine Learning Repository\n[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,\nSchool of Information and Computer Science. \n\n.. topic:: References\n\n  (1) S. Aeberhard, D. Coomans and O. de Vel, \n  Comparison of Classifiers in High Dimensional Settings, \n  Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of  \n  Mathematics and Statistics, James Cook University of North Queensland. \n  (Also submitted to Technometrics). \n\n  The data was used with many others for comparing various \n  classifiers. The classes are separable, though only RDA \n  has achieved 100% correct classification. \n  (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) \n  (All results using the leave-one-out technique) \n\n  (2) S. Aeberhard, D. Coomans and O. de Vel, \n  "THE CLASSIFICATION PERFORMANCE OF RDA" \n  Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of \n  Mathematics and Statistics, James Cook University of North Queensland. \n  (Also submitted to Journal of Chemometrics).\n',
 'feature_names': ['alcohol',
  'malic_acid',
  'ash',
  'alcalinity_of_ash',
  'magnesium',
  'total_phenols',
  'flavanoids',
  'nonflavanoid_phenols',
  'proanthocyanins',
  'color_intensity',
  'hue',
  'od280/od315_of_diluted_wines',
  'proline']}
print(wine.data)
print(wine.data.shape)     # 178行,13列----13个特征

wine.target     # 标签集-----三分类的数据集,一共178个
[[1.423e+01 1.710e+00 2.430e+00 ... 1.040e+00 3.920e+00 1.065e+03]
 [1.320e+01 1.780e+00 2.140e+00 ... 1.050e+00 3.400e+00 1.050e+03]
 [1.316e+01 2.360e+00 2.670e+00 ... 1.030e+00 3.170e+00 1.185e+03]
 ...
 [1.327e+01 4.280e+00 2.260e+00 ... 5.900e-01 1.560e+00 8.350e+02]
 [1.317e+01 2.590e+00 2.370e+00 ... 6.000e-01 1.620e+00 8.400e+02]
 [1.413e+01 4.100e+00 2.740e+00 ... 6.100e-01 1.600e+00 5.600e+02]]
(178, 13)





array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])
# 如果wine是一张表,应该长这样

import pandas as pd
pd.concat([pd.DataFrame(wine.data), pd.DataFrame(wine.target)], axis=1)     # 第0--12列:特征   最后一列:标签
01234567891011120
014.231.712.4315.6127.02.803.060.282.295.641.043.921065.00
113.201.782.1411.2100.02.652.760.261.284.381.053.401050.00
213.162.362.6718.6101.02.803.240.302.815.681.033.171185.00
314.371.952.5016.8113.03.853.490.242.187.800.863.451480.00
413.242.592.8721.0118.02.802.690.391.824.321.042.93735.00
.............................................
17313.715.652.4520.595.01.680.610.521.067.700.641.74740.02
17413.403.912.4823.0102.01.800.750.431.417.300.701.56750.02
17513.274.282.2620.0120.01.590.690.431.3510.200.591.56835.02
17613.172.592.3720.0120.01.650.680.531.469.300.601.62840.02
17714.134.102.7424.596.02.050.760.561.359.200.611.60560.02

178 rows × 14 columns

print(wine.feature_names)       # 特征的标题,一共有13个
print(wine.target_names)        # 3个标签
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
['class_0' 'class_1' 'class_2']
# 数据集,标签,30%测试集和70%训练集
# 原始数据顺序被打烂
xtrain, xtest, ytrain, ytest = train_test_split(wine.data, wine.target, test_size=0.3)        # x---x---y---y
print(xtrain)
print(xtrain.shape)            #124------178*70%
print(xtest.shape)             #54-------178*30%

print(ytrain)          
print(ytrain.shape)            
print(ytest.shape)
[[1.242e+01 1.610e+00 2.190e+00 ... 1.060e+00 2.960e+00 3.450e+02]
 [1.394e+01 1.730e+00 2.270e+00 ... 1.120e+00 3.100e+00 1.260e+03]
 [1.296e+01 3.450e+00 2.350e+00 ... 6.800e-01 1.750e+00 6.750e+02]
 ...
 [1.252e+01 2.430e+00 2.170e+00 ... 9.000e-01 2.780e+00 3.250e+02]
 [1.330e+01 1.720e+00 2.140e+00 ... 1.020e+00 2.770e+00 1.285e+03]
 [1.329e+01 1.970e+00 2.680e+00 ... 1.070e+00 2.840e+00 1.270e+03]]
(124, 13)
(54, 13)
[1 0 2 2 2 1 0 1 0 1 1 0 2 1 1 2 1 0 1 1 0 1 1 1 2 0 2 1 2 0 1 2 1 1 1 1 1
 2 2 0 1 1 0 1 1 1 0 2 0 1 1 1 1 1 0 2 2 0 2 2 2 1 2 1 0 1 2 0 2 0 0 0 1 2
 0 0 2 1 1 1 0 2 0 0 1 1 2 2 0 2 2 1 0 0 1 0 2 1 1 1 1 0 1 2 0 0 2 2 2 0 0
 0 1 0 1 1 0 0 0 1 1 1 0 0]
(124,)
(54,)

# 第一次建模
# train_test_split()------每次的结果随机
# DecisionTreeClassifier()-----决策树是随机的

clf = tree.DecisionTreeClassifier(criterion='entropy')     # 默认:基尼系数      entropy(信息熵):不纯度的指标
clf = clf.fit(xtrain, ytrain)      # 训练模型
score = clf.score(xtest, ytest)    # 返回训练的准确度

print(score)                      # 模型拟合良好
0.9444444444444444
# 决策树-------画出分类树
 
feature_name= ['酒精','苹果酸','灰','灰的碱性','镁','总酚','类黄酮','非黄烷类酚类','花青素','颜色强度','色调','od280/od315稀释葡萄酒','脯氨酸']

import graphviz
dot_data = tree.export_graphviz(clf
                                ,feature_names = feature_name                    # 特征一共13个------wine.feature_names
                                ,class_names = ['琴酒','雪梨','贝尔摩德']          # 分为3类 --------wine.target_names
                                ,filled = True                                   # 填充颜色。不纯度越低,颜色越深
                                ,rounded = True                                  # 带有圆形轮廓
                                )
graph = graphviz.Source(dot_data)

graph

在这里插入图片描述

# 特征所占的权重

clf.feature_importances_              # 13个特征所占的权重
array([0.        , 0.        , 0.        , 0.05685274, 0.        ,
       0.        , 0.37844738, 0.        , 0.        , 0.40305332,
       0.        , 0.        , 0.16164656])
[*zip(feature_name, clf.feature_importances_)]        # 类黄酮的贡献度最高 ,色调和脯氨酸次之
[('酒精', 0.0),
 ('苹果酸', 0.0),
 ('灰', 0.0),
 ('灰的碱性', 0.056852743309727054),
 ('镁', 0.0),
 ('总酚', 0.0),
 ('类黄酮', 0.37844737886553675),
 ('非黄烷类酚类', 0.0),
 ('花青素', 0.0),
 ('颜色强度', 0.40305331725356425),
 ('色调', 0.0),
 ('od280/od315稀释葡萄酒', 0.0),
 ('脯氨酸', 0.16164656057117208)]



# 第二次建模
# random_state=0      ------ 设置随机数种子,使每次的结果相同
# train_test_split()------每次的结果随机
# DecisionTreeClassifier()-----决策树是随机的


clf = tree.DecisionTreeClassifier(criterion='entropy', random_state=30)  #默认:基尼系数    entropy:不纯度的指标 -------criterion:规定不纯度的计算
clf = clf.fit(xtrain, ytrain)      # 训练模型
score = clf.score(xtest, ytest)    # 返回训练的准确度

print(score)                      # 模型拟合良好
0.9629629629629629

# 第三次建模
# splitter = 'random'     防止过拟合
clf = tree.DecisionTreeClassifier(criterion = 'entropy'     # 默认:基尼系数        entropy:不纯度的指标
                                  ,random_state = 30        # 设置随机数种子,使每次的结果相同
                                  ,splitter = 'random'      # 防止过拟合
                                 )
clf = clf.fit(xtrain, ytrain)                               # 训练模型
score = clf.score(xtest, ytest)                             # 返回训练的准确度

score
0.9629629629629629
# 决策树-------画出分类树
 
feature_name= ['酒精','苹果酸','灰','灰的碱性','镁','总酚','类黄酮','非黄烷类酚类','花青素','颜色强度','色调','od280/od315稀释葡萄酒','脯氨酸']

import graphviz
dot_data = tree.export_graphviz(clf
                                ,feature_names = feature_name                    # 特征一共13个--------wine.feature_names
                                ,class_names = ['琴酒','雪梨','贝尔摩德']          # 分为3类 --------wine.target_names
                                ,filled = True                                   # 填充颜色。不纯度越低,颜色越深
                                ,rounded = True                                  # 带有圆形轮廓
                                )
graph = graphviz.Source(dot_data)

graph

在这里插入图片描述

'''
我们的树对训练集的拟合程度如何
(1)测试集得分80%,而训练集得分100%,故模型严重过拟合
(2)过拟合的表现----在训练集上表现良好,在测试集上却表现糟糕
'''
# (1)对测试集的得分score越高越好,但不可以等于1 
# (2)若得分为1,则过拟合。需要剪枝
score_train1 = clf.score(xtrain, ytrain)
print(score_train1)

score_train2 = clf.score(xtest, ytest)
print(score_train2)
1.0
0.9629629629629629



-------------------------------class2:剪枝-------------------------------------
# 剪枝参数调优

'''
☺不加限制的条件下,决策树往往过拟合
☺剪枝策略对决策树的影响巨大,正确的剪枝策略是优化决策树算法的核心
'''
# 剪枝参数调优 1
'''
(1)max_depth 
max_depth = 3,表示限制树的最大深度,超过设定深度的树枝全部剪掉。建议从=3开始尝试。适合高维度数据,最常用。


(2)min_samples_leaf /// min_samples_split
min_samples_leaf = 5,表示子节点至少要有5个样本,才会分枝。建议从=5开始使用
min_samples_split = 5,表示父节点至少要有5个样本,才会分枝。和min_samples_leaf搭配使用。
'''
# (1)训练集,测试集
xtrain, xtest, ytrain, ytest = train_test_split(wine.data, wine.target, test_size=0.3)


# (2)建模
clf = tree.DecisionTreeClassifier(criterion='entropy'    # 默认Gini系数,这里设置信息熵------------不纯度指标
                                  ,random_state=30       # 设置随机数种子,使每次随机的结果相同
                                  ,splitter='random'     # 防止过拟合
                                  ,max_depth = 3             # (1)若得分低于50%,往高调;若得分高于80%,往低调
                                  ,min_samples_leaf = 20     # (2)子节点的最小分支样本数
                                  ,min_samples_split = 25     #(3)父节点的最小分支样本数
                                  )
clf = clf.fit(xtrain, ytrain)


# (3)画决策树
feature_name= ['酒精','苹果酸','灰','灰的碱性','镁','总酚','类黄酮','非黄烷类酚类','花青素','颜色强度','色调','od280/od315稀释葡萄酒','脯氨酸']
import graphviz
dot_data = tree.export_graphviz(clf
                                ,feature_names = feature_name                # 13个特征
                                ,class_names = ['琴酒','雪梨','贝尔摩德']      # 3个分类
                                ,filled = True                               # 填充颜色
                                ,rounded = True                              # 带有圆形轮廓
                                )
graph = graphviz.Source(dot_data)

graph

在这里插入图片描述

# 看看拟合效果------对测试集的得分越高越好

score1 = clf.score(xtest, ytest)
print(score1)
0.8148148148148148

# 剪枝参数调优 2
'''
(1)max_features-----最大特征数
max_features限制分枝时考虑的特征个数,超过限制个数的特征都会被舍弃。适合高维度数据的过拟合。
注意:更推荐的降维的方式防止过拟合,建议使用PCA,ICA或者特征选择模块中的降维算法


(2)min_impurity_decrease-----最小信息熵(最小不纯度)
min_impurity_decrease限制信息增益的大小,信息增益小于设定数值的分枝不会发生
信息增益=父节点和子节点的信息熵之差。子节点的信息熵一定小于父节点
当信息增益越大,代表这一层的分枝对于决策树的贡献越大。
'''

# 剪枝参数调优 3
# 目标权重参数(用的非常少,了解即可)

class_weight
(1)样本不平衡是指在一组数据集中,标签的一类天生占有很大的比例
(2)例如:银行判断客户是否违约,按照常理没有违约的客户占了99%。而我们更看重研究违约客户,所以应该给违约客户更大的权重
(3)因此,我们应使用class_weight参数对样本标签进行一定的权衡,给少量的标签更多的权重。
(4)该参数默认为Null,表示自动给与数据集中的所有标签相同的权重 1:1。可以手动设置51等比例


min_weight_fraction_leaf
(1)设置权重后,样本量就不再是单纯的记录数目,而是受权重影响。
     这时候剪枝,就需要搭配min_weight_fraction_leaf这个基于权重的剪枝参数来使用
(2)注意:基于权重的剪枝参数(min_weight_fraction_leaf)将比不知道样本权重的标准(min_samples_leaf)更少偏向主导类



-------------------------------class3:如何确定最优的剪枝参数-------------------------------------
# 确认最优的剪枝参数1)具体怎么确定每个参数填写什么值呢?这个时候,使用确定超参数的曲线来进行判断
(2)超参数的学习曲线,是一条以超参数的取值为横坐标,模型的度量指标为纵坐标的曲线。
     它是用来衡量不同超参数取值下模型的表现的线。
(3)已经训练好的决策树模型clf,决策树里模型度量指标score

# 思考1)剪枝参数一定能够提升模型在测试集上的表现吗? ----调参没有绝对的答案,一切都是看数据本身
(2)这么多参数,一个个画学习曲线?              ----在泰坦尼克号的案例中,我们会解答这个问题
from sklearn import tree
from sklearn.datasets import load_wine     # datasets:数据集(波士顿房价,鸢尾花,红酒)
from sklearn.model_selection import train_test_split   #训练集,测试集的类
wine = load_wine()   #数据实例化
xtrain, xtest, ytrain, ytest = train_test_split(wine.data, wine.target, test_size=0.3)        

import matplotlib.pyplot as plt
test = []
for i in range(10):                               #0~~~10共11个
    clf = tree.DecisionTreeClassifier(max_depth = i + 1
                                     ,criterion = 'entropy'
                                     ,random_state = 30
                                     ,splitter = 'random'
                                     )
    clf = clf.fit(xtrain, ytrain)
    score = clf.score(xtest, ytest)               # 看看拟合效果----对测试集得分越高越好
    test.append(score)
plt.plot(range(1,11), test, color='red', label='max_depth')
plt.legend()
plt.show()

# 由图可见,当max_depth=3,得分最高,拟合效果最好

在这里插入图片描述



-------------------------------class4:重要属性和接口-------------------------------------
1)sklearn中许多算法的接口都是相似的,比如说我们之前已经用到的fit和score,几乎对每个算法都可以使用。
     除这两个接口外,决策树最常用的接口还有apply和predict
(2apply中输入测试集返回每个测试样本所在的叶子节点的索引
(3)predict中输入测试集返回每个测试样本的标签(分类/回归结果)

注意:
(1)sklearn只接受 >=2维 的数据集(不接受一维矩阵)
(2)若只有一维数据(一个特征和一个样本),使用reshape(1, -1)来给你的数据增维
#(1)score
score  = clf.score(xtest, ytest)
print(score)

# (2)apply返回每个测试样本所在的叶子节点的索引
apply = clf.apply(xtest)
print(apply)

# (3)predict返回每个测试样本的分类/回归结果
predict = clf.predict(xtest)
print(predict)
0.9444444444444444
[16 27  4 30 21 16 30 16 30  8  4 16 16 30 16  4 27  4 30 16 21 27 30 30
 16  4 21 21 21  4 16 16 27 16 27 30 30 10  8 21 21 16 21  4  4 13 30 21
 16 21  8 16 10 27]
[1 0 2 0 1 1 0 1 0 2 2 1 1 0 1 2 0 2 0 1 1 0 0 0 1 2 1 1 1 2 1 1 0 1 0 0 0
 2 2 1 1 1 1 2 2 1 0 1 1 1 2 1 2 0]


-------------------------------class5:决策树总结-------------------------------------
至此我们学完了分类树DecisionTreeClassifier和决策树绘图(export_graphviz)
分类树的八个参数,一个属性,四个接口,以及绘图所用的代码

(1)八个参数
criterion
两个随机性相关的参数(random_state,splitter)
五个剪枝参数(max_depth,min_samples_leaf,min_samples_split,max_feature, max_impurity_decrese)

(2)一个属性
feature_importances_

(3)四个接口
fit,score,apply, predict
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值