3-决策树实战

最新推荐文章于 2024-04-21 11:40:34 发布

进击的小杨人

最新推荐文章于 2024-04-21 11:40:34 发布

阅读量700

点赞数

分类专栏：机器学习实战文章标签：机器学习人工智能决策树

本文链接：https://blog.csdn.net/weixin_42600072/article/details/88432142

版权

机器学习实战专栏收录该内容

23 篇文章 7 订阅

订阅专栏

import matplotlib.pyplot as plt
import pandas as pd

from sklearn.datasets.california_housing import fetch_california_housing
housing = fetch_california_housing()
print(housing.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. topic:: References

    - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
      Statistics and Probability Letters, 33 (1997) 291-297

housing.data.shape

(20640, 8)

housing.data[0]

array([   8.3252    ,   41.        ,    6.98412698,    1.02380952,
        322.        ,    2.55555556,   37.88      , -122.23      ])

树模型参数:

1.criterion gini or entropy
2.splitter best or random 前者是在所有特征中找最好的切分点后者是在部分特征中（数据量大的时候）
3.max_features None（所有），log2，sqrt，N 特征小于50的时候一般使用所有的
4.max_depth 数据少或者特征少的时候可以不管这个值，如果模型样本量多，特征也多的情况下，可以尝试限制下
5.min_samples_split 如果某节点的样本数少于min_samples_split，则不会继续再尝试选择最优特征来进行划分如果样本量不大，不需要管这个值。如果样本量数量级非常大，则推荐增大这个值。
6.min_samples_leaf 这个值限制了叶子节点最少的样本数，如果某叶子节点数目小于样本数，则会和兄弟节点一起被剪枝，如果样本量不大，不需要管这个值，大些如10W可是尝试下5
7.min_weight_fraction_leaf 这个值限制了叶子节点所有样本权重和的最小值，如果小于这个值，则会和兄弟节点一起被剪枝默认是0，就是不考虑权重问题。一般来说，如果我们有较多样本有缺失值，或者分类树样本的分布类别偏差很大，就会引入样本权重，这时我们就要注意这个值了。
8.max_leaf_nodes 通过限制最大叶子节点数，可以防止过拟合，默认是"None”，即不限制最大的叶子节点数。如果加了限制，算法会建立在最大叶子节点数内最优的决策树。如果特征不多，可以不考虑这个值，但是如果特征分成多的话，可以加以限制具体的值可以通过交叉验证得到。
9.class_weight 指定样本各类别的的权重，主要是为了防止训练集某些类别的样本过多导致训练的决策树过于偏向这些类别。这里可以自己指定各个样本的权重如果使用“balanced”，则算法会自己计算权重，样本量少的类别所对应的样本权重会高。
10.min_impurity_split 这个值限制了决策树的增长，如果某节点的不纯度(基尼系数，信息增益，均方差，绝对差)小于这个阈值则该节点不再生成子节点。即为叶子节点。
n_estimators:要建立树的个数

from sklearn import tree
dtr = tree.DecisionTreeRegressor(max_depth = 2)
dtr.fit(housing.data[:,[6,7]], housing.target)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

#要可视化显示 首先需要安装 graphviz   http://www.graphviz.org/Download..php
dot_data = \
    tree.export_graphviz(
        dtr,
        out_file = None,
        feature_names = housing.feature_names[6:8],
        filled = True,
        impurity = False,
        rounded = True
    )

import pydotplus
graph = pydotplus.graph_from_dot_data(dot_data)
graph.get_nodes()[7].set_fillcolor("#FFF2DD")
from IPython.display import Image
Image(graph.create_png())

在这里插入图片描述

graph.write_png("dtr_white_background.png")

True

from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = \
    train_test_split(housing.data, housing.target, test_size = 0.1, random_state=42)
dtr = tree.DecisionTreeRegressor(random_state=42)
dtr.fit(data_train, target_train)
dtr.score(data_test, target_test)

0.637355881715626

from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor( random_state = 42)
rfr.fit(data_train, target_train)
rfr.score(data_test, target_test)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)





0.7910601348350835

from sklearn.model_selection import GridSearchCV
tree_param_grid = { 'min_samples_split': list((3,6,9)),'n_estimators':list((10,50,100))}
grid = GridSearchCV(RandomForestRegressor(),param_grid=tree_param_grid, cv=5)
grid.fit(data_train, target_train)
grid.cv_results_, grid.best_params_, grid.best_score_

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:125: FutureWarning: You are accessing a training score ('split0_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:125: FutureWarning: You are accessing a training score ('split1_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:125: FutureWarning: You are accessing a training score ('split2_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:125: FutureWarning: You are accessing a training score ('split3_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:125: FutureWarning: You are accessing a training score ('split4_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:125: FutureWarning: You are accessing a training score ('mean_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:125: FutureWarning: You are accessing a training score ('std_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)





({'mean_fit_time': array([ 1.13386483,  6.54137421, 13.39396534,  1.29727416,  6.23775678,
         12.7155273 ,  1.20046864,  6.02114434, 11.96628437]),
  'std_fit_time': array([0.1676311 , 0.12323692, 0.31254778, 0.06784611, 0.03376336,
         0.33566092, 0.01870518, 0.07025612, 0.10596189]),
  'mean_score_time': array([0.0146008 , 0.07720447, 0.14040809, 0.01400075, 0.06260366,
         0.12600718, 0.01580095, 0.05400319, 0.11200643]),
  'std_score_time': array([0.00392962, 0.00530697, 0.01645216, 0.00477524, 0.00463066,
         0.00328652, 0.00453465, 0.00931714, 0.00874127]),
  'param_min_samples_split': masked_array(data=[3, 3, 3, 6, 6, 6, 9, 9, 9],
               mask=[False, False, False, False, False, False, False, False,
                     False],
         fill_value='?',
              dtype=object),
  'param_n_estimators': masked_array(data=[10, 50, 100, 10, 50, 100, 10, 50, 100],
               mask=[False, False, False, False, False, False, False, False,
                     False],
         fill_value='?',
              dtype=object),
  'params': [{'min_samples_split': 3, 'n_estimators': 10},
   {'min_samples_split': 3, 'n_estimators': 50},
   {'min_samples_split': 3, 'n_estimators': 100},
   {'min_samples_split': 6, 'n_estimators': 10},
   {'min_samples_split': 6, 'n_estimators': 50},
   {'min_samples_split': 6, 'n_estimators': 100},
   {'min_samples_split': 9, 'n_estimators': 10},
   {'min_samples_split': 9, 'n_estimators': 50},
   {'min_samples_split': 9, 'n_estimators': 100}],
  'split0_test_score': array([0.78976579, 0.80823569, 0.8110849 , 0.7881789 , 0.80860001,
         0.80966129, 0.79381159, 0.8076118 , 0.80888393]),
  'split1_test_score': array([0.78418412, 0.79726994, 0.80204225, 0.77963146, 0.79681302,
         0.79995257, 0.78619206, 0.79799212, 0.79933496]),
  'split2_test_score': array([0.77915685, 0.80552189, 0.80430674, 0.78249812, 0.79931462,
         0.80119057, 0.78363434, 0.8023625 , 0.80136363]),
  'split3_test_score': array([0.78982826, 0.81267086, 0.81066329, 0.78639345, 0.80946871,
         0.81021067, 0.78952342, 0.80655655, 0.81103163]),
  'split4_test_score': array([0.78781872, 0.80597808, 0.80921501, 0.78035292, 0.80545273,
         0.8079091 , 0.78643771, 0.80556252, 0.80505796]),
  'mean_test_score': array([0.78615094, 0.80593541, 0.80746263, 0.78341123, 0.80393007,
         0.80578505, 0.78792014, 0.80401729, 0.80513462]),
  'std_test_score': array([0.00405354, 0.00501827, 0.00362701, 0.00334966, 0.00503502,
         0.00434163, 0.0034882 , 0.00348704, 0.00439757]),
  'rank_test_score': array([8, 2, 1, 9, 6, 3, 7, 5, 4]),
  'split0_train_score': array([0.95792931, 0.96819181, 0.96998529, 0.94633218, 0.95609768,
         0.95703435, 0.93351209, 0.94313076, 0.94430289]),
  'split1_train_score': array([0.95917872, 0.96903203, 0.96981718, 0.94568888, 0.95604337,
         0.95740729, 0.93275355, 0.94466686, 0.94585449]),
  'split2_train_score': array([0.95799713, 0.96920172, 0.97037121, 0.94619778, 0.95688282,
         0.95805602, 0.93520693, 0.94462025, 0.94534767]),
  'split3_train_score': array([0.95838268, 0.96865722, 0.97013743, 0.94731506, 0.95665454,
         0.95768814, 0.93553841, 0.94405375, 0.94506609]),
  'split4_train_score': array([0.95707483, 0.96843915, 0.96972808, 0.94659881, 0.95621223,
         0.95729949, 0.93364081, 0.9436983 , 0.94430204]),
  'mean_train_score': array([0.95811253, 0.96870438, 0.97000784, 0.94642654, 0.95637813,
         0.95749706, 0.93413036, 0.94403398, 0.94497464]),
  'std_train_score': array([0.00068315, 0.00037148, 0.00022976, 0.00053376, 0.00033147,
         0.00034932, 0.00106387, 0.00057847, 0.0006042 ])},
 {'min_samples_split': 3, 'n_estimators': 100},
 0.8074626318808404)

rfr = RandomForestRegressor( min_samples_split=3,n_estimators = 100,random_state = 42)
rfr.fit(data_train, target_train)
rfr.score(data_test, target_test)

0.8088623476993486

pd.Series(rfr.feature_importances_, index = housing.feature_names).sort_values(ascending = False)

MedInc        0.524257
AveOccup      0.137947
Latitude      0.090622
Longitude     0.089414
HouseAge      0.053970
AveRooms      0.044443
Population    0.030263
AveBedrms     0.029084
dtype: float64

进击的小杨人

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
3-决策树实战

import matplotlib.pyplot as pltimport pandas as pdfrom sklearn.datasets.california_housing import fetch_california_housinghousing = fetch_california_housing()print(housing.DESCR).. _california_...
复制链接

扫一扫