使用Graphviz决策树可视化展示，将DataFrame数据保存到本地

最新推荐文章于 2022-07-13 10:21:16 发布

志存高远脚踏实地

最新推荐文章于 2022-07-13 10:21:16 发布

阅读量2.6k

点赞数 3

分类专栏：机器学习文章标签：使用Graphaviz 决策树可视化展示 InvocationException: GraphViz's exe DataFrame数据保存到本地

本文链接：https://blog.csdn.net/weixin_44451032/article/details/100063344

版权

机器学习专栏收录该内容

24 篇文章 8 订阅

订阅专栏

决策树可视化展示

准备数据集

本次使用sklearn的内置数据集

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets.california_housing import fetch_california_housing #导入内置数据集
house_price = fetch_california_housing()
print(house_price.DESCR)  #打印关于数据集的描述
print(house_price)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. topic:: References

    - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
      Statistics and Probability Letters, 33 (1997) 291-297

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
          37.88      , -122.23      ],
       [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
          37.86      , -122.22      ],
       [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
          37.85      , -122.24      ],
       ...,
       [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
          39.43      , -121.22      ],
       [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
          39.43      , -121.32      ],
       [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
          39.37      , -121.24      ]]), 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]), 'feature_names': ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'], 'DESCR': '.. _california_housing_dataset:\n\nCalifornia Housing dataset\n--------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 20640\n\n    :Number of Attributes: 8 numeric, predictive attributes and the target\n\n    :Attribute Information:\n        - MedInc        median income in block\n        - HouseAge      median house age in block\n        - AveRooms      average number of rooms\n        - AveBedrms     average number of bedrooms\n        - Population    block population\n        - AveOccup      average house occupancy\n        - Latitude      house block latitude\n        - Longitude     house block longitude\n\n    :Missing Attribute Values: None\n\nThis dataset was obtained from the StatLib repository.\nhttp://lib.stat.cmu.edu/datasets/\n\nThe target variable is the median house value for California districts.\n\nThis dataset was derived from the 1990 U.S. census, using one row per census\nblock group. A block group is the smallest geographical unit for which the U.S.\nCensus Bureau publishes sample data (a block group typically has a population\nof 600 to 3,000 people).\n\nIt can be downloaded/loaded using the\n:func:`sklearn.datasets.fetch_california_housing` function.\n\n.. topic:: References\n\n    - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,\n      Statistics and Probability Letters, 33 (1997) 291-297\n'}

如何将fetch_california_housing数据集转化为DataFrame类型生成一张新表

查看数据类型

#查看数据类型
type(house_price)

sklearn.utils.Bunch

查看data

house_price.data

array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
          37.88      , -122.23      ],
       [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
          37.86      , -122.22      ],
       [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
          37.85      , -122.24      ],
       ...,
       [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
          39.43      , -121.22      ],
       [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
          39.43      , -121.32      ],
       [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
          39.37      , -121.24      ]])

查看target

house_price.target.reshape(-1,1)

array([[4.526],
       [3.585],
       [3.521],
       ...,
       [0.923],
       [0.847],
       [0.894]])

查看feature_names

house_price.feature_names

['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']

将特征数据转换为DataFrame类型

data = pd.DataFrame(house_price.data)

data.head()

	0	1	2	3	4	5	6	7	Target
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23	4.526
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22	3.585
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24	3.521
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25	3.413
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25	3.422

将目标值转化为DataFrame类型，并插入data中

data.insert(8,'Target',pd.DataFrame(house_price.target))

data.head()

	0	1	2	3	4	5	6	7	Target
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23	4.526
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22	3.585
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24	3.521
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25	3.413
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25	3.422

查看列的索引值

data.columns

Index([0, 1, 2, 3, 4, 5, 6, 7, 'Target'], dtype='object')

修改列名生成新表

columns = dict(zip([0, 1, 2, 3, 4, 5, 6, 7],house_price.feature_names))

data.rename(columns=columns).head()

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	Target
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23	4.526
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22	3.585
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24	3.521
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25	3.413
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25	3.422

将数据保存到本地

table = data.rename(columns=columns)
table.to_csv('table.csv',index=False)#保存文件

在这里插入图片描述

进行可视化展示

建立决策树

from sklearn import tree
dtr = tree.DecisionTreeRegressor(max_depth = 2)  #指定树的深度
dtr.fit(house_price.data[:, [6, 7]], house_price.target)  #这里只选择经度和纬度两个特征进行拟合

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

要可视化显示首先需要安装 graphviz http://www.Graphviz.org

#要可视化显示 首先需要安装 graphviz   http://www.Graphviz.org
dot_data = \
    tree.export_graphviz(
        dtr,
        out_file = None,
        feature_names = house_price.feature_names[6:8],
        filled = True,  #是否填充
        impurity = False,  #控制是否显示均方误差
        rounded = True  #控制显示格式、字体
    )

绘图并显示

import pydotplus
graph = pydotplus.graph_from_dot_data(dot_data)
nodes = graph.get_nodes()
for i in nodes:
    i.set_fillcolor("#FFE1FF")
from IPython.display import Image
Image(graph.create_png())

在这里插入图片描述

但是，有些时候会报错InvocationException: GraphViz’s executables not found

这是因为python并没有识别出其环境变量，重新写入一次，然后重新执行绘图显示操作

import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'  #注意修改自己的路径   将路径写入环境变量

志存高远脚踏实地

关注

3
点赞
踩
5

收藏

觉得还不错? 一键收藏
2
评论
使用Graphviz决策树可视化展示，将DataFrame数据保存到本地

决策树可视化展示准备数据集本次使用sklearn的内置数据集import matplotlib.pyplot as pltimport pandas as pdfrom sklearn.datasets.california_housing import fetch_california_housing #导入内置数据集house_price = fetch_california_ho...
复制链接

扫一扫