[python] 基于matplotlib实现树形图的绘制

树形图Tree diagram (代码下载)

本文旨在描述如何使用Python实现基本的树形图。要实现这样的树形图,首先需要有一个数值矩阵。每一行代表一个实体(这里是一辆汽车)。每列都是描述汽车的变量。目标是将实体聚类以了解谁与谁有共同点。python下通过scipy中hierarchy.linkage进行聚类,hierarchy.dendrogram画树形图。参考文档:https://python-graph-gallery.com/dendrogram/
该章节主要内容有:

  1. 数据处理 data processing
  2. 基础树形图 basic dendrogram
  3. 自定义树形图 customised dendrogram
  4. 彩色树形图标签 color dendrogram labels

1. 数据处理 data processing

画树形图,往往第一列是数据实体名字,即物体种类。其他列分别为物体变量。

# 导入库
import pandas as pd
from matplotlib import pyplot as plt
from scipy.cluster import hierarchy
import numpy as np

# Import the mtcars dataset from the web + keep only numeric variables
url = 'https://python-graph-gallery.com/wp-content/uploads/mtcars.csv'
df = pd.read_csv(url)
df
modelmpgcyldisphpdratwtqsecvsamgearcarb
0Mazda RX421.06160.01103.902.62016.460144
1Mazda RX4 Wag21.06160.01103.902.87517.020144
2Datsun 71022.84108.0933.852.32018.611141
3Hornet 4 Drive21.46258.01103.083.21519.441031
4Hornet Sportabout18.78360.01753.153.44017.020032
5Valiant18.16225.01052.763.46020.221031
6Duster 36014.38360.02453.213.57015.840034
7Merc 240D24.44146.7623.693.19020.001042
8Merc 23022.84140.8953.923.15022.901042
9Merc 28019.26167.61233.923.44018.301044
10Merc 280C17.86167.61233.923.44018.901044
11Merc 450SE16.48275.81803.074.07017.400033
12Merc 450SL17.38275.81803.073.73017.600033
13Merc 450SLC15.28275.81803.073.78018.000033
14Cadillac Fleetwood10.48472.02052.935.25017.980034
15Lincoln Continental10.48460.02153.005.42417.820034
16Chrysler Imperial14.78440.02303.235.34517.420034
17Fiat 12832.4478.7664.082.20019.471141
18Honda Civic30.4475.7524.931.61518.521142
19Toyota Corolla33.9471.1654.221.83519.901141
20Toyota Corona21.54120.1973.702.46520.011031
21Dodge Challenger15.58318.01502.763.52016.870032
22AMC Javelin15.28304.01503.153.43517.300032
23Camaro Z2813.38350.02453.733.84015.410034
24Pontiac Firebird19.28400.01753.083.84517.050032
25Fiat X1-927.3479.0664.081.93518.901141
26Porsche 914-226.04120.3914.432.14016.700152
27Lotus Europa30.4495.11133.771.51316.901152
28Ford Pantera L15.88351.02644.223.17014.500154
29Ferrari Dino19.76145.01753.622.77015.500156
30Maserati Bora15.08301.03353.543.57014.600158
31Volvo 142E21.44121.01094.112.78018.601142
# 通常获得数据表格需要将车名设置行标题,这里model代表车的类型
df = df.set_index('model')
df
mpgcyldisphpdratwtqsecvsamgearcarb
model
Mazda RX421.06160.01103.902.62016.460144
Mazda RX4 Wag21.06160.01103.902.87517.020144
Datsun 71022.84108.0933.852.32018.611141
Hornet 4 Drive21.46258.01103.083.21519.441031
Hornet Sportabout18.78360.01753.153.44017.020032
Valiant18.16225.01052.763.46020.221031
Duster 36014.38360.02453.213.57015.840034
Merc 240D24.44146.7623.693.19020.001042
Merc 23022.84140.8953.923.15022.901042
Merc 28019.26167.61233.923.44018.301044
Merc 280C17.86167.61233.923.44018.901044
Merc 450SE16.48275.81803.074.07017.400033
Merc 450SL17.38275.81803.073.73017.600033
Merc 450SLC15.28275.81803.073.78018.000033
Cadillac Fleetwood10.48472.02052.935.25017.980034
Lincoln Continental10.48460.02153.005.42417.820034
Chrysler Imperial14.78440.02303.235.34517.420034
Fiat 12832.4478.7664.082.20019.471141
Honda Civic30.4475.7524.931.61518.521142
Toyota Corolla33.9471.1654.221.83519.901141
Toyota Corona21.54120.1973.702.46520.011031
Dodge Challenger15.58318.01502.763.52016.870032
AMC Javelin15.28304.01503.153.43517.300032
Camaro Z2813.38350.02453.733.84015.410034
Pontiac Firebird19.28400.01753.083.84517.050032
Fiat X1-927.3479.0664.081.93518.901141
Porsche 914-226.04120.3914.432.14016.700152
Lotus Europa30.4495.11133.771.51316.901152
Ford Pantera L15.88351.02644.223.17014.500154
Ferrari Dino19.76145.01753.622.77015.500156
Maserati Bora15.08301.03353.543.57014.600158
Volvo 142E21.44121.01094.112.78018.601142
# 同时需要删除行标索引的标题名
del df.index.name
df
mpgcyldisphpdratwtqsecvsamgearcarb
Mazda RX421.06160.01103.902.62016.460144
Mazda RX4 Wag21.06160.01103.902.87517.020144
Datsun 71022.84108.0933.852.32018.611141
Hornet 4 Drive21.46258.01103.083.21519.441031
Hornet Sportabout18.78360.01753.153.44017.020032
Valiant18.16225.01052.763.46020.221031
Duster 36014.38360.02453.213.57015.840034
Merc 240D24.44146.7623.693.19020.001042
Merc 23022.84140.8953.923.15022.901042
Merc 28019.26167.61233.923.44018.301044
Merc 280C17.86167.61233.923.44018.901044
Merc 450SE16.48275.81803.074.07017.400033
Merc 450SL17.38275.81803.073.73017.600033
Merc 450SLC15.28275.81803.073.78018.000033
Cadillac Fleetwood10.48472.02052.935.25017.980034
Lincoln Continental10.48460.02153.005.42417.820034
Chrysler Imperial14.78440.02303.235.34517.420034
Fiat 12832.4478.7664.082.20019.471141
Honda Civic30.4475.7524.931.61518.521142
Toyota Corolla33.9471.1654.221.83519.901141
Toyota Corona21.54120.1973.702.46520.011031
Dodge Challenger15.58318.01502.763.52016.870032
AMC Javelin15.28304.01503.153.43517.300032
Camaro Z2813.38350.02453.733.84015.410034
Pontiac Firebird19.28400.01753.083.84517.050032
Fiat X1-927.3479.0664.081.93518.901141
Porsche 914-226.04120.3914.432.14016.700152
Lotus Europa30.4495.11133.771.51316.901152
Ford Pantera L15.88351.02644.223.17014.500154
Ferrari Dino19.76145.01753.622.77015.500156
Maserati Bora15.08301.03353.543.57014.600158
Volvo 142E21.44121.01094.112.78018.601142

2. 基础树形图 basic dendrogram

# 执行分层聚类
Z = hierarchy.linkage(df, 'ward')
# 函数原型如下:
# scipy.cluster.hierarchy.linkage(y, method='single', metric='euclidean', optimal_ordering=False)
# y输入矩阵,method聚类方法,metric距离计算方法。通常ward比较靠谱
# optimal_ordering重新排序链接矩阵,以使连续叶之间的距离最小,这样树形结构更为直观,但是计算速度变慢。
# 参数选择见:https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html
# Make the dendrogram
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance (Ward)')
# 画聚类图,常用参数labels设定横坐标下标,leaf_rotation标题旋转
# 详细使用见:https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.dendrogram.html
hierarchy.dendrogram(Z, labels=df.index, leaf_rotation=90);

png

3. 自定义树形图 customised dendrogram

  • 叶标签 leaf label
  • 聚类簇数 number of clusters
  • 颜色 color
  • 截减 truncate
  • 方向 orientation
# 叶标签 leaf label
# Calculate the distance between each sample
Z = hierarchy.linkage(df, 'ward')
 
# Plot with Custom leaves
# 常用参数labels设定横坐标下标,leaf_rotation标题旋转,leaf_font_size设置字号
hierarchy.dendrogram(Z, leaf_rotation=90, leaf_font_size=8, labels=df.index);

png

# 聚类簇数 number of clusters
# Calculate the distance between each sample
Z = hierarchy.linkage(df, 'ward')
 
# Control number of clusters in the plot + add horizontal line.
# color_threshold设定颜色阈值,小于olor_threshold根据簇节点为一簇
hierarchy.dendrogram(Z, color_threshold=240)
# 画水平线,y纵坐标,c颜色,lw线条粗细,linestyle线形
plt.axhline(y=240, c='grey', lw=1, linestyle='dashed');

png

# 颜色 color
# Calculate the distance between each sample
Z = hierarchy.linkage(df, 'ward')

# Set the colour of the cluster here: 设置聚类颜色
hierarchy.set_link_color_palette(['#b30000','#996600', '#b30086'])
 
# Make the dendrogram and give the colour above threshold
# above_threshold_color设置color_threshold上方链接的颜色
hierarchy.dendrogram(Z, color_threshold=240, above_threshold_color='grey')
 
# Add horizontal line.
plt.axhline(y=240, c='grey', lw=1, linestyle='dashed');

png

# 截减 truncate
# 原始观察矩阵很大时,树形图很难读取。截断用于压缩树形图。有几种模式:
# 1 None 不执行截断
# 2 lastp lastp设置叶子节点数,最底层节点数
# 3 level 根据level设置图中层最大数
# Calculate the distance between each sample
Z = hierarchy.linkage(df, 'ward')
  
# method 1: lastp
# you will have 4 leaf at the bottom of the plot  
hierarchy.dendrogram(Z, truncate_mode = 'lastp', p=4);

png

# method 2: level
# No more than ``p`` levels of the dendrogram tree are displayed.
hierarchy.dendrogram(Z, truncate_mode = 'level', p=2);

png

# 方向 orientation

# Calculate the distance between each sample
Z = hierarchy.linkage(df, 'ward')
 
# Orientation of the dendrogram
# 设置层次树的朝向,orientation可选"top", "left", "bottom", "right",默认top
hierarchy.dendrogram(Z, orientation="right", labels=df.index);

png

# Orientation of the dendrogram
hierarchy.dendrogram(Z, orientation="bottom", labels=df.index);

png

4. 彩色树形图标签 color dendrogram labels

# Calculate the distance between each sample
Z = hierarchy.linkage(df, 'ward')

# Make the dendro
# 画树状图
hierarchy.dendrogram(Z, labels=df.index, leaf_rotation=0, orientation="left", color_threshold=240, above_threshold_color='grey')

# Create a color palette with 3 color for the 3 cyl possibilities
# 设置渐变颜色,共三种颜色
my_palette = plt.cm.get_cmap("Accent", 3)

# transforme the 'cyl' column in a categorical variable. It will allow to put one color on each level.
# 根据cyl设置颜色参数,对参数进行分类
df['cyl']=pd.Categorical(df['cyl'])
# 获得每种汽车cyl对应的颜色
my_color=df['cyl'].cat.codes

# Apply the right color to each label
ax = plt.gca()
# 获得y轴坐标标签
xlbls = ax.get_ymajorticklabels()
num=-1
for lbl in xlbls:
    num+=1
    val=my_color[num]
    # 设置颜色
    lbl.set_color(my_palette(val))

png

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值