python层次聚类算法_机器学习-层级聚类算法(Hierarchy Cluster)

Section I: Brief Introduction on Hierarchy Cluster

The two standard algorithms for agglomerative hierarchical clustering are single linkage and complete linkage. Using single linkage, the distances between the most similar members for each pair of clusters and merge the two clusters for which the distance between the most similar members is the smallest. With respect to complete linkage, the approach is similar to single linkage but, instead of comparing the most similar members in each pair of clusters, it compare the most dissimilar members to perform the merge.

Hierarchical complete linkage clustering is an iterative procedure that can be summarized by the following steps:

Step 1: Compute the distance matrix of all samples (Euclidean Distance)

Step 2: Represent each data point as a singleton cluster

Step 3: Merge the two closest clusters based on the distance between the most similar/dissimilar (distant) members

Step 4: Update similarity matrix

Step 5: Repeat steps 2-4 until one single cluster remains

FROM

Sebastian Raschka, Vahid Mirjalili. Python机器学习第二版. 南京:东南大学出版社,2018.

第一部分: 数据初始化

代码

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import warnings

warnings.filterwarnings("ignore")

plt.rcParams['figure.dpi']=200

plt.rcParams['savefig.dpi']=200

font = {'weight': 'light'}

plt.rc("font", **font)

np.random.seed(123)

#Section 1: Generate random data

variables=['X','Y','Z']

labels=['ID_1','ID_2','ID_3','ID_4','ID_5']

X=np.random.random_sample([5,3])*10

df=pd.DataFrame(X,columns=variables,index=labels)

print("Original DataFrame:\n",df)1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

结果

Original DataFrame:

X Y Z

ID_1 6.964692 2.861393 2.268515

ID_2 5.513148 7.194690 4.231065

ID_3 9.807642 6.848297 4.809319

ID_4 3.921175 3.431780 7.290497

ID_5 4.385722 0.596779 3.9804431

2

3

4

5

6

7

第二部分:Euclidean距离计算

方法一:通过scipy包的pdist和square函数

代码

#Section 2: Perform hierarchical clustering on a distance matrix

#Section 2.1: Via pdist and squareform methods

from scipy.spatial.distance import pdist,squareform

row_dist=pd.DataFrame(squareform(pdist(df,metric='euclidean')),

columns=labels,index=labels)

print("\nData Distance via pdist and squareform: \n",row_dist)1

2

3

4

5

6

7

结果

Data Distance via pdist and squareform:

ID_1 ID_2 ID_3 ID_4 ID_5

ID_1 0.000000 4.973534 5.516653 5.899885 3.835396

ID_2 4.973534 0.000000 4.347073 5.104311 6.698233

ID_3 5.516653 4.347073 0.000000 7.244262 8.316594

ID_4 5.899885 5.104311 7.244262 0.000000 4.382864

ID_5 3.835396 6.698233 8.316594 4.382864 0.0000001

2

3

4

5

6

7

方法二:通过linkage函数

代码

#Section 2.2: Via linkage method

from scipy.cluster.hierarchy import linkage

row_cluster=linkage(df.values,method='complete',metric='euclidean')

row_dist_linkage=pd.DataFrame(row_cluster,

columns=['Row Label 1','Row Label 2','Distance','Item Number in Cluster'],

index=['Cluster %d' % (i+1) for i in range(row_cluster.shape[0])])

print("\nData Distance via Linkage: \n",row_dist_linkage)1

2

3

4

5

6

7

8

结果

Data Distance via Linkage:

Row Label 1 Row Label 2 Distance Item Number in Cluster

Cluster 1 0.0 4.0 3.835396 2.0

Cluster 2 1.0 2.0 4.347073 2.0

Cluster 3 3.0 5.0 5.899885 3.0

Cluster 4 6.0 7.0 8.316594 5.01

2

3

4

5

6

参考文献

Sebastian Raschka, Vahid Mirjalili. Python机器学习第二版. 南京:东南大学出版社,2018.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值