层次聚类算法的Python实现

KaiWings-X

已于 2024-04-27 22:40:40 修改

阅读量1.6k

点赞数 24

文章标签： python 机器学习聚类

于 2024-04-27 18:59:09 首次发布

本文链接：https://blog.csdn.net/m0_74860956/article/details/138255156

版权

- - 层次聚类算法实例 Hierarchical Clustering Algorithm

层次聚类算法实例 Hierarchical Clustering Algorithm

数据集：Travel details dataset

来源：https://www.kaggle.com/code/rkiattisak/starter-for-traveler-trip-dataset

字段	描述
Trip ID	旅行ID
Destination	目的地
Start date	开始日期
End date	结束日期
Duration (days)	持续时间（天）
Traveler name	旅行者姓名
Traveler age	旅行者年龄
Traveler gender	旅行者性别
Traveler nationality	旅行者国籍
Accommodation type	住宿类型
Accommodation cost	住宿费用
Transportation type	交通方式
Transportation cost	交通费用

1、数据获取以及数据预处理

（1）数据集的获取

import pandas as pd

# 读取文件
file_path = 'D:/MachineLearningDesign/TotalDataset/Travel_detailsDataset/Travel details dataset.xlsx'
travel_data = pd.read_excel(file_path)

# 打印数据集
travel_data

	Trip ID	Destination	Start date	End date	Duration (days)	Traveler name	Traveler age	Traveler gender	Traveler nationality	Accommodation type	Accommodation cost	Transportation type	Transportation cost
0	1	London, UK	5/1/2023	5/8/2023	7	John Smith	35	Male	American	Hotel	1200	Flight	600.0
1	2	Phuket, Thailand	6/15/2023	6/20/2023	5	Jane Doe	28	Female	Canadian	Resort	800	Flight	500.0
2	3	Bali, Indonesia	7/1/2023	7/8/2023	7	David Lee	45	Male	Korean	Villa	1000	Flight	700.0
3	4	New York, USA	8/15/2023	8/29/2023	14	Sarah Johnson	29	Female	British	Hotel	2000	Flight	1000.0
4	5	Tokyo, Japan	9/10/2023	9/17/2023	7	Kim Nguyen	26	Female	Vietnamese	Airbnb	700	Train	200.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
132	135	Rio de Janeiro, Brazil	8/1/2023	8/10/2023	9	Jose Perez	37	Male	Brazilian	Hostel	2500	Car	2000.0
133	136	Vancouver, Canada	8/15/2023	8/21/2023	6	Emma Wilson	29	Female	Canadian	Hotel	5000	Airplane	3000.0
134	137	Bangkok, Thailand	9/1/2023	9/8/2023	7	Ryan Chen	34	Male	Chinese	Hostel	2000	Train	1000.0
135	138	Barcelona, Spain	9/15/2023	9/22/2023	7	Sofia Rodriguez	25	Female	Spanish	Airbnb	6000	Airplane	2500.0
136	139	Auckland, New Zealand	10/1/2023	10/8/2023	7	William Brown	39	Male	New Zealander	Hotel	7000	Train	2500.0

137 rows × 13 columns

（2）将时间特征数值化处理

# 该数据集的时间特征有两个：Start date 和 End date，即旅行的开始日期和结束日期
# 将时间特征转换为日期时间格式
travel_data['Start date'] = pd.to_datetime(travel_data['Start date'])
travel_data['End date'] = pd.to_datetime(travel_data['End date'])

# 方法：使用日期分解
# 将时间特征Start date和End date分解为年、月、日
# 分解后新的数值特征将自动以列的格式连接到原数据集travel_data的后面
travel_data['sy'] = travel_data['Start date'].dt.year   #Start year
travel_data['sm'] = travel_data['Start date'].dt.month  #Start month
travel_data['sd'] = travel_data['Start date'].dt.day    #Start day

travel_data['ey'] = travel_data['End date'].dt.year   #End year
travel_data['em'] = travel_data['End date'].dt.month  #End month
travel_data['ed'] = travel_data['End date'].dt.day    #End day

# 移除 'Start date' 和 'End date' 列
travel_data.drop(['Start date', 'End date'], axis=1, inplace=True)

travel_data

	Trip ID	Destination	Duration (days)	Traveler name	Traveler age	Traveler gender	Traveler nationality	Accommodation type	Accommodation cost	Transportation type	Transportation cost	sy	sm	sd	ey	em	ed
0	1	London, UK	7	John Smith	35	Male	American	Hotel	1200	Flight	600.0	2023	5	1	2023	5	8
1	2	Phuket, Thailand	5	Jane Doe	28	Female	Canadian	Resort	800	Flight	500.0	2023	6	15	2023	6	20
2	3	Bali, Indonesia	7	David Lee	45	Male	Korean	Villa	1000	Flight	700.0	2023	7	1	2023	7	8
3	4	New York, USA	14	Sarah Johnson	29	Female	British	Hotel	2000	Flight	1000.0	2023	8	15	2023	8	29
4	5	Tokyo, Japan	7	Kim Nguyen	26	Female	Vietnamese	Airbnb	700	Train	200.0	2023	9	10	2023	9	17
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
132	135	Rio de Janeiro, Brazil	9	Jose Perez	37	Male	Brazilian	Hostel	2500	Car	2000.0	2023	8	1	2023	8	10
133	136	Vancouver, Canada	6	Emma Wilson	29	Female	Canadian	Hotel	5000	Airplane	3000.0	2023	8	15	2023	8	21
134	137	Bangkok, Thailand	7	Ryan Chen	34	Male	Chinese	Hostel	2000	Train	1000.0	2023	9	1	2023	9	8
135	138	Barcelona, Spain	7	Sofia Rodriguez	25	Female	Spanish	Airbnb	6000	Airplane	2500.0	2023	9	15	2023	9	22
136	139	Auckland, New Zealand	7	William Brown	39	Male	New Zealander	Hotel	7000	Train	2500.0	2023	10	1	2023	10	8

137 rows × 17 columns

（3）非数值特征数值化处理以及数据集缺失值的处理

from sklearn.preprocessing import LabelEncoder #标签编码 用于将类别型数据转换为数值型数据

# 读取数据
data = travel_data

# 提取非数值型的列名（特征）
non_numeric_columns = ['Destination', 'Traveler gender','Traveler nationality',
                       'Accommodation type', 'Transportation type']

# 对非数值列进行数值化处理
label_encoders = {}  # 该字典用于存储每个特征的 LabelEncoder 

# 对于数据集中的每个非数值型列，依次执行以下操作
for column in non_numeric_columns:

    # 检查当前列的数据类型是否为字符串对象
    if data[column].dtype == 'object':

        # 如果列的数据类型为字符串，创建一个名为 column 的列的 LabelEncoder 对象，
        # 并将其存储在名为 label_encoders 的字典中，以便稍后使用
        label_encoders[column] = LabelEncoder()

        # 使用 LabelEncoder 对象对当前列进行转换。
        # fit_transform() 方法用于对数据进行拟合和转换，
        # 将非数字型数据转换为数字型数据，并将转换后的数据存储回原始的数据集中的当前列
        data[column] = label_encoders[column].fit_transform(data[column])

print(label_encoders)

# 重新排列列--把旅客姓名特征放到前面以便后期处理数据
cols = list(data.columns)
cols.remove('Traveler name')  # 从列列表中删除 'Traveler name'
cols.insert(1, 'Traveler name')  # 将 'Traveler name' 插入到第二列
# 将排好序的列名重新赋值给数据集
data = data[cols]

# 对数值列进行缺失值填充
numeric_columns = data.select_dtypes(include=['number']).columns
data[numeric_columns] = data[numeric_columns].fillna(data[numeric_columns].mean())

data

{'Destination': LabelEncoder(), 'Traveler gender': LabelEncoder(), 'Traveler nationality': LabelEncoder(), 'Accommodation type': LabelEncoder(), 'Transportation type': LabelEncoder()}

	Trip ID	Traveler name	Destination	Duration (days)	Traveler age	Traveler gender	Traveler nationality	Accommodation type	Accommodation cost	Transportation type	Transportation cost	sy	sm	sd	ey	em	ed
0	1	John Smith	30	7	35	1	0	3	1200	5	600.0	2023	5	1	2023	5	8
1	2	Jane Doe	42	5	28	0	7	4	800	5	500.0	2023	6	15	2023	6	20
2	3	David Lee	6	7	45	1	23	7	1000	5	700.0	2023	7	1	2023	7	8
3	4	Sarah Johnson	36	14	29	0	4	3	2000	5	1000.0	2023	8	15	2023	8	29
4	5	Kim Nguyen	57	7	26	0	40	0	700	8	200.0	2023	9	10	2023	9	17
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
132	135	Jose Perez	44	9	37	1	3	2	2500	2	2000.0	2023	8	1	2023	8	10
133	136	Emma Wilson	58	6	29	0	7	3	5000	0	3000.0	2023	8	15	2023	8	21
134	137	Ryan Chen	9	7	34	1	9	2	2000	8	1000.0	2023	9	1	2023	9	8
135	138	Sofia Rodriguez	11	7	25	0	33	0	6000	0	2500.0	2023	9	15	2023	9	22
136	139	William Brown	3	7	39	1	26	3	7000	8	2500.0	2023	10	1	2023	10	8

137 rows × 17 columns

（4）数据标准化处理

from sklearn.preprocessing import StandardScaler
# 从第三列开始截取数据并转换为数组
array_data = data.iloc[:, 2:].values  # [;,2:]：逗号前表示行，所有行；逗号后表示列，从第3列到结束

# array_data
transfer = StandardScaler() # 实例化转换器
scaler_data = transfer.fit_transform(array_data)

#检查标准化后的数据，样本数row + 特征数column
print(scaler_data,'\n',scaler_data.shape)

[[-0.06124973 -0.37973645  0.25631927 ...  0.1651258  -0.52158185
  -1.30645256]
 [ 0.61004731 -1.63332426 -0.72692145 ...  0.1651258  -0.19379876
   0.35465186]
 [-1.4038438  -0.37973645  1.66094888 ...  0.1651258   0.13398433
  -1.30645256]
 ...
 [-1.23601954 -0.37973645  0.11585631 ...  0.1651258   0.78955051
  -1.30645256]
 [-1.1241367  -0.37973645 -1.14831034 ...  0.1651258   0.78955051
   0.63150259]
 [-1.57166806 -0.37973645  0.81817112 ...  0.1651258   1.1173336
  -1.30645256]] 
 (137, 15)

（5）获取样本名称（人名）-- 第0次聚类迭代的样本标签

# 获取每个样本的名称（人名）
column_names_array = data.iloc[:, 1:2].values

column_names_array

array([['John Smith'],
       ['Jane Doe'],
       ['David Lee'],
       ['Sarah Johnson'],
       ['Kim Nguyen'],
       ['Michael Brown'],
       ['Emily Davis'],
       ['Lucas Santos'],
       ['Laura Janssen'],
       ['Mohammed Ali'],
       ['Ana Hernandez'],
       ['Carlos Garcia'],
       ['Lily Wong'],
       ['Hans Mueller'],
       ['Fatima Khouri'],
       ['James MacKenzie'],
       ['Sarah Johnson'],
       ['Michael Chang'],
       ['Olivia Rodriguez'],
       ['Kenji Nakamura'],
       ['Emily Lee'],
       ['James Wilson'],
       ['Sofia Russo'],
       ['Raj Patel'],
       ['Lily Nguyen'],
       ['David Kim'],
       ['Maria Garcia'],
       ['Alice Smith'],
       ['Bob Johnson'],
       ['Charlie Lee'],
       ['Emma Davis'],
       ['Olivia Martin'],
       ['Harry Wilson'],
       ['Sophia Lee'],
       ['James Brown'],
       ['Mia Johnson'],
       ['William Davis'],
       ['Amelia Brown'],
       ['Mia Johnson'],
       ['Adam Lee'],
       ['Sarah Wong'],
       ['John Smith'],
       ['Maria Silva'],
       ['Peter Brown'],
       ['Emma Garcia'],
       ['Michael Davis'],
       ['Nina Patel'],
       ['Kevin Kim'],
       ['Laura van den Berg'],
       ['Jennifer Nguyen'],
       ['David Kim'],
       ['Rachel Lee'],
       ['Jessica Wong'],
       ['Felipe Almeida'],
       ['Nisa Patel'],
       ['Ben Smith'],
       ['Laura Gomez'],
       ['Park Min Woo'],
       ['Michael Chen'],
       ['Sofia Rossi'],
       ['Rachel Sanders'],
       ['Kenji Nakamura'],
       ['Emily Watson'],
       ['David Lee'],
       ['Ana Rodriguez'],
       ['Tom Wilson'],
       ['Olivia Green'],
       ['James Chen'],
       ['Lila Patel'],
       ['Marco Rossi'],
       ['Sarah Brown'],
       ['Sarah Lee'],
       ['Alex Kim'],
       ['Maria Hernandez'],
       ['John Smith'],
       ['Mark Johnson'],
       ['Amanda Chen'],
       ['David Lee'],
       ['Nana Kwon'],
       ['Tom Hanks'],
       ['Emma Watson'],
       ['James Kim'],
       ['John Smith'],
       ['Sarah Lee'],
       ['Maria Garcia'],
       ['David Lee'],
       ['Emily Davis'],
       ['James Wilson'],
       ['Fatima Ahmed'],
       ['Liam Nguyen'],
       ['Giulia Rossi'],
       ['Putra Wijaya'],
       ['Kim Min-ji'],
       ['John Smith'],
       ['Emily Johnson'],
       ['David Lee'],
       ['Sarah Brown'],
       ['Michael Wong'],
       ['Jessica Chen'],
       ['Ken Tanaka'],
       ['Maria Garcia'],
       ['Rodrigo Oliveira'],
       ['Olivia Kim'],
       ['Robert Mueller'],
       ['John Smith'],
       ['Sarah Lee'],
       ['Michael Wong'],
       ['Lisa Chen'],
       ['David Kim'],
       ['Emily Wong'],
       ['Mark Tan'],
       ['Emma Lee'],
       ['George Chen'],
       ['Sophia Kim'],
       ['Alex Ng'],
       ['Alice Smith'],
       ['Bob Johnson'],
       ['Cindy Chen'],
       ['David Lee'],
       ['Emily Kim'],
       ['Frank Li'],
       ['Gina Lee'],
       ['Henry Kim'],
       ['Isabella Chen'],
       ['Jack Smith'],
       ['Katie Johnson'],
       ['John Doe'],
       ['Jane Smith'],
       ['Michael Johnson'],
       ['Sarah Lee'],
       ['David Kim'],
       ['Emily Davis'],
       ['Jose Perez'],
       ['Emma Wilson'],
       ['Ryan Chen'],
       ['Sofia Rodriguez'],
       ['William Brown']], dtype=object)

#检查样本名个数和数据行数是否匹配
column_names_array.shape, len(scaler_data)

((137, 1), 137)

2、运用不同的相似度计算方式绘制不同的聚类树状图，根据树状图选择较好的计算方式

– 选用数据集前51个样本 –

– 距离计算采用的是欧氏距离 euclidean –

（1）层次聚类分析的类以及绘图工具的导入

from scipy.cluster import hierarchy as sch #层次聚类分析 按行（样本）进行聚类
import matplotlib.pylab as plt
import matplotlib;matplotlib.rc('font',family='Microsoft YaHei')

（2）采用最小距离法（单链接算法）进行相似度的计算

# 构建树状图1
Z1 = sch.linkage(scaler_data[:51], metric='euclidean', method='single')

# 绘制树状图，并设置列名
plt.figure(figsize=(10, 5))
dendrogram = sch.dendrogram(Z1, labels=column_names_array[:51])

# 设置斜体列名
plt.xticks(rotation=90, fontsize=10) #90度倾斜

plt.ylabel('欧氏距离 + 单链接算法')
plt.title('层次聚类树状图1')

plt.show()

在这里插入图片描述

（3）采用最大距离法（全链接算法）进行相似度的计算

# 构建树状图2
Z2 = sch.linkage(scaler_data[:51], metric='euclidean', method='complete')

plt.figure(figsize=(10, 5))
dendrogram = sch.dendrogram(Z2, labels=column_names_array[:51])

plt.xticks(rotation=90, fontsize=10)

plt.ylabel('欧氏距离 + 全链接算法')
plt.title('层次聚类树状图2')

plt.show()

在这里插入图片描述

*补充：
语句 Z2 = sch.linkage(scaler_data[:51], metric='euclidean', method='complete')
返回一个层次聚类树的连接矩阵(linkage matrix)，其结构类似于一个包含有关聚类的信息的二维数组。连接矩阵的每一行代表一次簇的合并，其中包含以下信息:
1. 前两列是被合并的簇的索引或者标签。
2. 第三列是这两个簇之间的距离或者相似度。
3. 第四列是被合并后的新簇中包含的数据点个数。

注意：可以根据第2点，即根据连接矩阵的第三列获取模型每一次迭代所得的树的高度

这个连接矩阵是由层次聚类算法生成的，通过对数据集进行层次聚类，将数据点逐步合并成越来越大的簇。在这个过程中，会产生一系列的合并操作，而连接矩阵就记录了这些合并操作的细节。

（3）采用平均距离法（均链接算法）进行相似度的计算

# 构建树状图3
Z3 = sch.linkage(scaler_data[:51], metric='euclidean', method='average')

plt.figure(figsize=(10, 5))
dendrogram = sch.dendrogram(Z2, labels=column_names_array[:51])

plt.xticks(rotation=90, fontsize=10)

plt.ylabel('欧氏距离 + 均链接算法')
plt.title('层次聚类树状图3')

plt.show()

在这里插入图片描述

根据上述的3张树状图可知，使用全链接算法或者均链接算法计算簇的相似度，每一次迭代的距离效果更好，层次分离的效果明显。相比之下，单链接算法的距离层次分的就不如前二者。下述的聚类结果计算，我们就选择全链接算法的结果进行操作。

3、聚类结果计算

（1）给定阈值下的聚类结果：可以分成几个簇，每个簇对应的样本有哪些

#给定阈值下的聚类结果
yuzhi = float(input('请输入阈值：'))
# yuzhi=7.4  #聚类的高度阈值，决定了在树状结构中被切割成多少个聚类。

label=[]   #用于存储每个数据点所属的聚类标签 

#外层：根据给定的高度阈值yuzhi对层次聚类树进行切割，返回每个数据点所属的聚类标签 如[0]、[1]...
for i in sch.cut_tree(Z2,height=yuzhi):
    
    #内层：遍历了当前聚类的标签，并将每个数据点的标签添加到label列表中。
    for j in i : label.append(j)  #将每个样本所属的标签值存入列表

labelCount = set(label)  #标签的集合 有多少类标签
    
print('输入阈值 = '+str(yuzhi)+' \n聚类簇数 = '+str(len(list(labelCount))))

# 构建游客名称列表
guest_names = [str(name) for name in column_names_array[:51]]

print('-'*15)
print('聚类簇 \t 样本编号&&旅客姓名')
unique_labels = list(set(label))
if unique_labels:  # 检查 label 是否为空
    for i in unique_labels: #遍历标签集的列表
        print(i, '   : ', end='  ')
        for j in range(len(label)): #遍历标签列表
            if i == label[j]:       #检查标签列表元素的值是否为当前迭代的标签值
                print(j,guest_names[j], end='\t\t')  #检查为是时，打印该元素对应的下表索引
            else:
                pass                #检查为否时，跳到下一个元素
        print()  # 换行到下一个簇
else:
    print("没有聚类结果")     # 若检查 label 为空，则断定没有聚类结果

输入阈值 = 6.5 
聚类簇数 = 5
---------------
聚类簇 	 样本编号&&旅客姓名
0    :   0 ['John Smith']		2 ['David Lee']		5 ['Michael Brown']		13 ['Hans Mueller']		19 ['Kenji Nakamura']		21 ['James Wilson']		25 ['David Kim']		41 ['John Smith']		
1    :   1 ['Jane Doe']		11 ['Carlos Garcia']		14 ['Fatima Khouri']		15 ['James MacKenzie']		17 ['Michael Chang']		18 ['Olivia Rodriguez']		23 ['Raj Patel']		24 ['Lily Nguyen']		26 ['Maria Garcia']		27 ['Alice Smith']		28 ['Bob Johnson']		36 ['William Davis']		38 ['Mia Johnson']		44 ['Emma Garcia']		45 ['Michael Davis']		48 ['Laura van den Berg']		49 ['Jennifer Nguyen']		
2    :   3 ['Sarah Johnson']		30 ['Emma Davis']		
3    :   4 ['Kim Nguyen']		6 ['Emily Davis']		16 ['Sarah Johnson']		20 ['Emily Lee']		22 ['Sofia Russo']		29 ['Charlie Lee']		31 ['Olivia Martin']		32 ['Harry Wilson']		33 ['Sophia Lee']		37 ['Amelia Brown']		40 ['Sarah Wong']		42 ['Maria Silva']		47 ['Kevin Kim']		50 ['David Kim']		
4    :   7 ['Lucas Santos']		8 ['Laura Janssen']		9 ['Mohammed Ali']		10 ['Ana Hernandez']		12 ['Lily Wong']		34 ['James Brown']		35 ['Mia Johnson']		39 ['Adam Lee']		43 ['Peter Brown']		46 ['Nina Patel']

（2）给定簇数下的聚类结果：每个簇对应的样本有哪些

# 给定簇数下的聚类结果
n=int(input('请输入簇数：'))
# n = 9
label = []
for i in sch.cut_tree(Z2, n_clusters=n):
    for j in i:
        label.append(j)
print('输入簇数 = ' + str(n))

# 构建游客名称列表
guest_names = [str(name) for name in column_names_array[:51]]

print('-'*15)
print('聚类簇 \t 样本编号&&旅客姓名')
unique_labels = list(set(label))
if unique_labels:  # 检查 label 是否为空
    for i in unique_labels:  # 遍历标签集的列表
        print(i, '   : ', end='  ')
        # 使用 zip() 函数同时遍历 label 和 guest_names
        for idx, name in zip(range(len(label)), guest_names):
            if i == label[idx]:  # 检查标签列表元素的值是否为当前迭代的标签值
                print(idx, name, end='\t\t')  # 检查为是时，打印该元素对应的下表索引
        print()  # 换行到下一个簇
else:
    print("没有聚类结果")  # 若检查 label 为空，则断定没有聚类结果

输入簇数 = 5
---------------
聚类簇 	 样本编号&&旅客姓名
0    :   0 ['John Smith']		2 ['David Lee']		5 ['Michael Brown']		13 ['Hans Mueller']		19 ['Kenji Nakamura']		21 ['James Wilson']		25 ['David Kim']		41 ['John Smith']		
1    :   1 ['Jane Doe']		11 ['Carlos Garcia']		14 ['Fatima Khouri']		15 ['James MacKenzie']		17 ['Michael Chang']		18 ['Olivia Rodriguez']		23 ['Raj Patel']		24 ['Lily Nguyen']		26 ['Maria Garcia']		27 ['Alice Smith']		28 ['Bob Johnson']		36 ['William Davis']		38 ['Mia Johnson']		44 ['Emma Garcia']		45 ['Michael Davis']		48 ['Laura van den Berg']		49 ['Jennifer Nguyen']		
2    :   3 ['Sarah Johnson']		30 ['Emma Davis']		
3    :   4 ['Kim Nguyen']		6 ['Emily Davis']		16 ['Sarah Johnson']		20 ['Emily Lee']		22 ['Sofia Russo']		29 ['Charlie Lee']		31 ['Olivia Martin']		32 ['Harry Wilson']		33 ['Sophia Lee']		37 ['Amelia Brown']		40 ['Sarah Wong']		42 ['Maria Silva']		47 ['Kevin Kim']		50 ['David Kim']		
4    :   7 ['Lucas Santos']		8 ['Laura Janssen']		9 ['Mohammed Ali']		10 ['Ana Hernandez']		12 ['Lily Wong']		34 ['James Brown']		35 ['Mia Johnson']		39 ['Adam Lee']		43 ['Peter Brown']		46 ['Nina Patel']

（3）给定簇数下阈值的取值范围

# 求解层次聚类模型每次迭代聚合的树高
# 将树高度存入列表 TreehighList 中，将列表的0号元素设置为0，表示初始时无高度
# 迭代树的高度从列表的1号元素开始追加
# 此时列表的长度 = 样本数
TreehighList = []
TreehighList.append(0)
for i in range(len(Z2)):
    TreehighList.append(Z2[i,2])

TreehighList, len(TreehighList)

([0,
  1.3207220250706178,
  1.713192288277125,
  1.948618219530062,
  2.1938156142412155,
  2.344051009400656,
  2.385236125346607,
  2.5227496951700554,
  2.5847786704830993,
  2.6609018439884236,
  2.664531638770863,
  2.7194643386552477,
  2.7231357411789694,
  2.838154747820297,
  2.959983314668614,
  3.014549824867137,
  3.0870393522464195,
  3.21498864327589,
  3.228770291629527,
  3.242982467896726,
  3.293801779310498,
  3.346960826020779,
  3.396800112724491,
  3.4031309441625845,
  3.4379960453193914,
  3.7387064608965064,
  3.7774664570067644,
  3.7988233271953695,
  3.843327398012852,
  3.9861651879658195,
  4.2197659666669365,
  4.291448853880782,
  4.463605246207876,
  4.6155892525418025,
  4.782801321694507,
  4.7839476461738535,
  4.793592690309439,
  4.851137858942306,
  4.945266643885942,
  4.994082102721412,
  5.214885156903247,
  5.29249457737719,
  5.356639595890159,
  5.835802656747644,
  5.967735673534881,
  6.1319341491966295,
  6.340049875642774,
  6.650802835756102,
  6.85452516718054,
  7.432114871942092,
  8.118708906268715],
 51)

# 获取模型的最顶层的高度
top = TreehighList[-1] # 列表的最后一个元素（使用切片）
top

8.118708906268715

# 合并次数： 样本数 = 合并次数 + 1 = 连接矩阵的行数 + 1
linkCounts = len(Z2)
linkCounts

# 求解阈值范围函数
def ThresholdRange(n):
    # 输入簇数 > 样本数：输入不合法
    if n > len(TreehighList): 
        return "找不到满足簇数为 {} 的阈值!".format(n)
    
    # 输入簇数 == 1时，阈值 >= 最顶层数高度
    elif n == 1:
        y = top
        return "簇数为 {} 的阈值范围：[{} , INF).".format(n, y)
    

    else:
        # 下界值对应的列表索引
        low = (len(TreehighList)) - n 

        # 上界值对应的列表索引
        high = low + 1
    
    # 左闭右开
    return "簇数为 {} 的阈值范围：[{} , {}).".format(n, TreehighList[low], TreehighList[high])

n = int(input("请输入簇数："))
# n=46
ThresholdRange(n)

'簇数为 9 的阈值范围：[5.356639595890159 , 5.835802656747644).'

KaiWings-X

关注

24
点赞
踩
44

收藏

觉得还不错? 一键收藏
5
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

层次聚类算法的Python实现

Contents

层次聚类算法实例 Hierarchical Clustering Algorithm