层次聚类算法实例 Hierarchical Clustering Algorithm
数据集:Travel details dataset
来源:https://www.kaggle.com/code/rkiattisak/starter-for-traveler-trip-dataset
字段 | 描述 |
---|---|
Trip ID | 旅行ID |
Destination | 目的地 |
Start date | 开始日期 |
End date | 结束日期 |
Duration (days) | 持续时间(天) |
Traveler name | 旅行者姓名 |
Traveler age | 旅行者年龄 |
Traveler gender | 旅行者性别 |
Traveler nationality | 旅行者国籍 |
Accommodation type | 住宿类型 |
Accommodation cost | 住宿费用 |
Transportation type | 交通方式 |
Transportation cost | 交通费用 |
1、数据获取以及数据预处理
(1)数据集的获取
import pandas as pd
# 读取文件
file_path = 'D:/MachineLearningDesign/TotalDataset/Travel_detailsDataset/Travel details dataset.xlsx'
travel_data = pd.read_excel(file_path)
# 打印数据集
travel_data
Trip ID | Destination | Start date | End date | Duration (days) | Traveler name | Traveler age | Traveler gender | Traveler nationality | Accommodation type | Accommodation cost | Transportation type | Transportation cost | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | London, UK | 5/1/2023 | 5/8/2023 | 7 | John Smith | 35 | Male | American | Hotel | 1200 | Flight | 600.0 |
1 | 2 | Phuket, Thailand | 6/15/2023 | 6/20/2023 | 5 | Jane Doe | 28 | Female | Canadian | Resort | 800 | Flight | 500.0 |
2 | 3 | Bali, Indonesia | 7/1/2023 | 7/8/2023 | 7 | David Lee | 45 | Male | Korean | Villa | 1000 | Flight | 700.0 |
3 | 4 | New York, USA | 8/15/2023 | 8/29/2023 | 14 | Sarah Johnson | 29 | Female | British | Hotel | 2000 | Flight | 1000.0 |
4 | 5 | Tokyo, Japan | 9/10/2023 | 9/17/2023 | 7 | Kim Nguyen | 26 | Female | Vietnamese | Airbnb | 700 | Train | 200.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
132 | 135 | Rio de Janeiro, Brazil | 8/1/2023 | 8/10/2023 | 9 | Jose Perez | 37 | Male | Brazilian | Hostel | 2500 | Car | 2000.0 |
133 | 136 | Vancouver, Canada | 8/15/2023 | 8/21/2023 | 6 | Emma Wilson | 29 | Female | Canadian | Hotel | 5000 | Airplane | 3000.0 |
134 | 137 | Bangkok, Thailand | 9/1/2023 | 9/8/2023 | 7 | Ryan Chen | 34 | Male | Chinese | Hostel | 2000 | Train | 1000.0 |
135 | 138 | Barcelona, Spain | 9/15/2023 | 9/22/2023 | 7 | Sofia Rodriguez | 25 | Female | Spanish | Airbnb | 6000 | Airplane | 2500.0 |
136 | 139 | Auckland, New Zealand | 10/1/2023 | 10/8/2023 | 7 | William Brown | 39 | Male | New Zealander | Hotel | 7000 | Train | 2500.0 |
137 rows × 13 columns
(2)将时间特征数值化处理
# 该数据集的时间特征有两个:Start date 和 End date,即旅行的开始日期和结束日期
# 将时间特征转换为日期时间格式
travel_data['Start date'] = pd.to_datetime(travel_data['Start date'])
travel_data['End date'] = pd.to_datetime(travel_data['End date'])
# 方法:使用日期分解
# 将时间特征Start date和End date分解为年、月、日
# 分解后新的数值特征将自动以列的格式连接到原数据集travel_data的后面
travel_data['sy'] = travel_data['Start date'].dt.year #Start year
travel_data['sm'] = travel_data['Start date'].dt.month #Start month
travel_data['sd'] = travel_data['Start date'].dt.day #Start day
travel_data['ey'] = travel_data['End date'].dt.year #End year
travel_data['em'] = travel_data['End date'].dt.month #End month
travel_data['ed'] = travel_data['End date'].dt.day #End day
# 移除 'Start date' 和 'End date' 列
travel_data.drop(['Start date', 'End date'], axis=1, inplace=True)
travel_data
Trip ID | Destination | Duration (days) | Traveler name | Traveler age | Traveler gender | Traveler nationality | Accommodation type | Accommodation cost | Transportation type | Transportation cost | sy | sm | sd | ey | em | ed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | London, UK | 7 | John Smith | 35 | Male | American | Hotel | 1200 | Flight | 600.0 | 2023 | 5 | 1 | 2023 | 5 | 8 |
1 | 2 | Phuket, Thailand | 5 | Jane Doe | 28 | Female | Canadian | Resort | 800 | Flight | 500.0 | 2023 | 6 | 15 | 2023 | 6 | 20 |
2 | 3 | Bali, Indonesia | 7 | David Lee | 45 | Male | Korean | Villa | 1000 | Flight | 700.0 | 2023 | 7 | 1 | 2023 | 7 | 8 |
3 | 4 | New York, USA | 14 | Sarah Johnson | 29 | Female | British | Hotel | 2000 | Flight | 1000.0 | 2023 | 8 | 15 | 2023 | 8 | 29 |
4 | 5 | Tokyo, Japan | 7 | Kim Nguyen | 26 | Female | Vietnamese | Airbnb | 700 | Train | 200.0 | 2023 | 9 | 10 | 2023 | 9 | 17 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
132 | 135 | Rio de Janeiro, Brazil | 9 | Jose Perez | 37 | Male | Brazilian | Hostel | 2500 | Car | 2000.0 | 2023 | 8 | 1 | 2023 | 8 | 10 |
133 | 136 | Vancouver, Canada | 6 | Emma Wilson | 29 | Female | Canadian | Hotel | 5000 | Airplane | 3000.0 | 2023 | 8 | 15 | 2023 | 8 | 21 |
134 | 137 | Bangkok, Thailand | 7 | Ryan Chen | 34 | Male | Chinese | Hostel | 2000 | Train | 1000.0 | 2023 | 9 | 1 | 2023 | 9 | 8 |
135 | 138 | Barcelona, Spain | 7 | Sofia Rodriguez | 25 | Female | Spanish | Airbnb | 6000 | Airplane | 2500.0 | 2023 | 9 | 15 | 2023 | 9 | 22 |
136 | 139 | Auckland, New Zealand | 7 | William Brown | 39 | Male | New Zealander | Hotel | 7000 | Train | 2500.0 | 2023 | 10 | 1 | 2023 | 10 | 8 |
137 rows × 17 columns
(3)非数值特征数值化处理以及数据集缺失值的处理
from sklearn.preprocessing import LabelEncoder #标签编码 用于将类别型数据转换为数值型数据
# 读取数据
data = travel_data
# 提取非数值型的列名(特征)
non_numeric_columns = ['Destination', 'Traveler gender','Traveler nationality',
'Accommodation type', 'Transportation type']
# 对非数值列进行数值化处理
label_encoders = {} # 该字典用于存储每个特征的 LabelEncoder
# 对于数据集中的每个非数值型列,依次执行以下操作
for column in non_numeric_columns:
# 检查当前列的数据类型是否为字符串对象
if data[column].dtype == 'object':
# 如果列的数据类型为字符串,创建一个名为 column 的列的 LabelEncoder 对象,
# 并将其存储在名为 label_encoders 的字典中,以便稍后使用
label_encoders[column] = LabelEncoder()
# 使用 LabelEncoder 对象对当前列进行转换。
# fit_transform() 方法用于对数据进行拟合和转换,
# 将非数字型数据转换为数字型数据,并将转换后的数据存储回原始的数据集中的当前列
data[column] = label_encoders[column].fit_transform(data[column])
print(label_encoders)
# 重新排列列--把旅客姓名特征放到前面以便后期处理数据
cols = list(data.columns)
cols.remove('Traveler name') # 从列列表中删除 'Traveler name'
cols.insert(1, 'Traveler name') # 将 'Traveler name' 插入到第二列
# 将排好序的列名重新赋值给数据集
data = data[cols]
# 对数值列进行缺失值填充
numeric_columns = data.select_dtypes(include=['number']).columns
data[numeric_columns] = data[numeric_columns].fillna(data[numeric_columns].mean())
data
{'Destination': LabelEncoder(), 'Traveler gender': LabelEncoder(), 'Traveler nationality': LabelEncoder(), 'Accommodation type': LabelEncoder(), 'Transportation type': LabelEncoder()}
Trip ID | Traveler name | Destination | Duration (days) | Traveler age | Traveler gender | Traveler nationality | Accommodation type | Accommodation cost | Transportation type | Transportation cost | sy | sm | sd | ey | em | ed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | John Smith | 30 | 7 | 35 | 1 | 0 | 3 | 1200 | 5 | 600.0 | 2023 | 5 | 1 | 2023 | 5 | 8 |
1 | 2 | Jane Doe | 42 | 5 | 28 | 0 | 7 | 4 | 800 | 5 | 500.0 | 2023 | 6 | 15 | 2023 | 6 | 20 |
2 | 3 | David Lee | 6 | 7 | 45 | 1 | 23 | 7 | 1000 | 5 | 700.0 | 2023 | 7 | 1 | 2023 | 7 | 8 |
3 | 4 | Sarah Johnson | 36 | 14 | 29 | 0 | 4 | 3 | 2000 | 5 | 1000.0 | 2023 | 8 | 15 | 2023 | 8 | 29 |
4 | 5 | Kim Nguyen | 57 | 7 | 26 | 0 | 40 | 0 | 700 | 8 | 200.0 | 2023 | 9 | 10 | 2023 | 9 | 17 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
132 | 135 | Jose Perez | 44 | 9 | 37 | 1 | 3 | 2 | 2500 | 2 | 2000.0 | 2023 | 8 | 1 | 2023 | 8 | 10 |
133 | 136 | Emma Wilson | 58 | 6 | 29 | 0 | 7 | 3 | 5000 | 0 | 3000.0 | 2023 | 8 | 15 | 2023 | 8 | 21 |
134 | 137 | Ryan Chen | 9 | 7 | 34 | 1 | 9 | 2 | 2000 | 8 | 1000.0 | 2023 | 9 | 1 | 2023 | 9 | 8 |
135 | 138 | Sofia Rodriguez | 11 | 7 | 25 | 0 | 33 | 0 | 6000 | 0 | 2500.0 | 2023 | 9 | 15 | 2023 | 9 | 22 |
136 | 139 | William Brown | 3 | 7 | 39 | 1 | 26 | 3 | 7000 | 8 | 2500.0 | 2023 | 10 | 1 | 2023 | 10 | 8 |
137 rows × 17 columns
(4)数据标准化处理
from sklearn.preprocessing import StandardScaler
# 从第三列开始截取数据并转换为数组
array_data = data.iloc[:, 2:].values # [;,2:]:逗号前表示行,所有行;逗号后表示列,从第3列到结束
# array_data
transfer = StandardScaler() # 实例化转换器
scaler_data = transfer.fit_transform(array_data)
#检查标准化后的数据,样本数row + 特征数column
print(scaler_data,'\n',scaler_data.shape)
[[-0.06124973 -0.37973645 0.25631927 ... 0.1651258 -0.52158185
-1.30645256]
[ 0.61004731 -1.63332426 -0.72692145 ... 0.1651258 -0.19379876
0.35465186]
[-1.4038438 -0.37973645 1.66094888 ... 0.1651258 0.13398433
-1.30645256]
...
[-1.23601954 -0.37973645 0.11585631 ... 0.1651258 0.78955051
-1.30645256]
[-1.1241367 -0.37973645 -1.14831034 ... 0.1651258 0.78955051
0.63150259]
[-1.57166806 -0.37973645 0.81817112 ... 0.1651258 1.1173336
-1.30645256]]
(137, 15)
(5)获取样本名称(人名)-- 第0次聚类迭代的样本标签
# 获取每个样本的名称(人名)
column_names_array = data.iloc[:, 1:2].values
column_names_array
array([['John Smith'],
['Jane Doe'],
['David Lee'],
['Sarah Johnson'],
['Kim Nguyen'],
['Michael Brown'],
['Emily Davis'],
['Lucas Santos'],
['Laura Janssen'],
['Mohammed Ali'],
['Ana Hernandez'],
['Carlos Garcia'],
['Lily Wong'],
['Hans Mueller'],
['Fatima Khouri'],
['James MacKenzie'],
['Sarah Johnson'],
['Michael Chang'],
['Olivia Rodriguez'],
['Kenji Nakamura'],
['Emily Lee'],
['James Wilson'],
['Sofia Russo'],
['Raj Patel'],
['Lily Nguyen'],
['David Kim'],
['Maria Garcia'],
['Alice Smith'],
['Bob Johnson'],
['Charlie Lee'],
['Emma Davis'],
['Olivia Martin'],
['Harry Wilson'],
['Sophia Lee'],
['James Brown'],
['Mia Johnson'],
['William Davis'],
['Amelia Brown'],
['Mia Johnson'],
['Adam Lee'],
['Sarah Wong'],
['John Smith'],
['Maria Silva'],
['Peter Brown'],
['Emma Garcia'],
['Michael Davis'],
['Nina Patel'],
['Kevin Kim'],
['Laura van den Berg'],
['Jennifer Nguyen'],
['David Kim'],
['Rachel Lee'],
['Jessica Wong'],
['Felipe Almeida'],
['Nisa Patel'],
['Ben Smith'],
['Laura Gomez'],
['Park Min Woo'],
['Michael Chen'],
['Sofia Rossi'],
['Rachel Sanders'],
['Kenji Nakamura'],
['Emily Watson'],
['David Lee'],
['Ana Rodriguez'],
['Tom Wilson'],
['Olivia Green'],
['James Chen'],
['Lila Patel'],
['Marco Rossi'],
['Sarah Brown'],
['Sarah Lee'],
['Alex Kim'],
['Maria Hernandez'],
['John Smith'],
['Mark Johnson'],
['Amanda Chen'],
['David Lee'],
['Nana Kwon'],
['Tom Hanks'],
['Emma Watson'],
['James Kim'],
['John Smith'],
['Sarah Lee'],
['Maria Garcia'],
['David Lee'],
['Emily Davis'],
['James Wilson'],
['Fatima Ahmed'],
['Liam Nguyen'],
['Giulia Rossi'],
['Putra Wijaya'],
['Kim Min-ji'],
['John Smith'],
['Emily Johnson'],
['David Lee'],
['Sarah Brown'],
['Michael Wong'],
['Jessica Chen'],
['Ken Tanaka'],
['Maria Garcia'],
['Rodrigo Oliveira'],
['Olivia Kim'],
['Robert Mueller'],
['John Smith'],
['Sarah Lee'],
['Michael Wong'],
['Lisa Chen'],
['David Kim'],
['Emily Wong'],
['Mark Tan'],
['Emma Lee'],
['George Chen'],
['Sophia Kim'],
['Alex Ng'],
['Alice Smith'],
['Bob Johnson'],
['Cindy Chen'],
['David Lee'],
['Emily Kim'],
['Frank Li'],
['Gina Lee'],
['Henry Kim'],
['Isabella Chen'],
['Jack Smith'],
['Katie Johnson'],
['John Doe'],
['Jane Smith'],
['Michael Johnson'],
['Sarah Lee'],
['David Kim'],
['Emily Davis'],
['Jose Perez'],
['Emma Wilson'],
['Ryan Chen'],
['Sofia Rodriguez'],
['William Brown']], dtype=object)
#检查样本名个数和数据行数是否匹配
column_names_array.shape, len(scaler_data)
((137, 1), 137)
2、运用不同的相似度计算方式绘制不同的聚类树状图,根据树状图选择较好的计算方式
– 选用数据集前51个样本 –
– 距离计算采用的是欧氏距离 euclidean –
(1)层次聚类分析的类以及绘图工具的导入
from scipy.cluster import hierarchy as sch #层次聚类分析 按行(样本)进行聚类
import matplotlib.pylab as plt
import matplotlib;matplotlib.rc('font',family='Microsoft YaHei')
(2)采用最小距离法(单链接算法)进行相似度的计算
# 构建树状图1
Z1 = sch.linkage(scaler_data[:51], metric='euclidean', method='single')
# 绘制树状图,并设置列名
plt.figure(figsize=(10, 5))
dendrogram = sch.dendrogram(Z1, labels=column_names_array[:51])
# 设置斜体列名
plt.xticks(rotation=90, fontsize=10) #90度倾斜
plt.ylabel('欧氏距离 + 单链接算法')
plt.title('层次聚类树状图1')
plt.show()
(3)采用最大距离法(全链接算法)进行相似度的计算
# 构建树状图2
Z2 = sch.linkage(scaler_data[:51], metric='euclidean', method='complete')
plt.figure(figsize=(10, 5))
dendrogram = sch.dendrogram(Z2, labels=column_names_array[:51])
plt.xticks(rotation=90, fontsize=10)
plt.ylabel('欧氏距离 + 全链接算法')
plt.title('层次聚类树状图2')
plt.show()
*补充: 语句 Z2 = sch.linkage(scaler_data[:51], metric='euclidean', method='complete')
返回一个层次聚类树的连接矩阵(linkage matrix),其结构类似于一个包含有关聚类的信息的二维数组。连接矩阵的每一行代表一次簇的合并,其中包含以下信息:
1. 前两列是被合并的簇的索引或者标签。
2. 第三列是这两个簇之间的距离或者相似度。
3. 第四列是被合并后的新簇中包含的数据点个数。
注意:可以根据第2点,即根据连接矩阵的第三列获取模型每一次迭代所得的树的高度
这个连接矩阵是由层次聚类算法生成的,通过对数据集进行层次聚类,将数据点逐步合并成越来越大的簇。在这个过程中,会产生一系列的合并操作,而连接矩阵就记录了这些合并操作的细节。
(3)采用平均距离法(均链接算法)进行相似度的计算
# 构建树状图3
Z3 = sch.linkage(scaler_data[:51], metric='euclidean', method='average')
plt.figure(figsize=(10, 5))
dendrogram = sch.dendrogram(Z2, labels=column_names_array[:51])
plt.xticks(rotation=90, fontsize=10)
plt.ylabel('欧氏距离 + 均链接算法')
plt.title('层次聚类树状图3')
plt.show()
根据上述的3张树状图可知,使用 全链接算法 或者 均链接算法 计算簇的相似度,每一次迭代的距离效果更好,层次分离的效果明显。相比之下,单链接算法 的距离层次分的就不如前二者。下述的聚类结果计算,我们就选择 全链接算法的结果 进行操作。
3、聚类结果计算
(1)给定阈值下的聚类结果:可以分成几个簇,每个簇对应的样本有哪些
#给定阈值下的聚类结果
yuzhi = float(input('请输入阈值:'))
# yuzhi=7.4 #聚类的高度阈值,决定了在树状结构中被切割成多少个聚类。
label=[] #用于存储每个数据点所属的聚类标签
#外层:根据给定的高度阈值yuzhi对层次聚类树进行切割,返回每个数据点所属的聚类标签 如[0]、[1]...
for i in sch.cut_tree(Z2,height=yuzhi):
#内层:遍历了当前聚类的标签,并将每个数据点的标签添加到label列表中。
for j in i : label.append(j) #将每个样本所属的标签值存入列表
labelCount = set(label) #标签的集合 有多少类标签
print('输入阈值 = '+str(yuzhi)+' \n聚类簇数 = '+str(len(list(labelCount))))
# 构建游客名称列表
guest_names = [str(name) for name in column_names_array[:51]]
print('-'*15)
print('聚类簇 \t 样本编号&&旅客姓名')
unique_labels = list(set(label))
if unique_labels: # 检查 label 是否为空
for i in unique_labels: #遍历标签集的列表
print(i, ' : ', end=' ')
for j in range(len(label)): #遍历标签列表
if i == label[j]: #检查标签列表元素的值是否为当前迭代的标签值
print(j,guest_names[j], end='\t\t') #检查为是时,打印该元素对应的下表索引
else:
pass #检查为否时,跳到下一个元素
print() # 换行到下一个簇
else:
print("没有聚类结果") # 若检查 label 为空,则断定没有聚类结果
输入阈值 = 6.5
聚类簇数 = 5
---------------
聚类簇 样本编号&&旅客姓名
0 : 0 ['John Smith'] 2 ['David Lee'] 5 ['Michael Brown'] 13 ['Hans Mueller'] 19 ['Kenji Nakamura'] 21 ['James Wilson'] 25 ['David Kim'] 41 ['John Smith']
1 : 1 ['Jane Doe'] 11 ['Carlos Garcia'] 14 ['Fatima Khouri'] 15 ['James MacKenzie'] 17 ['Michael Chang'] 18 ['Olivia Rodriguez'] 23 ['Raj Patel'] 24 ['Lily Nguyen'] 26 ['Maria Garcia'] 27 ['Alice Smith'] 28 ['Bob Johnson'] 36 ['William Davis'] 38 ['Mia Johnson'] 44 ['Emma Garcia'] 45 ['Michael Davis'] 48 ['Laura van den Berg'] 49 ['Jennifer Nguyen']
2 : 3 ['Sarah Johnson'] 30 ['Emma Davis']
3 : 4 ['Kim Nguyen'] 6 ['Emily Davis'] 16 ['Sarah Johnson'] 20 ['Emily Lee'] 22 ['Sofia Russo'] 29 ['Charlie Lee'] 31 ['Olivia Martin'] 32 ['Harry Wilson'] 33 ['Sophia Lee'] 37 ['Amelia Brown'] 40 ['Sarah Wong'] 42 ['Maria Silva'] 47 ['Kevin Kim'] 50 ['David Kim']
4 : 7 ['Lucas Santos'] 8 ['Laura Janssen'] 9 ['Mohammed Ali'] 10 ['Ana Hernandez'] 12 ['Lily Wong'] 34 ['James Brown'] 35 ['Mia Johnson'] 39 ['Adam Lee'] 43 ['Peter Brown'] 46 ['Nina Patel']
(2)给定簇数下的聚类结果:每个簇对应的样本有哪些
# 给定簇数下的聚类结果
n=int(input('请输入簇数:'))
# n = 9
label = []
for i in sch.cut_tree(Z2, n_clusters=n):
for j in i:
label.append(j)
print('输入簇数 = ' + str(n))
# 构建游客名称列表
guest_names = [str(name) for name in column_names_array[:51]]
print('-'*15)
print('聚类簇 \t 样本编号&&旅客姓名')
unique_labels = list(set(label))
if unique_labels: # 检查 label 是否为空
for i in unique_labels: # 遍历标签集的列表
print(i, ' : ', end=' ')
# 使用 zip() 函数同时遍历 label 和 guest_names
for idx, name in zip(range(len(label)), guest_names):
if i == label[idx]: # 检查标签列表元素的值是否为当前迭代的标签值
print(idx, name, end='\t\t') # 检查为是时,打印该元素对应的下表索引
print() # 换行到下一个簇
else:
print("没有聚类结果") # 若检查 label 为空,则断定没有聚类结果
输入簇数 = 5
---------------
聚类簇 样本编号&&旅客姓名
0 : 0 ['John Smith'] 2 ['David Lee'] 5 ['Michael Brown'] 13 ['Hans Mueller'] 19 ['Kenji Nakamura'] 21 ['James Wilson'] 25 ['David Kim'] 41 ['John Smith']
1 : 1 ['Jane Doe'] 11 ['Carlos Garcia'] 14 ['Fatima Khouri'] 15 ['James MacKenzie'] 17 ['Michael Chang'] 18 ['Olivia Rodriguez'] 23 ['Raj Patel'] 24 ['Lily Nguyen'] 26 ['Maria Garcia'] 27 ['Alice Smith'] 28 ['Bob Johnson'] 36 ['William Davis'] 38 ['Mia Johnson'] 44 ['Emma Garcia'] 45 ['Michael Davis'] 48 ['Laura van den Berg'] 49 ['Jennifer Nguyen']
2 : 3 ['Sarah Johnson'] 30 ['Emma Davis']
3 : 4 ['Kim Nguyen'] 6 ['Emily Davis'] 16 ['Sarah Johnson'] 20 ['Emily Lee'] 22 ['Sofia Russo'] 29 ['Charlie Lee'] 31 ['Olivia Martin'] 32 ['Harry Wilson'] 33 ['Sophia Lee'] 37 ['Amelia Brown'] 40 ['Sarah Wong'] 42 ['Maria Silva'] 47 ['Kevin Kim'] 50 ['David Kim']
4 : 7 ['Lucas Santos'] 8 ['Laura Janssen'] 9 ['Mohammed Ali'] 10 ['Ana Hernandez'] 12 ['Lily Wong'] 34 ['James Brown'] 35 ['Mia Johnson'] 39 ['Adam Lee'] 43 ['Peter Brown'] 46 ['Nina Patel']
(3)给定簇数下阈值的取值范围
# 求解层次聚类模型每次迭代聚合的树高
# 将树高度存入列表 TreehighList 中,将列表的0号元素设置为0,表示初始时无高度
# 迭代树的高度从列表的1号元素开始追加
# 此时列表的长度 = 样本数
TreehighList = []
TreehighList.append(0)
for i in range(len(Z2)):
TreehighList.append(Z2[i,2])
TreehighList, len(TreehighList)
([0,
1.3207220250706178,
1.713192288277125,
1.948618219530062,
2.1938156142412155,
2.344051009400656,
2.385236125346607,
2.5227496951700554,
2.5847786704830993,
2.6609018439884236,
2.664531638770863,
2.7194643386552477,
2.7231357411789694,
2.838154747820297,
2.959983314668614,
3.014549824867137,
3.0870393522464195,
3.21498864327589,
3.228770291629527,
3.242982467896726,
3.293801779310498,
3.346960826020779,
3.396800112724491,
3.4031309441625845,
3.4379960453193914,
3.7387064608965064,
3.7774664570067644,
3.7988233271953695,
3.843327398012852,
3.9861651879658195,
4.2197659666669365,
4.291448853880782,
4.463605246207876,
4.6155892525418025,
4.782801321694507,
4.7839476461738535,
4.793592690309439,
4.851137858942306,
4.945266643885942,
4.994082102721412,
5.214885156903247,
5.29249457737719,
5.356639595890159,
5.835802656747644,
5.967735673534881,
6.1319341491966295,
6.340049875642774,
6.650802835756102,
6.85452516718054,
7.432114871942092,
8.118708906268715],
51)
# 获取模型的最顶层的高度
top = TreehighList[-1] # 列表的最后一个元素(使用切片)
top
8.118708906268715
# 合并次数: 样本数 = 合并次数 + 1 = 连接矩阵的行数 + 1
linkCounts = len(Z2)
linkCounts
50
# 求解阈值范围函数
def ThresholdRange(n):
# 输入簇数 > 样本数:输入不合法
if n > len(TreehighList):
return "找不到满足簇数为 {} 的阈值!".format(n)
# 输入簇数 == 1时,阈值 >= 最顶层数高度
elif n == 1:
y = top
return "簇数为 {} 的阈值范围:[{} , INF).".format(n, y)
else:
# 下界值对应的列表索引
low = (len(TreehighList)) - n
# 上界值对应的列表索引
high = low + 1
# 左闭右开
return "簇数为 {} 的阈值范围:[{} , {}).".format(n, TreehighList[low], TreehighList[high])
n = int(input("请输入簇数:"))
# n=46
ThresholdRange(n)
'簇数为 9 的阈值范围:[5.356639595890159 , 5.835802656747644).'