DBSCAN Clustering Algorithm
——密度聚类算法的基本原理和Python实现
Contents
One. 密度聚类算法的基本原理(DBSCAN)
手写笔记如下:
(专业概念——好记性不如烂笔头!
专业思维——算法原理重在推导!)
Two. 使用Python编程对酒店信息数据集进行密度聚类算法的聚类分析示例
Dataset:HostelDataset.xls
Source :https://www.kaggle.com/datasets/lexipetzold/hostels-in-guatemala-hostelworld-data
Tips :对该数据集做过修改
修改后的HostelDataset(请点击此处)
Dataset’s description
字段 | 描述 |
---|---|
Hostel Name | 酒店名 |
Hostel_ID | 酒店ID |
DESTINATION | 目的地 |
Starting_Price | 起价 |
Free_Breakfast | 免费早餐 |
Wifi | 无限网络 |
Rating (out of 10) | 评分(满分10分) |
Total Ratings | 总评分 |
Traffic situation | 交通情况 |
一、数据集的读取以及数据预处理
'''1、数据集的读取'''
import pandas as pd
# 获取文件路径
file_path='D:/MachineLearningDesign/TotalDataset/Guatemalas_Travel/HostelDataset.xls'
# 使用pandas库函数读取文件中的数据集
origin_data = pd.read_excel(file_path)
# 打印数据集进行检查
origin_data
Hostel Name | Hostel_ID | DESTINATION | Starting_Price | Free_Breakfast | Wifi | Rating (out of 10) | Total Ratings | Traffic situation | |
---|---|---|---|---|---|---|---|---|---|
0 | Maya Papaya | 1 | Antigua | 15.00 | YES | YES | 9.5 | 897 | A |
1 | Tropicana Hostel | 2 | Antigua | 11.94 | NO | NO | 9.0 | 1411 | B |
2 | Somos | 3 | Antigua | 9.94 | NO | YES | 8.5 | 283 | A |
3 | Selina Antigua | 4 | Antigua | 12.00 | NO | NO | 9.7 | 371 | A |
4 | Ojala | 5 | Antigua | 15.00 | YES | YES | 7.2 | 172 | B |
5 | Amura | 6 | Antigua | 8.00 | NO | YES | 7.2 | 39 | C |
6 | Hostal Antigua | 7 | Antigua | 9.11 | NO | NO | 8.1 | 325 | A |
7 | Hostel Los Tecalotes | 8 | Antigua | 8.61 | NO | YES | 7.0 | 2 | C |
8 | Hostel Antigueno | 9 | Antigua | 9.35 | NO | NO | 8.7 | 162 | A |
9 | Adra Hostel | 10 | Antigua | 16.97 | YES | YES | 9.0 | 95 | C |
10 | Barbara's Boutique Hostel | 11 | Antigua | 10.34 | NO | YES | 9.6 | 461 | A |
11 | The Purpose Hostel | 12 | Antigua | 12.58 | No | NO | 9.6 | 536 | A |
12 | La Iguana Perdida | 13 | Lake Atitlan | 6.85 | NO | YES | 9.1 | 909 | A |
13 | Free Cerveza | 14 | Lake Atitlan | 5.84 | NO | YES | 9.0 | 1293 | A |
14 | Selina Atitlan | 15 | Lake Atitlan | 11.00 | NO | YES | 8.9 | 193 | B |
15 | Hostel San Marcos | 16 | Lake Atitlan | 8.00 | YES | YES | 6.5 | 60 | B |
16 | Eco-Hostel Mayachik | 17 | Lake Atitlan | 8.00 | NO | NO | 9.4 | 45 | B |
17 | Maca Hostel and Micro-Resort | 18 | Lake Atitlan | 27.50 | NO | YES | 9.6 | 2 | B |
18 | Mandala’s Hostal | 19 | Lake Atitlan | 9.94 | NO | YES | 9.5 | 94 | A |
19 | Mr. Mullet's | 20 | Lake Atitlan | 11.26 | NO | NO | 8.8 | 864 | A |
20 | Hotel Amigos | 21 | Lake Atitlan | 9.25 | NO | YES | 7.7 | 21 | B |
21 | Tequila Sunrise | 22 | Guatemala City | 5.50 | YES | NO | 8.4 | 231 | A |
22 | Hostal Guatefriends | 23 | Guatemala City | 18.00 | YES | YES | 9.6 | 252 | B |
23 | Nostalgic Hostel | 24 | Guatemala City | 8.61 | YES | NO | 7.8 | 39 | A |
24 | Kaena Point Hostel | 25 | Guatemala City | 6.18 | NO | YES | 7.4 | 54 | C |
25 | Hostal Los Lagos | 26 | Guatemala City | 15.00 | YES | YES | 9.2 | 189 | C |
26 | Euro Hostel | 27 | Guatemala City | 17.50 | YES | YES | 10.0 | 183 | A |
27 | Life Builders | 28 | Guatemala City | 9.94 | YES | NO | 7.6 | 0 | C |
28 | Driftwood Surfer beach hostel El Paredon Guate... | 29 | El Paredon | 9.97 | NO | YES | 8.8 | 435 | C |
29 | Mellow Hostel El Paredon | 30 | El Paredon | 13.41 | NO | YES | 9.3 | 48 | B |
30 | Cocorí Lodge El Paredon | 31 | El Paredon | 11.50 | NO | NO | 8.5 | 100 | A |
'''2、 数据集非数值特征的数值化处理以及对缺失值的填充'''
from sklearn.preprocessing import LabelEncoder #标签编码 用于将类别型数据转换为数值型数据
# 读取数据
Numlize_data = origin_data
# 提取非数值型的列名(特征)
non_numeric_columns = ['DESTINATION', 'Free_Breakfast','Wifi','Traffic situation']
# 对非数值列进行数值化处理
label_encoders = {} # 该字典用于存储每个特征的 LabelEncoder
# 对于数据集中的每个非数值型列,依次执行以下操作
for column in non_numeric_columns:
# 检查当前列的数据类型是否为字符串对象
if Numlize_data[column].dtype == 'object':
# 如果列的数据类型为字符串,创建一个名为 column 的列的 LabelEncoder 对象,
# 并将其存储在名为 label_encoders 的字典中,以便稍后使用
label_encoders[column] = LabelEncoder()
# 使用 LabelEncoder 对象对当前列进行转换。
# fit_transform() 方法用于对数据进行拟合和转换,
# 将非数字型数据转换为数字型数据,并将转换后的数据存储回原始的数据集中的当前列
Numlize_data[column] = label_encoders[column].fit_transform(Numlize_data[column])
# 对数值列进行缺失值填充
numeric_columns = Numlize_data.select_dtypes(include=['number']).columns
Numlize_data[numeric_columns] = Numlize_data[numeric_columns].fillna(Numlize_data[numeric_columns].mean())
# 打印数值化处理后的结果
Numlize_data
Hostel Name | Hostel_ID | DESTINATION | Starting_Price | Free_Breakfast | Wifi | Rating (out of 10) | Total Ratings | Traffic situation | |
---|---|---|---|---|---|---|---|---|---|
0 | Maya Papaya | 1 | 0 | 15.00 | 2 | 1 | 9.5 | 897 | 0 |
1 | Tropicana Hostel | 2 | 0 | 11.94 | 0 | 0 | 9.0 | 1411 | 1 |
2 | Somos | 3 | 0 | 9.94 | 0 | 1 | 8.5 | 283 | 0 |
3 | Selina Antigua | 4 | 0 | 12.00 | 0 | 0 | 9.7 | 371 | 0 |
4 | Ojala | 5 | 0 | 15.00 | 2 | 1 | 7.2 | 172 | 1 |
5 | Amura | 6 | 0 | 8.00 | 0 | 1 | 7.2 | 39 | 2 |
6 | Hostal Antigua | 7 | 0 | 9.11 | 0 | 0 | 8.1 | 325 | 0 |
7 | Hostel Los Tecalotes | 8 | 0 | 8.61 | 0 | 1 | 7.0 | 2 | 2 |
8 | Hostel Antigueno | 9 | 0 | 9.35 | 0 | 0 | 8.7 | 162 | 0 |
9 | Adra Hostel | 10 | 0 | 16.97 | 2 | 1 | 9.0 | 95 | 2 |
10 | Barbara's Boutique Hostel | 11 | 0 | 10.34 | 0 | 1 | 9.6 | 461 | 0 |
11 | The Purpose Hostel | 12 | 0 | 12.58 | 1 | 0 | 9.6 | 536 | 0 |
12 | La Iguana Perdida | 13 | 3 | 6.85 | 0 | 1 | 9.1 | 909 | 0 |
13 | Free Cerveza | 14 | 3 | 5.84 | 0 | 1 | 9.0 | 1293 | 0 |
14 | Selina Atitlan | 15 | 3 | 11.00 | 0 | 1 | 8.9 | 193 | 1 |
15 | Hostel San Marcos | 16 | 3 | 8.00 | 2 | 1 | 6.5 | 60 | 1 |
16 | Eco-Hostel Mayachik | 17 | 3 | 8.00 | 0 | 0 | 9.4 | 45 | 1 |
17 | Maca Hostel and Micro-Resort | 18 | 3 | 27.50 | 0 | 1 | 9.6 | 2 | 1 |
18 | Mandala’s Hostal | 19 | 3 | 9.94 | 0 | 1 | 9.5 | 94 | 0 |
19 | Mr. Mullet's | 20 | 3 | 11.26 | 0 | 0 | 8.8 | 864 | 0 |
20 | Hotel Amigos | 21 | 3 | 9.25 | 0 | 1 | 7.7 | 21 | 1 |
21 | Tequila Sunrise | 22 | 2 | 5.50 | 2 | 0 | 8.4 | 231 | 0 |
22 | Hostal Guatefriends | 23 | 2 | 18.00 | 2 | 1 | 9.6 | 252 | 1 |
23 | Nostalgic Hostel | 24 | 2 | 8.61 | 2 | 0 | 7.8 | 39 | 0 |
24 | Kaena Point Hostel | 25 | 2 | 6.18 | 0 | 1 | 7.4 | 54 | 2 |
25 | Hostal Los Lagos | 26 | 2 | 15.00 | 2 | 1 | 9.2 | 189 | 2 |
26 | Euro Hostel | 27 | 2 | 17.50 | 2 | 1 | 10.0 | 183 | 0 |
27 | Life Builders | 28 | 2 | 9.94 | 2 | 0 | 7.6 | 0 | 2 |
28 | Driftwood Surfer beach hostel El Paredon Guate... | 29 | 1 | 9.97 | 0 | 1 | 8.8 | 435 | 2 |
29 | Mellow Hostel El Paredon | 30 | 1 | 13.41 | 0 | 1 | 9.3 | 48 | 1 |
30 | Cocorí Lodge El Paredon | 31 | 1 | 11.50 | 0 | 0 | 8.5 | 100 | 0 |
'''3、标准化处理'''
from sklearn.preprocessing import StandardScaler
# 从数据集的第三列数据开始获取待标准化的特征值
scale_data=Numlize_data.iloc[:,2:].values
# 初始化一个标准化转换器对象
transfer0 = StandardScaler()
scale_data=transfer0.fit_transform(scale_data)
print(scale_data,scale_data.shape)
[[-1.1226828 0.82538383 1.42313078 0.74161985 0.92288087 1.56308313
-0.92519568]
[-1.1226828 0.13194738 -0.72892064 -1.34839972 0.3789777 2.94361451
0.32180719]
[-1.1226828 -0.32127907 -0.72892064 0.74161985 -0.16492548 -0.08603412
-0.92519568]
[-1.1226828 0.14554417 -0.72892064 -1.34839972 1.14044214 0.15032145
-0.92519568]
[-1.1226828 0.82538383 1.42313078 0.74161985 -1.57907374 -0.38416444
0.32180719]
[-1.1226828 -0.76090872 -0.72892064 0.74161985 -1.57907374 -0.74138365
1.56881007]
[-1.1226828 -0.50936804 -0.72892064 -1.34839972 -0.60004802 0.02677195
-0.92519568]
[-1.1226828 -0.62267465 -0.72892064 0.74161985 -1.79663501 -0.84076042
1.56881007]
[-1.1226828 -0.45498087 -0.72892064 -1.34839972 0.05263579 -0.41102302
-0.92519568]
[-1.1226828 1.27181188 1.42313078 0.74161985 0.3789777 -0.59097556
1.56881007]
[-1.1226828 -0.23063378 -0.72892064 0.74161985 1.03166151 0.39204873
-0.92519568]
[-1.1226828 0.27697984 0.34710507 -1.34839972 1.03166151 0.59348814
-0.92519568]
[ 1.25026039 -1.02151392 -0.72892064 0.74161985 0.48775833 1.59531344
-0.92519568]
[ 1.25026039 -1.25039327 -0.72892064 0.74161985 0.3789777 2.62668318
-0.92519568]
[ 1.25026039 -0.08106905 -0.72892064 0.74161985 0.27019706 -0.3277614
0.32180719]
[ 1.25026039 -0.76090872 1.42313078 0.74161985 -2.34053819 -0.68498061
0.32180719]
[ 1.25026039 -0.76090872 -0.72892064 -1.34839972 0.81410024 -0.72526849
0.32180719]
[ 1.25026039 3.65804909 -0.72892064 0.74161985 1.03166151 -0.84076042
0.32180719]
[ 1.25026039 -0.32127907 -0.72892064 0.74161985 0.92288087 -0.59366142
-0.92519568]
[ 1.25026039 -0.02214961 -0.72892064 -1.34839972 0.16141643 1.47444979
-0.92519568]
[ 1.25026039 -0.47764219 -0.72892064 0.74161985 -1.03517056 -0.7897291
0.32180719]
[ 0.45927933 -1.32744177 1.42313078 -1.34839972 -0.27370611 -0.22569877
-0.92519568]
[ 0.45927933 1.50522349 1.42313078 0.74161985 1.03166151 -0.16929574
0.32180719]
[ 0.45927933 -0.62267465 1.42313078 -1.34839972 -0.92638993 -0.74138365
-0.92519568]
[ 0.45927933 -1.17334478 -0.72892064 0.74161985 -1.36151247 -0.70109576
1.56881007]
[ 0.45927933 0.82538383 1.42313078 0.74161985 0.59653897 -0.33850484
1.56881007]
[ 0.45927933 1.39191688 1.42313078 0.74161985 1.46678405 -0.35461999
-0.92519568]
[ 0.45927933 -0.32127907 1.42313078 -1.34839972 -1.1439512 -0.84613214
1.56881007]
[-0.33170174 -0.31448067 -0.72892064 0.74161985 0.16141643 0.32221641
1.56881007]
[-0.33170174 0.46506881 -0.72892064 0.74161985 0.7053196 -0.71721092
0.32180719]
[-0.33170174 0.03223756 -0.72892064 -1.34839972 -0.16492548 -0.57754626
-0.92519568]] (31, 7)
说明: 由标准化处理后的数组数据可知,数据集待聚类处理的特征维度为七维。为了方便后续聚类结果的可视化,接下来我们采用PCA降维技术将其降到二维!
'''4、降维处理'''
from sklearn.decomposition import PCA
transfer1 = PCA(n_components = 2) # 减少到二维空间 为了方便可视化
pca_data = transfer1.fit_transform(scale_data)
pca_data
array([[-1.20875831, 1.62723311],
[-2.17954007, -0.49999884],
[-0.41892282, -0.35878887],
[-1.8327227 , -0.03959969],
[ 1.69512135, 0.41250462],
[ 2.01954599, -1.34181723],
[-0.9805543 , -1.41593127],
[ 2.17906043, -1.35095566],
[-1.06041339, -1.0250797 ],
[ 1.62931985, 1.78510981],
[-1.22269447, 0.31381418],
[-1.74506824, 0.3039498 ],
[-1.49489319, -0.46815072],
[-1.98313424, -0.72149184],
[ 0.36494895, 0.18653765],
[ 2.26083643, -1.01827662],
[-0.41744797, -0.70656468],
[ 0.4389093 , 3.32221946],
[-0.54500345, 0.33179379],
[-1.93418732, -0.59639915],
[ 1.1977873 , -0.76596774],
[-0.44229712, -1.12713865],
[ 0.47138872, 2.33007222],
[ 0.16094183, -0.93674703],
[ 1.96129619, -1.4563731 ],
[ 1.46069336, 1.6395852 ],
[-0.3718697 , 2.45268164],
[ 1.78780896, -0.77514925],
[ 0.7174211 , -0.10764917],
[ 0.30145571, 0.74718192],
[-0.80902817, -0.74060422]])
'''5、降维后数据点的可视化 调整坐标范围以及点索引的位置'''
import matplotlib.pylab as plt
B = pca_data # 将pca处理后的数据赋值给变量B,后续的操作均使用B来调用数据集
# 采用按行进行聚类
for i in range(B.shape[0]):
# B.shape[0]表示B的所有行数;B.shape[1]表示B的所有列数。
# B[i, 0], B[i, 1]分别表示 第i个数据点的行 和 第i个数据点的列
plt.plot(B[i, 0], B[i, 1], 'bo') # 'bo'表示蓝色,o型点
plt.text(B[i,0],B[i,1]-0.22,i,fontsize=10,color='g') # 给数据点添加点索引的文本文字
plt.title('Data Sample Visualization')
plt.show()
# 检验获取B的数据类型:numpy.ndarray
type(B)
numpy.ndarray
二、采用不调用sklearn中库函数的方式对数据集进行密度聚类操作
(选择最优参数时除外)
'''1、选择最优参数 (eps,MinPTS)'''
from sklearn.cluster import DBSCAN
import numpy as np
# 构建空列表,用于保存不同参数组合下的结果
res = []
# 迭代不同的eps值:从0.001开始,步长为0.05,遍历到1结束
for eps in np.arange(0.001,1,0.05):
# 迭代不同的min_samples值:从2遍历到9
for min_samples in range(2,10):
dbscan = DBSCAN(eps = eps, min_samples = min_samples)
# 模型拟合
dbscan.fit(B)
# 统计各参数组合下的聚类个数(-1表示异常点)
n_clusters = len([i for i in set(dbscan.labels_) if i != -1])
# 异常点的个数
outliners = np.sum(np.where(dbscan.labels_ == -1, 1,0))
# 统计每个簇的样本个数
stats = str(pd.Series([i for i in dbscan.labels_ if i != -1]).value_counts().values)
res.append({'eps':eps,'min_samples':min_samples,'n_clusters':n_clusters,'outliners':outliners,'stats':stats})
# 将迭代后的结果存储到数据框中
df = pd.DataFrame(res)
# 根据条件筛选合理的参数组合 设置聚类结果的簇数==3
df.loc[df.n_clusters == 3, :]
eps | min_samples | n_clusters | outliners | stats | |
---|---|---|---|---|---|
40 | 0.251 | 2 | 3 | 24 | [3 2 2] |
48 | 0.301 | 2 | 3 | 23 | [3 3 2] |
65 | 0.401 | 3 | 3 | 19 | [5 4 3] |
73 | 0.451 | 3 | 3 | 18 | [6 4 3] |
81 | 0.501 | 3 | 3 | 17 | [6 4 4] |
82 | 0.501 | 4 | 3 | 19 | [4 4 4] |
89 | 0.551 | 3 | 3 | 13 | [7 6 5] |
90 | 0.551 | 4 | 3 | 16 | [5 5 5] |
98 | 0.601 | 4 | 3 | 15 | [6 5 5] |
99 | 0.601 | 5 | 3 | 15 | [6 5 5] |
106 | 0.651 | 4 | 3 | 12 | [7 6 6] |
107 | 0.651 | 5 | 3 | 13 | [7 6 5] |
114 | 0.701 | 4 | 3 | 12 | [7 6 6] |
115 | 0.701 | 5 | 3 | 12 | [7 6 6] |
121 | 0.751 | 3 | 3 | 7 | [15 6 3] |
129 | 0.801 | 3 | 3 | 7 | [15 6 3] |
136 | 0.851 | 2 | 3 | 5 | [15 9 2] |
152 | 0.951 | 2 | 3 | 3 | [24 2 2] |
根据聚类经验,选择离群点少的参数组合,可以增强密度聚类的效果
'''2、设置合适的参数组合'''
# 根据上述的测试结果,不妨选择eps = 0.8,minpts = 3的组合
r = 0.8
minPts = 3
'''3、求每个数据点(index)对应邻域范围内的数据点'''
# 求邻域范围内的数值的函数
def neighborhoodList(key): # key为点的索引
neighborList = [] # 用于存储key点邻域内的其他点
for i in range(B.shape[0]): # 遍历列表B的所有行
if(np.linalg.norm(B[i,] - B[key,]) <= r): # 判断B数组内其他数据点与key点的距离是否<=r
neighborList.append(i)
else: pass
return neighborList # 返回邻域列表
# 检验样本邻域:寻找下标为2的邻域:
neighborhoodList(2)
[2, 16, 18, 21, 30]
B.shape[0] # 数据集的样本量 == 行数
31
# 打印所有样本点的邻域列表
for i in range(B.shape[0]):
print(neighborhoodList(i),end=' ')
[0] [1, 3, 12, 13, 19] [2, 16, 18, 21, 30] [1, 3, 10, 11, 12, 13, 19] [4] [5, 7, 15, 24, 27] [6, 8, 21, 30] [5, 7, 15, 24, 27] [6, 8, 12, 16, 21, 30] [9, 25] [3, 10, 11, 18] [3, 10, 11] [1, 3, 8, 12, 13, 19, 30] [1, 3, 12, 13, 19] [14, 28, 29] [5, 7, 15, 24, 27] [2, 8, 16, 21, 23, 30] [17] [2, 10, 18] [1, 3, 12, 13, 19] [20, 27] [2, 6, 8, 16, 21, 23, 30] [22] [16, 21, 23] [5, 7, 15, 24, 27] [9, 25] [26] [5, 7, 15, 20, 24, 27] [14, 28] [14, 29] [2, 6, 8, 12, 16, 21, 30]
'''4、寻找核心对象(邻域列表满足核心点要求的数据点)'''
def FindCore(binary_array): # 传入存储数据点的二维数组
Core=[] # 用于存储核心对象的列表
for i in range(binary_array.shape[0]): # 遍历所有数据点
if len(neighborhoodList(i))>=minPts: # 若邻域列表的长度满足条件
Core.append(i) # 设定为核心对象,追加到核心对象列表中
else:
pass
return Core # 返回核心对象列表
Core=FindCore(B)
print(Core,'\n',"核心对象数目:",len(Core))
[1, 2, 3, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 18, 19, 21, 23, 24, 27, 30]
核心对象数目: 21
以下是进行“核操作”:
# 创建核心对象的邻域集合列表
SetList=[]
for i in Core:
elem=set(neighborhoodList(i)) # 将每个核心对象的邻域以集合的形式作为元素存入列表中
SetList.append(elem)
print(SetList)
[{1, 3, 12, 13, 19}, {2, 16, 18, 21, 30}, {1, 3, 10, 11, 12, 13, 19}, {5, 7, 15, 24, 27}, {8, 21, 6, 30}, {5, 7, 15, 24, 27}, {6, 8, 12, 16, 21, 30}, {11, 18, 10, 3}, {11, 10, 3}, {1, 3, 8, 12, 13, 19, 30}, {1, 3, 12, 13, 19}, {28, 29, 14}, {5, 7, 15, 24, 27}, {2, 8, 16, 21, 23, 30}, {18, 2, 10}, {1, 3, 12, 13, 19}, {2, 6, 8, 16, 21, 23, 30}, {16, 21, 23}, {5, 7, 15, 24, 27}, {5, 7, 15, 20, 24, 27}, {2, 6, 8, 12, 16, 21, 30}]
'''5、核心对象可视化(描红 + 画出邻域范围)'''
for i in range(B.shape[0]):
for i in range(B.shape[0]):
if i in Core:
plt.plot(B[i, 0], B[i, 1], 'ro') # 'ro'表示红色,o型点
else:
plt.plot(B[i, 0], B[i, 1], 'bo') # 'bo'表示蓝色,o型点
#给数据点做下标文本标记
plt.text(B[i,0],B[i,1]-0.22,i,fontsize = 8,color='g')
# 给核心对象画邻域范围
a = np.arange(0, 2*np.pi, 0.01)
for i in Core:
plt.plot(r*np.cos(a)+B[i,0], r*np.sin(a)+B[i,1], 'y--')
plt.title('Core Point Visualization')
plt.show()
'''6、寻找密度直达: 核心点间的交集是否包含他们本身'''
def directly_density_reachable(CoreList):
directlyList=[]
for i in CoreList:
for j in CoreList:
if i and j in set(neighborhoodList(i)).intersection(set(neighborhoodList(j))): #交集
directlyList.append({i,j})
else: pass
return directlyList
directlyList=directly_density_reachable(Core) #核心对象密度直达列表
print(directlyList)
[{1}, {1, 3}, {1, 12}, {1, 13}, {1, 19}, {2}, {16, 2}, {2, 18}, {2, 21}, {2, 30}, {1, 3}, {3}, {10, 3}, {11, 3}, {3, 12}, {3, 13}, {19, 3}, {5}, {5, 7}, {5, 15}, {24, 5}, {27, 5}, {6}, {8, 6}, {21, 6}, {6, 30}, {5, 7}, {7}, {15, 7}, {24, 7}, {27, 7}, {8, 6}, {8}, {8, 12}, {8, 16}, {8, 21}, {8, 30}, {10, 3}, {10}, {10, 11}, {10, 18}, {3, 11}, {10, 11}, {11}, {1, 12}, {3, 12}, {8, 12}, {12}, {12, 13}, {19, 12}, {12, 30}, {1, 13}, {3, 13}, {12, 13}, {13}, {19, 13}, {14}, {5, 15}, {7, 15}, {15}, {24, 15}, {27, 15}, {16, 2}, {16, 8}, {16}, {16, 21}, {16, 23}, {16, 30}, {18, 2}, {18, 10}, {18}, {1, 19}, {3, 19}, {19, 12}, {19, 13}, {19}, {2, 21}, {21, 6}, {8, 21}, {16, 21}, {21}, {21, 23}, {21, 30}, {16, 23}, {21, 23}, {23}, {24, 5}, {24, 7}, {24, 15}, {24}, {24, 27}, {27, 5}, {27, 7}, {27, 15}, {24, 27}, {27}, {2, 30}, {30, 6}, {8, 30}, {12, 30}, {16, 30}, {21, 30}, {30}]
列表去重元素函数
def Drop_duplicates(binary_array): # Drop_duplicates:去重
unique_list=[] # 用于存放不重复的元素
for i in range(len(binary_array)): # 遍历检测列表
isUnique=True # 检测标签初始化为true
for j in range(len(unique_list)): # 遍历新列表
# 当新列表中检测到已有元素,标签设置为false,跳出内层循环,更新检测列表的检测元素
if binary_array[i] == unique_list[j]:
isUnique=False
break
# 若检测不到重复元素,标记依旧为true,则将元素追加到新列表中
if isUnique==True:
unique_list.append(binary_array[i])
return unique_list
对密度直达列表进行去重操作
# 去重后的密度直达列表
directlyList=Drop_duplicates(directlyList)
print(directlyList)
[{1}, {1, 3}, {1, 12}, {1, 13}, {1, 19}, {2}, {16, 2}, {2, 18}, {2, 21}, {2, 30}, {3}, {10, 3}, {11, 3}, {3, 12}, {3, 13}, {19, 3}, {5}, {5, 7}, {5, 15}, {24, 5}, {27, 5}, {6}, {8, 6}, {21, 6}, {6, 30}, {7}, {15, 7}, {24, 7}, {27, 7}, {8}, {8, 12}, {8, 16}, {8, 21}, {8, 30}, {10}, {10, 11}, {10, 18}, {11}, {12}, {12, 13}, {19, 12}, {12, 30}, {13}, {19, 13}, {14}, {15}, {24, 15}, {27, 15}, {16}, {16, 21}, {16, 23}, {16, 30}, {18}, {19}, {21}, {21, 23}, {21, 30}, {23}, {24}, {24, 27}, {27}, {30}]
'''7、寻找密度相连(间接密度可达)的核心对象'''
def density_reachable():
reachableList=[] # 用于存储密度相连的核心对象集合
for i in range(len(directlyList)): # 传入(已去重)密度直达列表 遍历密度直达列表的每个元素
pre = directlyList[i] # 将当前元素传递给pre变量
for j in range(len(directlyList)):
if pre.intersection(directlyList[j])!=set(): # 两个密度直达的集合的交集不为空集
pre = pre.union(directlyList[j]) # 将这两个集合的并集赋值给pre
else:pass
#去掉子集的操作 每更新一次元素,从新扫描index = j前面的列表元素,以剔除前边漏掉的子集
for k in range(j):
if pre.intersection(directlyList[k])!=set():
pre=pre.union(directlyList[k])
reachableList.append(pre) #将更新好、去除漏子集情况的集合追加到密度可达列表中
reachableList=Drop_duplicates(reachableList) # 去重处理
return reachableList
reachableList=density_reachable()
reachableList
[{1, 2, 3, 6, 8, 10, 11, 12, 13, 16, 18, 19, 21, 23, 30},
{5, 7, 15, 24, 27},
{14}]
'''8、聚类结果操作 簇集 + 离群点'''
# ClusterList 簇结果列表:每个簇以集合的形式作为元素存入该列表中
# PointClusterSet 所有位于所有簇中的数据点的集合
# PerClusterSet 用于暂存某个簇中的数据点集合
ClustersList=[]
PointClusterSet=set()
for i in range(len(reachableList)): # 1、遍历密度可达列表的每个元素,此时每个元素是集合(一个簇中的核心对象的集合)
PerClusterSet=set() #开辟暂存集合
for j in list(reachableList[i]): # 2、遍历每个元素(集合列表化)的每个子元素(每个簇中的每个核心对象)
for k in neighborhoodList(j): # 3、求子元素的邻域列表,并遍历邻域列表的每个元素
PerClusterSet.add(k) # 4、将每个元素追加到集合中
#该层循环(i)每一次迭代得到的是一个簇的集合 PerClusterSet
PointClusterSet=PointClusterSet.union(PerClusterSet) # 5、计算所有位于各簇中的点,用于后续计算离群点
ClustersList.append(PerClusterSet) # 6、将每个簇的集合追加到簇结果列表中
print("Cluster",i,": ",PerClusterSet)
print("Border Points of Cluster",i,": ",reachableList[i])
BorderSet=PerClusterSet - reachableList[i] # 7、计算每个簇的边界点 即计算每个簇的集合与每个簇中核心对象的集合的差集
if BorderSet != set():
print("Border Points of Cluster",i,": ",BorderSet)
else:
print("Border Points of Cluster",i,": NULL")
print('-'*40)
# 打印所有位于簇中点的集合
# print("All Points in Clusters: ", PointClusterSet)
print('-'*40)
# 计算离群点:
# 1、首先计算所有点的集合
AllPoint=set()
for i in range(B.shape[0]):
AllPoint.add(i)
# 2、然后计算所有点集合-所有位于簇中点的集合
outlier=AllPoint-PointClusterSet
print("Outlier: ", outlier)
Cluster 0 : {1, 2, 3, 6, 8, 10, 11, 12, 13, 16, 18, 19, 21, 23, 30}
Border Points of Cluster 0 : {1, 2, 3, 6, 8, 10, 11, 12, 13, 16, 18, 19, 21, 23, 30}
Border Points of Cluster 0 : NULL
----------------------------------------
Cluster 1 : {5, 7, 15, 20, 24, 27}
Border Points of Cluster 1 : {5, 7, 24, 27, 15}
Border Points of Cluster 1 : {20}
----------------------------------------
Cluster 2 : {28, 29, 14}
Border Points of Cluster 2 : {14}
Border Points of Cluster 2 : {28, 29}
----------------------------------------
----------------------------------------
Outlier: {0, 4, 9, 17, 22, 25, 26}
# 簇集列表
ClustersList
[{1, 2, 3, 6, 8, 10, 11, 12, 13, 16, 18, 19, 21, 23, 30},
{5, 7, 15, 20, 24, 27},
{14, 28, 29}]
print("密度聚类后类的数目:",len(ClustersList))
密度聚类后类的数目: 3
'''9、DBSCAN Cluster结果可视化'''
import matplotlib.pyplot as plt
# 设置颜色
colors = {'Cluster1': 'red', 'Cluster2': 'green', 'Cluster3': 'blue', 'Outliers': 'purple'}
# 绘制散点图
for i in range(B.shape[0]):
if i in ClustersList[0]:
plt.scatter(B[i, 0], B[i, 1], color='red', marker='o')
elif i in ClustersList[1]:
plt.scatter(B[i, 0], B[i, 1], color='green', marker='o')
elif i in ClustersList[2]:
plt.scatter(B[i, 0], B[i, 1], color='blue', marker='o')
else:
plt.scatter(B[i, 0], B[i, 1], color='purple', marker='o')
# 给数据点做下标文本标记
plt.text(B[i,0],B[i,1]-0.22,i,fontsize=10,color='black')
# 创建图例项
legend_handles = []
for label, color in colors.items():
legend_handles.append(plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=color, markersize=10))
# 添加图例
plt.legend(legend_handles, colors.keys())
# 设置坐标轴标签
plt.xlabel('X')
plt.ylabel('Y')
# 添加标题
plt.title('DBSCAN Cluster Visualization')
plt.show()
三、基于调用sklearn中库函数的方式对数据集进行密度聚类操作
'''1、使用sklearn中的DBSCAN类对数据(B数组)进行聚类模型训练'''
from sklearn.cluster import DBSCAN
# 使用 DBSCAN 进行聚类 r=0.8, MinPTS=3
db = DBSCAN(eps=0.8, min_samples=3)
db.fit(B)
DBSCAN(eps=0.8, min_samples=3)
'''2、对聚类结果进行评估 (采用轮廓系数法)'''
from sklearn.metrics import silhouette_score
# 获取聚类标签 (-1为离群点) labels是一个列表
labels = db.labels_
# 计算轮廓系数
score = silhouette_score(B, labels)
print('轮廓系数:', score)
轮廓系数: 0.37453991956763705
说明: 轮廓系数评估:
score(i)->-1时,说明样本i更应该分类到别的簇中;
score(i)->0时,说明样本i在两个簇的边界上;
score(i)->1时,说明样本i聚类合理。
轮廓系数为0.37的评估结果不是很好,但是为了对比非调用类库编程的结果,我们还是选用(0.8,3)的参数组合
可使用k近邻距离图像,通过找到图像的肘点来优化eps的选取
'''* 可视化最近邻算法(Nearest Neighbors)来计算每个样本点到其第二个最近邻点的距离'''
from sklearn.neighbors import NearestNeighbors
import numpy as np
import matplotlib.pyplot as plt
neigh = NearestNeighbors(n_neighbors=2) # 创建最近邻对象,设置最近邻数为2
nbrs = neigh.fit(B) # 将数据 B 应用于最近邻对象并进行拟合
distances, indices = nbrs.kneighbors(B) # 计算每个样本点到其第二个最近邻点的距离
distances = np.sort(distances, axis=0) # 对距离进行排序
distances = distances[:, 1] # 取每个样本点到其第二个最近邻点的距离作为横坐标
# 绘制距离图像
plt.plot(distances)
plt.title('Distance to 2nd Nearest Neighbor') # 添加图像标题
plt.show()
# 打印聚类标签列表
print(labels)
[-1 0 0 0 -1 1 0 1 0 -1 0 0 0 0 2 1 0 -1 0 0 1 0 -1 0
1 -1 -1 1 2 2 0]
# 打印核心点列表以及其长度
print(db.core_sample_indices_, '\n','核心对象的数目:',db.core_sample_indices_.shape[0])
[ 1 2 3 5 6 7 8 10 11 12 13 14 15 16 18 19 21 23 24 27 30]
核心对象的数目: 21
'''3、显示聚类结果'''
core = db.core_sample_indices_ # 将核心对象列表复制给core变量
# 遍历聚类标签
for i in range(len(list(set(labels)))-1): # len-1是为了剔除掉离群点
list1=[] # 用于暂存每个簇的数据点
# 检测聚类标签与迭代标签列表的元素是否匹配
for j in range(B.shape[0]):
if labels[j] == i:
list1.append(j)
else: pass
print("Cluster ",i,": ",set(list1))
list2=[] # 用于存储每个簇中的核心对象
list3=[] # 用于存储每个簇中的边界点
for k in list1:
if k in core:list2.append(k)
else:list3.append(k)
print("Core Point from Cluster",i,": ",set(list2))
if list3!=[]:
print("Border Point from Cluster",i,": ",set(list3))
else:
print("Border Point from Cluster",i,": NULL")
print('-'*40)
list4=[] # 用于存储离群点
for k in range(len(labels)):
if labels[k]== -1: list4.append(k) # 标签值==-1的数据点是离群点
print("Outlier: ",set(list4))
Cluster 0 : {1, 2, 3, 6, 8, 10, 11, 12, 13, 16, 18, 19, 21, 23, 30}
Core Point from Cluster 0 : {1, 2, 3, 6, 8, 10, 11, 12, 13, 16, 18, 19, 21, 23, 30}
Border Point from Cluster 0 : NULL
----------------------------------------
Cluster 1 : {5, 7, 15, 20, 24, 27}
Core Point from Cluster 1 : {5, 7, 15, 24, 27}
Border Point from Cluster 1 : {20}
----------------------------------------
Cluster 2 : {28, 29, 14}
Core Point from Cluster 2 : {14}
Border Point from Cluster 2 : {28, 29}
----------------------------------------
Outlier: {0, 4, 9, 17, 22, 25, 26}
'''4、聚类结果可视化'''
for i in range(B.shape[0]):
if labels[i] == 0:
c1 = plt.scatter(B[i,0], B[i,1], c = 'r', marker='+')
elif labels[i] == 1:
c2 = plt.scatter(B[i,0], B[i,1], c = 'g', marker='o')
elif labels[i] == 2:
c3 = plt.scatter(B[i,0], B[i,1], c = 'b', marker='*')
elif labels[i] == -1:
c4 = plt.scatter(B[i,0], B[i,1], c = 'purple', marker='^')
# 给数据点做下标文本标记
plt.text(B[i,0],B[i,1]-0.22,i,fontsize=10,color='black')
plt.legend([c1,c2,c3,c4], ['Cluster 1','Cluster 2','Cluster 3','Outlier'])
plt.title('DBSCAN Clustering Results based on sklearn')
plt.show()
经过上述两次聚类算法结果的对比可得:
在参数组合选择相同的情况下,采取自己编写算法的方式和采取调用sklearn库函数的方式进行密度聚类所得的结果一样。说明本人的算法思路是正确的。