DBSCAN Clustering Algorithm——密度聚类算法的基本原理和Python实现

KaiWings-X

于 2024-05-05 22:16:54 发布

阅读量1.6k

点赞数 37

文章标签：算法聚类 python 机器学习

本文链接：https://blog.csdn.net/m0_74860956/article/details/138474041

版权

DBSCAN Clustering Algorithm
——密度聚类算法的基本原理和Python实现

One. 密度聚类算法的基本原理（DBSCAN）

手写笔记如下：
（专业概念——好记性不如烂笔头！
专业思维——算法原理重在推导！）
在这里插入图片描述

Two. 使用Python编程对酒店信息数据集进行密度聚类算法的聚类分析示例

Dataset：HostelDataset.xls

Source ：https://www.kaggle.com/datasets/lexipetzold/hostels-in-guatemala-hostelworld-data

Tips ：对该数据集做过修改
修改后的HostelDataset（请点击此处）

Dataset’s description

字段	描述
Hostel Name	酒店名
Hostel_ID	酒店ID
DESTINATION	目的地
Starting_Price	起价
Free_Breakfast	免费早餐
Wifi	无限网络
Rating (out of 10)	评分（满分10分）
Total Ratings	总评分
Traffic situation	交通情况

一、数据集的读取以及数据预处理

'''1、数据集的读取''' 
import pandas as pd 

# 获取文件路径
file_path='D:/MachineLearningDesign/TotalDataset/Guatemalas_Travel/HostelDataset.xls'

# 使用pandas库函数读取文件中的数据集
origin_data = pd.read_excel(file_path)

# 打印数据集进行检查
origin_data

	Hostel Name	Hostel_ID	DESTINATION	Starting_Price	Free_Breakfast	Wifi	Rating (out of 10)	Total Ratings	Traffic situation
0	Maya Papaya	1	Antigua	15.00	YES	YES	9.5	897	A
1	Tropicana Hostel	2	Antigua	11.94	NO	NO	9.0	1411	B
2	Somos	3	Antigua	9.94	NO	YES	8.5	283	A
3	Selina Antigua	4	Antigua	12.00	NO	NO	9.7	371	A
4	Ojala	5	Antigua	15.00	YES	YES	7.2	172	B
5	Amura	6	Antigua	8.00	NO	YES	7.2	39	C
6	Hostal Antigua	7	Antigua	9.11	NO	NO	8.1	325	A
7	Hostel Los Tecalotes	8	Antigua	8.61	NO	YES	7.0	2	C
8	Hostel Antigueno	9	Antigua	9.35	NO	NO	8.7	162	A
9	Adra Hostel	10	Antigua	16.97	YES	YES	9.0	95	C
10	Barbara's Boutique Hostel	11	Antigua	10.34	NO	YES	9.6	461	A
11	The Purpose Hostel	12	Antigua	12.58	No	NO	9.6	536	A
12	La Iguana Perdida	13	Lake Atitlan	6.85	NO	YES	9.1	909	A
13	Free Cerveza	14	Lake Atitlan	5.84	NO	YES	9.0	1293	A
14	Selina Atitlan	15	Lake Atitlan	11.00	NO	YES	8.9	193	B
15	Hostel San Marcos	16	Lake Atitlan	8.00	YES	YES	6.5	60	B
16	Eco-Hostel Mayachik	17	Lake Atitlan	8.00	NO	NO	9.4	45	B
17	Maca Hostel and Micro-Resort	18	Lake Atitlan	27.50	NO	YES	9.6	2	B
18	Mandala’s Hostal	19	Lake Atitlan	9.94	NO	YES	9.5	94	A
19	Mr. Mullet's	20	Lake Atitlan	11.26	NO	NO	8.8	864	A
20	Hotel Amigos	21	Lake Atitlan	9.25	NO	YES	7.7	21	B
21	Tequila Sunrise	22	Guatemala City	5.50	YES	NO	8.4	231	A
22	Hostal Guatefriends	23	Guatemala City	18.00	YES	YES	9.6	252	B
23	Nostalgic Hostel	24	Guatemala City	8.61	YES	NO	7.8	39	A
24	Kaena Point Hostel	25	Guatemala City	6.18	NO	YES	7.4	54	C
25	Hostal Los Lagos	26	Guatemala City	15.00	YES	YES	9.2	189	C
26	Euro Hostel	27	Guatemala City	17.50	YES	YES	10.0	183	A
27	Life Builders	28	Guatemala City	9.94	YES	NO	7.6	0	C
28	Driftwood Surfer beach hostel El Paredon Guate...	29	El Paredon	9.97	NO	YES	8.8	435	C
29	Mellow Hostel El Paredon	30	El Paredon	13.41	NO	YES	9.3	48	B
30	Cocorí Lodge El Paredon	31	El Paredon	11.50	NO	NO	8.5	100	A

'''2、 数据集非数值特征的数值化处理以及对缺失值的填充'''
from sklearn.preprocessing import LabelEncoder #标签编码 用于将类别型数据转换为数值型数据

# 读取数据
Numlize_data = origin_data

# 提取非数值型的列名（特征）
non_numeric_columns = ['DESTINATION', 'Free_Breakfast','Wifi','Traffic situation']

# 对非数值列进行数值化处理
label_encoders = {}  # 该字典用于存储每个特征的 LabelEncoder 

# 对于数据集中的每个非数值型列，依次执行以下操作
for column in non_numeric_columns:

    # 检查当前列的数据类型是否为字符串对象
    if Numlize_data[column].dtype == 'object':

        # 如果列的数据类型为字符串，创建一个名为 column 的列的 LabelEncoder 对象，
        # 并将其存储在名为 label_encoders 的字典中，以便稍后使用
        label_encoders[column] = LabelEncoder()

        # 使用 LabelEncoder 对象对当前列进行转换。
        # fit_transform() 方法用于对数据进行拟合和转换，
        # 将非数字型数据转换为数字型数据，并将转换后的数据存储回原始的数据集中的当前列
        Numlize_data[column] = label_encoders[column].fit_transform(Numlize_data[column])

# 对数值列进行缺失值填充
numeric_columns = Numlize_data.select_dtypes(include=['number']).columns
Numlize_data[numeric_columns] = Numlize_data[numeric_columns].fillna(Numlize_data[numeric_columns].mean())

# 打印数值化处理后的结果
Numlize_data

	Hostel Name	Hostel_ID	DESTINATION	Starting_Price	Free_Breakfast	Wifi	Rating (out of 10)	Total Ratings	Traffic situation
0	Maya Papaya	1	0	15.00	2	1	9.5	897	0
1	Tropicana Hostel	2	0	11.94	0	0	9.0	1411	1
2	Somos	3	0	9.94	0	1	8.5	283	0
3	Selina Antigua	4	0	12.00	0	0	9.7	371	0
4	Ojala	5	0	15.00	2	1	7.2	172	1
5	Amura	6	0	8.00	0	1	7.2	39	2
6	Hostal Antigua	7	0	9.11	0	0	8.1	325	0
7	Hostel Los Tecalotes	8	0	8.61	0	1	7.0	2	2
8	Hostel Antigueno	9	0	9.35	0	0	8.7	162	0
9	Adra Hostel	10	0	16.97	2	1	9.0	95	2
10	Barbara's Boutique Hostel	11	0	10.34	0	1	9.6	461	0
11	The Purpose Hostel	12	0	12.58	1	0	9.6	536	0
12	La Iguana Perdida	13	3	6.85	0	1	9.1	909	0
13	Free Cerveza	14	3	5.84	0	1	9.0	1293	0
14	Selina Atitlan	15	3	11.00	0	1	8.9	193	1
15	Hostel San Marcos	16	3	8.00	2	1	6.5	60	1
16	Eco-Hostel Mayachik	17	3	8.00	0	0	9.4	45	1
17	Maca Hostel and Micro-Resort	18	3	27.50	0	1	9.6	2	1
18	Mandala’s Hostal	19	3	9.94	0	1	9.5	94	0
19	Mr. Mullet's	20	3	11.26	0	0	8.8	864	0
20	Hotel Amigos	21	3	9.25	0	1	7.7	21	1
21	Tequila Sunrise	22	2	5.50	2	0	8.4	231	0
22	Hostal Guatefriends	23	2	18.00	2	1	9.6	252	1
23	Nostalgic Hostel	24	2	8.61	2	0	7.8	39	0
24	Kaena Point Hostel	25	2	6.18	0	1	7.4	54	2
25	Hostal Los Lagos	26	2	15.00	2	1	9.2	189	2
26	Euro Hostel	27	2	17.50	2	1	10.0	183	0
27	Life Builders	28	2	9.94	2	0	7.6	0	2
28	Driftwood Surfer beach hostel El Paredon Guate...	29	1	9.97	0	1	8.8	435	2
29	Mellow Hostel El Paredon	30	1	13.41	0	1	9.3	48	1
30	Cocorí Lodge El Paredon	31	1	11.50	0	0	8.5	100	0

'''3、标准化处理'''
from sklearn.preprocessing import StandardScaler

# 从数据集的第三列数据开始获取待标准化的特征值
scale_data=Numlize_data.iloc[:,2:].values

# 初始化一个标准化转换器对象
transfer0 = StandardScaler()

scale_data=transfer0.fit_transform(scale_data)
print(scale_data,scale_data.shape)

[[-1.1226828   0.82538383  1.42313078  0.74161985  0.92288087  1.56308313
  -0.92519568]
 [-1.1226828   0.13194738 -0.72892064 -1.34839972  0.3789777   2.94361451
   0.32180719]
 [-1.1226828  -0.32127907 -0.72892064  0.74161985 -0.16492548 -0.08603412
  -0.92519568]
 [-1.1226828   0.14554417 -0.72892064 -1.34839972  1.14044214  0.15032145
  -0.92519568]
 [-1.1226828   0.82538383  1.42313078  0.74161985 -1.57907374 -0.38416444
   0.32180719]
 [-1.1226828  -0.76090872 -0.72892064  0.74161985 -1.57907374 -0.74138365
   1.56881007]
 [-1.1226828  -0.50936804 -0.72892064 -1.34839972 -0.60004802  0.02677195
  -0.92519568]
 [-1.1226828  -0.62267465 -0.72892064  0.74161985 -1.79663501 -0.84076042
   1.56881007]
 [-1.1226828  -0.45498087 -0.72892064 -1.34839972  0.05263579 -0.41102302
  -0.92519568]
 [-1.1226828   1.27181188  1.42313078  0.74161985  0.3789777  -0.59097556
   1.56881007]
 [-1.1226828  -0.23063378 -0.72892064  0.74161985  1.03166151  0.39204873
  -0.92519568]
 [-1.1226828   0.27697984  0.34710507 -1.34839972  1.03166151  0.59348814
  -0.92519568]
 [ 1.25026039 -1.02151392 -0.72892064  0.74161985  0.48775833  1.59531344
  -0.92519568]
 [ 1.25026039 -1.25039327 -0.72892064  0.74161985  0.3789777   2.62668318
  -0.92519568]
 [ 1.25026039 -0.08106905 -0.72892064  0.74161985  0.27019706 -0.3277614
   0.32180719]
 [ 1.25026039 -0.76090872  1.42313078  0.74161985 -2.34053819 -0.68498061
   0.32180719]
 [ 1.25026039 -0.76090872 -0.72892064 -1.34839972  0.81410024 -0.72526849
   0.32180719]
 [ 1.25026039  3.65804909 -0.72892064  0.74161985  1.03166151 -0.84076042
   0.32180719]
 [ 1.25026039 -0.32127907 -0.72892064  0.74161985  0.92288087 -0.59366142
  -0.92519568]
 [ 1.25026039 -0.02214961 -0.72892064 -1.34839972  0.16141643  1.47444979
  -0.92519568]
 [ 1.25026039 -0.47764219 -0.72892064  0.74161985 -1.03517056 -0.7897291
   0.32180719]
 [ 0.45927933 -1.32744177  1.42313078 -1.34839972 -0.27370611 -0.22569877
  -0.92519568]
 [ 0.45927933  1.50522349  1.42313078  0.74161985  1.03166151 -0.16929574
   0.32180719]
 [ 0.45927933 -0.62267465  1.42313078 -1.34839972 -0.92638993 -0.74138365
  -0.92519568]
 [ 0.45927933 -1.17334478 -0.72892064  0.74161985 -1.36151247 -0.70109576
   1.56881007]
 [ 0.45927933  0.82538383  1.42313078  0.74161985  0.59653897 -0.33850484
   1.56881007]
 [ 0.45927933  1.39191688  1.42313078  0.74161985  1.46678405 -0.35461999
  -0.92519568]
 [ 0.45927933 -0.32127907  1.42313078 -1.34839972 -1.1439512  -0.84613214
   1.56881007]
 [-0.33170174 -0.31448067 -0.72892064  0.74161985  0.16141643  0.32221641
   1.56881007]
 [-0.33170174  0.46506881 -0.72892064  0.74161985  0.7053196  -0.71721092
   0.32180719]
 [-0.33170174  0.03223756 -0.72892064 -1.34839972 -0.16492548 -0.57754626
  -0.92519568]] (31, 7)

说明：
由标准化处理后的数组数据可知，数据集待聚类处理的特征维度为七维。为了方便后续聚类结果的可视化，接下来我们采用PCA降维技术将其降到二维！

'''4、降维处理'''
from sklearn.decomposition import PCA
transfer1 = PCA(n_components = 2) # 减少到二维空间 为了方便可视化

pca_data = transfer1.fit_transform(scale_data)

pca_data

array([[-1.20875831,  1.62723311],
       [-2.17954007, -0.49999884],
       [-0.41892282, -0.35878887],
       [-1.8327227 , -0.03959969],
       [ 1.69512135,  0.41250462],
       [ 2.01954599, -1.34181723],
       [-0.9805543 , -1.41593127],
       [ 2.17906043, -1.35095566],
       [-1.06041339, -1.0250797 ],
       [ 1.62931985,  1.78510981],
       [-1.22269447,  0.31381418],
       [-1.74506824,  0.3039498 ],
       [-1.49489319, -0.46815072],
       [-1.98313424, -0.72149184],
       [ 0.36494895,  0.18653765],
       [ 2.26083643, -1.01827662],
       [-0.41744797, -0.70656468],
       [ 0.4389093 ,  3.32221946],
       [-0.54500345,  0.33179379],
       [-1.93418732, -0.59639915],
       [ 1.1977873 , -0.76596774],
       [-0.44229712, -1.12713865],
       [ 0.47138872,  2.33007222],
       [ 0.16094183, -0.93674703],
       [ 1.96129619, -1.4563731 ],
       [ 1.46069336,  1.6395852 ],
       [-0.3718697 ,  2.45268164],
       [ 1.78780896, -0.77514925],
       [ 0.7174211 , -0.10764917],
       [ 0.30145571,  0.74718192],
       [-0.80902817, -0.74060422]])

'''5、降维后数据点的可视化 调整坐标范围以及点索引的位置'''
import matplotlib.pylab as plt
B = pca_data  # 将pca处理后的数据赋值给变量B，后续的操作均使用B来调用数据集

# 采用按行进行聚类
for i in range(B.shape[0]):  
# B.shape[0]表示B的所有行数；B.shape[1]表示B的所有列数。

    # B[i, 0], B[i, 1]分别表示 第i个数据点的行 和 第i个数据点的列
    plt.plot(B[i, 0], B[i, 1], 'bo')  # 'bo'表示蓝色，o型点
    plt.text(B[i,0],B[i,1]-0.22,i,fontsize=10,color='g') # 给数据点添加点索引的文本文字

plt.title('Data Sample Visualization')
    
plt.show()

在这里插入图片描述

# 检验获取B的数据类型：numpy.ndarray
type(B)

numpy.ndarray

二、采用不调用sklearn中库函数的方式对数据集进行密度聚类操作

（选择最优参数时除外）

'''1、选择最优参数 (eps,MinPTS)'''
from sklearn.cluster import DBSCAN
import numpy as np
# 构建空列表，用于保存不同参数组合下的结果
res = []
# 迭代不同的eps值：从0.001开始，步长为0.05，遍历到1结束
for eps in np.arange(0.001,1,0.05): 
    # 迭代不同的min_samples值：从2遍历到9
    for min_samples in range(2,10):
        dbscan = DBSCAN(eps = eps, min_samples = min_samples)

        # 模型拟合
        dbscan.fit(B)

        # 统计各参数组合下的聚类个数（-1表示异常点）
        n_clusters = len([i for i in set(dbscan.labels_) if i != -1])

        # 异常点的个数
        outliners = np.sum(np.where(dbscan.labels_ == -1, 1,0))

        # 统计每个簇的样本个数
        stats = str(pd.Series([i for i in dbscan.labels_ if i != -1]).value_counts().values)
        res.append({'eps':eps,'min_samples':min_samples,'n_clusters':n_clusters,'outliners':outliners,'stats':stats})
# 将迭代后的结果存储到数据框中        
df = pd.DataFrame(res)

# 根据条件筛选合理的参数组合 设置聚类结果的簇数==3
df.loc[df.n_clusters == 3, :]

	eps	min_samples	n_clusters	outliners	stats
40	0.251	2	3	24	[3 2 2]
48	0.301	2	3	23	[3 3 2]
65	0.401	3	3	19	[5 4 3]
73	0.451	3	3	18	[6 4 3]
81	0.501	3	3	17	[6 4 4]
82	0.501	4	3	19	[4 4 4]
89	0.551	3	3	13	[7 6 5]
90	0.551	4	3	16	[5 5 5]
98	0.601	4	3	15	[6 5 5]
99	0.601	5	3	15	[6 5 5]
106	0.651	4	3	12	[7 6 6]
107	0.651	5	3	13	[7 6 5]
114	0.701	4	3	12	[7 6 6]
115	0.701	5	3	12	[7 6 6]
121	0.751	3	3	7	[15 6 3]
129	0.801	3	3	7	[15 6 3]
136	0.851	2	3	5	[15 9 2]
152	0.951	2	3	3	[24 2 2]

说明：
根据聚类经验，选择离群点少的参数组合，可以增强密度聚类的效果

'''2、设置合适的参数组合'''
# 根据上述的测试结果，不妨选择eps = 0.8，minpts = 3的组合
r = 0.8
minPts = 3

'''3、求每个数据点(index)对应邻域范围内的数据点'''
# 求邻域范围内的数值的函数
def neighborhoodList(key): # key为点的索引
    neighborList = []      # 用于存储key点邻域内的其他点
    for i in range(B.shape[0]): #  遍历列表B的所有行
        if(np.linalg.norm(B[i,] - B[key,]) <= r): # 判断B数组内其他数据点与key点的距离是否<=r
            neighborList.append(i)
        else: pass

    return neighborList  # 返回邻域列表

# 检验样本邻域：寻找下标为2的邻域：
neighborhoodList(2)

[2, 16, 18, 21, 30]

B.shape[0]  # 数据集的样本量 == 行数

# 打印所有样本点的邻域列表
for i in range(B.shape[0]):
    print(neighborhoodList(i),end=' ')

[0] [1, 3, 12, 13, 19] [2, 16, 18, 21, 30] [1, 3, 10, 11, 12, 13, 19] [4] [5, 7, 15, 24, 27] [6, 8, 21, 30] [5, 7, 15, 24, 27] [6, 8, 12, 16, 21, 30] [9, 25] [3, 10, 11, 18] [3, 10, 11] [1, 3, 8, 12, 13, 19, 30] [1, 3, 12, 13, 19] [14, 28, 29] [5, 7, 15, 24, 27] [2, 8, 16, 21, 23, 30] [17] [2, 10, 18] [1, 3, 12, 13, 19] [20, 27] [2, 6, 8, 16, 21, 23, 30] [22] [16, 21, 23] [5, 7, 15, 24, 27] [9, 25] [26] [5, 7, 15, 20, 24, 27] [14, 28] [14, 29] [2, 6, 8, 12, 16, 21, 30]

'''4、寻找核心对象(邻域列表满足核心点要求的数据点)'''
def FindCore(binary_array): # 传入存储数据点的二维数组
    Core=[]  # 用于存储核心对象的列表
    for i in range(binary_array.shape[0]): # 遍历所有数据点
        if len(neighborhoodList(i))>=minPts: # 若邻域列表的长度满足条件
            Core.append(i) # 设定为核心对象，追加到核心对象列表中
        else:
            pass
    
    return Core # 返回核心对象列表

Core=FindCore(B)
print(Core,'\n',"核心对象数目：",len(Core))

[1, 2, 3, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 18, 19, 21, 23, 24, 27, 30] 
 核心对象数目： 21

以下是进行“核操作”：

# 创建核心对象的邻域集合列表
SetList=[]
for i in Core:
    elem=set(neighborhoodList(i)) # 将每个核心对象的邻域以集合的形式作为元素存入列表中
    SetList.append(elem)

print(SetList)

[{1, 3, 12, 13, 19}, {2, 16, 18, 21, 30}, {1, 3, 10, 11, 12, 13, 19}, {5, 7, 15, 24, 27}, {8, 21, 6, 30}, {5, 7, 15, 24, 27}, {6, 8, 12, 16, 21, 30}, {11, 18, 10, 3}, {11, 10, 3}, {1, 3, 8, 12, 13, 19, 30}, {1, 3, 12, 13, 19}, {28, 29, 14}, {5, 7, 15, 24, 27}, {2, 8, 16, 21, 23, 30}, {18, 2, 10}, {1, 3, 12, 13, 19}, {2, 6, 8, 16, 21, 23, 30}, {16, 21, 23}, {5, 7, 15, 24, 27}, {5, 7, 15, 20, 24, 27}, {2, 6, 8, 12, 16, 21, 30}]

'''5、核心对象可视化(描红 + 画出邻域范围)'''
for i in range(B.shape[0]):
    for i in range(B.shape[0]):
        if i in Core:
            plt.plot(B[i, 0], B[i, 1], 'ro')  # 'ro'表示红色，o型点
        else:
            plt.plot(B[i, 0], B[i, 1], 'bo')  # 'bo'表示蓝色，o型点

        #给数据点做下标文本标记
        plt.text(B[i,0],B[i,1]-0.22,i,fontsize = 8,color='g')

# 给核心对象画邻域范围
a = np.arange(0, 2*np.pi, 0.01)
for i in Core:
    plt.plot(r*np.cos(a)+B[i,0], r*np.sin(a)+B[i,1], 'y--')

plt.title('Core Point Visualization')
plt.show()

在这里插入图片描述

'''6、寻找密度直达:  核心点间的交集是否包含他们本身'''
def directly_density_reachable(CoreList):
    directlyList=[]
    for i in CoreList:
        for j in CoreList:
            if i and j in set(neighborhoodList(i)).intersection(set(neighborhoodList(j))): #交集
                directlyList.append({i,j})
            else: pass
    
    return directlyList

directlyList=directly_density_reachable(Core) #核心对象密度直达列表

print(directlyList)

[{1}, {1, 3}, {1, 12}, {1, 13}, {1, 19}, {2}, {16, 2}, {2, 18}, {2, 21}, {2, 30}, {1, 3}, {3}, {10, 3}, {11, 3}, {3, 12}, {3, 13}, {19, 3}, {5}, {5, 7}, {5, 15}, {24, 5}, {27, 5}, {6}, {8, 6}, {21, 6}, {6, 30}, {5, 7}, {7}, {15, 7}, {24, 7}, {27, 7}, {8, 6}, {8}, {8, 12}, {8, 16}, {8, 21}, {8, 30}, {10, 3}, {10}, {10, 11}, {10, 18}, {3, 11}, {10, 11}, {11}, {1, 12}, {3, 12}, {8, 12}, {12}, {12, 13}, {19, 12}, {12, 30}, {1, 13}, {3, 13}, {12, 13}, {13}, {19, 13}, {14}, {5, 15}, {7, 15}, {15}, {24, 15}, {27, 15}, {16, 2}, {16, 8}, {16}, {16, 21}, {16, 23}, {16, 30}, {18, 2}, {18, 10}, {18}, {1, 19}, {3, 19}, {19, 12}, {19, 13}, {19}, {2, 21}, {21, 6}, {8, 21}, {16, 21}, {21}, {21, 23}, {21, 30}, {16, 23}, {21, 23}, {23}, {24, 5}, {24, 7}, {24, 15}, {24}, {24, 27}, {27, 5}, {27, 7}, {27, 15}, {24, 27}, {27}, {2, 30}, {30, 6}, {8, 30}, {12, 30}, {16, 30}, {21, 30}, {30}]

列表去重元素函数

def Drop_duplicates(binary_array):  # Drop_duplicates:去重
    
    unique_list=[]   # 用于存放不重复的元素

    for i in range(len(binary_array)):  # 遍历检测列表

        isUnique=True # 检测标签初始化为true

        for j in range(len(unique_list)): # 遍历新列表
            # 当新列表中检测到已有元素，标签设置为false，跳出内层循环，更新检测列表的检测元素
            if binary_array[i] == unique_list[j]:
                isUnique=False
                break
        
        # 若检测不到重复元素，标记依旧为true，则将元素追加到新列表中
        if isUnique==True:
            unique_list.append(binary_array[i])
        
    return unique_list

对密度直达列表进行去重操作

# 去重后的密度直达列表
directlyList=Drop_duplicates(directlyList)
print(directlyList)

[{1}, {1, 3}, {1, 12}, {1, 13}, {1, 19}, {2}, {16, 2}, {2, 18}, {2, 21}, {2, 30}, {3}, {10, 3}, {11, 3}, {3, 12}, {3, 13}, {19, 3}, {5}, {5, 7}, {5, 15}, {24, 5}, {27, 5}, {6}, {8, 6}, {21, 6}, {6, 30}, {7}, {15, 7}, {24, 7}, {27, 7}, {8}, {8, 12}, {8, 16}, {8, 21}, {8, 30}, {10}, {10, 11}, {10, 18}, {11}, {12}, {12, 13}, {19, 12}, {12, 30}, {13}, {19, 13}, {14}, {15}, {24, 15}, {27, 15}, {16}, {16, 21}, {16, 23}, {16, 30}, {18}, {19}, {21}, {21, 23}, {21, 30}, {23}, {24}, {24, 27}, {27}, {30}]

'''7、寻找密度相连(间接密度可达)的核心对象'''
def density_reachable():
    reachableList=[] # 用于存储密度相连的核心对象集合
    
    for i in range(len(directlyList)): # 传入（已去重）密度直达列表 遍历密度直达列表的每个元素

        pre = directlyList[i] # 将当前元素传递给pre变量

        for j in range(len(directlyList)):
            if pre.intersection(directlyList[j])!=set(): # 两个密度直达的集合的交集不为空集
                pre = pre.union(directlyList[j])         # 将这两个集合的并集赋值给pre
            else:pass

            #去掉子集的操作 每更新一次元素，从新扫描index = j前面的列表元素，以剔除前边漏掉的子集
            for k in range(j):
                    if pre.intersection(directlyList[k])!=set():
                        pre=pre.union(directlyList[k])   

        reachableList.append(pre) #将更新好、去除漏子集情况的集合追加到密度可达列表中
        
    reachableList=Drop_duplicates(reachableList) # 去重处理

    return reachableList

reachableList=density_reachable()      
reachableList

[{1, 2, 3, 6, 8, 10, 11, 12, 13, 16, 18, 19, 21, 23, 30},
 {5, 7, 15, 24, 27},
 {14}]

'''8、聚类结果操作 簇集 + 离群点'''
# ClusterList 簇结果列表：每个簇以集合的形式作为元素存入该列表中
# PointClusterSet 所有位于所有簇中的数据点的集合
# PerClusterSet 用于暂存某个簇中的数据点集合

ClustersList=[]
PointClusterSet=set()

for i in range(len(reachableList)):   # 1、遍历密度可达列表的每个元素，此时每个元素是集合（一个簇中的核心对象的集合）

    PerClusterSet=set()  #开辟暂存集合

    for j in list(reachableList[i]):  # 2、遍历每个元素（集合列表化）的每个子元素（每个簇中的每个核心对象）

        for k in neighborhoodList(j): # 3、求子元素的邻域列表，并遍历邻域列表的每个元素

            PerClusterSet.add(k)      # 4、将每个元素追加到集合中
    
    #该层循环(i)每一次迭代得到的是一个簇的集合 PerClusterSet

    PointClusterSet=PointClusterSet.union(PerClusterSet) # 5、计算所有位于各簇中的点，用于后续计算离群点
    ClustersList.append(PerClusterSet) # 6、将每个簇的集合追加到簇结果列表中
    
    print("Cluster",i,": ",PerClusterSet)
    print("Border Points of Cluster",i,": ",reachableList[i])
    
    BorderSet=PerClusterSet - reachableList[i] # 7、计算每个簇的边界点 即计算每个簇的集合与每个簇中核心对象的集合的差集
    if BorderSet != set():
        print("Border Points of Cluster",i,": ",BorderSet)
    else:
        print("Border Points of Cluster",i,": NULL")

    print('-'*40)
# 打印所有位于簇中点的集合   
# print("All Points in Clusters: ", PointClusterSet)
print('-'*40)

# 计算离群点：
# 1、首先计算所有点的集合
AllPoint=set()
for i in range(B.shape[0]):
    AllPoint.add(i)
# 2、然后计算所有点集合-所有位于簇中点的集合
outlier=AllPoint-PointClusterSet
print("Outlier: ", outlier)

Cluster 0 :  {1, 2, 3, 6, 8, 10, 11, 12, 13, 16, 18, 19, 21, 23, 30}
Border Points of Cluster 0 :  {1, 2, 3, 6, 8, 10, 11, 12, 13, 16, 18, 19, 21, 23, 30}
Border Points of Cluster 0 : NULL
----------------------------------------
Cluster 1 :  {5, 7, 15, 20, 24, 27}
Border Points of Cluster 1 :  {5, 7, 24, 27, 15}
Border Points of Cluster 1 :  {20}
----------------------------------------
Cluster 2 :  {28, 29, 14}
Border Points of Cluster 2 :  {14}
Border Points of Cluster 2 :  {28, 29}
----------------------------------------
----------------------------------------
Outlier:  {0, 4, 9, 17, 22, 25, 26}

# 簇集列表
ClustersList

[{1, 2, 3, 6, 8, 10, 11, 12, 13, 16, 18, 19, 21, 23, 30},
 {5, 7, 15, 20, 24, 27},
 {14, 28, 29}]

print("密度聚类后类的数目：",len(ClustersList))

密度聚类后类的数目： 3

'''9、DBSCAN Cluster结果可视化'''
import matplotlib.pyplot as plt

# 设置颜色
colors = {'Cluster1': 'red', 'Cluster2': 'green', 'Cluster3': 'blue', 'Outliers': 'purple'}

# 绘制散点图
for i in range(B.shape[0]):
    if i in ClustersList[0]:
        plt.scatter(B[i, 0], B[i, 1], color='red', marker='o')  
    elif i in ClustersList[1]:
        plt.scatter(B[i, 0], B[i, 1], color='green', marker='o')  
    elif i in ClustersList[2]:
        plt.scatter(B[i, 0], B[i, 1], color='blue', marker='o')  
    else:
        plt.scatter(B[i, 0], B[i, 1], color='purple', marker='o') 

    # 给数据点做下标文本标记
    plt.text(B[i,0],B[i,1]-0.22,i,fontsize=10,color='black')

# 创建图例项
legend_handles = []
for label, color in colors.items():
    legend_handles.append(plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=color, markersize=10))

# 添加图例
plt.legend(legend_handles, colors.keys())

# 设置坐标轴标签
plt.xlabel('X')
plt.ylabel('Y')

# 添加标题
plt.title('DBSCAN Cluster Visualization')

plt.show()

在这里插入图片描述

三、基于调用sklearn中库函数的方式对数据集进行密度聚类操作

'''1、使用sklearn中的DBSCAN类对数据(B数组)进行聚类模型训练'''
from sklearn.cluster import DBSCAN

# 使用 DBSCAN 进行聚类 r=0.8, MinPTS=3
db = DBSCAN(eps=0.8, min_samples=3)
db.fit(B)

DBSCAN(eps=0.8, min_samples=3)

'''2、对聚类结果进行评估 （采用轮廓系数法）'''
from sklearn.metrics import silhouette_score

# 获取聚类标签 (-1为离群点) labels是一个列表
labels = db.labels_

# 计算轮廓系数
score = silhouette_score(B, labels)
print('轮廓系数:', score)

轮廓系数: 0.37453991956763705

说明：
轮廓系数评估：
score(i)->-1时，说明样本i更应该分类到别的簇中；
score(i)->0时，说明样本i在两个簇的边界上；
score(i)->1时，说明样本i聚类合理。

轮廓系数为0.37的评估结果不是很好，但是为了对比非调用类库编程的结果，我们还是选用(0.8,3)的参数组合

可使用k近邻距离图像，通过找到图像的肘点来优化eps的选取

'''* 可视化最近邻算法(Nearest Neighbors)来计算每个样本点到其第二个最近邻点的距离'''
from sklearn.neighbors import NearestNeighbors
import numpy as np
import matplotlib.pyplot as plt

neigh = NearestNeighbors(n_neighbors=2) # 创建最近邻对象，设置最近邻数为2

nbrs = neigh.fit(B) # 将数据 B 应用于最近邻对象并进行拟合

distances, indices = nbrs.kneighbors(B) # 计算每个样本点到其第二个最近邻点的距离


distances = np.sort(distances, axis=0) # 对距离进行排序

distances = distances[:, 1] # 取每个样本点到其第二个最近邻点的距离作为横坐标

# 绘制距离图像
plt.plot(distances)
plt.title('Distance to 2nd Nearest Neighbor') # 添加图像标题
plt.show()

在这里插入图片描述

# 打印聚类标签列表
print(labels)

[-1  0  0  0 -1  1  0  1  0 -1  0  0  0  0  2  1  0 -1  0  0  1  0 -1  0
  1 -1 -1  1  2  2  0]

# 打印核心点列表以及其长度
print(db.core_sample_indices_, '\n','核心对象的数目：',db.core_sample_indices_.shape[0])

[ 1  2  3  5  6  7  8 10 11 12 13 14 15 16 18 19 21 23 24 27 30] 
 核心对象的数目： 21

'''3、显示聚类结果'''
core = db.core_sample_indices_ # 将核心对象列表复制给core变量

# 遍历聚类标签
for i in range(len(list(set(labels)))-1): # len-1是为了剔除掉离群点
    list1=[] # 用于暂存每个簇的数据点

    # 检测聚类标签与迭代标签列表的元素是否匹配
    for j in range(B.shape[0]):
        if labels[j] == i:
            list1.append(j)
        else: pass
    print("Cluster ",i,": ",set(list1))

    list2=[] # 用于存储每个簇中的核心对象
    list3=[] # 用于存储每个簇中的边界点
    for k in list1:
        if k in core:list2.append(k)
        else:list3.append(k)
    
    print("Core Point from Cluster",i,": ",set(list2))

    if list3!=[]:
        print("Border Point from Cluster",i,": ",set(list3))
    else:
            print("Border Point from Cluster",i,": NULL")

    print('-'*40)

list4=[] # 用于存储离群点
for k in range(len(labels)):
    if labels[k]== -1: list4.append(k) # 标签值==-1的数据点是离群点
print("Outlier: ",set(list4))

Cluster  0 :  {1, 2, 3, 6, 8, 10, 11, 12, 13, 16, 18, 19, 21, 23, 30}
Core Point from Cluster 0 :  {1, 2, 3, 6, 8, 10, 11, 12, 13, 16, 18, 19, 21, 23, 30}
Border Point from Cluster 0 : NULL
----------------------------------------
Cluster  1 :  {5, 7, 15, 20, 24, 27}
Core Point from Cluster 1 :  {5, 7, 15, 24, 27}
Border Point from Cluster 1 :  {20}
----------------------------------------
Cluster  2 :  {28, 29, 14}
Core Point from Cluster 2 :  {14}
Border Point from Cluster 2 :  {28, 29}
----------------------------------------
Outlier:  {0, 4, 9, 17, 22, 25, 26}

'''4、聚类结果可视化'''
for i in range(B.shape[0]):
    if labels[i] == 0:
        c1 = plt.scatter(B[i,0], B[i,1], c = 'r', marker='+')
    elif labels[i] == 1:
        c2 = plt.scatter(B[i,0], B[i,1], c = 'g', marker='o')
    elif labels[i] == 2:
        c3 = plt.scatter(B[i,0], B[i,1], c = 'b', marker='*')
    elif labels[i] == -1:
        c4 = plt.scatter(B[i,0], B[i,1], c = 'purple', marker='^')
    
    # 给数据点做下标文本标记
    plt.text(B[i,0],B[i,1]-0.22,i,fontsize=10,color='black')

plt.legend([c1,c2,c3,c4], ['Cluster 1','Cluster 2','Cluster 3','Outlier'])
plt.title('DBSCAN Clustering Results based on sklearn')
plt.show()