DBSCAN Clustering Algorithm——密度聚类算法的基本原理和Python实现

DBSCAN Clustering Algorithm
——密度聚类算法的基本原理和Python实现

One. 密度聚类算法的基本原理(DBSCAN)

手写笔记如下:
(专业概念——好记性不如烂笔头!
专业思维——算法原理重在推导!)
在这里插入图片描述在这里插入图片描述在这里插入图片描述在这里插入图片描述

Two. 使用Python编程对酒店信息数据集进行密度聚类算法的聚类分析示例

Dataset:HostelDataset.xls

Source :https://www.kaggle.com/datasets/lexipetzold/hostels-in-guatemala-hostelworld-data

Tips :对该数据集做过修改
修改后的HostelDataset(请点击此处)

Dataset’s description

字段描述
Hostel Name酒店名
Hostel_ID酒店ID
DESTINATION目的地
Starting_Price起价
Free_Breakfast免费早餐
Wifi无限网络
Rating (out of 10)评分(满分10分)
Total Ratings总评分
Traffic situation交通情况
一、数据集的读取以及数据预处理
'''1、数据集的读取''' 
import pandas as pd 

# 获取文件路径
file_path='D:/MachineLearningDesign/TotalDataset/Guatemalas_Travel/HostelDataset.xls'

# 使用pandas库函数读取文件中的数据集
origin_data = pd.read_excel(file_path)

# 打印数据集进行检查
origin_data
Hostel NameHostel_IDDESTINATIONStarting_PriceFree_BreakfastWifiRating (out of 10)Total RatingsTraffic situation
0Maya Papaya1Antigua15.00YESYES9.5897A
1Tropicana Hostel2Antigua11.94NONO9.01411B
2Somos3Antigua9.94NOYES8.5283A
3Selina Antigua4Antigua12.00NONO9.7371A
4Ojala5Antigua15.00YESYES7.2172B
5Amura6Antigua8.00NOYES7.239C
6Hostal Antigua7Antigua9.11NONO8.1325A
7Hostel Los Tecalotes8Antigua8.61NOYES7.02C
8Hostel Antigueno9Antigua9.35NONO8.7162A
9Adra Hostel10Antigua16.97YESYES9.095C
10Barbara's Boutique Hostel11Antigua10.34NOYES9.6461A
11The Purpose Hostel12Antigua12.58NoNO9.6536A
12La Iguana Perdida13Lake Atitlan6.85NOYES9.1909A
13Free Cerveza14Lake Atitlan5.84NOYES9.01293A
14Selina Atitlan15Lake Atitlan11.00NOYES8.9193B
15Hostel San Marcos16Lake Atitlan8.00YESYES6.560B
16Eco-Hostel Mayachik17Lake Atitlan8.00NONO9.445B
17Maca Hostel and Micro-Resort18Lake Atitlan27.50NOYES9.62B
18Mandala’s Hostal19Lake Atitlan9.94NOYES9.594A
19Mr. Mullet's20Lake Atitlan11.26NONO8.8864A
20Hotel Amigos21Lake Atitlan9.25NOYES7.721B
21Tequila Sunrise22Guatemala City5.50YESNO8.4231A
22Hostal Guatefriends23Guatemala City18.00YESYES9.6252B
23Nostalgic Hostel24Guatemala City8.61YESNO7.839A
24Kaena Point Hostel25Guatemala City6.18NOYES7.454C
25Hostal Los Lagos26Guatemala City15.00YESYES9.2189C
26Euro Hostel27Guatemala City17.50YESYES10.0183A
27Life Builders28Guatemala City9.94YESNO7.60C
28Driftwood Surfer beach hostel El Paredon Guate...29El Paredon9.97NOYES8.8435C
29Mellow Hostel El Paredon30El Paredon13.41NOYES9.348B
30Cocorí Lodge El Paredon31El Paredon11.50NONO8.5100A
'''2、 数据集非数值特征的数值化处理以及对缺失值的填充'''
from sklearn.preprocessing import LabelEncoder #标签编码 用于将类别型数据转换为数值型数据

# 读取数据
Numlize_data = origin_data

# 提取非数值型的列名(特征)
non_numeric_columns = ['DESTINATION', 'Free_Breakfast','Wifi','Traffic situation']

# 对非数值列进行数值化处理
label_encoders = {}  # 该字典用于存储每个特征的 LabelEncoder 

# 对于数据集中的每个非数值型列,依次执行以下操作
for column in non_numeric_columns:

    # 检查当前列的数据类型是否为字符串对象
    if Numlize_data[column].dtype == 'object':

        # 如果列的数据类型为字符串,创建一个名为 column 的列的 LabelEncoder 对象,
        # 并将其存储在名为 label_encoders 的字典中,以便稍后使用
        label_encoders[column] = LabelEncoder()

        # 使用 LabelEncoder 对象对当前列进行转换。
        # fit_transform() 方法用于对数据进行拟合和转换,
        # 将非数字型数据转换为数字型数据,并将转换后的数据存储回原始的数据集中的当前列
        Numlize_data[column] = label_encoders[column].fit_transform(Numlize_data[column])

# 对数值列进行缺失值填充
numeric_columns = Numlize_data.select_dtypes(include=['number']).columns
Numlize_data[numeric_columns] = Numlize_data[numeric_columns].fillna(Numlize_data[numeric_columns].mean())

# 打印数值化处理后的结果
Numlize_data

Hostel NameHostel_IDDESTINATIONStarting_PriceFree_BreakfastWifiRating (out of 10)Total RatingsTraffic situation
0Maya Papaya1015.00219.58970
1Tropicana Hostel2011.94009.014111
2Somos309.94018.52830
3Selina Antigua4012.00009.73710
4Ojala5015.00217.21721
5Amura608.00017.2392
6Hostal Antigua709.11008.13250
7Hostel Los Tecalotes808.61017.022
8Hostel Antigueno909.35008.71620
9Adra Hostel10016.97219.0952
10Barbara's Boutique Hostel11010.34019.64610
11The Purpose Hostel12012.58109.65360
12La Iguana Perdida1336.85019.19090
13Free Cerveza1435.84019.012930
14Selina Atitlan15311.00018.91931
15Hostel San Marcos1638.00216.5601
16Eco-Hostel Mayachik1738.00009.4451
17Maca Hostel and Micro-Resort18327.50019.621
18Mandala’s Hostal1939.94019.5940
19Mr. Mullet's20311.26008.88640
20Hotel Amigos2139.25017.7211
21Tequila Sunrise2225.50208.42310
22Hostal Guatefriends23218.00219.62521
23Nostalgic Hostel2428.61207.8390
24Kaena Point Hostel2526.18017.4542
25Hostal Los Lagos26215.00219.21892
26Euro Hostel27217.502110.01830
27Life Builders2829.94207.602
28Driftwood Surfer beach hostel El Paredon Guate...2919.97018.84352
29Mellow Hostel El Paredon30113.41019.3481
30Cocorí Lodge El Paredon31111.50008.51000
'''3、标准化处理'''
from sklearn.preprocessing import StandardScaler

# 从数据集的第三列数据开始获取待标准化的特征值
scale_data=Numlize_data.iloc[:,2:].values

# 初始化一个标准化转换器对象
transfer0 = StandardScaler()

scale_data=transfer0.fit_transform(scale_data)
print(scale_data,scale_data.shape)
[[-1.1226828   0.82538383  1.42313078  0.74161985  0.92288087  1.56308313
  -0.92519568]
 [-1.1226828   0.13194738 -0.72892064 -1.34839972  0.3789777   2.94361451
   0.32180719]
 [-1.1226828  -0.32127907 -0.72892064  0.74161985 -0.16492548 -0.08603412
  -0.92519568]
 [-1.1226828   0.14554417 -0.72892064 -1.34839972  1.14044214  0.15032145
  -0.92519568]
 [-1.1226828   0.82538383  1.42313078  0.74161985 -1.57907374 -0.38416444
   0.32180719]
 [-1.1226828  -0.76090872 -0.72892064  0.74161985 -1.57907374 -0.74138365
   1.56881007]
 [-1.1226828  -0.50936804 -0.72892064 -1.34839972 -0.60004802  0.02677195
  -0.92519568]
 [-1.1226828  -0.62267465 -0.72892064  0.74161985 -1.79663501 -0.84076042
   1.56881007]
 [-1.1226828  -0.45498087 -0.72892064 -1.34839972  0.05263579 -0.41102302
  -0.92519568]
 [-1.1226828   1.27181188  1.42313078  0.74161985  0.3789777  -0.59097556
   1.56881007]
 [-1.1226828  -0.23063378 -0.72892064  0.74161985  1.03166151  0.39204873
  -0.92519568]
 [-1.1226828   0.27697984  0.34710507 -1.34839972  1.03166151  0.59348814
  -0.92519568]
 [ 1.25026039 -1.02151392 -0.72892064  0.74161985  0.48775833  1.59531344
  -0.92519568]
 [ 1.25026039 -1.25039327 -0.72892064  0.74161985  0.3789777   2.62668318
  -0.92519568]
 [ 1.25026039 -0.08106905 -0.72892064  0.74161985  0.27019706 -0.3277614
   0.32180719]
 [ 1.25026039 -0.76090872  1.42313078  0.74161985 -2.34053819 -0.68498061
   0.32180719]
 [ 1.25026039 -0.76090872 -0.72892064 -1.34839972  0.81410024 -0.72526849
   0.32180719]
 [ 1.25026039  3.65804909 -0.72892064  0.74161985  1.03166151 -0.84076042
   0.32180719]
 [ 1.25026039 -0.32127907 -0.72892064  0.74161985  0.92288087 -0.59366142
  -0.92519568]
 [ 1.25026039 -0.02214961 -0.72892064 -1.34839972  0.16141643  1.47444979
  -0.92519568]
 [ 1.25026039 -0.47764219 -0.72892064  0.74161985 -1.03517056 -0.7897291
   0.32180719]
 [ 0.45927933 -1.32744177  1.42313078 -1.34839972 -0.27370611 -0.22569877
  -0.92519568]
 [ 0.45927933  1.50522349  1.42313078  0.74161985  1.03166151 -0.16929574
   0.32180719]
 [ 0.45927933 -0.62267465  1.42313078 -1.34839972 -0.92638993 -0.74138365
  -0.92519568]
 [ 0.45927933 -1.17334478 -0.72892064  0.74161985 -1.36151247 -0.70109576
   1.56881007]
 [ 0.45927933  0.82538383  1.42313078  0.74161985  0.59653897 -0.33850484
   1.56881007]
 [ 0.45927933  1.39191688  1.42313078  0.74161985  1.46678405 -0.35461999
  -0.92519568]
 [ 0.45927933 -0.32127907  1.42313078 -1.34839972 -1.1439512  -0.84613214
   1.56881007]
 [-0.33170174 -0.31448067 -0.72892064  0.74161985  0.16141643  0.32221641
   1.56881007]
 [-0.33170174  0.46506881 -0.72892064  0.74161985  0.7053196  -0.71721092
   0.32180719]
 [-0.33170174  0.03223756 -0.72892064 -1.34839972 -0.16492548 -0.57754626
  -0.92519568]] (31, 7)
说明:
由标准化处理后的数组数据可知,数据集待聚类处理的特征维度为七维。为了方便后续聚类结果的可视化,接下来我们采用PCA降维技术将其降到二维!
'''4、降维处理'''
from sklearn.decomposition import PCA
transfer1 = PCA(n_components = 2) # 减少到二维空间 为了方便可视化

pca_data = transfer1.fit_transform(scale_data)

pca_data
array([[-1.20875831,  1.62723311],
       [-2.17954007, -0.49999884],
       [-0.41892282, -0.35878887],
       [-1.8327227 , -0.03959969],
       [ 1.69512135,  0.41250462],
       [ 2.01954599, -1.34181723],
       [-0.9805543 , -1.41593127],
       [ 2.17906043, -1.35095566],
       [-1.06041339, -1.0250797 ],
       [ 1.62931985,  1.78510981],
       [-1.22269447,  0.31381418],
       [-1.74506824,  0.3039498 ],
       [-1.49489319, -0.46815072],
       [-1.98313424, -0.72149184],
       [ 0.36494895,  0.18653765],
       [ 2.26083643, -1.01827662],
       [-0.41744797, -0.70656468],
       [ 0.4389093 ,  3.32221946],
       [-0.54500345,  0.33179379],
       [-1.93418732, -0.59639915],
       [ 1.1977873 , -0.76596774],
       [-0.44229712, -1.12713865],
       [ 0.47138872,  2.33007222],
       [ 0.16094183, -0.93674703],
       [ 1.96129619, -1.4563731 ],
       [ 1.46069336,  1.6395852 ],
       [-0.3718697 ,  2.45268164],
       [ 1.78780896, -0.77514925],
       [ 0.7174211 , -0.10764917],
       [ 0.30145571,  0.74718192],
       [-0.80902817, -0.74060422]])
'''5、降维后数据点的可视化 调整坐标范围以及点索引的位置'''
import matplotlib.pylab as plt
B = pca_data  # 将pca处理后的数据赋值给变量B,后续的操作均使用B来调用数据集

# 采用按行进行聚类
for i in range(B.shape[0]):  
# B.shape[0]表示B的所有行数;B.shape[1]表示B的所有列数。

    # B[i, 0], B[i, 1]分别表示 第i个数据点的行 和 第i个数据点的列
    plt.plot(B[i, 0], B[i, 1], 'bo')  # 'bo'表示蓝色,o型点
    plt.text(B[i,0],B[i,1]-0.22,i,fontsize=10,color='g') # 给数据点添加点索引的文本文字

plt.title('Data Sample Visualization')
    
plt.show()

在这里插入图片描述

# 检验获取B的数据类型:numpy.ndarray
type(B)
numpy.ndarray
二、采用不调用sklearn中库函数的方式对数据集进行密度聚类操作

(选择最优参数时除外)

'''1、选择最优参数 (eps,MinPTS)'''
from sklearn.cluster import DBSCAN
import numpy as np
# 构建空列表,用于保存不同参数组合下的结果
res = []
# 迭代不同的eps值:从0.001开始,步长为0.05,遍历到1结束
for eps in np.arange(0.001,1,0.05): 
    # 迭代不同的min_samples值:从2遍历到9
    for min_samples in range(2,10):
        dbscan = DBSCAN(eps = eps, min_samples = min_samples)

        # 模型拟合
        dbscan.fit(B)

        # 统计各参数组合下的聚类个数(-1表示异常点)
        n_clusters = len([i for i in set(dbscan.labels_) if i != -1])

        # 异常点的个数
        outliners = np.sum(np.where(dbscan.labels_ == -1, 1,0))

        # 统计每个簇的样本个数
        stats = str(pd.Series([i for i in dbscan.labels_ if i != -1]).value_counts().values)
        res.append({'eps':eps,'min_samples':min_samples,'n_clusters':n_clusters,'outliners':outliners,'stats':stats})
# 将迭代后的结果存储到数据框中        
df = pd.DataFrame(res)

# 根据条件筛选合理的参数组合 设置聚类结果的簇数==3
df.loc[df.n_clusters == 3, :]
epsmin_samplesn_clustersoutlinersstats
400.2512324[3 2 2]
480.3012323[3 3 2]
650.4013319[5 4 3]
730.4513318[6 4 3]
810.5013317[6 4 4]
820.5014319[4 4 4]
890.5513313[7 6 5]
900.5514316[5 5 5]
980.6014315[6 5 5]
990.6015315[6 5 5]
1060.6514312[7 6 6]
1070.6515313[7 6 5]
1140.7014312[7 6 6]
1150.7015312[7 6 6]
1210.751337[15 6 3]
1290.801337[15 6 3]
1360.851235[15 9 2]
1520.951233[24 2 2]
说明:
根据聚类经验,选择离群点少的参数组合,可以增强密度聚类的效果
'''2、设置合适的参数组合'''
# 根据上述的测试结果,不妨选择eps = 0.8,minpts = 3的组合
r = 0.8
minPts = 3
'''3、求每个数据点(index)对应邻域范围内的数据点'''
# 求邻域范围内的数值的函数
def neighborhoodList(key): # key为点的索引
    neighborList = []      # 用于存储key点邻域内的其他点
    for i in range(B.shape[0]): #  遍历列表B的所有行
        if(np.linalg.norm(B[i,] - B[key,]) <= r): # 判断B数组内其他数据点与key点的距离是否<=r
            neighborList.append(i)
        else: pass

    return neighborList  # 返回邻域列表

# 检验样本邻域:寻找下标为2的邻域:
neighborhoodList(2)

[2, 16, 18, 21, 30]
B.shape[0]  # 数据集的样本量 == 行数
31
# 打印所有样本点的邻域列表
for i in range(B.shape[0]):
    print(neighborhoodList(i),end=' ')
[0] [1, 3, 12, 13, 19] [2, 16, 18, 21, 30] [1, 3, 10, 11, 12, 13, 19] [4] [5, 7, 15, 24, 27] [6, 8, 21, 30] [5, 7, 15, 24, 27] [6, 8, 12, 16, 21, 30] [9, 25] [3, 10, 11, 18] [3, 10, 11] [1, 3, 8, 12, 13, 19, 30] [1, 3, 12, 13, 19] [14, 28, 29] [5, 7, 15, 24, 27] [2, 8, 16, 21, 23, 30] [17] [2, 10, 18] [1, 3, 12, 13, 19] [20, 27] [2, 6, 8, 16, 21, 23, 30] [22] [16, 21, 23] [5, 7, 15, 24, 27] [9, 25] [26] [5, 7, 15, 20, 24, 27] [14, 28] [14, 29] [2, 6, 8, 12, 16, 21, 30] 
'''4、寻找核心对象(邻域列表满足核心点要求的数据点)'''
def FindCore(binary_array): # 传入存储数据点的二维数组
    Core=[]  # 用于存储核心对象的列表
    for i in range(binary_array.shape[0]): # 遍历所有数据点
        if len(neighborhoodList(i))>=minPts: # 若邻域列表的长度满足条件
            Core.append(i) # 设定为核心对象,追加到核心对象列表中
        else:
            pass
    
    return Core # 返回核心对象列表

Core=FindCore(B)
print(Core,'\n',"核心对象数目:",len(Core))
[1, 2, 3, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 18, 19, 21, 23, 24, 27, 30] 
 核心对象数目: 21

以下是进行“核操作”:

# 创建核心对象的邻域集合列表
SetList=[]
for i in Core:
    elem=set(neighborhoodList(i)) # 将每个核心对象的邻域以集合的形式作为元素存入列表中
    SetList.append(elem)

print(SetList)
[{1, 3, 12, 13, 19}, {2, 16, 18, 21, 30}, {1, 3, 10, 11, 12, 13, 19}, {5, 7, 15, 24, 27}, {8, 21, 6, 30}, {5, 7, 15, 24, 27}, {6, 8, 12, 16, 21, 30}, {11, 18, 10, 3}, {11, 10, 3}, {1, 3, 8, 12, 13, 19, 30}, {1, 3, 12, 13, 19}, {28, 29, 14}, {5, 7, 15, 24, 27}, {2, 8, 16, 21, 23, 30}, {18, 2, 10}, {1, 3, 12, 13, 19}, {2, 6, 8, 16, 21, 23, 30}, {16, 21, 23}, {5, 7, 15, 24, 27}, {5, 7, 15, 20, 24, 27}, {2, 6, 8, 12, 16, 21, 30}]
'''5、核心对象可视化(描红 + 画出邻域范围)'''
for i in range(B.shape[0]):
    for i in range(B.shape[0]):
        if i in Core:
            plt.plot(B[i, 0], B[i, 1], 'ro')  # 'ro'表示红色,o型点
        else:
            plt.plot(B[i, 0], B[i, 1], 'bo')  # 'bo'表示蓝色,o型点

        #给数据点做下标文本标记
        plt.text(B[i,0],B[i,1]-0.22,i,fontsize = 8,color='g')

# 给核心对象画邻域范围
a = np.arange(0, 2*np.pi, 0.01)
for i in Core:
    plt.plot(r*np.cos(a)+B[i,0], r*np.sin(a)+B[i,1], 'y--')

plt.title('Core Point Visualization')
plt.show()

在这里插入图片描述

'''6、寻找密度直达:  核心点间的交集是否包含他们本身'''
def directly_density_reachable(CoreList):
    directlyList=[]
    for i in CoreList:
        for j in CoreList:
            if i and j in set(neighborhoodList(i)).intersection(set(neighborhoodList(j))): #交集
                directlyList.append({i,j})
            else: pass
    
    return directlyList

directlyList=directly_density_reachable(Core) #核心对象密度直达列表

print(directlyList)

[{1}, {1, 3}, {1, 12}, {1, 13}, {1, 19}, {2}, {16, 2}, {2, 18}, {2, 21}, {2, 30}, {1, 3}, {3}, {10, 3}, {11, 3}, {3, 12}, {3, 13}, {19, 3}, {5}, {5, 7}, {5, 15}, {24, 5}, {27, 5}, {6}, {8, 6}, {21, 6}, {6, 30}, {5, 7}, {7}, {15, 7}, {24, 7}, {27, 7}, {8, 6}, {8}, {8, 12}, {8, 16}, {8, 21}, {8, 30}, {10, 3}, {10}, {10, 11}, {10, 18}, {3, 11}, {10, 11}, {11}, {1, 12}, {3, 12}, {8, 12}, {12}, {12, 13}, {19, 12}, {12, 30}, {1, 13}, {3, 13}, {12, 13}, {13}, {19, 13}, {14}, {5, 15}, {7, 15}, {15}, {24, 15}, {27, 15}, {16, 2}, {16, 8}, {16}, {16, 21}, {16, 23}, {16, 30}, {18, 2}, {18, 10}, {18}, {1, 19}, {3, 19}, {19, 12}, {19, 13}, {19}, {2, 21}, {21, 6}, {8, 21}, {16, 21}, {21}, {21, 23}, {21, 30}, {16, 23}, {21, 23}, {23}, {24, 5}, {24, 7}, {24, 15}, {24}, {24, 27}, {27, 5}, {27, 7}, {27, 15}, {24, 27}, {27}, {2, 30}, {30, 6}, {8, 30}, {12, 30}, {16, 30}, {21, 30}, {30}]

列表去重元素函数

def Drop_duplicates(binary_array):  # Drop_duplicates:去重
    
    unique_list=[]   # 用于存放不重复的元素

    for i in range(len(binary_array)):  # 遍历检测列表

        isUnique=True # 检测标签初始化为true

        for j in range(len(unique_list)): # 遍历新列表
            # 当新列表中检测到已有元素,标签设置为false,跳出内层循环,更新检测列表的检测元素
            if binary_array[i] == unique_list[j]:
                isUnique=False
                break
        
        # 若检测不到重复元素,标记依旧为true,则将元素追加到新列表中
        if isUnique==True:
            unique_list.append(binary_array[i])
        
    return unique_list

对密度直达列表进行去重操作

# 去重后的密度直达列表
directlyList=Drop_duplicates(directlyList)
print(directlyList)
[{1}, {1, 3}, {1, 12}, {1, 13}, {1, 19}, {2}, {16, 2}, {2, 18}, {2, 21}, {2, 30}, {3}, {10, 3}, {11, 3}, {3, 12}, {3, 13}, {19, 3}, {5}, {5, 7}, {5, 15}, {24, 5}, {27, 5}, {6}, {8, 6}, {21, 6}, {6, 30}, {7}, {15, 7}, {24, 7}, {27, 7}, {8}, {8, 12}, {8, 16}, {8, 21}, {8, 30}, {10}, {10, 11}, {10, 18}, {11}, {12}, {12, 13}, {19, 12}, {12, 30}, {13}, {19, 13}, {14}, {15}, {24, 15}, {27, 15}, {16}, {16, 21}, {16, 23}, {16, 30}, {18}, {19}, {21}, {21, 23}, {21, 30}, {23}, {24}, {24, 27}, {27}, {30}]
'''7、寻找密度相连(间接密度可达)的核心对象'''
def density_reachable():
    reachableList=[] # 用于存储密度相连的核心对象集合
    
    for i in range(len(directlyList)): # 传入(已去重)密度直达列表 遍历密度直达列表的每个元素

        pre = directlyList[i] # 将当前元素传递给pre变量

        for j in range(len(directlyList)):
            if pre.intersection(directlyList[j])!=set(): # 两个密度直达的集合的交集不为空集
                pre = pre.union(directlyList[j])         # 将这两个集合的并集赋值给pre
            else:pass

            #去掉子集的操作 每更新一次元素,从新扫描index = j前面的列表元素,以剔除前边漏掉的子集
            for k in range(j):
                    if pre.intersection(directlyList[k])!=set():
                        pre=pre.union(directlyList[k])   

        reachableList.append(pre) #将更新好、去除漏子集情况的集合追加到密度可达列表中
        
    reachableList=Drop_duplicates(reachableList) # 去重处理

    return reachableList

reachableList=density_reachable()      
reachableList
[{1, 2, 3, 6, 8, 10, 11, 12, 13, 16, 18, 19, 21, 23, 30},
 {5, 7, 15, 24, 27},
 {14}]
'''8、聚类结果操作 簇集 + 离群点'''
# ClusterList 簇结果列表:每个簇以集合的形式作为元素存入该列表中
# PointClusterSet 所有位于所有簇中的数据点的集合
# PerClusterSet 用于暂存某个簇中的数据点集合

ClustersList=[]
PointClusterSet=set()

for i in range(len(reachableList)):   # 1、遍历密度可达列表的每个元素,此时每个元素是集合(一个簇中的核心对象的集合)

    PerClusterSet=set()  #开辟暂存集合

    for j in list(reachableList[i]):  # 2、遍历每个元素(集合列表化)的每个子元素(每个簇中的每个核心对象)

        for k in neighborhoodList(j): # 3、求子元素的邻域列表,并遍历邻域列表的每个元素

            PerClusterSet.add(k)      # 4、将每个元素追加到集合中
    
    #该层循环(i)每一次迭代得到的是一个簇的集合 PerClusterSet

    PointClusterSet=PointClusterSet.union(PerClusterSet) # 5、计算所有位于各簇中的点,用于后续计算离群点
    ClustersList.append(PerClusterSet) # 6、将每个簇的集合追加到簇结果列表中
    
    print("Cluster",i,": ",PerClusterSet)
    print("Border Points of Cluster",i,": ",reachableList[i])
    
    BorderSet=PerClusterSet - reachableList[i] # 7、计算每个簇的边界点 即计算每个簇的集合与每个簇中核心对象的集合的差集
    if BorderSet != set():
        print("Border Points of Cluster",i,": ",BorderSet)
    else:
        print("Border Points of Cluster",i,": NULL")

    print('-'*40)
# 打印所有位于簇中点的集合   
# print("All Points in Clusters: ", PointClusterSet)
print('-'*40)

# 计算离群点:
# 1、首先计算所有点的集合
AllPoint=set()
for i in range(B.shape[0]):
    AllPoint.add(i)
# 2、然后计算所有点集合-所有位于簇中点的集合
outlier=AllPoint-PointClusterSet
print("Outlier: ", outlier)
Cluster 0 :  {1, 2, 3, 6, 8, 10, 11, 12, 13, 16, 18, 19, 21, 23, 30}
Border Points of Cluster 0 :  {1, 2, 3, 6, 8, 10, 11, 12, 13, 16, 18, 19, 21, 23, 30}
Border Points of Cluster 0 : NULL
----------------------------------------
Cluster 1 :  {5, 7, 15, 20, 24, 27}
Border Points of Cluster 1 :  {5, 7, 24, 27, 15}
Border Points of Cluster 1 :  {20}
----------------------------------------
Cluster 2 :  {28, 29, 14}
Border Points of Cluster 2 :  {14}
Border Points of Cluster 2 :  {28, 29}
----------------------------------------
----------------------------------------
Outlier:  {0, 4, 9, 17, 22, 25, 26}
# 簇集列表
ClustersList  
[{1, 2, 3, 6, 8, 10, 11, 12, 13, 16, 18, 19, 21, 23, 30},
 {5, 7, 15, 20, 24, 27},
 {14, 28, 29}]
print("密度聚类后类的数目:",len(ClustersList))
密度聚类后类的数目: 3
'''9、DBSCAN Cluster结果可视化'''
import matplotlib.pyplot as plt

# 设置颜色
colors = {'Cluster1': 'red', 'Cluster2': 'green', 'Cluster3': 'blue', 'Outliers': 'purple'}

# 绘制散点图
for i in range(B.shape[0]):
    if i in ClustersList[0]:
        plt.scatter(B[i, 0], B[i, 1], color='red', marker='o')  
    elif i in ClustersList[1]:
        plt.scatter(B[i, 0], B[i, 1], color='green', marker='o')  
    elif i in ClustersList[2]:
        plt.scatter(B[i, 0], B[i, 1], color='blue', marker='o')  
    else:
        plt.scatter(B[i, 0], B[i, 1], color='purple', marker='o') 

    # 给数据点做下标文本标记
    plt.text(B[i,0],B[i,1]-0.22,i,fontsize=10,color='black')

# 创建图例项
legend_handles = []
for label, color in colors.items():
    legend_handles.append(plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=color, markersize=10))

# 添加图例
plt.legend(legend_handles, colors.keys())

# 设置坐标轴标签
plt.xlabel('X')
plt.ylabel('Y')

# 添加标题
plt.title('DBSCAN Cluster Visualization')

plt.show()

在这里插入图片描述

三、基于调用sklearn中库函数的方式对数据集进行密度聚类操作
'''1、使用sklearn中的DBSCAN类对数据(B数组)进行聚类模型训练'''
from sklearn.cluster import DBSCAN

# 使用 DBSCAN 进行聚类 r=0.8, MinPTS=3
db = DBSCAN(eps=0.8, min_samples=3)
db.fit(B)
DBSCAN(eps=0.8, min_samples=3)
'''2、对聚类结果进行评估 (采用轮廓系数法)'''
from sklearn.metrics import silhouette_score

# 获取聚类标签 (-1为离群点) labels是一个列表
labels = db.labels_

# 计算轮廓系数
score = silhouette_score(B, labels)
print('轮廓系数:', score)
轮廓系数: 0.37453991956763705
说明:
轮廓系数评估:
  score(i)->-1时,说明样本i更应该分类到别的簇中;
  score(i)->0时,说明样本i在两个簇的边界上;
  score(i)->1时,说明样本i聚类合理。

轮廓系数为0.37的评估结果不是很好,但是为了对比非调用类库编程的结果,我们还是选用(0.8,3)的参数组合

可使用k近邻距离图像,通过找到图像的肘点来优化eps的选取

'''* 可视化最近邻算法(Nearest Neighbors)来计算每个样本点到其第二个最近邻点的距离'''
from sklearn.neighbors import NearestNeighbors
import numpy as np
import matplotlib.pyplot as plt

neigh = NearestNeighbors(n_neighbors=2) # 创建最近邻对象,设置最近邻数为2

nbrs = neigh.fit(B) # 将数据 B 应用于最近邻对象并进行拟合

distances, indices = nbrs.kneighbors(B) # 计算每个样本点到其第二个最近邻点的距离


distances = np.sort(distances, axis=0) # 对距离进行排序

distances = distances[:, 1] # 取每个样本点到其第二个最近邻点的距离作为横坐标

# 绘制距离图像
plt.plot(distances)
plt.title('Distance to 2nd Nearest Neighbor') # 添加图像标题
plt.show()

在这里插入图片描述

# 打印聚类标签列表
print(labels)
[-1  0  0  0 -1  1  0  1  0 -1  0  0  0  0  2  1  0 -1  0  0  1  0 -1  0
  1 -1 -1  1  2  2  0]
# 打印核心点列表以及其长度
print(db.core_sample_indices_, '\n','核心对象的数目:',db.core_sample_indices_.shape[0])
[ 1  2  3  5  6  7  8 10 11 12 13 14 15 16 18 19 21 23 24 27 30] 
 核心对象的数目: 21
'''3、显示聚类结果'''
core = db.core_sample_indices_ # 将核心对象列表复制给core变量

# 遍历聚类标签
for i in range(len(list(set(labels)))-1): # len-1是为了剔除掉离群点
    list1=[] # 用于暂存每个簇的数据点

    # 检测聚类标签与迭代标签列表的元素是否匹配
    for j in range(B.shape[0]):
        if labels[j] == i:
            list1.append(j)
        else: pass
    print("Cluster ",i,": ",set(list1))

    list2=[] # 用于存储每个簇中的核心对象
    list3=[] # 用于存储每个簇中的边界点
    for k in list1:
        if k in core:list2.append(k)
        else:list3.append(k)
    
    print("Core Point from Cluster",i,": ",set(list2))

    if list3!=[]:
        print("Border Point from Cluster",i,": ",set(list3))
    else:
            print("Border Point from Cluster",i,": NULL")

    print('-'*40)

list4=[] # 用于存储离群点
for k in range(len(labels)):
    if labels[k]== -1: list4.append(k) # 标签值==-1的数据点是离群点
print("Outlier: ",set(list4))

Cluster  0 :  {1, 2, 3, 6, 8, 10, 11, 12, 13, 16, 18, 19, 21, 23, 30}
Core Point from Cluster 0 :  {1, 2, 3, 6, 8, 10, 11, 12, 13, 16, 18, 19, 21, 23, 30}
Border Point from Cluster 0 : NULL
----------------------------------------
Cluster  1 :  {5, 7, 15, 20, 24, 27}
Core Point from Cluster 1 :  {5, 7, 15, 24, 27}
Border Point from Cluster 1 :  {20}
----------------------------------------
Cluster  2 :  {28, 29, 14}
Core Point from Cluster 2 :  {14}
Border Point from Cluster 2 :  {28, 29}
----------------------------------------
Outlier:  {0, 4, 9, 17, 22, 25, 26}
'''4、聚类结果可视化'''
for i in range(B.shape[0]):
    if labels[i] == 0:
        c1 = plt.scatter(B[i,0], B[i,1], c = 'r', marker='+')
    elif labels[i] == 1:
        c2 = plt.scatter(B[i,0], B[i,1], c = 'g', marker='o')
    elif labels[i] == 2:
        c3 = plt.scatter(B[i,0], B[i,1], c = 'b', marker='*')
    elif labels[i] == -1:
        c4 = plt.scatter(B[i,0], B[i,1], c = 'purple', marker='^')
    
    # 给数据点做下标文本标记
    plt.text(B[i,0],B[i,1]-0.22,i,fontsize=10,color='black')

plt.legend([c1,c2,c3,c4], ['Cluster 1','Cluster 2','Cluster 3','Outlier'])
plt.title('DBSCAN Clustering Results based on sklearn')
plt.show()
    

在这里插入图片描述


经过上述两次聚类算法结果的对比可得:
  在参数组合选择相同的情况下,采取自己编写算法的方式和采取调用sklearn库函数的方式进行密度聚类所得的结果一样。说明本人的算法思路是正确的。

以下是用 Python 实现 DBSCAN 聚类算法并可视化的代码: ```python import numpy as np import matplotlib.pyplot as plt def dbscan(X, eps, min_pts): """ DBSCAN clustering algorithm. :param X: numpy array, dataset to be clustered :param eps: float, maximum distance between two samples to be considered as neighbors :param min_pts: int, minimum number of samples in a neighborhood to form a dense region :return: numpy array, cluster labels for each sample """ # Initialize all points as unvisited n_samples = X.shape[0] visited = np.zeros(n_samples, dtype=bool) # Initialize all points as noise labels = np.zeros(n_samples, dtype=int) # Initialize cluster label cluster_label = 0 # Iterate over all unvisited points for i in range(n_samples): if not visited[i]: visited[i] = True # Find all points in the neighborhood neighbors = _region_query(X, i, eps) # If the neighborhood is too small, mark the point as noise if len(neighbors) < min_pts: labels[i] = -1 else: # Expand the cluster cluster_label += 1 labels[i] = cluster_label _expand_cluster(X, visited, labels, i, neighbors, cluster_label, eps, min_pts) return labels def _region_query(X, i, eps): """ Find all points in the neighborhood of point i. :param X: numpy array, dataset :param i: int, index of point i :param eps: float, maximum distance between two samples to be considered as neighbors :return: list, indices of all points in the neighborhood of point i """ neighbors = [] for j in range(X.shape[0]): if np.linalg.norm(X[i] - X[j]) < eps: neighbors.append(j) return neighbors def _expand_cluster(X, visited, labels, i, neighbors, cluster_label, eps, min_pts): """ Expand the cluster around point i. :param X: numpy array, dataset :param visited: numpy array, visited status of all points :param labels: numpy array, cluster labels for each sample :param i: int, index of point i :param neighbors: list, indices of all points in the neighborhood of point i :param cluster_label: int, label of the cluster :param eps: float, maximum distance between two samples to be considered as neighbors :param min_pts: int, minimum number of samples in a neighborhood to form a dense region """ # Iterate over all points in the neighborhood for j in neighbors: if not visited[j]: visited[j] = True # Find all points in the neighborhood of point j neighbors_j = _region_query(X, j, eps) # If the neighborhood is large enough, add new points to the cluster if len(neighbors_j) >= min_pts: neighbors += neighbors_j # If the point hasn't been assigned to a cluster yet, assign it to the current cluster if labels[j] == 0: labels[j] = cluster_label # Generate sample data from sklearn.datasets import make_blobs X, _ = make_blobs(n_samples=50, centers=3, random_state=42) # Run DBSCAN clustering algorithm labels = dbscan(X, eps=1.5, min_pts=5) # Visualize clustering results plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='rainbow') plt.title('DBSCAN Clustering Results') plt.show() ``` 输出结果为一张可视化的散点图,其中不同颜色代表不同的聚类簇。 ![dbscan_clustering_results](https://i.imgur.com/j1RzLZy.png)
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值