DBSCAN
01 DBSCAN的原理
- 从样本中选择一点,给定半径
epsilon
和圆内的最小近邻点数min_points
- 如果该点满足在其半径为
epsilon
的邻域圆内至少有min_points
个近邻点,则将圆心转移到下一样本点 - 若一样本点不满足上述条件,则重新选择样本点。按照设定的半径
epsilon
和min_points
进行迭代聚类
DBSCAN关键在于给定阈值
epsilon
的选择:若选择的半径过大,则会导致聚类效果较差,导致产生的簇过少分类的效果不好
若选择的半径过小,则会导致产生的簇过多
对于传统的DBSCAN算法,在许多文章中已经指出了它的局限性。
传统的DBSCAN算法的平均复杂度是O(nlogn),而其在最坏的情况下复杂度为O(n2),在对于高维数据的处理时是相对较慢的。
02 DBSCAN的例子
- 设定
Epsilon
= 1,minPoints = 4
- 选取一个核心点进行迭代
- 迭代完成后对不满足的点进行第二、三次迭代
03 DBSCAN的实现
-
导入相关库,选择
Sklearn
中的鸢尾花数据集import numpy as np import math import matplotlib.pyplot as plt from sklearn import datasets
-
设定未分类点标识和离群点标识
UNCLASSIFIED = False #未分类点标识 NOISE = None # 离群点标识
-
选择每次的首个核心点,并传入参数
epsilon
和min_points
进行首次迭代def _region_query(m, point_id, eps): # 初步聚类找到以核心点为圆心,eps为半径的圆 """ params: m: 特征向量的矩阵 point_id:每次选择的核心点 return: seeds: 该核心点聚类的近邻点 """ n_points = m.shape[1] seeds = [] for i in range(0, n_points): # 计算该核心点与其他点的距离,选择在圆内的近邻点 if _eps_neighbourhood(m[:, point_id], m[:, i], eps): seeds.append(i) return seeds
-
其中的判断近邻、欧式距离求解
def __dist(p, q): # 欧式距离 return math.sqrt(np.power(p - q, 2).sum()) def _eps_neighbourhood(p, q, eps): # 选择近邻,判断是否在半径内 return __dist(p, q) < eps
-
首个核心点迭代完成后,对它进行移动,直到出现不满足阈值条件的样本点为止
def _expand_cluster(m, classifications, point_id, cluster_id, eps, min_points): # 不断移动该圆,直到密度条件不满足为止 seeds = _region_query(m, point_id, eps) # 返回近邻点 if len(seeds) < min_points: # 近邻点数小于最小的聚类数 classifications[point_id] = NOISE # 该点以eps为半径的聚类失效,该点为离群点 return False else: # 近邻数大于最小的聚类数 classifications[point_id] = cluster_id # 赋值聚类序号 for seed_id in seeds: classifications[seed_id] = cluster_id # 将所有近邻点加入聚类 while len(seeds) > 0: # 对近邻点再次进行聚类 current_point = seeds[0] # 每次取第一个点作为核心点 results = _region_query(m, current_point, eps) # 检验以该点为圆心,eps为半径的圆内是否满足密度条件 if len(results) >= min_points: for i in range(0, len(results)): # 遍历所有近邻点 if classifications[ results[i]] == UNCLASSIFIED or classifications[ results[i]] == NOISE: if classifications[results[i]] == UNCLASSIFIED: seeds.append(results[i]) classifications[results[i]] = cluster_id seeds = seeds[1:] # 将第一个点舍去 return True
-
dbscan():初始化一个空的分类列表,对其中每个未分类点进行调用上述函数
def dbscan(m, eps, min_points): cluster_id = 1 n_points = m.shape[1] classifications = n_points * [UNCLASSIFIED] for point_id in range(0, n_points): if classifications[point_id] == UNCLASSIFIED: if _expand_cluster(m, classifications, point_id, cluster_id, eps, min_points): cluster_id += 1 return classifications
-
传入数据集及绘制图像
iris = datasets.load_iris() target = iris['target'] data = iris['data'][:, :2] # 取数据集的前两列 x = data[:, 0] # 取数据集的第一列 y = data[:, 1] # 取数据集的第二列 m = np.stack([x, y]) eps = 0.2 min_points = 5 classifications = dbscan(m, eps, min_points) print(classifications) plt.title("DBSCAN algorithm") for i, j in enumerate(classifications): if (type(j) == int): plt.plot(m[0, i], m[1, i], styles[(j % 10)]) else: plt.plot(m[0, i], m[1, i], 'k.') plt.show()
参考
https://blog.csdn.net/huacha__/article/details/81094891?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522164844221916782089362990%2522%252C%2522scm%2522%253A%252220140713.130102334…%2522%257D&request_id=164844221916782089362990&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2alltop_positive~default-1-81094891.142v5article_score_rank,143v6register&utm_term=DBSCAN&spm=1018.2226.3001.4187
https://github.com/choffstein/dbscan/blob/master/dbscan/dbscan.py
142v5article_score_rank,143v6register&utm_term=DBSCAN&spm=1018.2226.3001.4187https://github.com/choffstein/dbscan/blob/master/dbscan/dbscan.py
《DBSCAN++: Towards fast and scalable density clustering》