DBSCAN介绍及sklearn库中的使用

最新推荐文章于 2024-06-03 17:20:08 发布

teresa_zz

最新推荐文章于 2024-06-03 17:20:08 发布

阅读量9k

点赞数 4

分类专栏：聚类算法

本文链接：https://blog.csdn.net/teresa_zz/article/details/82454907

版权

聚类算法专栏收录该内容

1 篇文章 0 订阅

订阅专栏

DBSCAN

定义

DBSCAN的英文全程是Density-based spatial clustering of applications with noise，是一种以密度为本的聚类算法，在一个空间中，将距离近的点分为一类，将低密度的点抛弃。

算法

一个点 $p$ 如果在他的半径 $r$ 范围内有 $n$ 个点，则称他为核心点，那些点称为由 $p$ 直接可达.
如果点 $q$ 是 $p$ 可达的，则存在一条路径 $p_1,p_2,p_3,……，p_n$ ,其中 $p_n=q$ ，并且其余的点都是核心点。
所有不可达的点就被抛弃。

DBSCAN 需要两个参数：ε (eps) 和形成高密度区域所需要的最少点数 (minPts)，它由一个任意未被访问的点开始，然后探索这个点的 ε-邻域，如果 ε-邻域里有足够的点，则建立一个新的聚类，否则这个点被标签为杂音。注意这个点之后可能被发现在其它点的 ε-邻域里，而该 ε-邻域可能有足够的点，届时这个点会被加入该聚类中。

#DBSCAN的伪代码
DBSCAN(D, eps, MinPts) {
   C = 0
   for each point P in dataset D {
      if P is visited
         continue next point
      mark P as visited
      NeighborPts = regionQuery(P, eps)
      if sizeof(NeighborPts) < MinPts
         mark P as NOISE
      else {
         C = next cluster
         expandCluster(P, NeighborPts, C, eps, MinPts)
      }
   }
}
#将集合扩张，加入新的点，如果该点已经有属于的集合，则将那个集合都加入
expandCluster(P, NeighborPts, C, eps, MinPts) {
   add P to cluster C
   for each point P' in NeighborPts { 
      if P' is not visited {
         mark P' as visited
         NeighborPts' = regionQuery(P', eps)
         if sizeof(NeighborPts') >= MinPts
            NeighborPts = NeighborPts joined with NeighborPts'
      }
      if P' is not yet member of any cluster
         add P' to cluster C
   }
}
#查询一个点，设定半径内有多少的点
regionQuery(P, eps)
   return all points within P's eps-neighborhood (including P)

对该算法影响最大的是regionQuery(P, eps)，可以通过空间复杂度来替换时间复杂度。最差情况的时间复杂度是 $O(n^2)$ ,平均情况下是 $O(n log n)$

python sklearn库中的使用

在DBSCAN是一个非常有用的算法，而python，作为一个啥轮子都有的语言，自然也有包含DBSCAN的库了，那就是我们的sklearn

第一步，引入我们的库：

from sklearn.cluster import DBSCAN

他的调用方式及参数含义如下：

class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric=’euclidean’, metric_params=None, algorithm=’auto’, leaf_size=30, p=None, n_jobs=1)

参数名	意义
eps	The maximum distance between two samples for them to be considered as in the same neighborhood. 就是先前设置的那个半径
min_samples	半径中要有的样本数
metric	采用怎样的距离计算方式，默认是欧式距离 $\sqrt{\sum\limits_{i=1}^{n}(x_i-y_i)^2}$ 当然也有其他的距离，曼哈顿，切比雪夫等
metric_params	Additional keyword arguments for the metric function，计算距离的方式可能还有其他的参数需求，欧氏距离没有那就是none
algorithm	{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}在DBSCAN中我们要找一个点他的近邻，这有三种算法，而auto会自动挑一个最好的给你，稀疏数据的话，一般就brute了
leaf_size	这和树有关，Leaf size passed to BallTree or cKDTree.主要影响的是内存使用查询时间等
p	The power of the Minkowski metric to be used to calculate distance between points.
n_jobs	The number of parallel jobs to run. If -1, then the number of jobs is set to the number of CPU cores.默认是1

现在大家也应该可以看出，其实最最重要的参数是前面两个，eps和min_samples，而我们主要的调参也就是这些，其他的，保持默认就好。

teresa_zz

关注

4
点赞
踩
30

收藏

觉得还不错? 一键收藏
0
评论
DBSCAN介绍及sklearn库中的使用

DBSCAN定义DBSCAN的英文全程是Density-based spatial clustering of applications with noise，是一种以密度为本的聚类算法，在一个空间中，将距离近的点分为一类，将低密度的点抛弃。算法一个点ppp如果在他的半径rrr范围内有nnn个点，则称他为核心点，那些点称为由ppp直接可达.如果点qqq是ppp可达的，则...
复制链接

扫一扫