

DBSCAN(Density-Based Spatial Clustering of Applications with Noise,具有噪声的基于密度的聚类方法)是一种很典型的密度聚类算法,和K-Means,BIRCH这些一般只适用于凸样本集的聚类相比,DBSCAN既可以适用于凸样本集,也可以适用于非凸样本集。下面我们就对DBSCAN算法的原理做一个总结。


  • 待聚类的数据 D D D
  • 邻域半径 ϵ ϵ ϵ
  • 邻域内的点的数量阈值,表示密度MinPts


  • 每一个点的label


  • ϵ ϵ ϵ-邻域:对于xj∈D,其 ϵ ϵ ϵ-邻域包含样本集 D D D中与 x j x_j xj的距离不大于 ϵ ϵ ϵ的子样本集,即 N ϵ ( x j ) = x i ∈ D ∣ d i s t a n c e ( x i , x j ) ≤ ϵ Nϵ(x_j)={x_i∈D|distance(x_i,x_j)≤ϵ} Nϵ(xj)=xiDdistance(xi,xj)ϵ, 这个子样本集的个数记为 ∣ N ϵ ( x j ) ∣ |Nϵ(xj)| Nϵ(xj)

  • 核心对象:对于任一样本 x j ∈ D x_j∈D xjD,如果其 ϵ ϵ ϵ-邻域对应的 N ϵ ( x j ) Nϵ(x_j) Nϵ(xj)至少包含MinPts个样本,即如果 ∣ N ϵ ( x j ) ∣ ≥ M i n P t s |Nϵ(x_j)|≥MinPts Nϵ(xj)MinPts,则xj是核心对象。

  • 密度直达:如果 x i x_i xi x j x_j xj的邻域 ϵ ϵ ϵ内,并且 x j x_j xj核心对象,那么称 x i x_i xi可以由 x j x_j xj密度直达,反之不一定成立,除非 x i x_i xi也是核心对象.

  • 密度可达:多个密度直达,连接起来就是密度可达,密度可达也是不可逆的。 对于 x i x_i xi x j x_j xj,如果存在样本样本序列 p 1 p_1 p1, p 2 , . . . , p T p_2,...,p_T p2,...,pT,满足 p 1 = x i , p T = x j p_1=x_i,p_T=x_j p1=xi,pT=xj, 且 p t + 1 p_t+_1 pt+1 p t p_t pt密度直达,则称 x j x_j xj x i x_i xi密度可达。也就是说,密度可达满足传递性。此时序列中的传递样本 p 1 , p 2 , . . . , p T − 1 p_1,p_2,...,p_T−_1 p1,p2,...,pT1均为核心对象,因为只有核心对象才能使其他样本密度直达。注意密度可达也不满足对称性,这个可以由密度直达的不对称性得出。

  • 密度相连:对于 x i x_i xi x j x_j xj,如果存在核心对象样本 x k x_k xk,使 x i x_i xi x j x_j xj均由 x k x_k xk密度可达,则称 x i x_i xi x j x_j xj密度相连。注意密度相连关系是满足对称性的。简单理解,就是两个点可以由一个或多个核心对象连接起来



遍历每一个点 P P P,如果是核心点,查找其邻域点,将其作为当前cluster的种子序列;遍历每一个邻域点 P n P_n Pn,如果 P n P_n Pn也是核心点,那么将 P n P_n Pn邻域加入到当前label的种子序列中



  • 不需要输入聚类的数量
  • 可以对任意形状的稠密数据集进行聚类,相对的,K-Means之类的聚类算法一般只适用于凸数据集。
  • 可以在聚类的同时发现异常点,对数据集中的异常点不敏感。
  • 聚类结果没有偏倚,相对的,K-Means之类的聚类算法初始值对聚类结果有很大影响。


  • 如果样本集的密度不均匀、聚类间距差相差很大时,聚类质量较差,这时用DBSCAN聚类一般不适合。
  • 如果样本集较大时,聚类收敛时间较长,此时可以对搜索最近邻时建立的KD树或者球树进行规模限制来改进。
  • 调参相对于传统的K-Means之类的聚类算法稍复杂,主要需要对距离阈值 ϵ ϵ ϵ,邻域样本数阈值MinPts联合调参,不同的参数组合对最后的聚类效果有较大影响。



import numpy

def MyDBSCAN(D, eps, MinPts):
    Cluster the dataset `D` using the DBSCAN algorithm.
    MyDBSCAN takes a dataset `D` (a list of vectors), a threshold distance
    `eps`, and a required number of points `MinPts`.
    It will return a list of cluster labels. The label -1 means noise, and then
    the clusters are numbered starting from 1.
    # This list will hold the final cluster assignment for each point in D.
    # There are two reserved values:
    #    -1 - Indicates a noise point
    #     0 - Means the point hasn't been considered yet.
    # Initially all labels are 0.    
    labels = [0]*len(D)

    # C is the ID of the current cluster.    
    C = 0
    # This outer loop is just responsible for picking new seed points--a point
    # from which to grow a new cluster.
    # Once a valid seed point is found, a new cluster is created, and the 
    # cluster growth is all handled by the 'expandCluster' routine.
    # For each point P in the Dataset D...
    # ('P' is the index of the datapoint, rather than the datapoint itself.)
    for P in range(0, len(D)):
        # Only points that have not already been claimed can be picked as new 
        # seed points.    
        # If the point's label is not 0, continue to the next point.
        if not (labels[P] == 0):
        # Find all of P's neighboring points.
        NeighborPts = regionQuery(D, P, eps)
        # If the number is below MinPts, this point is noise. 
        # This is the only condition under which a point is labeled 
        # NOISE--when it's not a valid seed point. A NOISE point may later 
        # be picked up by another cluster as a boundary point (this is the only
        # condition under which a cluster label can change--from NOISE to 
        # something else).
        if len(NeighborPts) < MinPts:
            labels[P] = -1
        # Otherwise, if there are at least MinPts nearby, use this point as the 
        # seed for a new cluster.    
           C += 1
           growCluster(D, labels, P, NeighborPts, C, eps, MinPts)
    # All data has been clustered!
    return labels

def growCluster(D, labels, P, NeighborPts, C, eps, MinPts):
    Grow a new cluster with label `C` from the seed point `P`.
    This function searches through the dataset to find all points that belong
    to this new cluster. When this function returns, cluster `C` is complete.
      `D`      - The dataset (a list of vectors)
      `labels` - List storing the cluster labels for all dataset points
      `P`      - Index of the seed point for this new cluster
      `NeighborPts` - All of the neighbors of `P`
      `C`      - The label for this new cluster.  
      `eps`    - Threshold distance
      `MinPts` - Minimum required number of neighbors

    # Assign the cluster label to the seed point.
    labels[P] = C
    # Look at each neighbor of P (neighbors are referred to as Pn). 
    # NeighborPts will be used as a FIFO queue of points to search--that is, it
    # will grow as we discover new branch points for the cluster. The FIFO
    # behavior is accomplished by using a while-loop rather than a for-loop.
    # In NeighborPts, the points are represented by their index in the original
    # dataset.
    i = 0
    while i < len(NeighborPts):    
        # Get the next point from the queue.        
        Pn = NeighborPts[i]
        # If Pn was labelled NOISE during the seed search, then we
        # know it's not a branch point (it doesn't have enough neighbors), so
        # make it a leaf point of cluster C and move on.
        if labels[Pn] == -1:
           labels[Pn] = C
        # Otherwise, if Pn isn't already claimed, claim it as part of C.
        elif labels[Pn] == 0:
            # Add Pn to cluster C (Assign cluster label C).
            labels[Pn] = C
            # Find all the neighbors of Pn
            PnNeighborPts = regionQuery(D, Pn, eps)
            # If Pn has at least MinPts neighbors, it's a branch point!
            # Add all of its neighbors to the FIFO queue to be searched. 
            if len(PnNeighborPts) >= MinPts:
                NeighborPts = NeighborPts + PnNeighborPts
            # If Pn *doesn't* have enough neighbors, then it's a leaf point.
            # Don't queue up it's neighbors as expansion points.
                # Do nothing                
                #NeighborPts = NeighborPts               
        # Advance to the next point in the FIFO queue.
        i += 1        
    # We've finished growing cluster C!

def regionQuery(D, P, eps):
    Find all points in dataset `D` within distance `eps` of point `P`.
    This function calculates the distance between a point P and every other 
    point in the dataset, and then returns only those points which are within a
    threshold distance `eps`.
    neighbors = []
    # For each point in the dataset...
    for Pn in range(0, len(D)):
        # If the distance is below the threshold, add it to the neighbors list.
        if numpy.linalg.norm(D[P] - D[Pn]) < eps:
    return neighbors





