【DBSCAN】聚类方法与代码实现

算法简介

DBSCAN(Density-Based Spatial Clustering of Applications with Noise,具有噪声的基于密度的聚类方法)是一种很典型的密度聚类算法,和K-Means,BIRCH这些一般只适用于凸样本集的聚类相比,DBSCAN既可以适用于凸样本集,也可以适用于非凸样本集。下面我们就对DBSCAN算法的原理做一个总结。
   DBSCAN是一种基于密度的聚类算法,这类密度聚类算法一般假定类别可以通过样本分布的紧密程度决定。同一类别的样本,他们之间的紧密相连的,也就是说,在该类别任意样本周围不远处一定有同类别的样本存在
通过将紧密相连的样本划为一类,这样就得到了一个聚类类别。通过将所有各组紧密相连的样本划为各个不同的类别,则我们就得到了最终的所有聚类类别结果。

算法输入

  • 待聚类的数据 D D D
  • 邻域半径 ϵ ϵ ϵ
  • 邻域内的点的数量阈值,表示密度MinPts

算法输出

  • 每一个点的label

基本概念

  • ϵ ϵ ϵ-邻域:对于xj∈D,其 ϵ ϵ ϵ-邻域包含样本集 D D D中与 x j x_j xj的距离不大于 ϵ ϵ ϵ的子样本集,即 N ϵ ( x j ) = x i ∈ D ∣ d i s t a n c e ( x i , x j ) ≤ ϵ Nϵ(x_j)={x_i∈D|distance(x_i,x_j)≤ϵ} Nϵ(xj)=xiDdistance(xi,xj)ϵ, 这个子样本集的个数记为 ∣ N ϵ ( x j ) ∣ |Nϵ(xj)| Nϵ(xj)

  • 核心对象:对于任一样本 x j ∈ D x_j∈D xjD,如果其 ϵ ϵ ϵ-邻域对应的 N ϵ ( x j ) Nϵ(x_j) Nϵ(xj)至少包含MinPts个样本,即如果 ∣ N ϵ ( x j ) ∣ ≥ M i n P t s |Nϵ(x_j)|≥MinPts Nϵ(xj)MinPts,则xj是核心对象。

  • 密度直达:如果 x i x_i xi x j x_j xj的邻域 ϵ ϵ ϵ内,并且 x j x_j xj核心对象,那么称 x i x_i xi可以由 x j x_j xj密度直达,反之不一定成立,除非 x i x_i xi也是核心对象.

  • 密度可达:多个密度直达,连接起来就是密度可达,密度可达也是不可逆的。 对于 x i x_i xi x j x_j xj,如果存在样本样本序列 p 1 p_1 p1, p 2 , . . . , p T p_2,...,p_T p2,...,pT,满足 p 1 = x i , p T = x j p_1=x_i,p_T=x_j p1=xi,pT=xj, 且 p t + 1 p_t+_1 pt+1 p t p_t pt密度直达,则称 x j x_j xj x i x_i xi密度可达。也就是说,密度可达满足传递性。此时序列中的传递样本 p 1 , p 2 , . . . , p T − 1 p_1,p_2,...,p_T−_1 p1,p2,...,pT1均为核心对象,因为只有核心对象才能使其他样本密度直达。注意密度可达也不满足对称性,这个可以由密度直达的不对称性得出。

  • 密度相连:对于 x i x_i xi x j x_j xj,如果存在核心对象样本 x k x_k xk,使 x i x_i xi x j x_j xj均由 x k x_k xk密度可达,则称 x i x_i xi x j x_j xj密度相连。注意密度相连关系是满足对称性的。简单理解,就是两个点可以由一个或多个核心对象连接起来

如下图:红色代表核心点,黑色代表非核心点,箭头代表密度直达,多个箭头组成了密度可达序列,表示密度可达,左边的所有红色点+其邻域内的所有点相互都是密度相连
DBSCAN

算法原理

大致流程就是:
遍历每一个点 P P P,如果是核心点,查找其邻域点,将其作为当前cluster的种子序列;遍历每一个邻域点 P n P_n Pn,如果 P n P_n Pn也是核心点,那么将 P n P_n Pn邻域加入到当前label的种子序列中

优缺点

优点

  • 不需要输入聚类的数量
  • 可以对任意形状的稠密数据集进行聚类,相对的,K-Means之类的聚类算法一般只适用于凸数据集。
  • 可以在聚类的同时发现异常点,对数据集中的异常点不敏感。
  • 聚类结果没有偏倚,相对的,K-Means之类的聚类算法初始值对聚类结果有很大影响。

缺点

  • 如果样本集的密度不均匀、聚类间距差相差很大时,聚类质量较差,这时用DBSCAN聚类一般不适合。
  • 如果样本集较大时,聚类收敛时间较长,此时可以对搜索最近邻时建立的KD树或者球树进行规模限制来改进。
  • 调参相对于传统的K-Means之类的聚类算法稍复杂,主要需要对距离阈值 ϵ ϵ ϵ,邻域样本数阈值MinPts联合调参,不同的参数组合对最后的聚类效果有较大影响。

算法实现

参考:https://github.com/chrisjmccormick/dbscan/blob/master/dbscan.py

import numpy

def MyDBSCAN(D, eps, MinPts):
    """
    Cluster the dataset `D` using the DBSCAN algorithm.
    
    MyDBSCAN takes a dataset `D` (a list of vectors), a threshold distance
    `eps`, and a required number of points `MinPts`.
    
    It will return a list of cluster labels. The label -1 means noise, and then
    the clusters are numbered starting from 1.
    """
 
    # This list will hold the final cluster assignment for each point in D.
    # There are two reserved values:
    #    -1 - Indicates a noise point
    #     0 - Means the point hasn't been considered yet.
    # Initially all labels are 0.    
    labels = [0]*len(D)

    # C is the ID of the current cluster.    
    C = 0
    
    # This outer loop is just responsible for picking new seed points--a point
    # from which to grow a new cluster.
    # Once a valid seed point is found, a new cluster is created, and the 
    # cluster growth is all handled by the 'expandCluster' routine.
    
    # For each point P in the Dataset D...
    # ('P' is the index of the datapoint, rather than the datapoint itself.)
    for P in range(0, len(D)):
    
        # Only points that have not already been claimed can be picked as new 
        # seed points.    
        # If the point's label is not 0, continue to the next point.
        if not (labels[P] == 0):
           continue
        
        # Find all of P's neighboring points.
        NeighborPts = regionQuery(D, P, eps)
        
        # If the number is below MinPts, this point is noise. 
        # This is the only condition under which a point is labeled 
        # NOISE--when it's not a valid seed point. A NOISE point may later 
        # be picked up by another cluster as a boundary point (this is the only
        # condition under which a cluster label can change--from NOISE to 
        # something else).
        if len(NeighborPts) < MinPts:
            labels[P] = -1
        # Otherwise, if there are at least MinPts nearby, use this point as the 
        # seed for a new cluster.    
        else: 
           C += 1
           growCluster(D, labels, P, NeighborPts, C, eps, MinPts)
    
    # All data has been clustered!
    return labels


def growCluster(D, labels, P, NeighborPts, C, eps, MinPts):
    """
    Grow a new cluster with label `C` from the seed point `P`.
    
    This function searches through the dataset to find all points that belong
    to this new cluster. When this function returns, cluster `C` is complete.
    
    Parameters:
      `D`      - The dataset (a list of vectors)
      `labels` - List storing the cluster labels for all dataset points
      `P`      - Index of the seed point for this new cluster
      `NeighborPts` - All of the neighbors of `P`
      `C`      - The label for this new cluster.  
      `eps`    - Threshold distance
      `MinPts` - Minimum required number of neighbors
    """

    # Assign the cluster label to the seed point.
    labels[P] = C
    
    # Look at each neighbor of P (neighbors are referred to as Pn). 
    # NeighborPts will be used as a FIFO queue of points to search--that is, it
    # will grow as we discover new branch points for the cluster. The FIFO
    # behavior is accomplished by using a while-loop rather than a for-loop.
    # In NeighborPts, the points are represented by their index in the original
    # dataset.
    i = 0
    while i < len(NeighborPts):    
        
        # Get the next point from the queue.        
        Pn = NeighborPts[i]
       
        # If Pn was labelled NOISE during the seed search, then we
        # know it's not a branch point (it doesn't have enough neighbors), so
        # make it a leaf point of cluster C and move on.
        if labels[Pn] == -1:
           labels[Pn] = C
        
        # Otherwise, if Pn isn't already claimed, claim it as part of C.
        elif labels[Pn] == 0:
            # Add Pn to cluster C (Assign cluster label C).
            labels[Pn] = C
            
            # Find all the neighbors of Pn
            PnNeighborPts = regionQuery(D, Pn, eps)
            
            # If Pn has at least MinPts neighbors, it's a branch point!
            # Add all of its neighbors to the FIFO queue to be searched. 
            if len(PnNeighborPts) >= MinPts:
                NeighborPts = NeighborPts + PnNeighborPts
            # If Pn *doesn't* have enough neighbors, then it's a leaf point.
            # Don't queue up it's neighbors as expansion points.
            #else:
                # Do nothing                
                #NeighborPts = NeighborPts               
        
        # Advance to the next point in the FIFO queue.
        i += 1        
    
    # We've finished growing cluster C!


def regionQuery(D, P, eps):
    """
    Find all points in dataset `D` within distance `eps` of point `P`.
    
    This function calculates the distance between a point P and every other 
    point in the dataset, and then returns only those points which are within a
    threshold distance `eps`.
    """
    neighbors = []
    
    # For each point in the dataset...
    for Pn in range(0, len(D)):
        
        # If the distance is below the threshold, add it to the neighbors list.
        if numpy.linalg.norm(D[P] - D[Pn]) < eps:
           neighbors.append(Pn)
            
    return neighbors

算法效果

使用sk-learn生成数据,与sk-learn封装的函数进行比较,结果一致
在这里插入图片描述
在这里插入图片描述

参考

DBSCAN密度聚类算法

  • 0
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
下面是st-dbscan聚类算法Python代码实现: ```python import numpy as np from scipy.spatial.distance import pdist, squareform def stdbscan(data, eps, min_pts): """ ST-DBSCAN algorithm implementation. Parameters: data (ndarray): Input data. eps (float): The maximum distance between two samples for them to be considered as in the same neighborhood. min_pts (int): The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. Returns: labels (ndarray): Cluster labels for each point. -1 for noise points. """ # Compute pairwise distance matrix dist_mat = squareform(pdist(data)) # Initialize variables num_pts = data.shape[0] visited = np.zeros(num_pts, dtype=bool) labels = np.zeros(num_pts, dtype=int) cluster_id = 0 # Iterate over all points for i in range(num_pts): if not visited[i]: visited[i] = True # Get neighbors within eps distance neighbors = np.where(dist_mat[i] <= eps)[0] # Check if there are enough neighbors if len(neighbors) < min_pts: labels[i] = -1 # Noise point else: cluster_id += 1 labels[i] = cluster_id # Expand cluster j = 0 while j < len(neighbors): neighbor = neighbors[j] if not visited[neighbor]: visited[neighbor] = True # Get neighbors within eps distance new_neighbors = np.where(dist_mat[neighbor] <= eps)[0] # Check if there are enough neighbors if len(new_neighbors) >= min_pts: neighbors = np.concatenate((neighbors, new_neighbors)) # Assign to cluster if labels[neighbor] == 0: labels[neighbor] = cluster_id j += 1 return labels ``` 其中,`data`是输入数据,`eps`是最大距离阈值(即点与点之间的距离超过该值则不再属于同一簇),`min_pts`是最小密度阈值(即点周围的其它点个数不足该值时不再属于核心点)。函数返回每个点的聚类标签,-1表示噪声点。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Tech沉思录

点赞加投币,感谢您的资瓷~

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值