【DBSCAN】聚类方法与代码实现

最新推荐文章于 2024-03-19 16:03:20 发布

Tech沉思录

最新推荐文章于 2024-03-19 16:03:20 发布

阅读量1.4k

点赞数

分类专栏：机器学习文章标签：聚类算法 python

本文链接：https://blog.csdn.net/suyunzzz/article/details/107695346

版权

机器学习专栏收录该内容

11 篇文章 5 订阅

订阅专栏

算法简介

DBSCAN(Density-Based Spatial Clustering of Applications with Noise，具有噪声的基于密度的聚类方法)是一种很典型的密度聚类算法，和K-Means，BIRCH这些一般只适用于凸样本集的聚类相比，DBSCAN既可以适用于凸样本集，也可以适用于非凸样本集。下面我们就对DBSCAN算法的原理做一个总结。
　　　DBSCAN是一种基于密度的聚类算法，这类密度聚类算法一般假定类别可以通过样本分布的紧密程度决定。同一类别的样本，他们之间的紧密相连的，也就是说，在该类别任意样本周围不远处一定有同类别的样本存在。
通过将紧密相连的样本划为一类，这样就得到了一个聚类类别。通过将所有各组紧密相连的样本划为各个不同的类别，则我们就得到了最终的所有聚类类别结果。

算法输入

待聚类的数据 $D$
邻域半径 $ϵ$
邻域内的点的数量阈值，表示密度MinPts

算法输出

每一个点的label

基本概念

$ϵ$ -邻域：对于xj∈D，其 $ϵ$ -邻域包含样本集 $D$ 中与 $x_j$ 的距离不大于 $ϵ$ 的子样本集，即 $Nϵ(x_j)={x_i∈D|distance(x_i,x_j)≤ϵ}$ , 这个子样本集的个数记为 $∣ N ϵ (x j) ∣$
核心对象：对于任一样本 $x_j∈D$ ，如果其 $ϵ$ -邻域对应的 $Nϵ(x_j)$ 至少包含MinPts个样本，即如果 $Nϵ(x_j)|≥MinPts$ ，则xj是核心对象。
密度直达：如果 $x_i$ 在 $x_j$ 的邻域 $ϵ$ 内，并且 $x_j$ 是核心对象，那么称 $x_i$ 可以由 $x_j$ 密度直达，反之不一定成立，除非 $x_i$ 也是核心对象.
密度可达：多个密度直达，连接起来就是密度可达，密度可达也是不可逆的。 对于 $x_i$ 和 $x_j$ ,如果存在样本样本序列 $p_1$ , $p_2,...,p_T$ ,满足 $p_1=x_i,p_T=x_j$ , 且 $p_t+_1$ 由 $p_t$ 密度直达，则称 $x_j$ 由 $x_i$ 密度可达。也就是说，密度可达满足传递性。此时序列中的传递样本 $p_1,p_2,...,p_T−_1$ 均为核心对象，因为只有核心对象才能使其他样本密度直达。注意密度可达也不满足对称性，这个可以由密度直达的不对称性得出。
密度相连：对于 $x_i$ 和 $x_j$ ,如果存在核心对象样本 $x_k$ ，使 $x_i$ 和 $x_j$ 均由 $x_k$ 密度可达，则称 $x_i$ 和 $x_j$ 密度相连。注意密度相连关系是满足对称性的。简单理解，就是两个点可以由一个或多个核心对象连接起来

如下图：红色代表核心点，黑色代表非核心点，箭头代表密度直达，多个箭头组成了密度可达序列，表示密度可达，左边的所有红色点+其邻域内的所有点相互都是密度相连的
DBSCAN

算法原理

大致流程就是：
遍历每一个点 $P$ ，如果是核心点，查找其邻域点，将其作为当前cluster的种子序列；遍历每一个邻域点 $P_n$ ，如果 $P_n$ 也是核心点，那么将 $P_n$ 邻域加入到当前label的种子序列中

优缺点

优点

不需要输入聚类的数量
可以对任意形状的稠密数据集进行聚类，相对的，K-Means之类的聚类算法一般只适用于凸数据集。
可以在聚类的同时发现异常点，对数据集中的异常点不敏感。
聚类结果没有偏倚，相对的，K-Means之类的聚类算法初始值对聚类结果有很大影响。

缺点

如果样本集的密度不均匀、聚类间距差相差很大时，聚类质量较差，这时用DBSCAN聚类一般不适合。
如果样本集较大时，聚类收敛时间较长，此时可以对搜索最近邻时建立的KD树或者球树进行规模限制来改进。
调参相对于传统的K-Means之类的聚类算法稍复杂，主要需要对距离阈值 $ϵ$ ，邻域样本数阈值MinPts联合调参，不同的参数组合对最后的聚类效果有较大影响。

算法实现

参考：https://github.com/chrisjmccormick/dbscan/blob/master/dbscan.py

import numpy

def MyDBSCAN(D, eps, MinPts):
    """
    Cluster the dataset `D` using the DBSCAN algorithm.
    
    MyDBSCAN takes a dataset `D` (a list of vectors), a threshold distance
    `eps`, and a required number of points `MinPts`.
    
    It will return a list of cluster labels. The label -1 means noise, and then
    the clusters are numbered starting from 1.
    """
 
    # This list will hold the final cluster assignment for each point in D.
    # There are two reserved values:
    #    -1 - Indicates a noise point
    #     0 - Means the point hasn't been considered yet.
    # Initially all labels are 0.    
    labels = [0]*len(D)

    # C is the ID of the current cluster.    
    C = 0
    
    # This outer loop is just responsible for picking new seed points--a point
    # from which to grow a new cluster.
    # Once a valid seed point is found, a new cluster is created, and the 
    # cluster growth is all handled by the 'expandCluster' routine.
    
    # For each point P in the Dataset D...
    # ('P' is the index of the datapoint, rather than the datapoint itself.)
    for P in range(0, len(D)):
    
        # Only points that have not already been claimed can be picked as new 
        # seed points.    
        # If the point's label is not 0, continue to the next point.
        if not (labels[P] == 0):
           continue
        
        # Find all of P's neighboring points.
        NeighborPts = regionQuery(D, P, eps)
        
        # If the number is below MinPts, this point is noise. 
        # This is the only condition under which a point is labeled 
        # NOISE--when it's not a valid seed point. A NOISE point may later 
        # be picked up by another cluster as a boundary point (this is the only
        # condition under which a cluster label can change--from NOISE to 
        # something else).
        if len(NeighborPts) < MinPts:
            labels[P] = -1
        # Otherwise, if there are at least MinPts nearby, use this point as the 
        # seed for a new cluster.    
        else: 
           C += 1
           growCluster(D, labels, P, NeighborPts, C, eps, MinPts)
    
    # All data has been clustered!
    return labels


def growCluster(D, labels, P, NeighborPts, C, eps, MinPts):
    """
    Grow a new cluster with label `C` from the seed point `P`.
    
    This function searches through the dataset to find all points that belong
    to this new cluster. When this function returns, cluster `C` is complete.
    
    Parameters:
      `D`      - The dataset (a list of vectors)
      `labels` - List storing the cluster labels for all dataset points
      `P`      - Index of the seed point for this new cluster
      `NeighborPts` - All of the neighbors of `P`
      `C`      - The label for this new cluster.  
      `eps`    - Threshold distance
      `MinPts` - Minimum required number of neighbors
    """

    # Assign the cluster label to the seed point.
    labels[P] = C
    
    # Look at each neighbor of P (neighbors are referred to as Pn). 
    # NeighborPts will be used as a FIFO queue of points to search--that is, it
    # will grow as we discover new branch points for the cluster. The FIFO
    # behavior is accomplished by using a while-loop rather than a for-loop.
    # In NeighborPts, the points are represented by their index in the original
    # dataset.
    i = 0
    while i < len(NeighborPts):    
        
        # Get the next point from the queue.        
        Pn = NeighborPts[i]
       
        # If Pn was labelled NOISE during the seed search, then we
        # know it's not a branch point (it doesn't have enough neighbors), so
        # make it a leaf point of cluster C and move on.
        if labels[Pn] == -1:
           labels[Pn] = C
        
        # Otherwise, if Pn isn't already claimed, claim it as part of C.
        elif labels[Pn] == 0:
            # Add Pn to cluster C (Assign cluster label C).
            labels[Pn] = C
            
            # Find all the neighbors of Pn
            PnNeighborPts = regionQuery(D, Pn, eps)
            
            # If Pn has at least MinPts neighbors, it's a branch point!
            # Add all of its neighbors to the FIFO queue to be searched. 
            if len(PnNeighborPts) >= MinPts:
                NeighborPts = NeighborPts + PnNeighborPts
            # If Pn *doesn't* have enough neighbors, then it's a leaf point.
            # Don't queue up it's neighbors as expansion points.
            #else:
                # Do nothing                
                #NeighborPts = NeighborPts               
        
        # Advance to the next point in the FIFO queue.
        i += 1        
    
    # We've finished growing cluster C!


def regionQuery(D, P, eps):
    """
    Find all points in dataset `D` within distance `eps` of point `P`.
    
    This function calculates the distance between a point P and every other 
    point in the dataset, and then returns only those points which are within a
    threshold distance `eps`.
    """
    neighbors = []
    
    # For each point in the dataset...
    for Pn in range(0, len(D)):
        
        # If the distance is below the threshold, add it to the neighbors list.
        if numpy.linalg.norm(D[P] - D[Pn]) < eps:
           neighbors.append(Pn)
            
    return neighbors

算法效果

使用sk-learn生成数据，与sk-learn封装的函数进行比较，结果一致
在这里插入图片描述

参考

DBSCAN密度聚类算法

Tech沉思录

关注

0
点赞
踩
10

收藏

觉得还不错? 一键收藏
打赏
0
评论
【DBSCAN】聚类方法与代码实现

目录算法简介算法输入算法输出基本概念算法原理优缺点算法实现算法效果参考算法简介DBSCAN(Density-Based Spatial Clustering of Applications with Noise，具有噪声的基于密度的聚类方法)是一种很典型的密度聚类算法，和K-Means，BIRCH这些一般只适用于凸样本集的聚类相比，DBSCAN既可以适用于凸样本集，也可以适用于非凸样本集。下面我们就对DBSCAN算法的原理做一个总结。　　　DBSCAN是一种基于密度的聚类算法，这类密度聚类算法一般假定
复制链接

扫一扫