算法简介
DBSCAN(Density-Based Spatial Clustering of Applications with Noise,具有噪声的基于密度的聚类方法)是一种很典型的密度聚类算法,和K-Means,BIRCH这些一般只适用于凸样本集的聚类相比,DBSCAN既可以适用于凸样本集,也可以适用于非凸样本集。下面我们就对DBSCAN算法的原理做一个总结。
DBSCAN是一种基于密度的聚类算法,这类密度聚类算法一般假定类别可以通过样本分布的紧密程度决定。同一类别的样本,他们之间的紧密相连的,也就是说,在该类别任意样本周围不远处一定有同类别的样本存在。
通过将紧密相连的样本划为一类,这样就得到了一个聚类类别。通过将所有各组紧密相连的样本划为各个不同的类别,则我们就得到了最终的所有聚类类别结果。
算法输入
- 待聚类的数据 D D D
- 邻域半径 ϵ ϵ ϵ
- 邻域内的点的数量阈值,表示密度
MinPts
算法输出
- 每一个点的label
基本概念
-
ϵ ϵ ϵ-邻域:对于xj∈D,其 ϵ ϵ ϵ-邻域包含样本集 D D D中与 x j x_j xj的距离不大于 ϵ ϵ ϵ的子样本集,即 N ϵ ( x j ) = x i ∈ D ∣ d i s t a n c e ( x i , x j ) ≤ ϵ Nϵ(x_j)={x_i∈D|distance(x_i,x_j)≤ϵ} Nϵ(xj)=xi∈D∣distance(xi,xj)≤ϵ, 这个子样本集的个数记为 ∣ N ϵ ( x j ) ∣ |Nϵ(xj)| ∣Nϵ(xj)∣
-
核心对象:对于任一样本 x j ∈ D x_j∈D xj∈D,如果其 ϵ ϵ ϵ-邻域对应的 N ϵ ( x j ) Nϵ(x_j) Nϵ(xj)至少包含MinPts个样本,即如果 ∣ N ϵ ( x j ) ∣ ≥ M i n P t s |Nϵ(x_j)|≥MinPts ∣Nϵ(xj)∣≥MinPts,则xj是核心对象。
-
密度直达:如果 x i x_i xi在 x j x_j xj的邻域 ϵ ϵ ϵ内,并且 x j x_j xj是
核心对象
,那么称 x i x_i xi可以由 x j x_j xj密度直达,反之不一定成立,除非 x i x_i xi也是核心对象. -
密度可达:多个密度直达,连接起来就是密度可达,密度可达也是不可逆的。 对于 x i x_i xi和 x j x_j xj,如果存在样本样本序列 p 1 p_1 p1, p 2 , . . . , p T p_2,...,p_T p2,...,pT,满足 p 1 = x i , p T = x j p_1=x_i,p_T=x_j p1=xi,pT=xj, 且 p t + 1 p_t+_1 pt+1由 p t p_t pt密度直达,则称 x j x_j xj由 x i x_i xi密度可达。也就是说,密度可达满足传递性。此时序列中的传递样本 p 1 , p 2 , . . . , p T − 1 p_1,p_2,...,p_T−_1 p1,p2,...,pT−1均为核心对象,因为只有核心对象才能使其他样本密度直达。注意密度可达也不满足对称性,这个可以由密度直达的不对称性得出。
-
密度相连:对于 x i x_i xi和 x j x_j xj,如果存在核心对象样本 x k x_k xk,使 x i x_i xi和 x j x_j xj均由 x k x_k xk密度可达,则称 x i x_i xi和 x j x_j xj密度相连。注意密度相连关系是满足对称性的。简单理解,就是两个点可以由一个或多个核心对象连接起来
如下图:红色代表核心点,黑色代表非核心点,箭头代表密度直达,多个箭头组成了密度可达序列,表示密度可达,左边的所有红色点+其邻域内的所有点相互都是密度相连的
算法原理
大致流程就是:
遍历每一个点
P
P
P,如果是核心点,查找其邻域点,将其作为当前cluster的种子序列;遍历每一个邻域点
P
n
P_n
Pn,如果
P
n
P_n
Pn也是核心点,那么将
P
n
P_n
Pn邻域加入到当前label的种子序列中
优缺点
优点
- 不需要输入聚类的数量
- 可以对任意形状的稠密数据集进行聚类,相对的,K-Means之类的聚类算法一般只适用于凸数据集。
- 可以在聚类的同时发现异常点,对数据集中的异常点不敏感。
- 聚类结果没有偏倚,相对的,K-Means之类的聚类算法初始值对聚类结果有很大影响。
缺点
- 如果样本集的密度不均匀、聚类间距差相差很大时,聚类质量较差,这时用DBSCAN聚类一般不适合。
- 如果样本集较大时,聚类收敛时间较长,此时可以对搜索最近邻时建立的KD树或者球树进行规模限制来改进。
- 调参相对于传统的K-Means之类的聚类算法稍复杂,主要需要对距离阈值 ϵ ϵ ϵ,邻域样本数阈值MinPts联合调参,不同的参数组合对最后的聚类效果有较大影响。
算法实现
参考:https://github.com/chrisjmccormick/dbscan/blob/master/dbscan.py
import numpy
def MyDBSCAN(D, eps, MinPts):
"""
Cluster the dataset `D` using the DBSCAN algorithm.
MyDBSCAN takes a dataset `D` (a list of vectors), a threshold distance
`eps`, and a required number of points `MinPts`.
It will return a list of cluster labels. The label -1 means noise, and then
the clusters are numbered starting from 1.
"""
# This list will hold the final cluster assignment for each point in D.
# There are two reserved values:
# -1 - Indicates a noise point
# 0 - Means the point hasn't been considered yet.
# Initially all labels are 0.
labels = [0]*len(D)
# C is the ID of the current cluster.
C = 0
# This outer loop is just responsible for picking new seed points--a point
# from which to grow a new cluster.
# Once a valid seed point is found, a new cluster is created, and the
# cluster growth is all handled by the 'expandCluster' routine.
# For each point P in the Dataset D...
# ('P' is the index of the datapoint, rather than the datapoint itself.)
for P in range(0, len(D)):
# Only points that have not already been claimed can be picked as new
# seed points.
# If the point's label is not 0, continue to the next point.
if not (labels[P] == 0):
continue
# Find all of P's neighboring points.
NeighborPts = regionQuery(D, P, eps)
# If the number is below MinPts, this point is noise.
# This is the only condition under which a point is labeled
# NOISE--when it's not a valid seed point. A NOISE point may later
# be picked up by another cluster as a boundary point (this is the only
# condition under which a cluster label can change--from NOISE to
# something else).
if len(NeighborPts) < MinPts:
labels[P] = -1
# Otherwise, if there are at least MinPts nearby, use this point as the
# seed for a new cluster.
else:
C += 1
growCluster(D, labels, P, NeighborPts, C, eps, MinPts)
# All data has been clustered!
return labels
def growCluster(D, labels, P, NeighborPts, C, eps, MinPts):
"""
Grow a new cluster with label `C` from the seed point `P`.
This function searches through the dataset to find all points that belong
to this new cluster. When this function returns, cluster `C` is complete.
Parameters:
`D` - The dataset (a list of vectors)
`labels` - List storing the cluster labels for all dataset points
`P` - Index of the seed point for this new cluster
`NeighborPts` - All of the neighbors of `P`
`C` - The label for this new cluster.
`eps` - Threshold distance
`MinPts` - Minimum required number of neighbors
"""
# Assign the cluster label to the seed point.
labels[P] = C
# Look at each neighbor of P (neighbors are referred to as Pn).
# NeighborPts will be used as a FIFO queue of points to search--that is, it
# will grow as we discover new branch points for the cluster. The FIFO
# behavior is accomplished by using a while-loop rather than a for-loop.
# In NeighborPts, the points are represented by their index in the original
# dataset.
i = 0
while i < len(NeighborPts):
# Get the next point from the queue.
Pn = NeighborPts[i]
# If Pn was labelled NOISE during the seed search, then we
# know it's not a branch point (it doesn't have enough neighbors), so
# make it a leaf point of cluster C and move on.
if labels[Pn] == -1:
labels[Pn] = C
# Otherwise, if Pn isn't already claimed, claim it as part of C.
elif labels[Pn] == 0:
# Add Pn to cluster C (Assign cluster label C).
labels[Pn] = C
# Find all the neighbors of Pn
PnNeighborPts = regionQuery(D, Pn, eps)
# If Pn has at least MinPts neighbors, it's a branch point!
# Add all of its neighbors to the FIFO queue to be searched.
if len(PnNeighborPts) >= MinPts:
NeighborPts = NeighborPts + PnNeighborPts
# If Pn *doesn't* have enough neighbors, then it's a leaf point.
# Don't queue up it's neighbors as expansion points.
#else:
# Do nothing
#NeighborPts = NeighborPts
# Advance to the next point in the FIFO queue.
i += 1
# We've finished growing cluster C!
def regionQuery(D, P, eps):
"""
Find all points in dataset `D` within distance `eps` of point `P`.
This function calculates the distance between a point P and every other
point in the dataset, and then returns only those points which are within a
threshold distance `eps`.
"""
neighbors = []
# For each point in the dataset...
for Pn in range(0, len(D)):
# If the distance is below the threshold, add it to the neighbors list.
if numpy.linalg.norm(D[P] - D[Pn]) < eps:
neighbors.append(Pn)
return neighbors
算法效果
使用sk-learn生成数据,与sk-learn封装的函数进行比较,结果一致