基于密度的聚类算法DBscan

最新推荐文章于 2020-01-03 20:17:47 发布

journey_TripleP

最新推荐文章于 2020-01-03 20:17:47 发布

阅读量676

点赞数 3

分类专栏： machine-learning 文章标签：聚类算法 dbscan算法 python2-7

本文链接：https://blog.csdn.net/journey_TripleP/article/details/78604654

版权

machine-learning 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

基于密度的聚类算法 DBscan

基于密度的聚类算法的结果是球状的簇

基于密度的聚类算法的结果可以是任意形状，这有利于处理带有噪音点的数据

DBscan 相关概念

点P的邻接半径 eps : 以点P为中心，以 eps 为半径
点P的eps邻域 ：与点P的距离 <= eps的所有点的集合
密度域值 minPts ：指定的一个数，表示最小的点的个数，它刻画了最小的密度情况，过滤掉密度稀疏的点
核心点 ：点 P 的eps 领域点的个数 >= minPts ，则 P 称为核心点
边界点 ：点 Q 的eps 领域点的个数 < minPts ，但是 Q 落在某个核心点 P 的eps邻域内，则点 Q 称为边界点
噪音点 ：点 R 既不是核心点也不是边界点，则 R 称为噪音点
直接密度可达 ：点 q 在点 p 的 eps 邻域内，则称 q 从 p 出发是直接密度可达
密度可达 ：对于对象链 P1,P2,……,Pn , 若 Pi+ z1 从 Pi 出发直接密度可达，则 Pn 从 P1 出发密度可达（传递性，间接密度可达）

DBscan 算法思想

判断两点之间是否直接密度可达

def eps_neighborhood(a, b, eps):
    return dist(a, b) < eps

求某点的 eps 邻域

def region_query(dataSet, point_id, eps):
    n_points = dataSet.shape[1] #shape函数是numpy.core.fromnumeric中的函数，它的功能是读取矩阵的长度
    seeds = []
    for i in range(0, n_points):
        if eps_neighborhood(dataSet[:, point_id], dataSet[:, i], eps):
            seeds.append(i)
    return seeds

为核心对象聚类并合并

 合并两个存在密度相连的元素的集合

def expand_cluster(dataSet, clusterResults, point_id, cluster_id, eps, minPts):

    seeds = region_query(dataSet, point_id, eps)
    if len(seeds) < minPts:
        clusterResults[point_id] = NOISE #标为噪音点
        return False
    else:
        clusterResults[point_id] = cluster_id #划分到该簇
        for seed_id in seeds:
            clusterResults[seed_id] = cluster_id #该点的eps邻域也划分到该簇

        while len(seeds) > 0: # 持续扩张，seeds里面的点的eps邻域一定与当前簇有密度可达的点，合并簇
            current_point = seeds[0]
            expand_seeds = region_query(dataSet, current_point, eps)
            if len(expand_seeds) >= minPts: #如果 current_point是核心点
                for expand_id in range(0, len(expand_seeds)):
                    result_point = expand_seeds[expand_id]
                    if clusterResults[result_point] == UNCLASSIFIED: #未分类的类做我的seed
                        seeds.append(result_point)
                        clusterResults[result_point] == cluster_id
                    elif clusterResults[result_point] == NOISE: #已分类的类与我合并
                        clusterResults[result_point] == cluster_id
            seeds = seeds[1:]
        return True

DBscan输出聚类结果

def dbscan(dataSet, eps, minPts):
    cluster_id = 1
    n_points = dataSet.shape[1]
    clusterResults = [UNCLASSIFIED] * n_points
    for point_id in range(0, n_points):
        point = dataSet[:, point_id]
        if clusterResults[point_id] == UNCLASSIFIED:
            if expand_cluster(dataSet, clusterResults, point_id, cluster_id, eps, minPts):
                cluster_id = cluster_id + 1
    return clusterResults, cluster_id - 1