聚类算法 之 OPTICS算法总结

DBSCAN由于存在一些缺陷,所以引入的OPTICS算法进行改善

背景:

在DBSCAN算法中,需要人为确定领域半径ϵ \epsilonϵ和密度阈值M
MM,同时该算法的性能又对这两个超参数非常敏感,不同的初始参数设定会导致完全不同的结果。基于此,学者们提出了新的聚类算法OPTICS。该聚类算法同样也是基于密度聚类的算法,与DBSCAN不同的是,该算法的设计使得其对初始超参数的设定敏感度较低

基本知识点:
core_distance:核心距离
reach_distance:可达距离
具体知识点可以参考这一篇博文:我是链接

OPTICS核心思想
较稠密簇中的对象在簇排序中相互靠近
一个对象的最小可达距离给出了一个对象连接到一个稠密簇的最短路径(这也就是为什么一个样本点的可达距离定义为它关于各个核心距离中最小的那一个)

下面给出代码:

通过sklearn库的OPTICS和cluster_optics_dbscan函数进行聚类操作

import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt
from sklearn.cluster import OPTICS, cluster_optics_dbscan
from sklearn.cluster import DBSCAN
import numpy as np

G = gridspec.GridSpec(3, 2)
# ---------------------------------数据-------------------------------
n_points_per_cluster = 250
C1 = [-5, -2] + .8 * np.random.randn(n_points_per_cluster, 2)  # randn生成矩阵
C2 = [4, -1] + .1 * np.random.randn(n_points_per_cluster, 2)
C3 = [1, -2] + .2 * np.random.randn(n_points_per_cluster, 2)
C4 = [-2, 3] + .3 * np.random.randn(n_points_per_cluster, 2)
C5 = [3, -2] + 1.6 * np.random.randn(n_points_per_cluster, 2)
C6 = [5, 6] + 2 * np.random.randn(n_points_per_cluster, 2)
X = np.vstack((C1, C2, C3, C4, C5, C6))  # 按列合成向量
# -----------------------元数据画图----------------------------------------
ax = plt.subplot(G[0, 0])
ax.scatter(X[:, 0], X[:, 1])
ax.set_title('Scatter Picture for Original Data')
# ------------------DBSCAN------------------------------
clustering_module_dbs = DBSCAN(eps=.6, min_samples=20).fit(X)  # fit()生成的是训练模型
clustering_classif_dbs = clustering_module_dbs.fit_predict(X)  # fit_predict生成的是数据分类
ax2 = plt.subplot(G[1, 0])
ax2.scatter(X[:, 0], X[:, 1], c=clustering_module_dbs.labels_)
ax2.set_title('DBSCAN Scatter Picture for Eps=.6 Min_sample=20')
# -----------------OPTICS--------------------------------
clustering_module_opt = OPTICS(min_samples=50, min_cluster_size=.05, xi=.05).fit(X)  # run fit()
clustering_classif_opt = clustering_module_opt.fit_predict(X)

space = np.arange(len(X))
reachability = clustering_module_opt.reachability_[clustering_module_opt.ordering_]
labels = clustering_module_opt.labels_[clustering_module_opt.ordering_]
colors = ['g.', 'r.', 'b.', 'y.', 'c.']

ax3 = plt.subplot(G[0, 1])
# plot ReachAbility Pic
for klass, color in zip(range(0, 5), colors):
    # zip() 函数用于将可迭代的对象作为参数 将对象中对应的元素打包成一个个元组
    # 然后返回由这些元组组成的列表。
    # 如果各个迭代器的元素个数不一致 则返回列表长度与最短的对象相同 利用 * 号操作符 可以将元组解压为列表
    '''
    >> > a = [1, 2, 3]
    >> > b = [4, 5, 6]
    >> > c = [4, 5, 6, 7, 8]
    >> > zipped = zip(a, b)  # 打包为元组的列表
    [(1, 4), (2, 5), (3, 6)]
    >> > zip(a, c)  # 元素个数与最短的列表一致
    [(1, 4), (2, 5), (3, 6)]
    >> > zip(*zipped)  # 与 zip 相反,*zipped 可理解为解压,返回二维矩阵式
    [(1, 2, 3), (4, 5, 6)]
    '''
    Xk = space[labels == klass]
    Rk = reachability[labels == klass]
    ax3.plot(Xk, Rk, color, alpha=0.3)
ax3.plot(space[labels == -1], reachability[labels == -1], 'k.', alpha=0.3)  # noise # labels == -1
ax3.plot(space, np.full_like(a=space, shape=space.shape, fill_value=2, dtype=float), linestyle='-.', color='b',
         alpha=.5)
ax3.plot(space, np.full_like(a=space, shape=space.shape, fill_value=.5, dtype=float), linestyle='-', color='b',
         alpha=.5)
ax3.set_ylabel('Reachability (epsilon distance)')
ax3.set_title('Reachability Plot')
# ------------noise points eliminate------------------------------------
print("This sample has {}'s mini_blob(s)".format(max(clustering_module_opt.labels_ + 1)))
ax4 = plt.subplot(G[1, :])
ax4.scatter(X[:, 0][clustering_classif_opt != -1], X[:, 1][clustering_classif_opt != -1],
            c=clustering_classif_opt[clustering_classif_opt != -1])  # 去除噪声点
ax4.set_title('OPTICS Scatter Picture without Noise Points')

# ---------------------eps=0.5------eps=2----------------------------------------------------
clustering_05 = cluster_optics_dbscan(reachability=clustering_module_opt.reachability_,
                                      core_distances=clustering_module_opt.core_distances_,
                                      ordering=clustering_module_opt.ordering_,
                                      eps=2)  # 返回类型跟DBSCAN一样 clustering2是簇类
clustering_20 = cluster_optics_dbscan(reachability=clustering_module_opt.reachability_,
                                      core_distances=clustering_module_opt.core_distances_,
                                      ordering=clustering_module_opt.ordering_,
                                      eps=.5)
colors1 = ['g', 'm', 'y', 'c']  # for eps=2.
colors2 = ['g', 'greenyellow', 'olive', 'r', 'b', 'c']  # for eps=.5

ax5 = plt.subplot(G[2, 0])
for c, color in zip(range(4), colors1):
    Xi = X[clustering_20 == c]
    ax5.plot(Xi[:, 0], Xi[:, 1], 'k+', alpha=0.9, color=color)
ax5.plot(X[:, 0][clustering_20 == -1], X[:, 1][clustering_20 == -1], 'k+', alpha=0.1)  # noise points with marker'+'
ax5.set_title('OPTICS for eps=.5 with noise points')

ax6 = plt.subplot(G[2, 1])
for c, color in zip(range(6), colors2):
    Xi = X[clustering_20 == c]
    ax6.plot(Xi[:, 0], Xi[:, 1], 'k+', alpha=0.9, color=color)
ax6.plot(X[:, 0][clustering_05 == -1], X[:, 1][clustering_05 == -1], 'k+', alpha=0.1)  # noise points with marker'+'
ax6.set_title('OPTICS for eps=2 with noise points')
# --------show fig-----------------------------------
plt.show()

运行结果:
OPTICS运行结果

  • 4
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值