OPTICS聚类算法的实现

ZHW_AI课题组

已于 2024-06-19 14:53:26 修改

阅读量1.3k

点赞数 22

分类专栏：人工智能高级程序语言设计《初级》---机器学习文章标签：算法聚类 python

于 2024-06-13 15:59:09 首次发布

本文链接：https://blog.csdn.net/m0_37758063/article/details/139656117

版权

人工智能高级程序语言设计《初级》---机器学习专栏收录该内容

30 篇文章

订阅专栏

1. 作者介绍

李政，男，西安工程大学电子信息学院，2023级研究生
研究方向：机器视觉与人工智能
电子邮件：m17855081621@163.com

徐达，男，西安工程大学电子信息学院，2023级研究生，张宏伟人工智能课题组
研究方向：机器视觉与人工智能
电子邮件：1374455905@qq.com

2. Optics聚类算法

2.1 基础概念

1.核心点（Core Point）
一个点被定义为核心点，如果它在给定半径（ε, epsilon）内至少有 MinPts 个邻居点。
2. 邻域（ε-neighborhood）
邻域是指在半径 ε 内的所有点的集合。
3. 核心距离（Core Distance）
核心距离是核心点与其邻域中最近邻点的距离。
4. 可达距离（Reachability Distance）
对于一个点p和它的一个邻居点o，点的可达距离是指 max(核心距离§,距离(p,o))。如果不是核心点，则没有可达距离。

2.2 算法流程

1.初始化：所有点都标记为未处理。定义两个数组：一个存储核心距离，另一个存储可达距离。
2.选择一个未处理的点：从未处理的点中随机选择一个点，并计算它的核心距离和可达距离。
3.更新优先队列：
（1）对于当前点的每个邻居点，计算它们的可达距离。
（2）如果这个邻居点还未被处理，将其放入一个优先队列（通常是最小堆），按照它们的可达距离排序。
4.处理优先队列：
从优先队列中取出可达距离最小的点，标记为已处理。
重复步骤 2 和 3，直到队列为空。
5.生成聚类顺序：
重复上述步骤，直到所有点都被处理。
处理的顺序即为 OPTICS 生成的聚类顺序。

算法输出
OPTICS 的输出是一系列有序的点，每个点有一个核心距离和可达距离。通过这些距离，可以生成一个可达距离图（reachability plot），分析图中的距离变化可以确定不同密度的簇。
在这里插入图片描述

3. Optics聚类算法实现

3.1 数据集介绍

cluster数据集主要由一下随机数组成，由两维数据组合成的二维点，大致分布成簇。
在这里插入图片描述
数据集下载链接：
链接: https://pan.baidu.com/s/130fEMAdKUwy2eUV8maLSnw
提取码：2024

3.2 算法程序介绍

1、导入相关库
在这里插入图片描述
2、读取csv数据集

3、绘制二维数据集图表

4、输出可达距离柱状图与可达距离

5、主程序

3.3 实验结果

1、输出可达距离柱状图
根据图中的波谷可以看出大致分为3类或4类，可以分别得出参数为min_samples=10, xi=0.01, min_cluster_size=0.03或min_samples=50, xi=0.01, min_cluster_size=0.15。
在这里插入图片描述
2、输出可达距离路径

3、根据柱状图调整参数，输出聚类结果
下图为参数为min_samples=10, xi=0.01, min_cluster_size=0.0的聚类结果

3.4 完整代码

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import OPTICS

# 读取CSV文件
def read_csv(file_path):
    try:
        data = pd.read_csv(file_path)
        return data
    except FileNotFoundError:
        print("Error: The file was not found.")
        return None
    except pd.errors.EmptyDataError:
        print("Error: The file is empty.")
        return None
    except pd.errors.ParserError:
        print("Error: The file could not be parsed.")
        return None

# 绘制二维图表
def plot_data(data, x_column, y_column):
    if x_column not in data.columns or y_column not in data.columns:
        print(f"Error: Columns {x_column} and/or {y_column} not found in the data.")
        return

    plt.figure(figsize=(10, 6))
    plt.scatter(data[x_column], data[y_column], color='blue', marker='o')
    plt.title('Scatter Plot')
    plt.xlabel(x_column)
    plt.ylabel(y_column)
    plt.grid(True)
    plt.show()

# 绘制可达距离柱状图
def plot_reachability(optics_model):
    reachability = optics_model.reachability_
    space = np.arange(len(reachability))
    plt.figure(figsize=(10, 6))
    plt.bar(space, reachability, color='b')
    plt.title('Reachability Plot')
    plt.xlabel('Sample Index')
    plt.ylabel('Reachability Distance')
    plt.show()

# 输出可达距离路径
def print_reachability_path(optics_model):
    reachability = optics_model.reachability_
    ordering = optics_model.ordering_
    for order, reach in zip(ordering, reachability[ordering]):
        print(f"Point {order}: Reachability Distance = {reach}")

# 主程序
def main():
    file_path = r'C:\Users\t\Desktop\cluster2.csv'  # 你的CSV文件路径
    data = read_csv(file_path)
    if data is None:
        return
    x_column = data.columns[0]
    y_column = data.columns[1]
    plot_data(data, x_column, y_column)
    # 提取特征数据
    X = data[[x_column, y_column]].values
    # 初始化OPTICS对象并拟合数据
    optics = OPTICS(min_samples=50, xi=0.01, min_cluster_size=0.15)
    optics.fit(X)
    # 获取聚类标签和核心距离
    labels = optics.labels_
    core_distances = optics.core_distances_

    # 绘制聚类结果
    plt.figure(figsize=(10, 6))
    unique_labels = set(labels)
    for label in unique_labels:
        cluster_points = X[labels == label]
        plt.scatter(cluster_points[:, 0], cluster_points[:, 1], label=f'Cluster {label}')
    plt.title('OPTICS Clustering')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.legend()
    plt.show()

    # 输出可达距离路径
    print("Reachability Path:")
    print_reachability_path(optics)
    # 绘制可达距离柱状图
    plot_reachability(optics)

if __name__ == "__main__":
    main()

4. 问题与分析

参数问题
如上面所说，根据可达距离可以得到两种参数，而在确定两种参数之前，怎么确认参数是一个问题。我们根据波谷可以看到，大部分可达距离较小，而波峰位置则是可达距离较大的点，就证明是距离类较远的点，因此可以分为三或四簇。而在第三个波谷中只有少数点聚集，如设定为对应参数会出现聚类过多且异常点过多的情况，如下图：
在这里插入图片描述
因此我们采取抛弃的原则，采用点数较多的波谷对应的参数，可得到较好的结果。