数据科学基础上机作业（3）——南京工业大学2023年春季

最新推荐文章于 2024-07-14 16:05:43 发布

Melody_0v0

最新推荐文章于 2024-07-14 16:05:43 发布

阅读量82

点赞数 1

文章标签：学习

本文链接：https://blog.csdn.net/qq_41723563/article/details/140418178

版权

数据科学4月20日上机作业

——made by njtech_计2104 Melody

任务4 聚类分析

4.1 原理及流程

K-means算法是一种常用的聚类分析算法，主要用于将数据集划分为K个簇，使得同一簇内的数据对象相似度较高，不同簇的数据对象相似度较低。其主要原理是最小化簇内平方误差和，即使得每个数据点到其所属簇中心的距离尽量小。K-means算法的基本流程如下：

确定K值：首先确定要将数据分成几个簇，即选择一个合适的K值。K值的选择会直接影响聚类结果，但并没有通用的最优选择方法。通常根据问题背景、领域知识和实际需求来选择。
初始化聚类中心：从数据集中随机选择K个数据点作为初始聚类中心。有时也可以使用其他方法（如K-means++算法）来选择初始聚类中心，以提高算法的收敛速度和稳定性。
数据点分配：计算每个数据点与各个聚类中心的距离（通常使用欧氏距离），将数据点分配给距离最近的聚类中心，形成K个簇。
更新聚类中心：对于每个簇，计算簇内数据点的均值作为新的聚类中心。
收敛判断：判断算法是否收敛。如果新的聚类中心与上一轮迭代的聚类中心相同（或变化很小），则认为算法收敛，停止迭代；否则返回步骤3，继续迭代。
输出结果：算法收敛后，输出K个簇以及对应的聚类中心。

4.2-3 鸢尾花数据处理

利用k-mean算法，编程对鸢尾花进行聚类，使用sklearn中内置的已经预处理好的数据集。

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.cluster import KMeans
from matplotlib import gridspec

# 1.加载数据集
iris = datasets.load_iris()
X = iris.data # 一个二维数组，包含了150个样本（每种鸢尾花50个）。每个样本有特征：萼片长度（sepal length）、萼片宽度（sepal width）、花瓣长度（petal length）和花瓣宽度（petal width）
y = iris.target # 一个一维数组，包含了150个样本的目标分类（0, 1, 或 2）。这里，0 代表山鸢尾（Iris Setosa），1 代表杂色鸢尾（Iris Versicolor），2 代表维吉尼亚鸢尾（Iris Virginica）

# 2. 创建K-means模型，并设置K值为3（鸢尾花有三个子类）：
kmeans = KMeans(n_clusters=3, random_state=42)

# 3.拟合模型并进行预测：
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

# 可视化聚类结果（这里仅使用前两个特征进行可视化，实际聚类使用了全部四个特征）：
label_names = ["setosa", "versicolor", "virginica"]
colors = ['purple', 'green', 'yellow']
# 使用grid-spec创建自定义子图布局
fig = plt.figure(figsize=(12, 8))
gs = gridspec.GridSpec(2, 2, height_ratios=[1, 1], width_ratios=[1, 1])

# 第1张图：Original data distribution
ax0 = plt.subplot(gs[:, 0])
for i, (color, label_name) in enumerate(zip(colors, label_names)):
    indices = y == i
    scatter0 = ax0.scatter(X[indices, 0], X[indices, 1], c=color, cmap='viridis', edgecolor='k', s=50, label=label_name)
ax0.set_title('Original data distribution')
ax0.set_xlabel('Sepal length')
ax0.set_ylabel('Sepal width')
ax0.legend()

# 第2张图：Clusters by K-means
ax1 = plt.subplot(gs[0, 1])
for i, (color, label_name) in enumerate(zip(colors, label_names)):
    indices = y == i
    scatter1 = ax1.scatter(X[indices, 0], X[indices, 1], c=color, cmap='viridis', edgecolor='k', s=50, label=label_name)
#scatter1 = ax1.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis', edgecolor='k', s=50,label='label_name')
centers = ax1.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', marker='*', s=200, label='Centroids')
ax1.set_title('Clusters by K-means')
ax1.set_xlabel('Sepal length')
ax1.set_ylabel('Sepal width')
ax1.legend()

# 第3张图：Ground truth
ax2 = plt.subplot(gs[1, 1])
for i, (color, label_name) in enumerate(zip(colors, label_names)):
    indices = y == i
    scatter2 = ax2.scatter(X[indices, 0], X[indices, 1], c=color, edgecolor='k', s=50, label=label_name)
ax2.set_title('Ground truth')
ax2.set_xlabel('Sepal length')
ax2.set_ylabel('Sepal width')
ax2.legend(title='Species')

plt.tight_layout()
plt.show()

在这里插入图片描述

4.4 计算准确率

与真值数据对比，计算聚类的准确率。

import numpy as np
from sklearn.metrics import accuracy_score
from scipy.optimize import linear_sum_assignment

# 使用匈牙利算法找到最佳标签映射
def label_mapping(y_true, y_pred):
    n_clusters = len(np.unique(y_true))
    cost_matrix = np.zeros((n_clusters, n_clusters))

    for i in range(n_clusters):
        for j in range(n_clusters):
            cost_matrix[i, j] = np.sum((y_pred == j) & (y_true == i))

    row_ind, col_ind = linear_sum_assignment(cost_matrix, maximize=True)
    return col_ind

# 计算聚类准确率
def clustering_accuracy(y_true, y_pred):
    mapping = label_mapping(y_true, y_pred)
    y_pred_mapped = np.array([mapping[label] for label in y_pred])
    return accuracy_score(y_true, y_pred_mapped)

accuracy = clustering_accuracy(y, y_kmeans)
print(f"Clustering accuracy: {accuracy:.4f}")