【笔记】【机器学习基础】k均值聚类

最新推荐文章于 2024-03-09 20:15:17 发布

'VeNus

最新推荐文章于 2024-03-09 20:15:17 发布

阅读量275

点赞数

分类专栏：读书笔记文章标签：聚类机器学习均值算法

本文链接：https://blog.csdn.net/qq_47809408/article/details/125121829

版权

读书笔记专栏收录该内容

82 篇文章 5 订阅

订阅专栏

本文详细介绍了K均值聚类算法的工作原理，包括其初始化、簇分配和中心重计算的步骤，并通过代码示例展示了如何在模拟数据上应用K均值。同时，讨论了K均值的局限性，如对簇密度不均、非球形簇和复杂形状簇的识别问题。此外，还探讨了K均值在矢量量化和数据编码中的角色，以及如何通过增加簇数量提高数据表示能力。

摘要由CSDN通过智能技术生成

聚类：将数据集划分成组，这些组叫簇
簇内数据相似，不同簇之间非常不同

一、k均值聚类

目的：找出数据特定区域的簇中心
算法：交替执行以下步骤，直至簇的分配不再变化
1、数据点分配给最近的簇中心
2、将每个簇中心设置为所分配的所有数据点的平均值
（1）输入数据与k均值算法的三个步骤

mglearn.plots.plot_kmeans_algorithm()

在这里插入图片描述
△：簇中心
○：数据点
颜色：簇成员
先初始化，声明三个随机数据点为簇中心，再进行迭代（图2 Initialization）
1、分配数据点：将数据点分配给最近的簇中心（图3 Assign Points）
2、重新计算中心：将每个簇中心设置为所分配的所有数据点的平均值（图468 recompute centers）
3、迭代12，直至簇的分配不再变化

（2）k均值算法找到簇中心和簇边界

mglearn.plots.plot_kmeans_boundaries()

在这里插入图片描述
（3）生成模拟数据，构建聚类模型

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

X, y = make_blobs(random_state=1)

kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

（4）分配簇标签

print("Cluster memberships:\n{}".format(kmeans.labels_))

（5）分配簇标签（predict方法）

print(kmeans.predict(X))

结果与labels_相同

（6）3个簇的k均值算法找到的簇分配和簇中心
cluster_centers_保存簇中心

mglearn.discrete_scatter(X[:, 0], X[:, 1], kmeans.labels_, markers='o')
mglearn.discrete_scatter(
    kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], [0, 1, 2],
    markers='^', markeredgewidth=2)

在这里插入图片描述

（7）使用更少或更多的簇中心

fig, axes = plt.subplots(1, 2, figsize=(10, 5))
#2个
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
assignments = kmeans.labels_

mglearn.discrete_scatter(X[:, 0], X[:, 1], assignments, ax=axes[0])
#5个
kmeans = KMeans(n_clusters=5)
kmeans.fit(X)
assignments = kmeans.labels_

mglearn.discrete_scatter(X[:, 0], X[:, 1], assignments, ax=axes[1])

在这里插入图片描述

1、k均值的失败案例

对于k均值，即使给定算法正确的簇的个数，也无法保证一定可以找到正确的簇
（1）簇密度不同时的簇分配

X_varied, y_varied = make_blobs(n_samples=200,
                                cluster_std=[1.0, 2.5, 0.5],
                                random_state=170)
y_pred = KMeans(n_clusters=3, random_state=0).fit_predict(X_varied)

mglearn.discrete_scatter(X_varied[:, 0], X_varied[:, 1], y_pred)
plt.legend(["cluster 0", "cluster 1", "cluster 2"], loc='best')
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")

在这里插入图片描述
存在远离簇其他点的点

（2）k均值无法识别非球形簇

X, y = make_blobs(random_state=170, n_samples=600)
rng = np.random.RandomState(74)

transformation = rng.normal(size=(2, 2))
X = np.dot(X, transformation)

kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
y_pred = kmeans.predict(X)

mglearn.discrete_scatter(X[:, 0], X[:, 1], kmeans.labels_, markers='o')
mglearn.discrete_scatter(
    kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], [0, 1, 2],
    markers='^', markeredgewidth=2)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")

在这里插入图片描述

（3）簇的形状复杂（k均值无法识别具有复杂形状的簇）

from sklearn.datasets import make_moons
X, y = make_moons(n_samples=200, noise=0.05, random_state=0)

kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
y_pred = kmeans.predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap=mglearn.cm2, s=60, edgecolor='k')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            marker='^', c=[mglearn.cm2(0), mglearn.cm2(1)], s=100, linewidth=2,
            edgecolor='k')
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")

在这里插入图片描述

2、矢量量化，或者将k均值看做分解

虽然 k 均值是一种聚类算法，但在k均值和分解方法之间存在一些相似之处。
PCA ：找到数据中方差最大的方向
NMF ：找到累加的分量

PCA\NMF：表示为分量之和
k 均值：用一个分量表示每个数据点

（1）对比k均值的簇中心与PCA、NMF找到的分量

X_train, X_test, y_train, y_test = train_test_split(
    X_people, y_people, stratify=y_people, random_state=0)
nmf = NMF(n_components=100, random_state=0)
nmf.fit(X_train)
pca = PCA(n_components=100, random_state=0)
pca.fit(X_train)
kmeans = KMeans(n_clusters=100, random_state=0)
kmeans.fit(X_train)

X_reconstructed_pca = pca.inverse_transform(pca.transform(X_test))
X_reconstructed_kmeans = kmeans.cluster_centers_[kmeans.predict(X_test)]
X_reconstructed_nmf = np.dot(nmf.transform(X_test), nmf.components_)

fig, axes = plt.subplots(3, 5, figsize=(8, 8),
                         subplot_kw={'xticks': (), 'yticks': ()})
fig.suptitle("Extracted Components")
for ax, comp_kmeans, comp_pca, comp_nmf in zip(
        axes.T, kmeans.cluster_centers_, pca.components_, nmf.components_):
    ax[0].imshow(comp_kmeans.reshape(image_shape))
    ax[1].imshow(comp_pca.reshape(image_shape), cmap='viridis')
    ax[2].imshow(comp_nmf.reshape(image_shape))

axes[0, 0].set_ylabel("kmeans")
axes[1, 0].set_ylabel("pca")
axes[2, 0].set_ylabel("nmf")

fig, axes = plt.subplots(4, 5, subplot_kw={'xticks': (), 'yticks': ()},
                         figsize=(8, 8))
fig.suptitle("Reconstructions")
for ax, orig, rec_kmeans, rec_pca, rec_nmf in zip(
        axes.T, X_test, X_reconstructed_kmeans, X_reconstructed_pca,
        X_reconstructed_nmf):

    ax[0].imshow(orig.reshape(image_shape))
    ax[1].imshow(rec_kmeans.reshape(image_shape))
    ax[2].imshow(rec_pca.reshape(image_shape))
    ax[3].imshow(rec_nmf.reshape(image_shape))

axes[0, 0].set_ylabel("original")
axes[1, 0].set_ylabel("kmeans")
axes[2, 0].set_ylabel("pca")
axes[3, 0].set_ylabel("nmf")

在这里插入图片描述

利用100个分量的KMeans、PCA、NMF的图像重建的对比，KMeans每张图像中只使用了一个簇中心

（2）使用维度更多的簇对数据进行编码
使用two_moons 数据，如果我们使用更多的簇中心，我们可以用 k 均值找到一种更具表现力的表示

X, y = make_moons(n_samples=200, noise=0.05, random_state=0)

kmeans = KMeans(n_clusters=10, random_state=0)
kmeans.fit(X)
y_pred = kmeans.predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_pred, s=60, cmap='Paired')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=60,
            marker='^', c=range(kmeans.n_clusters), linewidth=2, cmap='Paired')
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
print("Cluster memberships:\n{}".format(y_pred))

在这里插入图片描述
现在我们使用了10个簇中心，换言之，每个点现在都被分配了0到9之间的一个数字。于是我们可以构建10个新的特征。利用这个 10 维表示，现在可以用线性模型来划分两个半月形，而利用原始的两个特征是无法做到这一点的。

（3）将到每个簇中心的距离作为特征

distance_features = kmeans.transform(X)
print("Distance feature shape: {}".format(distance_features.shape))
print("Distance features:\n{}".format(distance_features))