机器学习——聚类实现_层次聚类初始类给定方法-CSDN博客

本文链接：https://blog.csdn.net/gwpjiayou/article/details/104873150

聚类的定义：对大量未知标注的数据集，按数据的内在相似性将数据集划分为多个类别，使得类别内的数据相似度较大，类别间的数据相似度较小。-
所以聚类需要解决的问题是：

如何定义相似性
如何选择类别的数目

一. Kmeans

假设输入样本为 $S=x_1, x_2,x_3,......,x_m$ 则算法步骤为：

选择初始的 $k$ 个类别中心， $\mu_1,\mu_2,\mu_3,....,\mu_k$
对于每个样本 $x_i$ ，将其标记为距离类别中心最近的类别，即：
$label_i = argmin_{i<=j<=k}\left\vert\vert s \right\vert\vert$
将每个类别中心更新为隶属该类别的所有样本的均值：
$\mu_j=\frac{1}{∣c_j ∣}\sum_i x_i$
重复最后两步直到类别中心的变化小于某阈值。
下面直观的看下如何进行聚类：简单的说就是首先确定分多少类，然后计算机会根据你给出的类别确定相应的聚类中心，当然这个是计算机随机选定的，不过这个初始值的位置相当重要，后面会用Kmeans++来选择它。根据聚类中心不断的计算到聚类中心的距离，最近的那些点放到相应类别中，然后再求所有样本的均值，作为聚类中心。这样不断的迭代，直到满足设定的终止条件为止。

Kmeans聚类的实现：
对下面三类数据进行聚类操作：
在这里插入图片描述

import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics, cluster
from sklearn.datasets import samples_generator

x, y_true = samples_generator.make_blobs(n_samples=200, centers=2,cluster_std=0.60, random_state=0)
x2, y2_true = samples_generator.make_moons(n_samples=200, noise=0.05, random_state=0)
x3, y3_true = samples_generator.make_circles(n_samples=200, noise=0.05, random_state=0, factor=0.4)
gmm = cluster.KMeans(2)
label = gmm.fit_predict(x)
label2 = gmm.fit_predict(x2)
label3 = gmm.fit_predict(x3)
plt.subplot(2, 2, 1)
plt.scatter(x[:, 0], x[:, 1], c=label)
plt.title("blobs")
plt.axis('off')

plt.subplot(2, 2, 2)
plt.scatter(x2[:, 0], x2[:, 1], c=label2)
plt.title("moons")
plt.axis('off')

plt.subplot(2, 2, 3)
plt.scatter(x3[:, 0], x3[:, 1], c=label3)
plt.title("circles")
plt.axis('off')
plt.show()

聚类效果：可以看到对第一类数据的聚类效果与原始数据一样，说明很好，其他两类数据效果不佳。
在这里插入图片描述

Kmeans算法的缺陷

聚类中心的个数K 需要事先给定，但在实际中这个 K 值的选定是非常难以估计的，很多时候，事先并不知道给定的数据集应该分成多少个类别才最合适。
Kmeans需要人为地确定初始聚类中心，不同的初始聚类中心可能导致完全不同的聚类结果。（可以使用Kmeans++算法来解决）。

二. Kmeans++

Kmeans聚类中心的个数K 需要事先给定，但在实际中这个 K 值的选定是非常难以估计的，很多时候，事先并不知道给定的数据集应该分成多少个类别才最合适, Kmeans需要人为地确定初始聚类中心，不同的初始聚类中心可能导致完全不同的聚类结果。

K-Means ++ 算法思想

k-means++算法选择初始seeds的基本思想就是：初始的聚类中心之间的相互距离要尽可能的远。

从输入的数据点集合中随机选择一个点作为第一个聚类中心对于数据集中的每一个点x，计算它与最近聚类中心(指已选择的聚类中心)的距离D(x)
选择一个新的数据点作为新的聚类中心，选择的原则是：D(x)较大的点，被选取作为聚类中心的概率较大重复2和3直到k个聚类中心被选出来。
选择两个初始的聚类中心，使得他们的距离尽可能的远。如图点6作为初始点，那么选择离它最远的距离的点2作为第二个类别的聚类中心，这样不断的迭代，直到满足约束条件。

三. MeanShift

MeanShift不需要给它设定初始聚类中心，只需要设置圈的大小。圈的大小是影响聚类结果的重要因素。
在这里插入图片描述 MeanShift代码实现：

from sklearn import cluster
import matplotlib.pyplot as plt
from sklearn.datasets import samples_generator

x, y_true = samples_generator.make_blobs(n_samples=200, centers=2, cluster_std=0.5, random_state=0)
x2, y2_true = samples_generator.make_circles(n_samples=200, noise=0.05, random_state=0)
x3, y3_true = samples_generator.make_moons(n_samples=200, noise=0.05, random_state=0)
gmm = cluster.MeanShift()
label = gmm.fit_predict(x)
label2 = gmm.fit_predict(x2)
label3 = gmm.fit_predict(x3)
plt.subplot(2, 2, 1)
plt.scatter(x[:, 0], x[:, 1], c=label)
plt.title("blobs")
plt.axis('off')

plt.subplot(2, 2, 2)
plt.scatter(x2[:, 0], x2[:, 1], c=label2)
plt.title("moons")
plt.axis('off')

plt.subplot(2, 2, 3)
plt.scatter(x3[:, 0], x3[:, 1], c=label3)
plt.title("circles")
plt.axis('off')
plt.show()

聚类效果如图所示，对于后面两类数据效果不佳：
在这里插入图片描述

四. 层次聚类

在这里插入图片描述层次聚类方法：

凝聚层次聚类：AGNES算法（自底向上）
分裂的层次聚类：DIANA算法（自顶向下）
优点：

距离和规则的相似度容易定义，限制少；
不需要预先制定聚类数；
可以发现类的层次关系；
可以聚类成其它形状；

缺点：

1.计算复杂度太高；
2. 奇异值也能产生很大影响；
3. 算法很可能聚类成链状

层次聚类代码实现：
在这里插入图片描述

from sklearn import cluster
import matplotlib.pyplot as plt
from sklearn.datasets import samples_generator

x, y_true = samples_generator.make_blobs(n_samples=200, centers=2, cluster_std=0.5, random_state=0)
x2, y2_true = samples_generator.make_circles(n_samples=200, noise=0.05, random_state=0)
x3, y3_true = samples_generator.make_moons(n_samples=200, noise=0.05, random_state=0)
gmm = cluster.AgglomerativeClustering(n_clusters=2)
label = gmm.fit_predict(x)
label2 = gmm.fit_predict(x2)
label3 = gmm.fit_predict(x3)
plt.subplot(2, 2, 1)
plt.scatter(x[:, 0], x[:, 1], c=label)
plt.title("blobs")
plt.axis('off')

plt.subplot(2, 2, 2)
plt.scatter(x2[:, 0], x2[:, 1], c=label2)
plt.title("moons")
plt.axis('off')

plt.subplot(2, 2, 3)
plt.scatter(x3[:, 0], x3[:, 1], c=label3)
plt.title("circles")
plt.axis('off')
plt.show()

五. 密度聚类

在这里插入图片描述代码实现：

from sklearn import cluster
import matplotlib.pyplot as plt
from sklearn.datasets import samples_generator

x, y_true = samples_generator.make_blobs(n_samples=200, centers=2, cluster_std=0.5, random_state=0)
x2, y2_true = samples_generator.make_circles(n_samples=200, noise=0.05, random_state=0)
x3, y3_true = samples_generator.make_moons(n_samples=200, noise=0.05, random_state=0)
gmm = cluster.DBSCAN()
label = gmm.fit_predict(x)
label2 = gmm.fit_predict(x2)
label3 = gmm.fit_predict(x3)
plt.subplot(2, 2, 1)
plt.scatter(x[:, 0], x[:, 1], c=label)
plt.title("blobs")
plt.axis('off')

plt.subplot(2, 2, 2)
plt.scatter(x2[:, 0], x2[:, 1], c=label2)
plt.title("moons")
plt.axis('off')

plt.subplot(2, 2, 3)
plt.scatter(x3[:, 0], x3[:, 1], c=label3)
plt.title("circles")
plt.axis('off')
plt.show()

密度聚类结果：
在这里插入图片描述如果给它重新设置参数后能很好的处理后面两类：

# 第一个参数表示领域的大小，第二个参数表示，邻域内最小包含5个样本
gmm = cluster.DBSCAN(eps=0.3, min_samples=5)

在这里插入图片描述

六. AP 聚类

AP聚类算法思想：

将全部样本看作网络的节点，然后通过网络中各条边的消息传递计算出各样本的聚类中心。聚类过程中，共有两种消息在各节点间传递，分别是吸引度(responsibility)和归属(availability) 。AP算法通过迭代过程不断更新每一个点的吸引度和归属度值，直到产生m个高质量的Exemplar（类似于质心），同时将其余的数据点分配到相应的聚类中。

Exemplar：指的是聚类中心，K-Means中的质心，AP算法不需要事先指定聚类数目,相反它将所有的数据点都作为潜在的聚类中心。

Similarity（相似度）：数据点i和点j的相似度记为s(i,j)，是指点j作为点i的聚类中心的相似度。一般使用欧氏距离来计算，一般点与点的相似度值全部取为负值；因此，相似度值越大说明点与点的距离越近，便于后面的比较计算。

Preference：数据点i的参考度称为p(i)或s(i,i)，是指点i作为聚类中心的参考度，以S矩阵的对角线上的数值s (k, k)作为k点能否成为聚类中心的评判标准,这意味着该值越大,这个点成为聚类中心的可能性也就越大。一般取s相似度值的中值(Scikit-learn中默认为中位数)。聚类的数量受到参考度p的影响,如果认为每个数据点都有可能作为聚类中心,那么p就应取相同的值。如果取输入的相似度的均值作为p的值,得到聚类数量是中等的。如果取最小值,得到类数较少的聚类。

吸引度Responsibility：r(i,k)用来描述点k适合作为数据点i的聚类中心的程度。

归属度Availability：a(i,k)用来描述点i选择点k作为其聚类中心的适合程度。

Damping factor(阻尼系数)：主要是起收敛作用的。

在实际计算应用中，最重要的两个参数（也是需要手动指定）是Preference和Damping factor。前者定了聚类数量的多少，值越大聚类数量越多；后者控制算法收敛效果。

在这里插入图片描述 AP聚类代码实现：

from sklearn import cluster
import matplotlib.pyplot as plt
from sklearn.datasets import samples_generator

x, y_true = samples_generator.make_blobs(n_samples=200, centers=2, cluster_std=0.5, random_state=0)
x2, y2_true = samples_generator.make_circles(n_samples=200, noise=0.05, random_state=0,factor=0.4)
x3, y3_true = samples_generator.make_moons(n_samples=200, noise=0.05, random_state=0)
# gmm = cluster.AffinityPropagation(preference=-30)
gmm = cluster.AffinityPropagation()
label = gmm.fit_predict(x)
label2 = gmm.fit_predict(x2)
label3 = gmm.fit_predict(x3)
plt.subplot(2, 2, 1)
plt.scatter(x[:, 0], x[:, 1], c=label)
plt.title("blobs")
plt.axis('off')

plt.subplot(2, 2, 2)
plt.scatter(x2[:, 0], x2[:, 1], c=label2)
plt.title("moons")
plt.axis('off')

plt.subplot(2, 2, 3)
plt.scatter(x3[:, 0], x3[:, 1], c=label3)
plt.title("circles")
plt.axis('off')
plt.show()

说明：后面会对谱聚类高斯混合模型的数学原理和求解算法EM进行详细的推导。

七. 谱聚类

在这里插入图片描述

from sklearn import cluster
import matplotlib.pyplot as plt
from sklearn.datasets import samples_generator

x, y_true = samples_generator.make_blobs(n_samples=200, centers=2, cluster_std=0.5, random_state=0)
x2, y2_true = samples_generator.make_circles(n_samples=200, noise=0.05, random_state=0,factor=0.4)
x3, y3_true = samples_generator.make_moons(n_samples=200, noise=0.05, random_state=0)
gmm = cluster.SpectralClustering(2, affinity="nearest_neighbors")
label = gmm.fit_predict(x)
label2 = gmm.fit_predict(x2)
label3 = gmm.fit_predict(x3)
plt.subplot(2, 2, 1)
plt.scatter(x[:, 0], x[:, 1], c=label)
plt.title("blobs")
plt.axis('off')

plt.subplot(2, 2, 2)
plt.scatter(x2[:, 0], x2[:, 1], c=label2)
plt.title("moons")
plt.axis('off')

plt.subplot(2, 2, 3)
plt.scatter(x3[:, 0], x3[:, 1], c=label3)
plt.title("circles")
plt.axis('off')
plt.show()

八. 高斯混合模型

高斯混合模型

在这里插入图片描述

聚类效果：
在这里插入图片描述

from sklearn import cluster
import matplotlib.pyplot as plt
from sklearn.datasets import samples_generator
from sklearn.mixture import GaussianMixture
x, y_true = samples_generator.make_blobs(n_samples=200, centers=2, cluster_std=0.5, random_state=0)
x2, y2_true = samples_generator.make_circles(n_samples=200, noise=0.05, random_state=0,factor=0.4)
x3, y3_true = samples_generator.make_moons(n_samples=200, noise=0.05, random_state=0)
gmm = GaussianMixture(n_components=2)
label = gmm.fit_predict(x)
label2 = gmm.fit_predict(x2)
label3 = gmm.fit_predict(x3)
plt.subplot(2, 2, 1)
plt.scatter(x[:, 0], x[:, 1], c=label)
plt.title("blobs")
plt.axis('off')

plt.subplot(2, 2, 2)
plt.scatter(x2[:, 0], x2[:, 1], c=label2)
plt.title("moons")
plt.axis('off')

plt.subplot(2, 2, 3)
plt.scatter(x3[:, 0], x3[:, 1], c=label3)
plt.title("circles")
plt.axis('off')
plt.show()