聚类算法的使用

最新推荐文章于 2024-07-02 10:45:45 发布

可怜又无助的迪迪迪

最新推荐文章于 2024-07-02 10:45:45 发布

阅读量2.3k

点赞数

分类专栏：机器学习sklearn 文章标签：聚类算法机器学习

本文链接：https://blog.csdn.net/z1139269312/article/details/121888503

版权

机器学习sklearn 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

KMeans将一组N个样本的特征矩阵X划分为K个无交集的簇。

质心：簇中所有数据的均值

流程：1.随机抽取K个样本作为最初的质心，开始迭代

2.将每个样本点分配到离他们最近的簇心，生成K个簇

3.对于每个簇，计算所有被分配到该簇的样本点的平均值最为新的质心

4.当质心的位置不再变化时，迭代停止，聚类完成。

欧几里得距离： d(x, $\mu$ ) = $\sqrt{\sum_{i=1}^{n} \left ( x_{i} -\mu _{i}\right )^{2}}$

曼哈顿距离： d(x, $\mu$ ) = $\sum_{i=1}^{n}\left ( \left | x_{i} -\mu \right | \right )$

余弦距离： cos $\theta$ = $\frac{\sum _{1}^{n}\left (x _{i}*\mu \right )}{\sqrt{\sum _{1}^{n}\left (x _{i} \right )^{2}*\sqrt{\sum _{1}^{n}\left ( \mu \right )^{2}}}}$

sklearn.cluster.KMeans

重要参数：n_clusters:告诉模型分几类，默认8类

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# 自己创建数据集
x,y = make_blobs(n_samples=500,n_features=2,centers=4,random_state=1)
fig,ax1 = plt.subplots(1)  # 生成子图几个，fig画布，ax1对象
ax1.scatter(x[:,0],x[:,1],marker='o',s=8)
plt.show()

color = ['red','pink','orange','gray']
fig,ax1 = plt.subplots(1)
for i in range(4):
    ax1.scatter(x[y==i,0],x[y==i,1],marker='o',s=8,c=color[i])
plt.show()

from sklearn.cluster import KMeans
n_clusters = 3
cluster = KMeans(n_clusters=n_clusters,random_state=0).fit(x)   # 已经完成聚类，求出质心
y_pred = cluster.labels_    # labels_查看聚好的类别，每个样本中所对应的类
y_pred

# 也可以调用接口predict，跟fit的结果一样
# predict相当于已经有了质心，根据这个质心，把点进行聚类
# 为什么用predict，当数据量特别大时，可以在fit里把数据切片求出质心，然后predict把所有数据进行聚类，节省时间,但效果肯定不如减少数据量的，当数据非常大时，效果差不多
# cluster = KMeans(n_clusters=n_clusters,random_state=0).fit(x[:200])
# y = cluster_smallsub.predict(x)

centroid = cluster.cluster_centers_    # cluster_centers_ 查看质心
centroid

cluster.inertia_    # 查看距离总平方和

# 画出聚类后的图像

color = ["red","green","orange","gray"]
fig,ax1 = plt.subplots(1)
for i in range(n_clusters):
    ax1.scatter(x[y_pred==i,0],x[y_pred==i,1],marker='o',s=8,c=color[i])
ax1.scatter(centroid[:,0],centroid[:,1],marker="x",s=15,c="black")
plt.show()

# 猜测簇的个数

n_clusters = 4
cluster4 = KMeans(n_clusters,random_state=0).fit(x)
inertia_ = cluster4.inertia_
inertia_
n_clusters = 5
cluster5 = KMeans(n_clusters,random_state=0).fit(x)
inertia_ = cluster5.inertia_
inertia_

# intertia越小越好，但intertia受n_cluster影响，不能通过调整n_cluster来减少intertia
# 所以intertia不是一个有效的评估指标

# 聚类算法的模型评估指标

通过衡量簇内外的差异来衡量聚类的效果。

当真实标签已知：互信息分（现实中一般不会遇到）

当真实标签未知：

轮廓系数：用两个指标a，b评估簇内密集程度

a：样本与其自身所在的簇中的其他样本的相似度(样本簇中所有其他点之间的平均距离)

b：样本与其他簇中样本的相似度(样本与下一个最近的簇中的所有点之间的平均距离)

单个样本轮廓系数 s = $\frac{b-a}{max\left ( a,b \right )}$ (-1,1)越接近1越好

metrics.silhouette_score 返回一个数据集中所有样本轮廓系数均值

metrics.silhouette_ sample 返回数据集中每个样本的轮廓系数

卡林斯基-哈拉巴斯指数：Calinski-Harabasz Index 优点：计算快

s(k) = $\frac{Tr(B_{k})}{Tr(W_{k})}$ $*\frac{N-k}{k-1}$

k 簇的个数， $B_{k}$ 组间离散矩阵(不同簇之间协方差阵)

tr矩阵的迹(对角线元素的和) $W_{k}$ 簇内离散矩阵(一个簇内数据的协方差阵)

数据间离散程度越高，协方差的迹越大，所以Calinski-Harabasz 指数越高越好

轮廓系数

from sklearn.metrics import silhouette_score
from sklearn.metrics import silhouette_samples
silhouette_score(x,y_pred)

silhouette_score(x,cluster4.labels_)  # 分4簇比3簇效果好

silhouette_score(x,cluster5.labels_)  # 分5簇

silhouette_samples(x,y_pred)

卡林斯基-哈拉巴斯指数

from sklearn.metrics import calinski_harabasz_score
calinski_harabasz_score(x,y_pred)

calinski_harabasz_score(x,cluster4.labels_)

calinski_harabasz_score(x,cluster5.labels_)

比较两个系数所用的时间

from time import time
t0 = time()
calinski_harabasz_score(x,cluster4.labels_)
time()-t0

t0 = time()
silhouette_score(x,cluster4.labels_)
time()-t0

案例：基于轮廓系数来选择n_clusters

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score,silhouette_samples
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
from sklearn.datasets import make_blobs

# 自己创建数据集
x,y = make_blobs(n_samples=500,n_features=2,centers=4,random_state=1)
for n_clusters in [2,3,4,5,6,7]:
    # 知道每个聚类出来的类的轮廓系数是多少，还想要一个各个类之间的轮廓系数的对比
    # 知道聚类完毕之后图像的分布是什么样
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)
    # 第一个图是轮廓系数图，是由各个簇的轮廓系数组成的横向条形图
    # 横向条形图的横坐标是轮廓系数的取值，纵坐标是每个样本

    # 设定横坐标
    # 轮廓系数[-1,1]之间，太长的横坐标不利于可视化，所以设定x轴取值在[-0.1,1]之间
    ax1.set_xlim([-0.1, 1])
    # 设定纵坐标，最小值为0，最大值为x.shape[0]
    # 每个簇排在一起，不同的簇之间有空隙
    # 在x.shape[0]上加上一个距离（n_clusters + 1）*10,留作间隙
    ax1.set_ylim([0, x.shape[0] + (n_clusters + 1) * 10])
    # 开始建模
    clusterer = KMeans(n_clusters=n_clusters, random_state=10).fit(x)
    cluster_labels = clusterer.labels_
    # silhouette_score生成轮廓均值
    # 调用轮廓系数，需要输入的参数是特征矩阵x和聚类好的标签cluster_labels
    silhouette_avg = silhouette_score(x, cluster_labels)
    print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg)
    # 调用silhouette_samples 返回每个样本的轮廓系数，这是横坐标
    sample_silhouette_values = silhouette_samples(x, cluster_labels)
    # 设定y轴上的初始值(和x轴之间有一定的距离)
    y_lower = 10
    for i in range(n_clusters):
        # 从每个样本的轮廓系数结果中抽出第i个簇的轮廓系数，并对他进行排序（使每一簇的样本排在一起）
        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
        # .sort()会直接改变原数据的顺序
        # .sort()使数据沿y轴上半轴从小到大排序
        ith_cluster_silhouette_values.sort()
        # 查看簇中有多少样本
        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i
        # colormap库中的使用小数来调用颜色的函数
        # 在nipy_spectral([输入任意小数来代表一个颜色])
        # 要确保每次循环生成的小数是不同的，以确保所有的簇会有不同的颜色
        color = cm.nipy_spectral(float(i) / n_clusters)
        # 开始填充子图1的内容
        # fill_between是让一个范围中的柱状图都统一颜色的函数
        # fill_betweenx的范围在纵坐标上，fill_betweeny的范围在横坐标上
        # fill_betweenx(纵坐标下限，纵坐标上限，x轴上的取值，柱状图颜色)
        ax1.fill_betweenx(np.arange(y_lower, y_upper), ith_cluster_silhouette_values, facecolor=color, alpha=0.7)
        # 为每个簇的轮廓系数写上簇的编号，让编号显示在坐标轴上每个条形图的中间位置
        # text(要显示编号位置的横坐标，要显示编号位置的纵坐标，要显示编号的内容)
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
        # 为下一个簇计算新的y轴上的初始值，每一次迭代之后，y的上限加上10
        y_lower = y_upper + 10
    # 给图1加上标题
    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")
    # 把整个数据集上的轮廓系数的均值以虚线的形式放入我们的图中
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
    # 让y轴不显示刻度
    ax1.set_yticks([])
    # 让x轴上的刻度显示为我们规定的列表
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
    # 创建第二个图，点图
    # 获取颜色，由于没有循环，需要一次生成多个小数来获取多个颜色
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)  # 四个类四种标签变成浮点数
    ax2.scatter(x[:, 0], x[:, 1], marker="o", s=8, c=colors)
    # 把质心放到图像
    centers = clusterer.cluster_centers_
    ax2.scatter(centers[:, 0], centers[:, 1], marker="x", c="red", alpha=1, s=200)
    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")
    # 为整个图设置标题
    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data""with n_clusters = %d" % n_clusters)
                 , fontsize=14, fontweight='bold')  # fontsize 字体大小 ，fontweight 加粗字体
    plt.show()