（sklearn机器学习）第六章_聚类算法（1）

爱听许嵩歌

已于 2022-02-25 08:38:21 修改

阅读量375

点赞数 2

分类专栏：数据分析（Python）机器学习文章标签：机器学习

于 2021-03-16 16:05:15 首次发布

本文链接：https://blog.csdn.net/weixin_45092662/article/details/114885164

版权

数据分析（Python）同时被 2 个专栏收录

28 篇文章 11 订阅

订阅专栏

机器学习

17 篇文章 0 订阅

订阅专栏

上一章：（sklearn机器学习）第五章_逻辑回归（1）
https://blog.csdn.net/weixin_45092662/article/details/114537578
下一章：（sklearn机器学习）第七章_支持向量机（1）
https://blog.csdn.net/weixin_45092662/article/details/115009065

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# 自己创建数据集
x, y =make_blobs(n_samples=500,n_features=2,centers=4,random_state=1)

fig, ax1 =plt.subplots(1)
ax1.scatter(x[:,0],x[:,1]
            ,marker='o'# 点的形状
            ,s=8 # 点的大小
            )

plt.show()

在这里插入图片描述

color = ["red","pink","orange","gray"]
fig, ax1 = plt.subplots(1)

for i in range(4):
    ax1.scatter(x[y==i,0],x[y==i,1],marker='o',s=8,c=color[i])

plt.show()

在这里插入图片描述

from sklearn.cluster import KMeans

n_clusters = 4

cluster = KMeans(n_clusters=n_clusters, random_state=0).fit(x)

查看重要属性labels_，查看聚类好的类别，每个样本所对应的类

y_pred = cluster.labels_
# y_pred

KMeans并不需要建立模型或者预测结果，因此只需要fit就能够得到聚类结果了

KMeans也有接口predict和fit_predict，表示学习数据x并对x的类进行预测

pre = cluster.fit_predict(x)

# pre == y_pred

重要属性cluster_centers_，查看质心

centroid = cluster.cluster_centers_
centroid

array([[-10.00969056,  -3.84944007],
       [ -1.54234022,   4.43517599],
       [ -6.08459039,  -3.17305983],
       [ -7.09306648,  -8.10994454]])

重要属性inertia_，查看总距离平方和

inertia = cluster.inertia_
inertia

908.3855684760603

color = ["red","pink","orange","gray"]
fig, ax1 = plt.subplots(1)

for i in range(n_clusters):
    ax1.scatter(x[y_pred==i,0],x[y_pred==i,1],marker='o',s=8,c=color[i])

# 画质心
ax1.scatter(centroid[:,0],centroid[:,1],marker='x',s=15,c="black")
plt.show()

在这里插入图片描述

from sklearn.metrics import silhouette_score
from sklearn.metrics import silhouette_samples

x.shape

(500, 2)

y_pred.shape

(500,)

silhouette_score(x,y_pred)

0.6505186632729437

cluster.labels_.shape

(500,)

silhouette_score(x,cluster.labels_)

0.6505186632729437

silhouette_samples(x,y_pred).shape

(500,)

silhouette_samples(x,y_pred).mean()

0.6505186632729437

from sklearn.metrics import calinski_harabasz_score

calinski_harabasz_score(x,y_pred)

2704.4858735121097

时间戳可以通过datetime中的函数fromtimestamp转换成真正的时间格式

from time import time
# time():记下每一次time（）这行命令时的时间戳
# 时间戳是一行数字，用来记录此时此刻的时间
t0 = time()
calinski_harabasz_score(x,y_pred)
time() - t0

0.0009961128234863281

t0 = time()
silhouette_score(x,y_pred)
time() - t0

0.007996320724487305

可以看得出，calinski-harabaz指数比轮廓系数的计算块了一倍不止。想想看我们使用的数据量，如果是一个以万计的数据，轮廓系数就会大大拖慢我们模型的运行速度了。

import datetime
datetime.datetime.fromtimestamp(t0).strftime("%Y-%m-%d %H:%M:%S")

'2021-03-16 14:32:41'

基于轮廓系数来选择n_clusters

silhouette_score生成的是所有样本点的轮廓系数的均值

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

import matplotlib.pyplot as plt
import matplotlib.cm as cm 
import numpy as np 
import pandas as pd 

for n_clusters in [2,3,4,5,6,7]:
    n_clusters = n_clusters
    
    # 创建画布，生成一个画布，两个子图对象
    fig, (ax1, ax2) = plt.subplots(1,2)
    fig.set_size_inches(18,7)
    ax1.set_xlim([-0.1,1])
    ax1.set_ylim([0,x.shape[0]+(n_clusters+1)*10])
    clusterer = KMeans(n_clusters=n_clusters, random_state=10).fit(x)
    cluster_labels = clusterer.labels_
    silhouette_avg = silhouette_score(x,cluster_labels)
    print("For n_clusters =", n_clusters,"The average silhouette_score is :",silhouette_avg)
    
    # 调用silhouette_samples，返回每个样本点的轮廓系数，这就是我们的横坐标
    sample_silhouette_values = silhouette_samples(x,cluster_labels)
    
    # 设定y轴上的初始值
    y_lower = 10

    # 对每一个簇进行循环
    for i in range(n_clusters):
        # 从每个样本的轮廓系数结果中抽取出第i个簇的轮廓系数，并对它进行排序
        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]

        # sort()会直接改掉原数据的顺序
        ith_cluster_silhouette_values.sort()

        # 查看这一个簇中究竟有多少个样本
        size_cluster_i = ith_cluster_silhouette_values.shape[0]

        # 这一个簇在y轴上的取值，应该是由初始值（y_lower）开始，到初始值加上这个簇中的样本数量结束（y_upper）
        y_upper = y_lower + size_cluster_i

        # colormap库中，使用小数来调用颜色的函数，在nipy_spectral([输入任意小数来代表一个颜色])
        # 希望每个簇的颜色是不同的，每次循环生成的小数是不同的，使用i的浮点数除以n_clusters，这样就生成了不同的小数
        color = cm.nipy_spectral(float(i)/n_clusters)

        # 填充子图1中的内容
        # fill_between是填充曲线与直角之间的空间的函数
        # fill_betweenx的参数输入（纵坐标的下限，纵坐标的上限，x轴上的取值，柱状图的颜色）
        ax1.fill_betweenx(np.arange(y_lower,y_upper), ith_cluster_silhouette_values, facecolor=color, alpha=0.7)

        # 为每个簇的轮廓系数写上簇的编号，并且让簇的编号显示坐标轴上每个条形图的中间位置
        # text的参数为（要显示编号的位置的横坐标，要显示编号的位置的纵坐标，要显示的编号内容）
        ax1.text(-0.05, y_lower + 0.5*size_cluster_i, str(i)) 

        # 为下一个簇计算新的y轴上的初始值，是每一次迭代之后，y的上限再加上10，以此来保证，不同的簇的图像之间显示有空隙
        y_lower = y_upper + 10
        
    # 给图1加上标题，横坐标轴，纵坐标轴的标签
    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficent values")
    ax1.set_ylabel("Cluster label")
    
    # 把整个数据集上的轮廓系数的均值以虚线的形式放入图中
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
    
    # 让y轴不显示任何刻度
    ax1.set_yticks([])
    
    # 让x轴上的刻度显示为规定的列表
    ax1.set_xticks([-0.1,0,0.2,0.4,0.6,0.8,1])
    
    # 开始对第二个图进行处理，先获取新颜色，一次性生成多个小数来获取多个颜色
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)

    ax2.scatter(x[:,0],x[:,1],marker='o',s=8,c=colors)
    
    # 把生成的质心放到图像中去
    centers = clusterer.cluster_centers_

    ax2.scatter(centers[:,0],centers[:,1],marker='x',c="red",alpha=1,s=200)
    
    # 为图二设置标题，横坐标标题，纵坐标标题
    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2st feature") 
    
    # 为整个图设置标题
    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data""with n_clusters = %d" % n_clusters),fontsize=14, fontweight='bold')
    # plt.savefig("n_clusters = %d" % n_clusters, dpi=500, bbox_inches='tight') # 解决图片不清晰，不完整的问题 
    plt.show()

For n_clusters = 2 The average silhouette_score is : 0.7049787496083261

在这里插入图片描述

For n_clusters = 3 The average silhouette_score is : 0.5882004012129721

在这里插入图片描述

For n_clusters = 4 The average silhouette_score is : 0.6505186632729437

在这里插入图片描述

For n_clusters = 5 The average silhouette_score is : 0.5745566973301872

在这里插入图片描述

For n_clusters = 6 The average silhouette_score is : 0.4387644975296138

在这里插入图片描述

For n_clusters = 7 The average silhouette_score is : 0.3728615111052894

在这里插入图片描述
原代码在我的码云里，链接：https://gitee.com/rengarwang/sklearn-machine-learning-code/blob/master/（第六章）聚类算法/聚类算法k_means（1）.ipynb

有用请点个赞！！
本站所有文章均为原创，欢迎转载，请注明文章出处：https://blog.csdn.net/weixin_45092662。百度和各类采集站皆不可信，搜索请谨慎鉴别。技术类文章一般都有时效性，本人习惯不定期对自己的博文进行修正和更新，因此请访问出处以查看本文的最新版本。

爱听许嵩歌

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
（sklearn机器学习）第六章_聚类算法（1）

%matplotlib inline%config InlineBackend.figure_format = 'svg'from sklearn.datasets import make_blobsimport matplotlib.pyplot as plt# 自己创建数据集x, y =make_blobs(n_samples=500,n_features=2,centers=4,random_state=1)fig, ax1 =plt.subplots(1)ax1.scatter(
复制链接

扫一扫

专栏目录