机器学习-KMeans聚类(肘系数Elbow和轮廓系数Silhouette)

本文介绍了KMeans聚类算法的基本原理,包括如何选择初始中心点和迭代过程。针对聚类数量的选择,通过Elbow方法和Silhouette分析来评估模型性能。Elbow方法寻找变形度下降最陡的点,而Silhouette系数衡量样本在群组内的紧密程度和与其他群组的分离程度。结果表明,当设置聚类数量为3时,聚类效果最佳,平均轮廓系数接近0.72,与Elbow方法的结果一致。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Section I: Brief Introduction on KMeans Cluster

The K-Means algorithm belongs to the category of prototype-based clustering. Prototype-based clustering means that each cluster is represented by a prototype, which can either be the centorid (average) of similar points with continuous features, or the medoid (the most frequently occurring point) in the case of categorical features. While K-Means is very good at identifying clusters with a spherical shape, one of the drawbacks of this clutering algorithm is that the number of clusters need to be specified. An inapproriate choice for cluter number can result in poor clustering performance, so two indexes for model performance, i.e., elbow and silhouette, are useful techniques to evaluate the quality of clutering to determine the optimal number of cluters.
The flowchart of K-Means algorithm can be summarized by the following four steps:

  • Step 1: Randomly pick k centroids from the sample points as initial cluter centers
  • Step 2: Assign each sample to the nearest centroid according to distance difference, and then move the centroids to the center of the samples that were assigned to it
  • Step 3: Repeat steps 2 until the cluster assignments do not change or user-defined tolerance or maximum number of iterations is reached; otherwise, update centroids.

FROM
Sebastian Raschka, Vahid Mirjalili. Python机器学习第二版. 南京:东南大学出版社,2018.

第一部分:基本KMeans聚类算法
代码

from sklearn import datasets

import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

plt.rcParams['figure.dpi']=200
plt.rcParams['savefig.dpi']=200
font = {
   'weight': 'light'}
plt.rc("font", **font)

#Section 1: Load Blobs from datasets and visualize it
X,y=datasets.make_blobs(n_samples=150,
                        n_features=2,
                        centers=3,
                        cluster_std=0.5,
                        shuffle=True,
                        random_state=0)
plt.scatter(X[:,0],X[:,1],c='white',marker='o',edgecolors='black',s=50)
plt.grid()
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.savefig('./fig1.png')
plt.show()

#Section 2: Use KMeans algorithm to visualize data points and centroids
from sklearn.cluster import KMeans

#Set n_init=10 to run k-means clustering algorithm 10 times independently
#with different centroids to choose the final model with the lowest SSE
km=KMeans(n_clusters=3,
          init='random',
          n_init=10,
          max_iter=300,
          tol=1e-4,
          random_state=0)

y_km=km.fit_predict(X)
plt.scatter(X[y_km==0,0],
            X[y_km==0,1],
            s=50,
            c='lightgreen',
            marker='s',
            edgecolor='black',
            label='Cluster 1')
plt.scatter(X[y_km==1,0],
            X[y_km==1,1],
            s=50,
            c='orange',
            marker='o',
            edgecolor='black',
            label='Cluster 2')
plt.scatter(X[y_km==2,0],
            X[y_km==2,1],
            s=50,
            c='lightblue',
            marker='v',
            edgecolor='black',
            label='Cluster 3')
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],
            s=250,
            marker='*',
            c='red',
            edgecolor='black',
            label='Centroids')
plt.xlabel(
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值