一、基本代码
sklearn.cluster.KMeans(n_clusters=8, *, init='k-means++', n_init=10,
max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True,
algorithm='auto')
参数说明:
(1)n_cluster:需要给定的类别个数,默认值为8;
(2)max_iter:算法执行的最大迭代次数,这里设置最大迭代次数为300;
(3)n_init:初始化次数,设为10意味着进行10次随机初始化,选择效果最好的一种来作为模型;
(4)init='k-means++' 会由程序自动寻找合适的n_clusters;
(5) tol:float型,默认值=1e-4,与inertia结合来确定收敛条件;
(6)n_jobs:指定计算所用的进程数;
属性说明:
(1) cluster_centers _: ndarray of shape (n_clusters, n_features), Coordinates of cluster centers.
(2) labels _: ndarray of shape (n_samples,), Labels of each point
(3) inertia_float: Sum of squared distances of samples to their closest cluster center, weighted
by the sample weights if provided.
(4) n_iter_in: int, Number of iterations run.
(5) n_features_in _: int, Number of features seen during fit.
(6) feature_names_in _: ndarray of shape (n_features_in_,), Names of features seen during fit.
方法:
(1) fit(X[, y, sample_weight]), X{array-like, sparse matrix} of shape (n_samples, n_features)
(2) fit_predict(X, y=None, sample_weight=None),
Returns: labels, ndarray of shape (n_samples,), Index of the cluster each sample belongs to.
(3) fit_transform(X, y=None, sample_weight=None),
Returns: X newndarray of shape (n_samples, n_clusters)
案例实现:
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn import metrics
# 生成样本特征 X 和样本簇类别 y,共1000个样本,每个样本2个特征
# 共4个簇,簇中心分别在 [-1,-1], [0,0], [1,1], [2,2],簇方差分别为 [0.4, 0.2, 0.2, 0.2]
X, y = make_blobs(n_samples=1000,
n_features=2,
centers=[[-1, -1], [0, 0], [1, 1], [2, 2]],
cluster_std=[0.4, 0.2, 0.2, 0.2],
random_state=9)
# 绘制散点图,根据簇类别 y 设置颜色
plt.figure(figsize=(8, 6)) # 设置图表大小
plt.scatter(X[:, 0], X[:, 1], c=y, marker='o', cmap='viridis')
plt.colorbar() # 添加颜色条以显示类别
plt.title('Scatter plot of the blobs with true labels') # 添加标题
plt.xlabel('Feature 1') # x轴标签
plt.ylabel('Feature 2') # y轴标签
plt.show()
# 对不同的簇数进行KMeans聚类,并计算每次聚类结果的轮廓系数
plt.figure(figsize=(12, 10)) # 设置图表大小
for index, k in enumerate((2, 3, 4, 5)):
plt.subplot(2, 2, index + 1) # 创建一个2x2的子图
# KMeans聚类
y_pred = KMeans(n_clusters=k, random_state=9).fit_predict(X)
# 计算轮廓系数
score = metrics.silhouette_score(X, y_pred, metric='euclidean')
# 绘制聚类结果的散点图
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis')
plt.title(f'KMeans with k={k}') # 添加标题
plt.xlabel('Feature 1') # x轴标签
plt.ylabel('Feature 2') # y轴标签
# 在图中添加轮廓系数文本
plt.text(.99, .01, ('score: %.2f' % score), transform=plt.gca().transAxes,
size=10, horizontalalignment='right')
plt.tight_layout() # 调整子图间距
plt.show()