概论
1.随机选取K个样本作为初始质心。
2.循环,将每个样本分配到离他们最近的质心,形成K个簇
3.对于每个簇,计算被分到该簇的所有样本点的平均值作为新的质心
4.当质心不再发生变化,迭代停止,聚类完成。
from sklearn.cluster import Kmeans
简单建模
from sklearn.cluster import Kmeans
kmean=Kmeans(n_clusters=10,init='random',random_state=420).fit(x)
常用参数
参数 | 含义 |
---|---|
n_clusters | 要分成的簇数。默认8 |
init | 初始化质心的方法。默认‘k-means++’,一种K均值聚类选择初始聚类中心的方法。可输入‘random’或者n维数组 |
n_init | 使用不同质心随机初始化的种子来运行K-means算法的次数,最终会基于inertia来计算n_init次连续运行后的最佳输出 |
max_iter | 单次运行的最大迭代次数 |
n_jobs | 使用的线程数。-1表示全部可用的处理器,-2表示全部-1个处理器,以此类推 |
属性 | 注解 |
---|---|
cluster_centers_ | 收敛到的质心 |
labels_ | 每个样本点对应的标签 |
inertia | 每个样本到距离他们最近的簇心的均方距离,‘簇内平方和’ |
n_iter_ | 实际迭代次数 |
接口 | 注解 |
---|---|
fit_predict | 返回每个样本对应簇的索引 |
fit_transfrom | |
get_params | 获取该类的参数 |
set_params | 重设参数 |
计算公式
对于一个簇来讲,所有样本点到质心的距离之和越小,该簇的样本就越相似,簇类差异越小。而距离衡量的方法有三种:欧几里得距离、曼哈顿距离、余弦距离。
轮廓系数
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
for n_clusters in [2,3,4,5,6,7]:
n_clusters = n_clusters
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
ax1.set_xlim([-0.1, 1])
ax1.set_ylim([0, X.shape[0] + (n_clusters + 1) * 10])
clusterer = KMeans(n_clusters=n_clusters, random_state=10).fit(X)
cluster_labels = clusterer.labels_
silhouette_avg = silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
sample_silhouette_values = silhouette_samples(X, cluster_labels)
y_lower = 10
for i in range(n_clusters):
ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i)/n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper)
,ith_cluster_silhouette_values
,facecolor=color
,alpha=0.7
)
ax1.text(-0.05
, y_lower + 0.5 * size_cluster_i
, str(i))
y_lower = y_upper + 10
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([])
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(X[:, 0], X[:, 1]
,marker='o'
,s=8
,c=colors
)
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1], marker='x',
c="red", alpha=1, s=200)
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
plt.show()