sklearn K-means MiniBatch-K-Means

K-means:
K-means的注意事项,对于不同量纲(扁平数据)及(类别)非凸数据不适用,应当做PCA
预处理。

通过对协方差阵的估计可以看到,make_blobs是用单位协方差阵生成的。
cluster_std为每个cluster的标准差。
下面Anisotropicly Distributed Blobs施加的是强线性变换(无扰动)并强负相关
变换后相关系数-0.95065634126728737,可以看到样本点分布非均匀且分类不独立。
这时当然不适合K-means, 可见PCA的重要性。

下面是一些使用K-means的不当数据形式展示,但后两个的情况并没有很差:
import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.cluster import KMeans 
from sklearn.datasets import make_blobs 

plt.figure(figsize = (12, 12))
n_samples = 1500 
random_state = 170 

X, y = make_blobs(n_samples = n_samples, random_state = random_state)
y_pred = KMeans(n_clusters = 2, random_state = random_state).fit_predict(X)
plt.subplot(221)
plt.scatter(X[:,0], X[:,1], c = y_pred)
plt.title("Incorrect Number of Blobs")

transformation = [[0.60834549, -0.63667341], [-0.40887718,0.852553229]]
X_aniso = np.dot(X, transformation)
y_pred = KMeans(n_clusters = 3, random_state = random_state).fit_predict(X_aniso)
plt.subplot(222)
plt.scatter(X_aniso[:,0], X_aniso[:,1], c = y_pred)
plt.title("Anisotropicly Distributed Blobs")

X_varied, y_varied = make_blobs(n_samples = n_samples, cluster_std = [1.0, 2.5, 0.5], random_state = random_state)
y_pred = KMeans(n_clusters = 3, random_state = random_state).fit_predict(X_varied)
plt.subplot(223)
plt.scatter(X_varied[:,0], X_varied[:,1], c = y_pred)
plt.title("Unequal Variance")

X_filtered = np.vstack((X[y==0][:500], X[y==1][:100], X[y==2][:10]))
y_pred = KMeans(n_clusters = 3, random_state = random_state).fit_predict(X_filtered)
plt.subplot(224)
plt.scatter(X_filtered[:,0], X_filtered[:,1], c = y_pred)
plt.title("Unevenly Sized Blobs")

plt.show()








mini_batch k均值聚类与传统k均值的区别是,原来对质心的更新是单点进行,现在改为
对一小批数据进行更新及计算质心。(速度比K-means快,精度更低)
batch_size设定更新样本量。
参数设定 init='k-means++' 会由程序自动寻找合适的n_clusters
n_init 参数设定进行尝试的模型个数(default 10),最终会选择最好的(距质心平方和最小)。
max_no_improvement 参数设定在使用mini k-means模型时连续若干个抽样选择没有导致
 质心距离函数有明显下降的情况下的最多终止步数。(超过该步数求解终止)
verbose 参数设定打印求解过程的程度,值越大,细节打印越多。
subplots_adjust用于调整边宽,默认边宽约0.1个axis坐标系单位。

sklearn.metrics.pairwise_distances_argmin:
 对两个二维数组进行距离计算,按顺序(按行)计算在第二个数组中与第一个
 数组距离最近的相应行的下标。

np.logical_not:对数组(每个元素)取逻辑非。

下面是一个比较K-means与MiniBatchK-means的例子:
import time 
import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.cluster import MiniBatchKMeans, KMeans 
from sklearn.metrics.pairwise import pairwise_distances_argmin 
from sklearn.datasets.samples_generator import make_blobs 

np.random.seed(0)
batch_size = 45
centers = [[1,1], [-1,-1], [1,-1]]
n_clusters = len(centers)
X, labels_true = make_blobs(n_samples = 3000, centers = centers, cluster_std = 0.7)

k_means = KMeans(init = "k-means++", n_clusters = 3, n_init = 10)
t0 = time.time()
k_means.fit(X)
t_batch = time.time() - t0 
k_means_labels = k_means.labels_ 
k_means_cluster_centers = k_means.cluster_centers_
k_means_labels_unique = np.unique(k_means_labels)

mbk = MiniBatchKMeans(init = 'k-means++', n_clusters = 3, batch_size = batch_size, n_init = 10,\
   max_no_improvement = 10, verbose = 0)
t0 = time.time()
mbk.fit(X)
t_mini_batch = time.time() - t0 
mbk_means_labels = mbk.labels_ 
mbk_means_cluster_centers = mbk.cluster_centers_
mbk_means_labels_unique = np.unique(mbk_means_labels)

fig = plt.figure(figsize = (8, 3))
fig.subplots_adjust(left = 0.02, right = 0.98, bottom = 0.05, top = 0.9)
colors = ['#4EACC5', '#FF9C34', '#4E9A06']

order = pairwise_distances_argmin(k_means_cluster_centers, mbk_means_cluster_centers)

ax = fig.add_subplot(1, 3, 1)
for k, col in zip(range(n_clusters), colors):
 my_members = k_means_labels == k 
 cluster_center = k_means_cluster_centers[k]
 ax.plot(X[my_members, 0], X[my_members, 1], 'w', markerfacecolor = col, marker = '.')
 ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor = col, \
   markeredgecolor = 'k', markersize = 6)
ax.set_title('KMeans')
ax.set_xticks(())
ax.set_yticks(())
plt.text(-3.5, 1.8, 'train time: %.2fs\ninertia: %f' % (t_batch, k_means.inertia_))

ax = fig.add_subplot(1, 3, 2)
for k, col in zip(range(n_clusters), colors):
 my_members = mbk_means_labels == order[k]
 cluster_center = mbk_means_cluster_centers[order[k]]
 ax.plot(X[my_members, 0], X[my_members, 1], 'w', markerfacecolor = col, marker = '.')
 ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor = col, markeredgecolor = 'k', markersize = 6)
ax.set_title("MiniBatchKMeans")
ax.set_xticks(())
ax.set_yticks(())
plt.text(-3.5, 1.8, 'train time: %.2fs\ninertia: %f' % (t_mini_batch, mbk.inertia_))

different = (mbk_means_labels == 4)
ax = fig.add_subplot(1, 3, 3)
for k in range(n_clusters):
 different += ((k_means_labels == k)!=(mbk_means_labels == order[k]))

identic = np.logical_not(different)
ax.plot(X[identic, 0], X[identic, 1], 'w', markerfacecolor = '#bbbbbb', marker = '.')
ax.plot(X[different, 0], X[different, 1], 'w', markerfacecolor = 'm', marker = '.')
ax.set_title('Difference')
ax.set_xticks(())
ax.set_yticks(())

plt.show()



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值