目录
1. KMeans简介
KMeans是一种简单的聚类方法,它使用每个样本到聚类中心的距离作为度量来决定簇。其中 值是用户指定的簇的数目。初始时,随机选取 个点作为聚类中心(质心),然后通过不断修改聚类中心达到最优的效果。由于其计算每个簇的均值作为质心,因此也成为K均值。
2. KMeans模型
2.1 KMeans
KMeans算法较为简单,记K个簇中心为 ,每个簇的样本数目为 。KMeans使用误差平方和作为目标函数,即
对损失函数求偏导,可得
令导数等于0,解得
即令每个簇的均值作为质心。
KMeans代码如下:
def kmeans(self, train_data, k):
sample_num = len(train_data)
distances = np.zeros([sample_num, 2]) # (index, distance)
centers = self.createCenter(train_data)
centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
return centers, distances
其中的adjustCluster()函数,为确定了初始的质心后的调整过程,功能是最小化损失函数 ,其代码为:
def adjustCluster(self, centers, distances, train_data, k):
sample_num = len(train_data)
flag = True # If True, update cluster_center
while flag:
flag = False
d = np.zeros([sample_num, len(centers)])
for i in range(len(centers)):
# calculate the distance between each sample and each cluster center
d[:, i] = self.calculateDistance(train_data, centers[i])
# find the minimum distance between each sample and each cluster center
old_label = distances[:, 0].copy()
distances[:, 0] = np.argmin(d, axis=1)
distances[:, 1] = np.min(d, axis=1)
if np.sum(old_label - distances[:, 0]) != 0:
flag = True
# update cluster_center by calculating the mean of each cluster
for j in range(k):
current_cluster = train_data[distances[:, 0] == j] # find the samples belong to the j-th cluster center
if len(current_cluster) != 0:
centers[j, :] = np.mean(current_cluster, axis=0)
return centers, distances
2.2 2分KMeans
由于KMeans可能收敛于局部最小值,为了解决这一问题引入2分KMeans。2分KMeans原理是先将所有样本视为一个大簇,然后将其一分为二;然后选择其中一个继续划分,直到簇的个数达到了指定的 为止。那么如何选择要划分的簇呢?这里采用误差平方和(Sum of Squared Error, SSE)来作为评价标准。假设现在有 个簇,记为
选择被划分的簇流程为:将 中的一个簇 划分成两部分 (这两部分采用普通的KMeans方法),此时的SSE为
适合划分的簇为
然后,如此反复,直到质心个数等于指定的
代码如下:
def biKmeans(self, train_data):
sample_num = len(train_data)
distances = np.zeros([sample_num, 2]) # (index, distance)
initial_center = np.mean(train_data, axis=0) # initial cluster #shape (1, feature_dim)
centers = [initial_center] # cluster list
# clustering with the initial cluster center
distances[:, 1] = np.power(self.calculateDistance(train_data, initial_center), 2)
# generate cluster centers
while len(centers) < self.k:
# print(len(centers))
min_SSE = np.inf
best_index = None
best_centers = None
best_distances = None
# find the best split
for j in range(len(centers)):
centerj_data = train_data[distances[:, 0] == j] # find the samples belong to the j-th center
split_centers, split_distances = self.kmeans(centerj_data, 2)
split_SSE = np.sum(split_distances[:, 1]) ** 2 # calculate the distance for after clustering
other_distances = distances[distances[:, 0] != j] # the samples don't belong to j-th center
other_SSE = np.sum(other_distances[:, 1]) ** 2 # calculate the distance don't belong to j-th center
# save the best split result
if (split_SSE + other_SSE) < min_SSE:
best_index = j
best_centers = split_centers
best_distances = split_distances
min_SSE = split_SSE + other_SSE
# save the spilt data
best_distances[best_distances[:, 0] == 1, 0] = len(centers)
best_distances[best_distances[:, 0] == 0, 0] = best_index
centers[best_index] = best_centers[0, :]
centers.append(best_centers[1, :])
distances[distances[:, 0] == best_index, :] = best_distances
centers = np.array(centers) # transform form list to array
return centers, distances
2.3 KMeans++
KMeans 方法由于初始的质心选择对于聚类算法具有很大的影响,因此引入KMeans++的算法。其原理为:假设现在有 个簇
则在选取第 个聚类中心时:距离当前 个聚类中心越远的点会有更高的概率被选为第 个聚类中心。这也符合我们的直觉:聚类中心当然是互相离得越远越好。首先计算每个样本点与当前已有的聚类中心的最短距离
然后计算每个样本被选为下一个聚类中心的概率
然后采用轮盘法选出下一个聚类中心。 个质心选完之后,运行adjustCluster()调整质心。KMeans++代码如下:
def kmeansplusplus(self,train_data):
sample_num = len(train_data)
distances = np.zeros([sample_num, 2]) # (index, distance)
# randomly select a sample as the initial cluster
initial_center = train_data[np.random.randint(0, sample_num-1)]
centers = [initial_center]
while len(centers) < self.k:
d = np.zeros([sample_num, len(centers)])
for i in range(len(centers)):
# calculate the distance between each sample and each cluster center
d[:, i] = self.calculateDistance(train_data, centers[i])
# find the minimum distance between each sample and each cluster center
distances[:, 0] = np.argmin(d, axis=1)
distances[:, 1] = np.min(d, axis=1)
# Roulette Wheel Selection
prob = np.power(distances[:, 1], 2)/np.sum(np.power(distances[:, 1], 2))
index = self.rouletteWheelSelection(prob, sample_num)
new_center = train_data[index, :]
centers.append(new_center)
# adjust cluster
centers = np.array(centers) # transform form list to array
centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
return centers, distances
3. 总结与分析
聚类算法收敛后可以使用一些方法来调整质心。对于 值的选择也有一些方法来确定,如采用轮廓系数(Silhouette Coefficient)来确定。最后看下三种聚类方法的效果。
可以发现KMeans++的算法运行效果最好,这三种方法运行时间差不多。
本文相关代码和数据集: https://github.com/Ryuk17/MachineLearning
[1] Peter Harrington, Machine Learning IN ACTION
[2] Andrew Ng, CS229 Lecture notes