前面文章介绍的都是机器学习中的监督学习算法,本章开始介绍第一个非监督学习算法-K均值聚类算法。
监督学习包含分类和回归,其共同点是我们事先知道要求解的目标变量,而非监督学习中不存在类似的目标变量。非监督学习通常可以解决以下问题,
- 数据集S可以分成哪些类(注意,此时我们并不知道S中存在哪些类别,而监督学习事先是知道的)
- 数据集S中哪些特征最频繁出现
- 数据集S中哪些特征有相互关系
本章介绍K均值聚类算法,其可以高效的对数据集进行分类。包含以下内容:
- K均值算法
- 局部收敛的问题分析
- 二分K均值算法
部分内容引用自《Machine Learning in Action》
K均值算法
K均值算法的基本思想是,假设存在数据集S,我们想把S分成k个子类,首先随机选择k个点作为初始质心,然后对S中的每个数据点求解其到最近的那个质心及距离,经过一轮迭代后再通过当前质心相关的子类别的平均值更新质心的位置。重复迭代前面的步骤,直到每个数据点都正确分类。
下面通过代码来实现。创建模块k_mean.py,输入以下代码:
import numpy as np
import matplotlib.pyplot as plt
def load_data_set(file_name):
data_mat = []
with open(file_name) as f:
for line in f.readlines():
cur_line = line.strip().split('\t')
flt_line = list(map(float, cur_line))
data_mat.append(flt_line)
return data_mat
def dist_eclud(vec_A, vec_B):
return np.sqrt(np.sum(np.power(vec_A - vec_B, 2)))
def rand_cent(data_set, k):
n = np.shape(data_set)[1]
centroids = np.mat(np.zeros((k, n)))
for j in range(n):
min_j = np.min(data_set[:, j])
range_j = np.max(data_set[:, j]) - min_j
centroids[:, j] = min_j + range_j * np.random.rand(k, 1)
return centroids
def k_mean(data_set, k, dist_meas=dist_eclud, create_cent=rand_cent):
m = np.shape(data_set)[0]
cluster_assment = np.mat(np.zeros((m, 2)))
centroids = create_cent(data_set, k)
cluster_changed = True
while cluster_changed:
cluster_changed = False
# To get cluster_assment,
# index 0: the centroid index of the latest distance for current data
# index 1: the latest distance value
for i in range(m):
min_dist = np.inf
min_index = -1
for j in range(k):
dist_JI = dist_meas(centroids[j, :], data_set[i, :])
if dist_JI < min_dist:
min_dist = dist_JI
min_index = j
if cluster_assment[i, 0] != min_index:
cluster_changed = True
cluster_assment[i, :] = min_index, min_dist ** 2
# print(centroids)
# Re-calculate centroids, all sub data mean values is the new location of centroid
for cent in range(k):
pts_in_cluster = data_set[np.nonzero(cluster_assment[:, 0].A == cent)[0]]
centroids[cent, :] = np.mean(pts_in_cluster, axis=0)
return centroids, cluster_assment
def plot_result(data_set, centroids, cluster_assment, title='K Mean'):
for k in range(len(centroids)):
k_data_set = data_set[np.nonzero(cluster_assment[:, 0].A == k)[0]]
x = np.array(k_data_set[:, 0])
y = np.array(k_data_set[:, 1])
plt.plot(x, y, 'o', label='Data set %s' % k)
cent_x = np.array(centroids[:, 0])
cent_y = np.array(centroids[:, 1])
plt.plot(cent_x, cent_y, 'd', label='Centroid Values')
plt.title(title)
plt.legend()
plt.show()
if __name__ == '__main__':
data_set = np.mat(load_data_set('testSet.txt'))
centroids, cluster_assment = k_mean(data_set, 4)
print("Centroids:")
print(centroids)
plot_result(data_set, centroids, cluster_assment)
运行结果:
D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/kMean/k_mean.py
Centroids:
[[ 2.6265299 3.10868015]
[-3.53973889 -2.89384326]
[-2.46154315 2.78737555]
[ 2.65077367 -2.79019029]]
图像:
可见原数据集被正确的分成了四个子类,输出结果是这四个子类的平均值(质心)的位置。
局部收敛的问题分析
上面的K均值算法的初始质心是随机分配的,因此有可能出现局部收敛但非全局收敛的问题,如下图所示,
同样的数据集和算法,在多次运行后出现了上面的结果,原因是一次创建了多个随机质心,导致这些质心可能相距很近,因此无法正确分类。
我们可以通过二分K均值算法来解决此问题。
二分K均值算法
为了解决K均值算法局部收敛的问题,有人提出了二分K均值算法。该算法首先将所有点作为一个簇(类别),然后将该簇一分为二。之后选择一个簇进行二拆分,选择哪个簇取决于在拆分后是否能最大的减少整体误差。重复的在每个簇上做二拆分,直到划分出k个簇为止。注意,这里的误差可以用SSE(Sum of Squared Error,误差平方和)来衡量。
下面通过代码来实现。创建模块bisecting_k_mean.py,输入以下代码:
import numpy as np
import kMean.k_mean as kmean
def bin_k_mean(data_set, k, dist_meas=kmean.dist_eclud):
m = np.shape(data_set)[0]
cluster_assment = np.mat(np.zeros((m, 2)))
centroid0 = np.mean(data_set, axis=0).tolist()[0]
cent_list = [centroid0] # create a list with one centroid
for j in range(m): # calc initial Error
cluster_assment[j, 1] = dist_meas(np.mat(centroid0), data_set[j, :]) ** 2
while (len(cent_list) < k):
lowest_SSE = np.inf
for i in range(len(cent_list)):
pts_in_curr_cluster = data_set[np.nonzero(cluster_assment[:, 0].A == i)[0], :]
centroid_mat, split_clust_ass = kmean.k_mean(pts_in_curr_cluster, 2, dist_meas)
sse_split = sum(split_clust_ass[:, 1])
sse_not_split = sum(cluster_assment[np.nonzero(cluster_assment[:, 0].A != i)[0], 1])
# print("sse_split, and notSplit: %r, %r" % (sse_split, sse_not_split))
if (sse_split + sse_not_split) < lowest_SSE:
best_cent_to_split = i
best_new_cents = centroid_mat
best_clust_ass = split_clust_ass.copy()
lowest_SSE = sse_split + sse_not_split
best_clust_ass[np.nonzero(best_clust_ass[:, 0].A == 1)[0], 0] = len(cent_list)
best_clust_ass[np.nonzero(best_clust_ass[:, 0].A == 0)[0], 0] = best_cent_to_split
# print('the best_cent_to_split is: %r' % best_cent_to_split)
# print('the len of best_clust_ass is: %r' % len(best_clust_ass))
cent_list[best_cent_to_split] = best_new_cents[0, :].tolist()[0]
cent_list.append(best_new_cents[1, :].tolist()[0])
cluster_assment[np.nonzero(cluster_assment[:, 0].A == best_cent_to_split)[0], :] = best_clust_ass
return np.mat(cent_list), cluster_assment
if __name__ == '__main__':
data_set = np.mat(kmean.load_data_set('testSet.txt'))
centroids, cluster_assment = bin_k_mean(data_set, 4)
print("Centroids:")
print(centroids)
kmean.plot_result(data_set, centroids, cluster_assment, 'Bisecting K Mean')
运行结果:
D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/kMean/bisecting_k_mean.py
Centroids:
[[-2.46154315 2.78737555]
[ 2.80293085 -2.7315146 ]
[-3.38237045 -2.9473363 ]
[ 2.6265299 3.10868015]]
图像:
由于二分K均值算法基于K均值算法,因此也有一定的随机性,但因为其每次都只进行二拆分,所以不会出现局部收敛的问题,这就是为什么二分K均值算法的聚类效果要好于K均值算法的原因。