Figure 1: K-means algorithm. Training examples are shown as dots, and cluster centroids are shown as crosses. (a) Original dataset. (b) Random initial cluster centroids (in this instance, not chosen to be equal to twotrainingexamples). (c-f) Illustration of running two iterations of k-means. In each iteration, we assign each training example to the closest cluster centroid(shown by “painting” the training examples the same color as the cluster centroid to which is assigned); then we move each cluster centroid to the mean of the points assigned to it. (Best viewed in color.) Images courtesy Michael Jordan.
注意:
问题1:K值选取问题
K的选取通常是我们的目标,也就是说,我们要将这队数据分为几类。因此,是相对明确的。
问题2:初始值的选取问题
初始值的选取对于迭代的结果有较大的影响,选取不当,会出现所有点都归为一类的情况。一个通常的解决方案是:随机选取多组初始值进行分类,选取损失函数最小的分类结果。
编程举例:
将如下三维空间的点进行k-means分类:
[input.txt]
1.0 , 5.7 , 2.8
4.5 , 5.2 , -0.3
-0.9 , 8.1 , 1.4