Title: Efficient Clustering Based On A Unified View Of K-means And Ratio-cut
Summary
- 论文提出可以将k-means和谱聚类的方法用统一的公式形式表示出来,两者各有侧重点,k-means的方法主要侧重于使得属于同一簇里面的样本点的距离越近越好,谱聚类的方法则是旨在使得属于同一簇的样本点的相似度越高越好。而替换掉这个公式的其中一部分就形成了两者统一的聚类方法,该统一的方法旨在最小化同一个簇样本点的距离之和,在使用加速算法的情况下,可以在准确率、时间复杂度和各种聚类指标上都优于单个的k-means和谱聚类方法。
Problem Statement
- 传统的聚类方法主要包括k-means和谱聚类两种,但两者存在许多的不足:
- 谱聚类方法的时间复杂度是和样本数量的平方成正比的,因此随着样本点的增加,其计算量会剧增。且谱聚类的方法将特征提取和聚类分成了两个步骤,因此会丢失信息,使得聚类的结果背离了原问题的解。
- k-means方法不能区分在输入空间非线性可分的聚类簇。且在样本点数量较多时,k-means方法初始化的聚类中心的影响较大,不同的聚类中心初始化将导致不同的聚类结果。
- 如何克服上述问题,提出新的聚类算法是急迫的。
Method
- 谱聚类算法:
- 计算n*n的相似度矩阵
- 由相似度矩阵计算拉普拉斯矩阵
- 对拉普拉斯矩阵进行特征值分解,取拉普拉斯矩阵的前c个特征值对应的特征向量
- 根据该特征向量进行k-means聚类
- k-means聚类算法:
- 随机初始化聚类中心
- 根据聚类中心,将样本点聚类到最近的那个聚类中心所在的簇
- 根据聚类结果,更新聚类中心对应的位置,直到聚类中心的位置不变时,停止迭代
- 统一的表达式:
- 优化的目标函数:
- 加速优化的算法:
Evaluation
- 时间复杂度上的比较
- 各种聚类指标的比较
Conclusion
- 将两种聚类方法结合起来之后,可以得到一个线性复杂度,且高效的聚类方法。
Notes
- Spectral clustering and k-means, both as two major traditional clustering methods, are still attracting a lot of attention, although a variety of novel clustering algorithms have been proposed in recent years.
- Given a set of input patterns, the purpose of clustering is to group the data into a certain number of
clusters so that the samples in the same cluster are similar to each other, and the samples in different clusters are not. - A series of algorithms have been proposed for cluster analysis and applied to various areas successfully, such as document clustering, image segmentation, and social networks.
- To obtain the final solution, most of SC algorithms follow a two-stage approach, which may result in bad clustering structure and deviations from the solution of the original problem.
References
- k-means++: The advantages of careful seeding.
- Fast and provably good seedings for k-means.
- Fast spectral clustering with anchor graph for large hyperspectral images.
- Cross-pose lfw: A database for studying cross-pose face recognition in unconstrained environments.
- A review of k-mean algorithm.
- Scalable kernel k-means clustering with nyström approximation: relative-error bounds.