聚类算法总结

最新推荐文章于 2024-08-15 09:49:25 发布

大魁

最新推荐文章于 2024-08-15 09:49:25 发布

阅读量1w

点赞数 1

分类专栏：机器学习

本文链接：https://blog.csdn.net/lzt1983/article/details/39236107

版权

机器学习专栏收录该内容

6 篇文章 0 订阅

订阅专栏

最近要在spark上做一个聚类的项目，数据量和类的个数都比较大。KMeans效果尚可，但是有点慢，因而重新看了下常用的算法。最终选用mini-batch kmeans，使用类似kmeans++的方法来初始化类中心。

kMeans
attention: init centers (randomize vs kMeans++)

mini-batch kMeans
loops: random select samples; find closest for all samples; update centers for each sample

mean shift
init: get centers by bandwidth
loops: find neighbors of centers; update centers; de-duplicate

DBSCAN
init: get densest core samples
loops: get more core samples nearby old samples

Ward hierarchical
init: each sample as center
loops: merge to minimize RMSE within clusters

Spectral clustering
steps: similarity matrix S; S=UV; kmeans of U

AP cluster
init: get S; Rik=0, Aik=0
loops: Rik = Sik - Max_k'!=k(Aik' + Sik'); Aik = min(0, Rkk + Sum_i'!=i,k Max(0, Ri'k)); Akk = Sum_i'!=k Max(0, Ri’k)
end: for any i, Max_k Rik + Aik as it’s exemplar

Topic Model (LDA)

...

其实scikit-learn实现了很多算法，也有现成的数据集可以做做实验。例如：http://scikit-learn.org/stable/modules/clustering.html 上有一些效果图，和算法扩展性的说明。

A comparison of the clustering algorithms in scikit-learn

Method name	Parameters	Scalability	Usecase	Geometry (metric used)
K-Means	number of clusters	Very large `n_samples`, medium `n_clusters`with MiniBatch code	General-purpose, even cluster size, flat geometry, not too many clusters	Distances between points
Affinity propagation	damping, sample preference	Not scalable with n_samples	Many clusters, uneven cluster size, non-flat geometry	Graph distance (e.g. nearest-neighbor graph)
Mean-shift	bandwidth	Not scalable with`n_samples`	Many clusters, uneven cluster size, non-flat geometry	Distances between points
Spectral clustering	number of clusters	Medium `n_samples`, small `n_clusters`	Few clusters, even cluster size, non-flat geometry	Graph distance (e.g. nearest-neighbor graph)
Ward hierarchical clustering	number of clusters	Large `n_samples` and`n_clusters`	Many clusters, possibly connectivity constraints	Distances between points
Agglomerative clustering	number of clusters, linkage type, distance	Large `n_samples` and`n_clusters`	Many clusters, possibly connectivity constraints, non Euclidean distances	Any pairwise distance
DBSCAN	neighborhood size	Very large `n_samples`, medium `n_clusters`	Non-flat geometry, uneven cluster sizes	Distances between nearest points
Gaussian mixtures	many	Not scalable	Flat geometry, good for density estimation	Mahalanobis distances to centers