Canopy聚类
序
前两个月在做项目突然发现Canopy算法发现网上直接用python实现的不多,因为Mahout已经包含了这个算法,需要使用的时候仅需要执行Mahout几条命令即可,并且多数和MapReduce以及Hadoop分布式框架一起使用,感兴趣的可以在网上查阅。但出于学习和兴趣的态度,我更想尝试用python来亲自实现一些底层算法。
简介
The canopy clustering algorithm is an unsupervised pre-clustering algorithm introduced by Andrew McCallum, Kamal Nigam and Lyle Ungar in 2000.[1]
It is often used as preprocessing step for the K-means algorithm or the Hierarchical clustering algorithm. It is intended to speed up clustering operations on large data sets, where using another algorithm directly may be impractical due to the size of the data set.
以上面出自于维基百科.
Canopy算法是2000年由Andrew McCallum, Kamal Nigam and Lyle Ungar提出来的,它是对k-means聚类算法和层次聚类算法的预处理。众所周知,kmeans的一个不足之处在于k值需要通过人为的进行调整,后期可以通过肘部法则(Elbow Method)和轮廓系数(Silhouette Coefficient)来对k值进行最终的确定,但是这些方法都是属于“事后”判断的,而Canopy算法的作用就在于它是通过事先粗聚类的方式,为k-means算法确定初始聚类中心个数和聚类中心点。
Canopy算法过程:
The algorithm proceeds as follows, using two thresholds
T1 (the loose distance) and T2(the tight distance), whereT1>T2[1][2]
1.Begin with the set of data points to be clustered.
2.Remove a point from the set, beginning a new 'canopy'.
3.For each point left in