J-linkage是一种多结构聚类算法,可自动为数据中的每个structure选择合适参数的模型,如下图:
定义
consensus set (CS) of each model: the set of points such that
their distance from the model is less than a threshold ε
preference set (PS) of a point:the set of models such that the point prefers
PS of a cluster:the intersection of the preference sets of its points.
Jaccard distance:
方法
- M model hypothesis are generated by drawing M minimal sets of data points necessary to estimate the model
- The consensus set (CS) of each model is computed(分别算N个点与这M个model的距离), a N × M matrix is thus built, where entry (i, j) is 1 if point i belongs to the CS of model j, 0 otherwise. Each column of that matrix is the characteristic
function of the CS of a model hypothesis. Each row indicates which models a points has given consensus to, i.e., which models it prefers.用这个matrix可以求出每个cluster的PS(the intersection of the preference sets of its points.) - 用cluster的PS来代表cluster,cluster两两之间的距离为cluster的PS的Jaccard distance
- 用agglomerative clustering的方法来聚类:
(1)Among all current clusters, pick the two clusters with the smallest Jaccard distance between the respective PSs.
(2)Replace these two clusters with the union of the two original ones.
重复这两步直到最小Jaccard距离为1(所有cluster都没有交集)为止
T-linkage就是把PS换成了PF(soft preference,PS中的preference只能是0或1,而PF中的preference是介于0到1之间的一个数),similarity的衡量方式变成了相应的Tanimoto distance: