Local density adaptive similarity measurement for spectral clustering

2. Overview of spectral clustering

Given a set of n n n data points X = { x 1 , x 2 , … , x n } X=\left\{x_{1}, x_{2}, \ldots, x_{n}\right\} X={x1,x2,,xn}, the objective of clustering is to divide data points into different clusters, where data points in the same cluster are similar to each other. According to a specific similarity measure, we have the affinity matrix S ∈ R n × n S \in \mathfrak{R}^{n \times n} SRn×n. From S S S, we can construct an undirected graph G = ( V , E ) G=(V, E) G=(V,E) with each vertex v i ∈ V v_{i} \in V viV corresponding to the data point x i x_{i} xi and each edge e ( i , j ) ∈ E e(i, j) \in E e(i,j)E carries a weight S i j S_{i j} Sij which represents the similarity between point x i x_{i} xi and x j x_{j} xj. The clustering problem is equivalent to choosing a partition C 1 , C 2 , … , C k C_{1}, C_{2}, \ldots, C_{k} C1,C2,,Ck of G G G which minimizes a specific objective function such as RatioCut (Hagen and Kahng, 1992), MinmaxCut (Ding et al., 2001), and the Ncut (Shi and Malik, 2000 ). The performances of these three objective functions are input dependent: if clusters are well separated, all the three give very similar and accurate results; when clusters are marginally separated, NCut and MinMaxCut give better results; when clusters overlap significantly, MinMaxCut tend to give more compact and balanced clusters (Ding, 2004). In this paper, we use the mostly often adopted Ncut. It was shown in (Wagner and Wagner, 1993) that the minimization of Ncut is NP-hard. According to Rayleigh-Ritz theory (Lütkepohl, 1997), it is possible to find an approximate solution. In solving this, we need to define the normalized Laplacian matrix L = D − 1 2 S D − 1 2 L=D^{-\frac{1}{2}} S D^{-\frac{1}{2}} L=D21SD21 where D D D is a diagonal matrix with D i i = D_{i i}= Dii= ∑ j = 1 n S i j \sum_{j=1}^{n} S_{i j} j=1nSij. Then, the approximate solution could be derived from the leading eigenvectors of L L L. The use of Laplacian matrix eigenvector for approximating the graph minimum cut is called spectral clustering. A complete overview of spectral clustering can be found in (Luxburg, 2007).

The most commonly used similarity measure, the Gaussian kernel function, is defined as S G ( x i , x j ) = exp ⁡ ( − d ( x i , x j ) 2 / 2 σ 2 ) S_{G}\left(x_{i}, x_{j}\right)=\exp \left(-d\left(x_{i}, x_{j}\right)^{2} / 2 \sigma^{2}\right) SG(xi,xj)=exp(d(xi,xj)2/2σ2), where d ( x i , x j ) d\left(x_{i}, x_{j}\right) d(xi,xj) is the Euclidean distance between data points x i x_{i} xi and x j x_{j} xj, and σ \sigma σ is the kernel parameter (Shi and Malik, 2000). The obvious drawback of S G S_{G} SG is that the scaling parameter σ \sigma σ is fixed, thus the similarity between two points is only determined by their Euclidean distance, and does not vary with the change of the surroundings. An/ example is given in Fig. 1 (Fig. 2 in (Zelnik-Manor et al., 2004)). In Fig. 1(a), supposing d ( a , b ) = d ( a , c ) d(a, b)=d(a, c) d(a,b)=d(a,c), then with the Gaussian kernel function we have S G ( a , b ) = S G ( a , c ) S_{G}(a, b)=S_{G}(a, c) SG(a,b)=SG(a,c). Thus the clustering algorithm tends to cluster a , b , c a, b, c a,b,c together. However, the fact is that a , c a, c a,c are in the background cluster which is relatively sparse, while b b b is in the tight cluster in the center.

Zelnik-Manor et al. proposed a local scale similarity measure S T ( x i , x j ) = exp ⁡ ( − d ( x i , x j ) 2 / σ i σ j ) S_{T}\left(x_{i}, x_{j}\right)=\exp \left(-d\left(x_{i}, x_{j}\right)^{2} / \sigma_{i} \sigma_{j}\right) ST(xi,xj)=exp(d(xi,xj)2/σiσj), where σ i \sigma_{i} σi is the distance between point x i x_{i} xi and its k k k th nearest neighbor (Zelnik-Manor et al., 2004). With S T S_{T} ST, in Fig. 1, we have σ c > σ b \sigma_{c}>\sigma_{b} σc>σb, so σ a σ c > σ a σ b \sigma_{a} \sigma_{c}>\sigma_{a} \sigma_{b} σaσc>σaσb, point a a a gets closer to point c c c than to point b b b. This is just the information required for separation. The effect of local scaling can be seen from the comparison of Fig. 1(b) with Fig. 1(c).

The corresponding spectral clustering based on this local scale similarity measure is called self-tuning spectral clustering (SCST) (Zelnik-Manor et al., 2004). Since the adaptive local scale parameter reflects the local information properly, SC-ST works well on the data with multiple scales, e.g., Fig. 1(c). It reveals that the surroundings of two data points have high impact on their similarity.

However, the local scale parameter in SC-ST, the distance to a nearby neighbor, is still a Euclidean distance factor and does not help in many cases. For example, on the toy data set in Fig. 2, consider three data points a , b , c a, b, c a,b,c with d ( a , b ) = d ( a , c ) d(a, b)=d(a, c) d(a,b)=d(a,c) in the Euclidean space (see Fig. 3). Following the clustering assumption, the similarity between a a a and b b b should be higher than the similarity between a a a and c c c, since point a a a is in the same cluster with point b b b rather than point c. Unfortunately, SC-ST does not work, since the local statistics surrounding point b b b and c c c are similar, leading to S T ( a , b ) = − S T ( a , c ) S_{T}(a, b)=-S_{T}(a, c) ST(a,b)=ST(a,c). It can not make any contribution to clustering better than using S G S_{G} SG, thus SC-ST fails to produce the correct clustering result (see Fig. 3).

The path-based similarity used in path based clustering (Fischer and Buhmann, 2003) is defined as followS: S P ( x i , x j ) = max ⁡ p ∈ ρ i j S_{P}\left(x_{i}, x_{j}\right)=\max _{p \in \rho_{i j}} SP(xi,xj)=maxpρij { min ⁡ 1 ⩽ h < ∣ p ∣ d ( x p [ h ] , x p [ h + 1 ] ) } \left\{\min _{1 \leqslant h<|p|} d\left(x_{p[h]}, x_{p[h+1]}\right)\right\} {min1h<pd(xp[h],xp[h+1])} where ℘ i j \wp_{i j} ij denotes the set of all paths from x i x_{i} xi to x j , p [ h ] x_{j}, p[h] xj,p[h] denotes the h h h th point along the path p p p from x i x_{i} xi to x j x_{j} xj. This x i x_{i} xi to x j , p [ h ] x_{j}, p[h] xj,p[h] denotes the h h h th point along the path p p p from x i x_{i} xi to x j x_{j} xj. This tance between two points, they should be considered as in one cluster if they are connected by a set of successive points in dense regions. This is intuitively reasonable. However, it is not robust enough against noise and outliers (Chang and Yeung, 2008).

4. Local density adaptive similarity measure

一个好的相似性度量应该是自适应于相关点的邻域。我们所提出的方法是基于以下观察:如果两个点分布在同一簇中,它们是在同一密度相对较高的区域内。也就是说,两个点落在同一簇中,因为它们之间有许多点将它们“粘合”在一起。为了反映两个数据点之间的“粘合”效应,我们定义了共同近邻(Common-Near-Neighbor, CNN)措施。

Definition 1. C N N ( a , b ) CNN(a, b) CNN(a,b) is the number of the points in the join region of the ε \varepsilon ε-neighborhoods around points a a a and b b b, where the ε \varepsilon ε neighborhood of one point represents the sphere region around that point of specified radius ε \varepsilon ε.

C N N ( a , b ) CNN(a, b) CNN(a,b) shows the local density between points a a a and b b b, which can be used to distinguish points within the same cluster from points among different clusters. For example, as shown in Fig. 4, points a a a and b b b have more shared neighbors than points a a a and c c c, so we will have C N N ( a , b ) > C N N ( a , c ) CNN(a, b)>CNN(a, c) CNN(a,b)>CNN(a,c), which contains strong information for data partition.

The local density adaptive similarity measure between a pair of points is thus defined as:

S L ( x i , x j ) = { exp ⁡ ( − d ( x i , x j ) 2 2 σ 2 ( C N N ( x i , x j ) + 1 ) ) i ≠ j 0 i = j S_{L}\left(x_{i}, x_{j}\right)= \begin{cases}\exp \left(-\frac{d\left(x_{i}, x_{j}\right)^{2}}{2 \sigma^{2}\left(C N N\left(x_{i}, x_{j}\right)+1\right)}\right) & i \neq j \\ 0 & i=j\end{cases} SL(xi,xj)={exp(2σ2(CNN(xi,xj)+1)d(xi,xj)2)0i=ji=j
The proposed similarity measure has the following properties:


(1) If d ( x i , x j ) ⩾ 2 ε d\left(x_{i}, x_{j}\right) \geqslant 2 \varepsilon d(xi,xj)2ε, then S L ( x i , x j ) = S G ( x i , x j ) S_{L}\left(x_{i}, x_{j}\right)=S_{G}\left(x_{i}, x_{j}\right) SL(xi,xj)=SG(xi,xj), showing that the scale factor is a local one and does not affect far away data points.

(2) For two pairs of points x i , x j x_{i}, x_{j} xi,xj and x m , x n x_{m}, x_{n} xm,xn, supposing d ( x i , x j ) = d\left(x_{i}, x_{j}\right)= d(xi,xj)= d ( x m , x n ) < 2 ε d\left(x_{m}, x_{n}\right)<2 \varepsilon d(xm,xn)<2ε, but in fact x i , x j x_{i}, x_{j} xi,xj are in the same dense region while x m , x n x_{m}, x_{n} xm,xn are in different dense regions, then with very high probability we have S L ( x i , x j ) > S L ( x m , x n ) S_{L}\left(x_{i}, x_{j}\right)>S_{L}\left(x_{m}, x_{n}\right) SL(xi,xj)>SL(xm,xn). For example, in Fig. 4 , S L ( a , b ) > S L ( a , c ) 4, S_{L}(a, b)>S_{L}(a, c) 4,SL(a,b)>SL(a,c).

Since clusters are dense regions of the data set, thus from (2) it can be seen that our defined similarity has an effect of amplifying the intra-cluster similarity. This is desirable for high quality clustering.


Till now, to the best of our knowledge, there has been no quantitative metric on to what extent a similarity measure is good. However, it is believed that a good similarity function should make the affinity matrix as block diagonal as possible (and Poland, 2005). This is usually evaluated through visualizations of the affinity matrices. For the two-moon dataset, visualizations of the affinity matrices using the three different similarity functions are given in Fig. 5. Note that with our local density adaptive similarity measure, points in the same cluster obtain the highest similarities. It is the closest to the ideal block diagonal matrix (in which the value is 1 when two points are in the same cluster and 0 otherwise).

The local density adaptive spectral clustering algorithm (SC-DA) is simply the ordinary spectral clustering algorithm using the new similarity measure S L S_{L} SL in place S G S_{G} SG. This seemingly minor modification is in fact very significant as the clustering results can be improved effectively. The clustering result on the two moon dataset with our method is shown in Fig. 6. The groups now match the real solution well (Recall the result of SC-ST in Fig. 3. to see the superiority of our similarity for this dataset).

5. Experiments

5.1. Parameter setting

5.2. Experiments on synthetic data sets

Fig. 7. Clustering results on 2D synthetic data. (a) Results of SC, (b) results of SC-ST,(c) results of SC-PB and (d) results of SC-DA.
Fig. 8. Clustering results on 3D Synthetic data. (a) Results of SC, (b) results of SC-ST, (c) results of SC-PB and (d) results of SC-DA.

5.3. Experiments on real world data sets

将上述4个聚类方法( SC, SC-ST, SC-PB and SC-DA)在UCI数据集和UPS数据集上测试。用Clustering Error(CE) 和Normalized Mutual Information(NMI)两个评估指标。

5.3.2. Results on UCI data sets

对来自UCI数据存储库(Asuncion and Newman, http://archive.ics.uci.edu/ml/)的四个真实数据集进行了实验。


5.3.3. Results on USPS data sets


