Clustering By Fast Search And Find Of Density Peaks -- Sci14发表的聚类算法

最新推荐文章于 2022-07-21 21:54:33 发布

AceMa

最新推荐文章于 2022-07-21 21:54:33 发布

阅读量2.6k

点赞数

分类专栏：机器学习ML 文章标签：机器学习聚类算法

机器学习ML 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

This post is about a new cluster algorithm published by Alex Rodriguez and Alessandro Laio in the latest Science magazine. The method is short and efficient, I implemented it using about only 100 lines of cpp code.

BASIC METHOD

There are two leading criteria in this method: Local Density and Minimum Distance with higher density.

Rho above is the local density, in which,

and dc is a cutoff distance. Rho is basically equal to the number of points that are closer than dc to point i. The algorithm is sensitive only to the relative magnitude of rho in different points, implying that, for large data sets, the results of the analysis are robust with respect to the choice of dc.

The authors say that we can choose dc so that the average number of neighbors is around1 to 2% of the total number of points in the data set (I used 1.6% in my code).

Delta is measured by computing the minimum distance between the point i and any other point with higher density:

and for the point with highest density, we simply let:

The following figure illustrates the basic algorithm:

The left hand figure shows the dataset we use (28 points in a 2d space), most of the points belong to 2 clusters: blue and red, and there are 3 outliers with black circles. By calculating the rho and delta of each point in the dataset, we show the result by the right hand figure (decision graph), x-axis is local density, y-axis is the minimum distance between one point and any other point with higher density.

Our target is to find points that have both high rho and high delta (in this example point number 1 and 10), they’re cluster centers. We can see the three outliers (26, 27, 28) have high delta but very low rho.

CODE

   // Clustering by fast search and find of density peaks 
 
 
    // Science 27 June 2014: 
   
 
    // Vol. 344 no. 6191 pp. 1492-1496 
   
 
    // DOI: 10.1126/science.1242072 
   
 
    // http://www.sciencemag.org/content/344/6191/1492.full// 
   
 
    // 
   
 
    // Code Author: Eric Yuan 
   
 
    // Blog: http://eric-yuan.me 
   
 
    // You are FREE to use the following code for ANY purpose. 
   
 
    // 
   
 
    // Have fun with it 
   

       
   
 
    #include "iostream" 
   
 
    #include "vector" 
   
 
    #include "math.h" 
   

     using namespace std; 
   

       
   
 
    #define DIM 3 
   
 
    #define elif else if 
   

       
   
 
    #ifndef bool 
   

          
    #define bool int 
   

          
    #define false ((bool)0) 
   

          
    #define true  ((bool)1) 
   
 
    #endif 
   

       
   
 
    struct Point3d  
    { 
   

          
    double x; 
   

          
    double y; 
   

          
    double z; 
   

         Point3d 
    ( 
    double xin,  
    double yin,  
    double zin 
    )  
    : x 
    (xin 
    ), y 
    (yin 
    ), z 
    (zin 
    )  
    { 
    } 
   
 
    }; 
   

       
   
 
    void dataPro 
    (vector 
    &lt;vector  
    &gt;  
    &amp;src, vector  
    &amp;dst 
    ) 
    { 
   

          
    for 
    ( 
    int i  
    =  
    0; i  
    &lt; src. 
    size 
    ( 
    ); i 
    ++ 
    ) 
    { 
   

             Point3d pt 
    (src 
    [i 
    ] 
    [ 
    0 
    ], src 
    [i 
    ] 
    [ 
    1 
    ], src 
    [i 
    ] 
    [ 
    2 
    ] 
    ); 
   

             dst. 
    push_back 
    (pt 
    ); 
   

          
    } 
   
 
    } 
   

       
   
 
    double getDistance 
    (Point3d  
    &amp;pt1, Point3d  
    &amp;pt2 
    ) 
    { 
   

          
    double tmp  
    = pow 
    (pt1. 
    x  
    - pt2. 
    x,  
    2 
    )  
    + pow 
    (pt1. 
    y  
    - pt2. 
    y,  
    2 
    )  
    + pow 
    (pt1. 
    z  
    - pt2. 
    z,  
    2 
    ); 
   

          
    return pow 
    (tmp,  
    0.5 
    ); 
   
 
    } 
   

       
   
 
    double getdc 
    (vector  
    &amp;data,  
    double neighborRateLow,  
    double neighborRateHigh 
    ) 
    { 
   

          
    int nSamples  
    = data. 
    size 
    ( 
    ); 
   

          
    int nLow  
    = neighborRateLow  
    * nSamples  
    * nSamples; 
   

          
    int nHigh  
    = neighborRateHigh  
    * nSamples  
    * nSamples; 
   

          
    double dc  
    =  
    0.0; 
   

          
    int neighbors  
    =  
    0; 
   

         cout 
    &lt; 
    &lt; 
    "nLow = " 
    &lt; 
    &lt;nLow 
    &lt; 
    &lt; 
    ", nHigh = " 
    &lt; 
    &lt;nHigh 
    &lt; 
    &lt;endl; 
   

          
    while 
    (neighbors  
    &lt; nLow || neighbors  
    &gt; nHigh 
    ) 
    { 
   

          
    //while(dc &lt;= 1.0){ 
   

             neighbors  
    =  
    0; 
   

              
    for 
    ( 
    int i  
    =  
    0; i  
    &lt; nSamples  
    -  
    1; i 
    ++ 
    ) 
    { 
   

                  
    for 
    ( 
    int j  
    = i  
    +  
    1; j  
    &lt; nSamples; j 
    ++ 
    ) 
    { 
   

                      
    if 
    (getDistance 
    (data 
    [i 
    ], data 
    [j 
    ] 
    )  
    &lt; 
    = dc 
    )  
    ++neighbors;                  
    if 
    (neighbors  
    &gt; nHigh 
    )  
    goto DCPLUS; 
   

                  
    } 
   

              
    } 
   

     DCPLUS 
    : dc  
    +=  
    0.03; 
   

             cout 
    &lt; 
    &lt; 
    "dc = " 
    &lt; 
    &lt;dc 
    &lt; 
    &lt; 
    ", neighbors = " 
    &lt; 
    &lt;neighbors 
    &lt; 
    &lt;endl; 
   

          
    } 
   

          
    return dc; 
   
 
    } 
   

       
   

     vector getLocalDensity 
    (vector  
    &amp;points,  
    double dc 
    ) 
    { 
   

          
    int nSamples  
    = points. 
    size 
    ( 
    ); 
   

         vector rho 
    (nSamples,  
    0 
    ); 
   

          
    for 
    ( 
    int i  
    =  
    0; i  
    &lt; nSamples  
    -  
    1; i 
    ++ 
    ) 
    { 
   

              
    for 
    ( 
    int j  
    = i  
    +  
    1; j  
    &lt; nSamples; j 
    ++ 
    ) 
    { 
   

                  
    if 
    (getDistance 
    (points 
    [i 
    ], points 
    [j 
    ] 
    )  
    &lt; dc 
    ) 
    { 
   

                      
    ++rho 
    [i 
    ]; 
   

                      
    ++rho 
    [j 
    ]; 
   

                  
    } 
   

              
    } 
   

              
    //cout&lt;&lt;"getting rho. Processing point No."&lt;&lt;i&lt;&lt;endl; 
   

          
    } 
   

          
    return rho; 
   
 
    } 
   

       
   

     vector getDistanceToHigherDensity 
    (vector  
    &amp;points, vector  
    &amp;rho 
    ) 
    { 
   

          
    int nSamples  
    = points. 
    size 
    ( 
    ); 
   

         vector delta 
    (nSamples,  
    0.0 
    ); 
   

       
   

          
    for 
    ( 
    int i  
    =  
    0; i  
    &lt; nSamples; i 
    ++ 
    ) 
    { 
   

              
    double dist  
    =  
    0.0; 
   

             bool flag  
    =  
    false; 
   

              
    for 
    ( 
    int j  
    =  
    0; j  
    &lt; nSamples; j 
    ++ 
    ) 
    {              
    if 
    (i  
    == j 
    )  
    continue;              
    if 
    (rho 
    [j 
    ]  
    &gt; rho 
    [i 
    ] 
    ) 
    { 
   

                      
    double tmp  
    = getDistance 
    (points 
    [i 
    ], points 
    [j 
    ] 
    ); 
   

                      
    if 
    ( 
    !flag 
    ) 
    { 
   

                         dist  
    = tmp; 
   

                         flag  
    =  
    true; 
   

                      
    } 
    else dist  
    = tmp  
    &lt; dist ? tmp  
    : dist; 
   

                  
    } 
   

              
    } 
   

              
    if 
    ( 
    !flag 
    ) 
    { 
   

                  
    for 
    ( 
    int j  
    =  
    0; j  
    &lt; nSamples; j 
    ++ 
    ) 
    {                  
    double tmp  
    = getDistance 
    (points 
    [i 
    ], points 
    [j 
    ] 
    );                 dist  
    = tmp  
    &gt; dist ? tmp  
    : dist; 
   

                  
    } 
   

              
    } 
   

             delta 
    [i 
    ]  
    = dist; 
   

              
    //cout&lt;&lt;"getting delta. Processing point No."&lt;&lt;i&lt;&lt;endl; 
   

          
    } 
   

          
    return delta; 
   
 
    } 
   

       
   
 
    int main 
    ( 
    int argc,  
    char 
    ** argv 
    ) 
   
 
    { 
   

          
    long start, end; 
   

         FILE  
    *input; 
   

         input  
    = fopen 
    ( 
    "dataset.txt",  
    "r" 
    ); 
   

         vector 
    &lt;vector  
    &gt; data; 
   

          
    double tpdouble; 
   

          
    int counter  
    =  
    0; 
   

          
    while 
    ( 
    1 
    ) 
    { 
   

              
    if 
    (fscanf 
    (input,  
    "%lf",  
    &amp;tpdouble 
    ) 
    ==EOF 
    )  
    break; 
   

              
    if 
    (counter  
    /  
    3  
    &gt; 
    = data. 
    size 
    ( 
    ) 
    ) 
    { 
   

                 vector tpvec; 
   

                 data. 
    push_back 
    (tpvec 
    ); 
   

              
    } 
   

             data 
    [counter  
    /  
    3 
    ]. 
    push_back 
    (tpdouble 
    ); 
   

              
    ++ counter; 
   

          
    } 
   

         fclose 
    (input 
    ); 
   

          
    //random_shuffle(data.begin(), data.end()); 
   

       
   

         start  
    = clock 
    ( 
    ); 
   

         cout 
    &lt; 
    &lt; 
    "********" 
    &lt; 
    &lt;endl; 
   

         vector points; 
   

         dataPro 
    (data, points 
    ); 
   

          
    double dc  
    = getdc 
    (points,  
    0.016,  
    0.020 
    ); 
   

         vector rho  
    = getLocalDensity 
    (points, dc 
    ); 
   

         vector delta  
    = getDistanceToHigherDensity 
    (points, rho 
    ); 
   
 
    //    saveToTxt(rho, delta); 
   

          
    // now u get the cluster centers 
   

         end  
    = clock 
    ( 
    ); 
   

         cout 
    &lt; 
    &lt; 
    "used time: " 
    &lt; 
    &lt; 
    ( 
    ( 
    double 
    ) 
    (end  
    - start 
    ) 
    )  
    / CLOCKS_PER_SEC 
    &lt; 
    &lt;endl; 
   

          
    return  
    0; 
   
 
    } 
   

PostScript: I found that the result differs a lot when using different value of dc, If you have any idea of how to calculate dc precisely, please let me know.

Newly Updated: Jul. 1 9:14 Pm

Ryan raised an interesting idea yesterday and I tried it.

ryan

Here’s the new function to calculate “Local Density”:

   vector getAverageKnnDistance(vector &points){ 
 

          
    double ratio  
    =  
    0.015; 
   

          
    int nSamples  
    = points. 
    size 
    ( 
    ); 
   

          
    int M  
    = nSamples  
    * ratio; 
   

         vector rho 
    (nSamples,  
    0.0 
    ); 
   

          
    for 
    ( 
    int i  
    =  
    0; i  
    &lt; nSamples; i 
    ++ 
    ) 
    { 
   

             vector tmp; 
   

              
    for 
    ( 
    int j  
    =  
    0; j  
    &lt; nSamples; j 
    ++ 
    ) 
    { 
   

                  
    if 
    (i  
    == j 
    )  
    continue; 
   

                  
    double dis  
    = getDistance 
    (points 
    [i 
    ], points 
    [j 
    ] 
    ); 
   

                  
    if 
    (tmp. 
    empty 
    ( 
    ) 
    ) 
    { 
   

                     tmp. 
    push_back 
    (dis 
    ); 
   

                  
    }elif 
    (tmp. 
    size 
    ( 
    )  
    &lt; M 
    ) 
    { 
   

                      
    if 
    (dis  
    &lt; 
    = tmp 
    [tmp. 
    size 
    ( 
    )  
    -  
    1 
    ] 
    ) tmp. 
    push_back 
    (dis 
    ); 
   

                      
    else 
    { 
   

                          
    for 
    ( 
    int k  
    =  
    0; k  
    &lt; tmp. 
    size 
    ( 
    ); k 
    ++ 
    ) 
    { 
   

                              
    if 
    (tmp 
    [k 
    ]  
    &lt; 
    = dis 
    ) 
    {                             tmp. 
    insert 
    (tmp. 
    begin 
    ( 
    )  
    + k, dis 
    );                              
    break;                          
    }                      
    }                  
    }              
    } 
    else 
    {                  
    if 
    (dis  
    &gt; 
    = tmp 
    [ 
    0 
    ] 
    ) 
    { 
   

                         ;  
    // do nothing 
   

                      
    }elif 
    (dis  
    &lt; 
    = tmp 
    [tmp. 
    size 
    ( 
    )  
    -  
    1 
    ] 
    ) 
    { 
   

                         tmp. 
    erase 
    (tmp. 
    begin 
    ( 
    ) 
    ); 
   

                         tmp. 
    push_back 
    (dis 
    ); 
   

                      
    } 
    else 
    { 
   

                          
    for 
    ( 
    int k  
    =  
    0; k  
    &lt; tmp. 
    size 
    ( 
    ); k 
    ++ 
    ) 
    { 
   

                              
    if 
    (tmp 
    [k 
    ]  
    &lt; 
    = dis 
    ) 
    { 
   

                                 tmp. 
    insert 
    (tmp. 
    begin 
    ( 
    )  
    + k, dis 
    ); 
   

                                 tmp. 
    erase 
    (tmp. 
    begin 
    ( 
    ) 
    ); 
   

                                  
    break; 
   

                              
    } 
   

                          
    } 
   

                      
    } 
   

                  
    } 
   

              
    } 
   

              
    double res  
    =  
    0.0; 
   

              
    for 
    ( 
    int j  
    =  
    0; j  
    &lt; tmp. 
    size 
    ( 
    ); j 
    ++ 
    ) 
    { 
   

                 res  
    += tmp 
    [j 
    ]; 
   

              
    } 
   

             rho 
    [i 
    ]  
    =  
    0  
    - res  
    / tmp. 
    size 
    ( 
    ); 
   

          
    } 
   

          
    return rho; 
   
 
    } 
   

Because the mean distance to nearest M neighbors is inversely proportional to density, so I add a “y = -x” in the end.

I tested this idea using four dataset:

Aggregation: 788 vectors, 2 dimensions, 7 clusters.
Spiral: 312 vectors, 2 dimensions, 3 clusters
Jain: 373 vectors, 2 dimensions, 2 clusters
Flame: 240 vectors, 2 dimensions, 2 clusters

And the result is as follow (Ryan’s idea in red circles, method in paper in blue circles):

Jain:

Spiral:

Flame:

Aggregation:

Actually Ryan’s method works well, but just like I replied him, how to get a proper “M” is still like black magic.

If anyone has any idea, please let me know.

Thanks Ryan.

原文：http://eric-yuan.me/clustering-fast-search-find-density-peaks/

备注：

对于作者原来的论文里面有一个超参 (d_c) 设置会影响效果；但该算法能处理不同密度的簇。

对于Ryan's 的方法，同样的M=#sample * 0.015 也是一个经验值，需要设定或调节。但这个方法貌似比较好。

文献出处：

http://www.52ml.net/16351.html

http://www.52ml.net/16296.html （中文）

AceMa

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
Clustering By Fast Search And Find Of Density Peaks -- Sci14发表的聚类算法

This post is about a new cluster algorithm published by Alex Rodriguez and Alessandro Laio in the latest Science magazine. The method is short and efficient, I implemented it using about only 100 li
复制链接

扫一扫

专栏目录