聚类之密度峰值算法

最新推荐文章于 2024-07-19 16:20:14 发布

Sunning_001

最新推荐文章于 2024-07-19 16:20:14 发布

阅读量2.2w

点赞数 8

文章标签：数据挖掘聚类密度峰值

本文链接：https://blog.csdn.net/qq_17073497/article/details/81320837

版权

1、前言

With the rapid development of informatization, there are large amounts of data generating in all walks of life, from the Internet to our daily life, from the financial data to the medical images. So there are a large number of experts and scholars paying close attention to how to deal with such a vast amount of data effectively. Therefore, the concept of data mining is so popular that I need to introduce something about what is clustering.

Clustering is an important unsupervised method of data mining. Clustering aims at dividing a set of unlabeled objects into clusters so that objects from one cluster are more similar than those in the other clusters.

2、密度峰值算法

密度峰值算法（Clustering by fast search and find of density peaks）由Alex Rodriguez和Alessandro Laio于2014年提出，并将论文发表在Science上。Science上的这篇文章《Clustering by fast search and find of density peaks》主要讲的是一种基于密度的聚类方法，基于密度的聚类方法的主要思想是寻找被低密度区域分离的高密度区域。Similarly，密度峰值算法(DPCA)也基于这样的假设：（1）类簇中心点的密度大于周围邻居点的密度；（2）类簇中心点与更高密度点之间的距离相对较大。因此，DPCA主要有两个需要计算的量：第一，局部密度 $\rho _i$ ；第二，与高密度点之间的距离 $\delta _i$ 。

2.1、距离度量

定义论域 $U= \left \{x_{1},..., x_{n},...x_{N}\right \}$ 中含有N个数据对象，每个数据对象含有D维属性，表示为 $x_{i}=\left \{ x_{i}^{1} ,...,{ x_{i}^{d},...,{ x_{i}^{D} \right \}$ 。论域U中任意两个对象 $x{_{i}}$ 和 $x{_{j}}$ 之间的距离采用欧氏距离计算，表示为

$dist\left ( x_{i}, x_{i} \right ) = \sqrt{\sum_{d=1}^{D}\left ( x_{i}^{d}- x_{j}^{d}\right )^{2}}$

2.2、局部密度

数据对象 $x{_{i}}$ 的局部密度 $\rho _i$ 定义为：

$\rho _{i}=\sum_{x_{j}\in U}\chi \left ( dist\left ( x_{i},x_{j} \right ) \right-dist_{cutoff} )$

其中， $dist_{cutoff}$ 表示截断距离， $\chi \left ( x \right )=\left\{\begin{matrix} 1 &x\leq 0 & \\ 0& x> 0& \end{matrix}\right.$ ，这个公式的含义是说找到与第 $i$ 个数据点之间的距离小于截断距离 $d_c$ 的数据点的个数，并将其作为第i个数据点真的密度。

2.3、与高密度点之间的距离

数据对象 $x{_{i}}$ 与局部密度更高的数据对象的距离定义为

$\delta _{i}=\underset{j:\rho _{j}>\rho _{i} }{min}\left ( dist\left ( x_{i},x_{i} \right ) \right )$

数据对象 $x{_{i}}$ 与局部密度为最大值时， $\delta _{i}=\underset{x_{j}\in U,j\neq i}{max}\left ( dist\left ( x_{i},x_{i} \right ) \right )$ .

3、如何聚类

对于聚类问题，我们需要回答的是聚类中心是什么，对于每个数据点，如何定义所属的类别。DPCA中将那些具有较大距离 $\delta _i$ 且同时具有较大局部密度 $\rho _i$ 的点定义为聚类中心。

如上图B，数据点1和数据点10同时具有相对较高的距离和局部密度，所以是聚类中心。而在图中具有相对较高的距离，但是局部密度却较小的数据点26,27和28称为异常点。对于那些非异常点，如何对他们聚类呢？DPCA采用的是将他们归类到比他们的密度更大的最相近的类中心所属的类别中。到此，整个算法的基本思想解释结束。

参考文献

Clustering by fast search and find of density peaks