密度聚类dbscan_DBSCAN —基于密度的聚类方法的演练

最新推荐文章于 2024-09-13 20:44:03 发布

weixin_26746401

最新推荐文章于 2024-09-13 20:44:03 发布

阅读量531

点赞数

文章标签：聚类

原文链接：https://medium.com/devtorq/dbscan-a-walkthrough-of-a-density-based-clustering-method-b5e74ca9fcfa

版权

本文深入浅出地介绍了基于密度的聚类方法DBSCAN，通过实例详细讲解了其工作原理和应用。

摘要由CSDN通过智能技术生成

密度聚类dbscan

The idea of having newer algorithms come into the picture doesn’t make the older ones ‘completely redundant’. British statistician, George E. P. Box had once quoted that, “All models are wrong, but some are useful”, meaning that no model is exact enough to certify as cent percent accurate. Reverse claims can only lead to the loss of generalization. The most accurate thing to do is to find the most approximate model.

出现新算法的想法并不能使旧算法“完全冗余”。英国统计学家George EP Box曾经引述过： “所有模型都是错误的，但有些模型是有用的” ，这意味着没有任何一种模型能够精确到百分之一的精度。反向主张只能导致泛化。最准确的事情是找到最近似的模型。

Clustering is an unsupervised learning technique where the aim is to group similar objects together. We are virtually living in a world where our past and present choices have become a dataset that can be clustered to identify patterns in our searches, shopping carts, the books we read, etc such that the machine algorithm is sophisticated enough to recommend the things to us. It is fascinating that the algorithms know much more about us then we ourselves can recognize!

聚类是一种无监督的学习技术，其目的是将相似的对象分组在一起。实际上，我们生活在一个世界中，过去和现在的选择已成为一个数据集，可以将其聚类以识别我们的搜索，购物车，阅读的书籍等中的模式，从而机器算法足够复杂，可以向您推荐事物我们。令人着迷的是，这些算法对我们的了解更多，然后我们自己就能意识到！

As already discussed in the previous blog, K-means makes use of Euclidean distance as a metric to form the clusters. This leads to a variety of drawbacks as mentioned. Please refer to the blog to read about the K-means algorithm, implementation, and drawbacks: Clustering — Diving deep into K-means algorithm

如先前博客中已讨论的，K-means利用欧几里得距离作为度量来形成聚类。如上所述，这导致了各种缺点。请参阅博客，以了解有关K-means算法，实现和缺点的信息：聚类-深入探讨K-means算法

The real-life data has outliers and is irregular in shape. K-means fails to address these important points and becomes unsuitable for arbitrary shaped, noisy data. In this blog, we are going to learn about an interesting density-based clustering approach — DBSCAN.

现实生活中的数据存在异常值，并且形状不规则。 K均值无法解决这些重要问题，因此不适用于任意形状的嘈杂数据。在此博客中，我们将学习一种有趣的基于密度的聚类方法-DBSCAN。

应用程序基于密度的空间聚类— DBSCAN (Density-based spatial clustering of applications with noise — DBSCAN)

DBSCAN is a density-based clustering approach that separates regions with a high density of data points from the regions with a lower density of data points. Its fundamental definition is that the cluster is a contiguous region of dense data points separated from another such region by a region of the low density of data points. Unlike K-means clustering, the number of clusters is determined by the algorithm. Two important concepts are density reachability and density connectivity, which can be understood as follows:

DBSCAN是基于密度的聚类方法，可将数据点密度较高的区域与数据点密度较低的区域分开。 它的基本定义是，群集是密集数据点的连续区域，该区域与另一个此类区域之间被数据点的低密度区域分隔开 。与K均值聚类不同，聚类的数量由算法确定。 密度可达性和密度连通 性是两个重要的概念，可以理解如下：

“A point is considered to be density reachable to another point if it is situated within a particular distance range from it. It is the criteria for calling two points as neighbors. Similarly, if two points A and B are density reachable (neighbors), also B and C are density reachable (neighbors), then by chaining approach A and C belong to the same cluster. This concept is called density connectivity. By this approach, the algorithm performs cluster propagation.”

“如果一个点位于另一个点的特定距离范围内，则认为该点可以密度达到另一个点。这是将两个点称为邻居的标准。类似地，如果两个点A和B是密度可达的(邻居)，则B和C也是密度可达的(邻居)，则通过链接方法A和C属于同一群集。这个概念称为密度连接。通过这种方法，该算法执行集群传播。”

The key constructs of the DBSCAN algorithm that help it determine the ‘concept of density’ are as follows:

DBSCAN算法可帮助确定“密度概念”的关键结构如下：

Epsilon ε (measure): ε is the threshold radius distance which determines the neighborhood of a point. If a point is located at a distance less than or equal to ε from another point, it becomes its neighbor, that is, it becomes density reachable to it.

小量 ε(测量)：ε 是确定点附近的阈值半径距离。如果一个点与另一个点的距离小于或等于ε，则该点成为其相邻点，即它可以达到的密度。

Choice of ε: The choice of ε is made in a way that the clusters and the outlier data can be segregated perfectly. Too large ε value can cluster the entire data as one cluster and too small value can classify each point as noise. In layman terms, the average distance of each point from its k-nearest neighbors is determined, sorted, and plotted. The point of maximum change (the elbows bend) determines the optimal value of ε.

的选择 ε ：选择ε时，可以将聚类和离群数据完美地分开。太大的ε值会将整个数据聚类为一个聚类，而太小的ε值会将每个点归类为噪声。用外行术语来说，每个点到其k个最近邻居的平均距离被确定，排序和绘制。最大变化点(肘部弯曲)确定ε的最佳值。

Min points m (measure): It is a threshold number of points present in the ε distance of a data point that dictates the category of that data point. It is driven by the number of dimensions present.

最小点数m (小节)：它是数据点的ε距离中存在的阈值点数，它决定了该数据点的类别。它由当前尺寸的数量驱动。

Choice of Min points: Minimum value of Min points has to be 3. Larger density and dimensionality means larger value should be chosen. The formula to be used while assigning value to Min points is: Min points>= Dimensionality + 1

最小点数的选择：最小点数的最小值必须为3。较大的密度和维数表示应选择较大的值。将值分配给“最小点”时要使用的公式为： 最小点> =维+ 1

Core points (data points): A point is a core point if it has at least m number of points within radii of ε distance from it.

核心点 (数据点)：如果一个点在距其ε距离的半径内至少有m个点，则它是一个核心点。

Border points (data points): A point that doesn’t qualify as a core point but is a neighbor of a core point.

边界点 (数据点)：不符合核心点要求但与核心点相邻的点。

Noise points (data points): An outlier point that doesn’t fulfill any of the above-given criteria.

噪声点 (数据点)：不满足上述任何标准的异常点。

Algorithm:

算法：

Select a value for ε and m.
为ε和m选择一个值。
Mark all points as outliers.
将所有点标记为离群值。
For each point, if at least m points are present within its ε distance range:
对于每个点，如果在其ε距离范围内至少存在m个点：

Identify it as a core point and mark the point as visited.
将其标识为核心点并将该点标记为已访问。
Assign the core point and its density reachable points in one cluster and remove them from the outlier list.
在一个群集中分配核心点及其密度可达到的点，并将其从异常值列表中删除。

4. Check for the density connectivity of the clusters. If so, merge the clusters into one.

4.检查集群的密度连接。如果是这样，请将群集合并为一个。

5. For points remaining in the outlier list, identify them as noise.

5.对于剩余在异常值列表中的点，将其标识为噪声。

The time complexity of the DBSCAN lies between O(n log n) (best case scenario) to O(n²) (worst case), depending upon the indexing structure, ε, and m values chosen.

-O之间的DBSCAN位于(N log n)的 (最好的情况下)至O(N²)(最坏情况)，取决于所选择的索引结构 ，ε，和m值的时间复杂度。

Python code:

Python代码：

As a part of the scikit-learn module, below is the code of DBSCAN with some of the hyperparameters set to the default value:

作为scikit-learn模块的一部分，以下是DBSCAN的代码，其中一些超参数设置为默认值：

class sklearn.cluster.DBSCAN(eps=0.5, *, min_samples=5, metric='euclidean')

eps is the epsilon value as already explained.

如前所述，eps是epsilon值。

min_samples is the Min points value.

min_samples是最低分值。

metric is the process by which distance is calculated in the algorithm. By default, it is Euclidean distance, other than that it can be any user-defined distance function or a ‘precomputed’ distance matrix.

metric是在算法中计算距离的过程。默认情况下，它是欧几里得距离，除了可以是任何用户定义的距离函数或“预计算”距离矩阵。

There are some advanced hyperparameters which will be best discussed in future projects.

有一些高级超参数将在以后的项目中进行最佳讨论。

Drawbacks:

缺点：

For the large differences in densities and unequal density spacing between clusters, DBSCAN shows unimpressive results at times. At times, the dataset may require different ε and ‘Min points’ value, which is not possible with DBSCAN.
对于群集之间的密度差异和不相等的密度间距，DBSCAN有时会显示令人印象深刻的结果。有时，数据集可能需要不同的ε和“最小点”值，而DBSCAN则不可能。
DBSCAN sometimes shows different results on each run for the same dataset. Although rarely so, but it has been termed as non-deterministic.
对于同一数据集，DBSCAN有时每次运行都会显示不同的结果。虽然很少这样，但是它被称为不确定性的。
DBSCAN faces the curse of dimensionality. It doesn’t work as expected in high dimensional datasets.
DBSCAN面临着维度的诅咒。在高维数据集中无法正常工作。

To overcome these, other advanced algorithms have been designed which will be discussed in future blogs.

为了克服这些问题，已经设计了其他高级算法，这些算法将在以后的博客中讨论。

Stay tuned. Happy learning :)

敬请关注。快乐学习：)