dbscan聚类高维数据_k的实际实现意味着对数据集进行分层和dbscan聚类

最新推荐文章于 2023-12-13 23:09:13 发布

weixin_26750511

最新推荐文章于 2023-12-13 23:09:13 发布

阅读量2.1k

点赞数

文章标签：聚类 java

原文链接：https://medium.com/analytics-vidhya/practical-implementation-of-k-means-hierarchical-and-dbscan-clustering-on-dataset-with-bd7f3d13ef7f

版权

本文介绍了如何在高维数据集上应用DBSCAN聚类算法，详细阐述了其实现过程，包括数据预处理和分层聚类，旨在帮助读者理解如何有效地对复杂数据集进行聚类。

摘要由CSDN通过智能技术生成

dbscan聚类高维数据

Clustering Algorithms with Hyperparameter optimization

超参数优化的聚类算法

Table of contents :

(i) Article Agenda

(i)议程

(ii) Data Processing

(ii)数据处理

(iii) K-mean Clustering with Hyperparameter Optimization

(iii)具有超参数优化的K均值聚类

(iv) Hierarchical Clustering

(iv)层次聚类

(v) DBSCAN Clustering with Hyperparameter Optimization

(v)具有超参数优化的DBSCAN集群

(vi) Conclusion

(vi)结论

(vii) References

(vii)参考

(i)条议程：((i) Article Agenda :)

This article is purely related to the implementation of Clustering Algorithms on any data set. We also do Hyperparameter optimization.

本文纯粹与在任何数据集上实现聚类算法有关。我们还进行了超参数优化。

Prerequisites: Basic understanding of K-means, Hierarchical, and DBSCAN Clustering

先决条件：对K-means，分层和DBSCAN群集的基本了解

Throughout this article, I follow https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python Mall_Customers.csv data set

在整篇文章中，我都遵循https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python Mall_Customers.csv数据集

Please download the data set from the above link

请从上面的链接下载数据集

(ii)数据处理：((ii) Data Processing :)

We read a CSV file using pandas

我们使用熊猫读取了CSV文件

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt# Reading csv filedf=pd.read_csv('Customers.csv')df.head()

The data set contains 5 features

数据集包含5个特征

问题陈述：我们需要根据人们的年收入(k $)和他们的花费(支出得分(1–100))对他们进行分组(Problem statement: we need to cluster the people basis on their Annual income (k$) and how much they Spend (Spending Score(1–100) ))

So our features for the clustering are Annual Income(k$) and Spending Score(1–100)

因此，我们的聚类功能是年收入(k $)和支出得分(1–100)

Spending score is nothing but a score gave the basis on how much they spend

支出分数不过是分数，是他们支出多少的依据

f1: Annual Income (k$)

f1：年收入(k $)

f2: Spending Score(1–100)

f2：支出得分(1-100)

Now we need to create an array with f1(x) and f2 (y) from data frame df

现在我们需要从数据帧df创建一个具有f1(x)和f2(y)的数组

# converting features f1 and f2 into an array 
X=df.iloc[:,[3,4]].values

We had features in array form now we can proceed to implement step

我们具有数组形式的功能，现在我们可以继续执行步骤

(iii)K-均值聚类：((iii) K-means Clustering :)

K-means Clustering is Centroid based algorithm

K-均值聚类是基于质心的算法

K = no .of clusters =Hyperparameter

K =簇数=超参数

We find K value using the Elbow method

我们用弯头法找到K值

K-means objective function is argmin (sum(||x-c||)²

K均值目标函数为argmin(sum(|| xc ||)²

where x = data point in the cluster

其中x =集群中的数据点

c= centroid of the cluster

c =群集的质心

objective: We need to minimize the square distance between the data point and centroid

目标：我们需要最小化数据点和质心之间的平方距离

If we have K-clusters then we have K-centroids

如果我们有K簇，那么我们就有K形心

Intracluster distance: Distances between data points in the same cluster

集群内距离：同一集群中数据点之间的距离

Intercluster distance: Distances between different clusters

集群间距离：不同集群之间的距离

Our main aim to choose the clusters which have small intracluster distance and large intercluster distance

我们的主要目标是选择集群内距离较小和集群间距离较大的集群

We use K-means++ initialization(probabilistic approach)

我们使用K-means ++初始化(概率方法)

from sklearn.cluster import KMeans# objective function is nothing but argmin of c (sum of (|x-c|)^2 )  c: centroid ,x=point in data setobjective_function=[] 
for i in range(1,11):
    clustering=KMeans(n_clusters=i, init='k-means++')
    clustering.fit(X)
    objective_function.append(clustering.inertia_)#inertia is calculaing min intra cluster distance
# objective function contains min intra cluster distances objective_function

objective_function : min intracluster distances

objective_function：群集内最小距离

[269981.28000000014,
 183116.4295463669,
 106348.37306211119,
 73679.78903948837,
 44448.45544793369,
 37233.81451071002,
 31599.13139461115,
 25012.917069885472,
 21850.16528258562,
 19701.35225128174]

We tried K value in the in-between 1 to 10, we don’t know which is best K surely

我们在1到10之间尝试了K