dbscan聚类高维数据_k的实际实现意味着对数据集进行分层和dbscan聚类

本文介绍了如何在高维数据集上应用DBSCAN聚类算法,详细阐述了其实现过程,包括数据预处理和分层聚类,旨在帮助读者理解如何有效地对复杂数据集进行聚类。
摘要由CSDN通过智能技术生成

dbscan聚类高维数据

Clustering Algorithms with Hyperparameter optimization

超参数优化的聚类算法

Table of contents :

目录 :

(i) Article Agenda

(i)议程

(ii) Data Processing

(ii)数据处理

(iii) K-mean Clustering with Hyperparameter Optimization

(iii)具有超参数优化的K均值聚类

(iv) Hierarchical Clustering

(iv)层次聚类

(v) DBSCAN Clustering with Hyperparameter Optimization

(v)具有超参数优化的DBSCAN集群

(vi) Conclusion

(vi)结论

(vii) References

(vii)参考

(i)条议程:((i) Article Agenda :)

This article is purely related to the implementation of Clustering Algorithms on any data set. We also do Hyperparameter optimization.

本文纯粹与在任何数据集上实现聚类算法有关。 我们还进行了超参数优化。

Prerequisites: Basic understanding of K-means, Hierarchical, and DBSCAN Clustering

先决条件:对K-means,分层和DBSCAN群集的基本了解

Throughout this article, I follow https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python Mall_Customers.csv data set

在整篇文章中,我都遵循https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python Mall_Customers.csv数据集

Please download the data set from the above link

请从上面的链接下载数据集

(ii)数据处理:((ii) Data Processing :)

We read a CSV file using pandas

我们使用熊猫读取了CSV文件

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt# Reading csv filedf=pd.read_csv('Customers.csv')df.head()
Image for post
Top 5 rows of df
df的前5行

The data set contains 5 features

数据集包含5个特征

问题陈述:我们需要根据人们的年收入(k $)和他们的花费(支出得分(1–100))对他们进行分组(Problem statement: we need to cluster the people basis on their Annual income (k$) and how much they Spend (Spending Score(1–100) ))

So our features for the clustering are Annual Income(k$) and Spending Score(1–100)

因此,我们的聚类功能是年收入(k $)和支出得分(1–100)

Spending score is nothing but a score gave the basis on how much they spend

支出分数不过是分数,是他们支出多少的依据

f1: Annual Income (k$)

f1:年收入(k $)

f2: Spending Score(1–100)

f2:支出得分(1-100)

Now we need to create an array with f1(x) and f2 (y) from data frame df

现在我们需要从数据帧df创建一个具有f1(x)和f2(y)的数组

# converting features f1 and f2 into an array 
X=df.iloc[:,[3,4]].values

We had features in array form now we can proceed to implement step

我们具有数组形式的功能,现在我们可以继续执行步骤

(iii)K-均值聚类:((iii) K-means Clustering :)

K-means Clustering is Centroid based algorithm

K-均值聚类是基于质心的算法

K = no .of clusters =Hyperparameter

K =簇数=超参数

We find K value using the Elbow method

我们用弯头法找到K值

K-means objective function is argmin (sum(||x-c||)²

K均值目标函数为argmin(sum(|| xc ||)²

where x = data point in the cluster

其中x =集群中的数据点

c= centroid of the cluster

c =群集的质心

objective: We need to minimize the square distance between the data point and centroid

目标:我们需要最小化数据点和质心之间的平方距离

If we have K-clusters then we have K-centroids

如果我们有K簇,那么我们就有K形心

Intracluster distance: Distances between data points in the same cluster

集群内距离:同一集群中数据点之间的距离

Intercluster distance: Distances between different clusters

集群间距离:不同集群之间的距离

Our main aim to choose the clusters which have small intracluster distance and large intercluster distance

我们的主要目标是选择集群内距离较小和集群间距离较大的集群

We use K-means++ initialization(probabilistic approach)

我们使用K-means ++初始化(概率方法)

from sklearn.cluster import KMeans# objective function is nothing but argmin of c (sum of (|x-c|)^2 )  c: centroid ,x=point in data setobjective_function=[] 
for i in range(1,11):
clustering=KMeans(n_clusters=i, init='k-means++')
clustering.fit(X)
objective_function.append(clustering.inertia_)#inertia is calculaing min intra cluster distance
# objective function contains min intra cluster distances objective_function

objective_function : min intracluster distances

objective_function:群集内最小距离

[269981.28000000014,
183116.4295463669,
106348.37306211119,
73679.78903948837,
44448.45544793369,
37233.81451071002,
31599.13139461115,
25012.917069885472,
21850.16528258562,
19701.35225128174]

We tried K value in the in-between 1 to 10, we don’t know which is best K surely

我们在1到10之间尝试了K

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值