dbscan聚类高维数据
Clustering Algorithms with Hyperparameter optimization
超参数优化的聚类算法
Table of contents :
目录 :
(i) Article Agenda
(i)议程
(ii) Data Processing
(ii)数据处理
(iii) K-mean Clustering with Hyperparameter Optimization
(iii)具有超参数优化的K均值聚类
(iv) Hierarchical Clustering
(iv)层次聚类
(v) DBSCAN Clustering with Hyperparameter Optimization
(v)具有超参数优化的DBSCAN集群
(vi) Conclusion
(vi)结论
(vii) References
(vii)参考
(i)条议程:((i) Article Agenda :)
This article is purely related to the implementation of Clustering Algorithms on any data set. We also do Hyperparameter optimization.
本文纯粹与在任何数据集上实现聚类算法有关。 我们还进行了超参数优化。
Prerequisites: Basic understanding of K-means, Hierarchical, and DBSCAN Clustering
先决条件:对K-means,分层和DBSCAN群集的基本了解
Throughout this article, I follow https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python Mall_Customers.csv data set
在整篇文章中,我都遵循https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python Mall_Customers.csv数据集
Please download the data set from the above link
请从上面的链接下载数据集
(ii)数据处理:((ii) Data Processing :)
We read a CSV file using pandas
我们使用熊猫读取了CSV文件
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt# Reading csv filedf=pd.read_csv('Customers.csv')df.head()
The data set contains 5 features
数据集包含5个特征
问题陈述:我们需要根据人们的年收入(k $)和他们的花费(支出得分(1–100))对他们进行分组(Problem statement: we need to cluster the people basis on their Annual income (k$) and how much they Spend (Spending Score(1–100) ))
So our features for the clustering are Annual Income(k$) and Spending Score(1–100)
因此,我们的聚类功能是年收入(k $)和支出得分(1–100)
Spending score is nothing but a score gave the basis on how much they spend
支出分数不过是分数,是他们支出多少的依据
f1: Annual Income (k$)
f1:年收入(k $)
f2: Spending Score(1–100)
f2:支出得分(1-100)
Now we need to create an array with f1(x) and f2 (y) from data frame df
现在我们需要从数据帧df创建一个具有f1(x)和f2(y)的数组
# converting features f1 and f2 into an array
X=df.iloc[:,[3,4]].values
We had features in array form now we can proceed to implement step
我们具有数组形式的功能,现在我们可以继续执行步骤
(iii)K-均值聚类:((iii) K-means Clustering :)
K-means Clustering is Centroid based algorithm
K-均值聚类是基于质心的算法
K = no .of clusters =Hyperparameter
K =簇数=超参数
We find K value using the Elbow method
我们用弯头法找到K值
K-means objective function is argmin (sum(||x-c||)²
K均值目标函数为argmin(sum(|| xc ||)²
where x = data point in the cluster
其中x =集群中的数据点
c= centroid of the cluster
c =群集的质心
objective: We need to minimize the square distance between the data point and centroid
目标:我们需要最小化数据点和质心之间的平方距离
If we have K-clusters then we have K-centroids
如果我们有K簇,那么我们就有K形心
Intracluster distance: Distances between data points in the same cluster
集群内距离:同一集群中数据点之间的距离
Intercluster distance: Distances between different clusters
集群间距离:不同集群之间的距离
Our main aim to choose the clusters which have small intracluster distance and large intercluster distance
我们的主要目标是选择集群内距离较小和集群间距离较大的集群
We use K-means++ initialization(probabilistic approach)
我们使用K-means ++初始化(概率方法)
from sklearn.cluster import KMeans# objective function is nothing but argmin of c (sum of (|x-c|)^2 ) c: centroid ,x=point in data setobjective_function=[]
for i in range(1,11):
clustering=KMeans(n_clusters=i, init='k-means++')
clustering.fit(X)
objective_function.append(clustering.inertia_)#inertia is calculaing min intra cluster distance
# objective function contains min intra cluster distances objective_function
objective_function : min intracluster distances
objective_function:群集内最小距离
[269981.28000000014,
183116.4295463669,
106348.37306211119,
73679.78903948837,
44448.45544793369,
37233.81451071002,
31599.13139461115,
25012.917069885472,
21850.16528258562,
19701.35225128174]
We tried K value in the in-between 1 to 10, we don’t know which is best K surely
我们在1到10之间尝试了K