聚类技术分层和非分层

本文详细介绍了聚类技术,包括什么是聚类、为什么使用聚类、距离度量方法以及两种主要的聚类类型:层次聚类和K均值聚类。层次聚类是一种自下而上的方法,通过距离合并相似记录,而K均值聚类则是通过预先指定簇数量,迭代地分配和调整记录以找到最佳簇。文章还探讨了如何通过WSS值和轮廓分数来评估聚类效果。
摘要由CSDN通过智能技术生成

Clustering falls under the unsupervised learning technique. In this technique, the data is not labelled and there is no defined dependant variable. This type of learning is usually done to identify patterns in the data and/or to group similar data.

聚类属于非监督学习技术。 在此技术中,数据未标记,并且没有定义的因变量。 通常进行这种学习以识别数据中的模式和/或对相似数据进行分组。

In this post, a detailed explanation on the type of clustering techniques and a code walk-through is provided.

在这篇文章中,提供了有关聚类技术类型和代码遍历的详细说明。

什么是集群? (WHAT IS CLUSTERING?)

Clustering is a method of grouping of similar objects. The objective of clustering is to create homogeneous groups out of heterogeneous observations. The assumption is that the data comes from multiple population, for example, there could be people from different walks of life requesting loan from a bank for different purposes. If the person is a student, he/she could ask for an education loan, someone who is looking to buy a house can ask for home loan and so on. Clustering helps to identify similar group and cater the needs better.

聚类是对相似对象进行分组的一种方法。 聚类的目的是从异类观察中创建同质组。 假设数据来自多个人口,例如,可能有来自不同行业的人出于不同目的向银行申请贷款。 如果该人是学生,则他/她可以要求教育贷款,想要购买房屋的人可以要求房屋贷款等等。 聚类有助于确定相似的群体并更好地满足需求。

为什么要集群? (WHY CLUSTERING?)

Clustering is a distance-based algorithm. The purpose of clustering is to minimize the intra-cluster distance and maximize the inter-cluster distance.

聚类是基于距离的算法。 聚类的目的是最小化集群内距离并最大化集群间距离。

Image for post
Clustered data (Image by author)
集群数据(作者提供的图像)

Clustering as a tool can be used to gain insight into the data. Huge amount of information can be obtained by visualizing the data. The output of the clustering can also be used as a pre-processing step for other algorithms. There are several use cases of this technique that is used widely — some of the important ones are market segmentation, customer segmentation, image processing.

可以将聚类作为一种工具来深入了解数据。 通过可视化数据可以获得大量信息。 聚类的输出也可以用作其他算法的预处理步骤。 此技术有多种使用案例,广泛使用-一些重要的案例是市场细分,客户细分,图像处理。

Before proceeding further, let us understand the core of clustering.

在继续进行之前,让我们了解集群的核心。

距离测量 (MEASURE OF DISTANCE)

Clustering is all about distance between two points and distance between two clusters. Distance cannot be negative. There are a few common measures of distance that the algorithm uses for the clustering problem.

聚类是关于两个点之间的距离以及两个聚类之间的距离。 距离不能为负。 该算法对聚类问题使用了一些常用的距离度量。

EUCLIDEAN DISTANCE

欧盟距离

It is a default distance used by the algorithm. It is best explained as the distance between two points. If the distance between two points p and q are to be measured, then the Euclidean distance is

这是算法使用的默认距离。 最好将其解释为两点之间的距离。 如果要测量两个点p和q之间的距离,则欧几里得距离为

Image for post
Euclidean distance (Image by author)
欧几里得距离(作者提供的图片)

MANHATTAN DISTANCE

曼哈顿距离

It is the distance between two points calculated on a perpendicular angle along the axes. It is also called taxicab distance as this represents how vehicles in the city of Manhattan drive where the streets intersect at right angles.

它是沿轴垂直角计算出的两点之间的距离。 它也称为出租车距离,因为它代表曼哈顿城市中的车辆如何驾驶,街道之间以直角相交。

Image for post
Manhattan distance (Image by author)
曼哈顿距离(作者提供)

MINKOWSKI DISTANCE

MINKOWSKI距离

In an n-dimensional space, the distance between two points is called Minkowski distance.

在n维空间中,两点之间的距离称为Minkowski距离。

Image for post
Minkowski distance (Image by author)
Minkowski距离(作者提供的图像)

It is a generalization of the Euclidean and Manhattan distance that if the value of p is 2, it becomes Euclidean distance and if the value of p is 1, it becomes Manhattan distance.

欧几里得距离与曼哈顿距离的一般化是:如果p的值为2,则成为欧几里得距离;如果p的值为1,则它成为曼哈顿距离。

集群类型 (TYPES OF CLUSTERING)

There are two major types of clustering techniques

聚类技术有两种主要类型

  1. Hierarchical or Agglomerative

    分层或聚集
  2. k-means

    k均值

Let us look at each type along with code walk-through

让我们看看每种类型以及代码演练

层次聚类(HIERARCHICAL CLUSTERING)

It is a bottom-up approach. Records in the data set are grouped sequentially to form clusters based on distance between the records and also the distance between the clusters. Here is a step-wise approach to this method -

这是一种自下而上的方法。 根据记录之间的距离以及聚类之间的距离,将数据集中的记录按顺序分组以形成聚类。 这是此方法的逐步方法-

  1. Start with n clusters where each row is considered as a cluster

    从n个簇开始,其中每行都被视为一个簇
  2. Using distance based approach, two records that are closest to each other are merged into a cluster. In Fig 3, for the given five records, assuming A and C are closest in distance, they form a cluster and likewise B and E form another cluster and so on

    使用基于距离的方法,将彼此最接近的两条记录合并到一个群集中。 在图3中,对于给定的五个记录,假设A和C在距离上最接近,它们形成一个簇,同样B和E形成另一个簇,依此类推
Image for post
Clustering of two closest records (Image by author)
聚集两个最接近的记录(作者提供的图像)

3. At every step, two closest clusters will be merged. Either a single record(singleton) is added to the existing cluster or two clusters are combined. After at least one multiple-element cluster is formed, a scenario where the distance needs to be computed for a singleton and a set of observations and that is where the concept of linkages comes into picture. There are five major types of linkages. By using one of the below concepts, the clustering happens-

3.在每个步骤中,两个最接近的群集将合并。 将单个记录(单个)添加到现有群集中,或者将两个群集组合在一起。 在形成至少一个多元素群集之后,需要针对一个单例和一组观测值计算距离的场景,而链接的概念也应运而生。 链接有五种主要类型。 通过使用以下概念之一,可以进行聚类-

  • Single linkage: It is the shortest distance between any two points in both the clusters

    单联:这是两个集群中任意两个点之间的最短距离

  • Complete linkage: It is the opposite of single linkage. It is the longest distance between any two points in both the clusters

    完全链接:与单个链接相反。 这是两个群集中任意两点之间的最长距离

  • Average linkage: It is the average distance between each point in one cluster to every point in the other cluster

    平均链接:这是一个群集中的每个点到另一群集中的每个点之间的平均距离

  • Centroid linkage: The distance between the center point in one cluster to the center point in the other cluster

    重心链接:一个簇中的中心点到另一个簇中的中心点之间的距离

  • Ward’s linkage: A combination of average and centroid methods. The within cluster variance is calculated by determining the center point of the cluster and the distance of the observations from the center. While trying to merge two clusters, the variance is found between the clusters and the clusters are merged whose variance is less compared to the other combination.

    沃德联动:平均和质心方法的组合。 群集内方差是通过确定群集的中心点和观测值与中心的距离来计算的。 尝试合并两个群集时,发现群集之间存在方差,并且合并了与其他组合相比方差较小的群集。

A point to note is that each linkage method produces an unique result. When each of these methods are applied on the same data set, it may be clustered differently.

需要注意的一点是,每种链接方法都会产生唯一的结果。 当这些方法中的每一个应用于相同的数据集时,它的聚类可能会有所不同。

4. Repeat the steps until there is a single cluster with all the records

4.重复这些步骤,直到有一个包含所有记录的集群

可视化(VISUALIZATION)

To visualize the clustering, there is a concept called dendrogram. The dendrogram is a tree diagram summarizing the clustering process. The records are on the x-axis. Similar records are joined by lines whose vertical length reflects the distance between the records. The greater the difference in height, more the dissimilarity. A sample dendrogram is shown -

为了可视化聚类,有一个称为树状图的概念 树状图是概述聚类过程的树形图。 记录在x轴上。 相似的记录由线连接,这些线的垂直长度反映了记录之间的距离。 高度差越大,差异越大。 示例树状图显示-

Image for post
Dendrogram (Image by author)
树状图(作者提供)

层次聚类代码穿越 (HIERARCHICAL CLUSTERING CODE WALK-THROUGH)

The code for hierarchical clustering is written in Python 3x using jupyter notebook. Let’s begin by importing the necessary libraries.

使用jupyter notebook用Python 3x编写了层次化集群的代码。 让我们从导入必要的库开始。

#Import the necessary librariesimport numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster

Next, load the data set. Here, a data set on the food menu of Starbucks is used.

接下来,加载数据集。 在这里,使用在星巴克食品菜单上设置的数据。

#Read the datasetdf = pd.read_csv('starbucks_menu.csv')#Look at the top 5 rows df.head()

Do the necessary Exploratory Data Analysis like looking at the descriptive statistics, checking for null values, duplicate values. Perform uni-variate and bi-variate analysis, do outlier treatment(if any). Since this is a distance based algorithm, it is necessary to perform normalization wherever applicable so that all the variables are free of any units of measurement. This enables the model to perform at its best.

进行必要的探索性数据分析,例如查看描述性统计数据,检查空值,重复值。 进行单变量和双变量分析,进行离群值处理(如果有)。 由于这是基于距离的算法,因此必须在适用的地方执行归一化,以便所有变量都没有任何测量单位。 这使模型能够发挥最佳性能。

from scipy.stats import zscore
df.iloc[:,1:6] = df.iloc[:,1:6].apply(zscore)#Check the head after scalingdf.head()

Once the data is ready, let us work towards building the model. A label list needs to be assigned which is a list of unique value of categorical variable. Here, label list is created from the Food variable.

数据准备就绪后,让我们着手构建模型。 需要分配标签列表,它是分类变量的唯一值的列表。 在这里,标签列表是从Food变量创建的。

#Before clustering, setup label list from the food variablelabelList = list(df.Food.unique())
labelList

Next step is to form a linkage to cluster a singleton and another cluster. In this case, ward’s method is preferred.

下一步是形成一个链接以将一个单例和另一个群集聚类。 在这种情况下,首选病房的方法。

#Create linkage method using Ward's methodlink_method = linkage(df.iloc[:,1:6], method = 'ward')

Visualize the clustering with the help of a dendrogram. In this case, a truncated dendrogram by specifying the p value which displays the ante-penultimate and penultimate clusters.

借助树状图可视化聚类。 在这种情况下,通过指定p值来显示截短的树状图,该值显示前倒数第二和倒数第二类。

#Generate the dendrogramdend = dendrogram(link_method,
labels = labelList,
truncate_mode='lastp',
p=10)
Image for post
Truncated dendrogram (Image by author)
截断的树状图(作者提供的图片)

Once the dendrogram is created, it is necessary to cut the tree to determine the optimum number of clusters. It can be done in one of two ways (explained in the diagram). In this case, 3 clusters are chosen. The clusters can be attached to the data frame as a new column for gaining insights.

树状图一旦创建,就必须砍伐树以确定最佳簇数。 可以通过以下两种方式之一完成此操作(在图中说明)。 在这种情况下,选择了3个群集。 可以将群集作为新列附加到数据框,以获取见解。

#Method 1: criterion = 'maxclust' where a cut is defined based on the number of clustersclusters = fcluster(link_method, 3, criterion='maxclust') 
clusters#Method 2: criterion='distance' where a cut is defined based on distance in the y-axis#clusters = fcluster(link_method, 800, criterion='distance')#Apply the clusters back to the datasetdf['HCluster'] = clusters
df.head()

The last step is to do cluster profiling to extract information and insights from the algorithm to help with effective decision making. The cluster profiling is done by grouping the mean of the cluster and sorting based on the frequency.

最后一步是进行聚类分析,以从算法中提取信息和见解,以帮助做出有效的决策。 聚类分析是通过对聚类的平均值进行分组并根据频率进行排序来完成的。

aggdata=df.iloc[:,1:8].groupby('HCluster').mean()
aggdata['Frequency']=df.HCluster.value_counts().sort_index()
aggdata
Image for post
Cluster profiling (Image by author)
集群分析(作者提供的图片)

A quick insight is that the first cluster has the food items that are usually low in calories and hence the macro nutrients are on the lower side. The second cluster has the food items with the most amount of calories and hence more in macro nutrients and there is a mid range in between cluster 1 and 2 which is the third cluster that has good amount of calories and macro nutrients. Overall, on a brief, this model has clustered well.

一个快速的洞察力是,第一类食物中的热量通常较低,因此大量营养成分处于较低的位置。 第二类食物中卡路里含量最高的食物,因此含有大量的营养素,而第一类和第二类食物之间的中间含量较高,这是第三类食物,其卡路里和宏观营养素含量较高。 总体而言,简单地说,该模型已经很好地聚集了。

Let’s move on to the next method

让我们继续下一个方法

K均值聚类(K-MEANS CLUSTERING)

K-Means is a non-hierarchical approach. The idea is to specify the number of clusters before hand. Based on the number of clusters, each record is assigned to the cluster based on the distance from each cluster. This approach is preferred when the data set is large. The word means in k-means refers to averaging of the data which is also referred to as finding the centroid. Here is a step-wise approach-

K-Means是一种非分层方法。 这个想法是先指定簇的数量。 基于簇的数量,基于与每个簇的距离,将每个记录分配给簇。 当数据集较大时,首选此方法。 在k-均值的字的装置指的是也被称为查找质心数据的平均。 这是一种循序渐进的方法-

  1. Specify the k value before hand

    事先指定k值
  2. Assign each record to the cluster whose distance is smallest to the centroid. The K-Means by default uses Euclidean distance

    将每个记录分配给距离质心最小的簇。 默认情况下,K均值使用欧几里得距离
  3. Re-calculate the centroid for the newly formed clusters. There is a chance that some of the data points may move around based on the distance.

    重新计算新形成的群集的质心。 有些数据点可能会根据距离而移动。
  4. Re-assignment may happen on an iterative basis and new centroid is formed. This process will stop until there is no jumping of observations from cluster to cluster.

    重新分配可能会反复进行,并且会形成新的质心。 该过程将一直停止,直到观测值之间没有跳跃。
  5. If there is any re-assignment, go back to step 3 and continue the steps. If not, the clusters are finalized.

    如果有任何重新分配,请返回步骤3并继续执行步骤。 如果不是,集群将最终确定。

Although we have determined the number of clusters before hand, it may not be always right and it is necessary to determine the optimum number of clusters. There is no solid solution to determine the number of clusters, however, there is a common method in place. For each value of k, a Within Sum of Squares(WSS) value can be identified. A single cluster cannot be chosen so it is important to find that k value after which there is no significant difference in the WSS value. To make this more efficient, an elbow plot can be drawn with WSS scores in the y-axis and number of clusters in the x-axis to visualize the optimum number of clusters.

尽管我们已经预先确定了群集数,但可能并不总是正确的,因此有必要确定最佳群集数。 没有确定簇数的可靠解决方案,但是,有一种通用的方法。 对于每个k值,可以确定一个平方和(WSS)值。 无法选择单个群集,因此找到k值之后的WSS值没有显着差异非常重要。 为了使其更有效,可以在y轴上绘制WSS分数并在x轴上绘制簇数的肘图,以可视化最佳簇数。

There is a way to understand how well the model has performed. It is done by checking two metrics, namely Silhouette Width and Silhouette Score. This helps us to analyse whether each and every observation mapped to the clusters is correct or not based on distance criteria. Silhoutte Width is calculated as

有一种方法可以了解模型的执行情况。 通过检查两个指标即“轮廓宽度”和“轮廓分数”可以完成此操作。 这有助于我们根据距离标准分析映射到聚类的每个观察结果是否正确。 Silhoutte宽度的计算公式为

Image for post
Silhouette Width formula (Image by author)
轮廓宽度公式(作者提供)

where b is the distance between the observation and the neighboring cluster’s centroid and a is the distance between the observation and the very own cluster’s centroid.

其中b是观测值与邻近星团质心之间的距离,而a是观测值与自身星团质心之间的距离。

The Silhouette Width can have a value in the range of -1 to 1. If the value of Silhoutte Width is positive, then the mapping of the observation to the current cluster is correct. When a > b, Silhouette Width will return a negative value. The average of all the Silhouette Width is called the Silhouette Score. If the final score is a positive value and it is close to +1, the clusters are well separated on an average. If it is close to 0, it is not separated well enough. If it is negative value, the model has done a blunder in clustering.

Silhouette Width的值可以在-1到1的范围内。如果Silhoutte Width的值为正,则观测值到当前聚类的映射是正确的。 当a> b时,“轮廓宽度”将返回负值。 所有轮廓宽度的平均值称为轮廓分数。 如果最终分数是一个正值,并且接近+1,则这些聚类的平均分离度很好。 如果接近0,则分隔得不够好。 如果它是负值,则该模型在聚类中犯了一个错误。

K均值聚类代码穿越 (K-MEANS CLUSTERING CODE WALK-THROUGH)

Let’s begin by importing the necessary libraries

首先导入必要的库

#Import the necessary librariesimport numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

Next, load the data set. The same data set used for Hierarchical clustering is used here.

接下来,加载数据集。 此处使用了用于层次集群的相同数据集。

Do the necessary Exploratory Data Analysis like looking at the descriptive statistics, checking for null values, duplicate values. Perform uni-variate and bi-variate analysis, do outlier treatment(if any). K-means clustering demands scaling. It is done so that all the variables are free of any units of measurement. This enables the model to perform at its best. In this case, StandardScaler method is put to use.

进行必要的探索性数据分析,例如查看描述性统计信息,检查空值,重复值。 进行单变量和双变量分析,进行离群值处理(如果有)。 K均值聚类要求扩展。 这样做是为了使所有变量都没有任何度量单位。 这使模型能够发挥最佳性能。 在这种情况下,将使用StandardScaler方法。

#Importing the standard scaler module and applying it on continuous variablesfrom sklearn.preprocessing import StandardScaler 
X = StandardScaler()
scaled_df = X.fit_transform(df.iloc[:,1:6])
scaled_df

Next is to invoke the KMeans method with defining the number of clusters before hand. Then fit the scaled data set to the model.

接下来是调用KMeans方法,并事先定义簇数。 然后将缩放后的数据集拟合到模型。

# Create K Means cluster and store the result in the object k_meansk_means = KMeans(n_clusters=2)# Fit K means on the scaled_dfk_means.fit(scaled_df)# Get the labelsk_means.labels_

Now is the time to find the optimum number of clusters by analyzing the values of Within Sum of Squares(WSS) for a given range of k.

现在是时候通过分析给定k范围内的平方和之内(WSS)的值来找到最佳聚类数。

#To determine the optimum number of clusters, check the wss score for a given range of kwss =[] 
for i in range(1,11):
KM = KMeans(n_clusters=i)
KM.fit(scaled_df)
wss.append(KM.inertia_)

wss
Image for post
WSS scores (Image by author)
WSS分数(作者提供的图片)

It is seen that there is a dip in WSS score after k=2 hence let’s keep an eye on k=3. The same can be visualized using an elbow plot

可以看出,在k = 2之后WSS分数有所下降,因此让我们关注k = 3。 可以使用肘部图来可视化

#Draw the elbow plotplt.plot(range(1,11), wss, marker = '*')
Image for post
Elbow plot (Image by author)
肘部图(作者提供)

Another helping hand in deciding the number of clusters could be the value of Silhouette score. As discussed before, better the score, better the clustering. Let’s check the score.

决定聚类数量的另一个帮助手可能是Silhouette得分的值。 如前所述,分数越高,聚类越好。 让我们检查一下分数。

#Checking for n-clusters=3k_means_three = KMeans(n_clusters = 3)
k_means_three.fit(scaled_df)
print('WSS for K=3:', k_means_three.inertia_)
labels_three = k_means_three.labels_
print(labels_three)#Calculating silhouette_score for k=3print(silhouette_score(scaled_df, labels_three))

The WSS for k=3 is 261.67 and the Silhouette score for those labels is 0.3054. Since the score is positive, it is a sign that good clustering has happened.

k = 3的WSS为261.67,这些标签的剪影得分为0.3054。 由于分数为正,这表明发生了良好的聚类。

The final step is to do cluster profiling to understand how the cluster has happened and gain more insights.

最后一步是进行群集分析,以了解群集的发生方式并获得更多见解。

clust_profile=df.iloc[:,1:8].groupby('KMCluster').mean()
clust_profile['KMFrequency']=df.KMCluster.value_counts().sort_index()
clust_profile
Image for post
Cluster profiling for KMeans (Image by author)
KMeans的群集分析(作者提供的图像)

Just like Hierarchical clustering, these three clusters indicate three levels of foods with different calorie and macro nutrients range.

就像层次聚类一样,这三个聚类表示三个级别的食物具有不同的卡路里和大量营养素范围。

翻译自: https://towardsdatascience.com/clustering-techniques-hierarchical-and-non-hierarchical-b520b5d6a022

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值