机器学习中的无监督学习_无监督机器学习中聚类背后的直觉

最新推荐文章于 2024-09-16 16:56:48 发布

weixin_26752765

最新推荐文章于 2024-09-16 16:56:48 发布

阅读量1.6k

点赞数

文章标签：机器学习 python 人工智能无监督学习深度学习

原文链接：https://medium.com/predict/intuition-behind-clustering-in-unsupervised-machinelearning-ff8567fb7841

版权

机器学习中的无监督学习

When it comes to analyzing & making sense of the data from the past and understanding the future world based on those data , we rely on machine learning methodologies . This field of machine learning as I have discussed in my past articles on machine learning fundamentals is broadly categorized into

在分析和理解过去的数据并基于这些数据了解未来世界时，我们依赖于机器学习方法。正如我在过去有关机器学习基础的文章中所讨论的那样，该机器学习领域大致分为以下几类：

Supervised Machine Learning
监督机器学习
Unsupervised Machine Learning
无监督机器学习

要了解监督的ML，请访问： (To understand supervised ML please visit :)

集群：无监督机器学习的世界 (Clustering : The World Of Unsupervised Machine Learning)

Today, will dig deeper into the world of Unsupervised learning. To help you catch the concept , let me put up the example of e-Commerce portals like Flipkart, Amazon etc.

今天，它将更深入地研究无监督学习的世界。为了帮助您理解这一概念，让我举一个Flipkart，Amazon等电子商务门户的示例。

“Do you know how these eCommerce giants which you use everyday, manages to segment huge list of products into various categories with an intelligence which customizes the experience of browsing based on how you navigate on their portal . ”

“ 您知道您每天使用的这些电子商务巨人如何利用智能根据您在门户网站上的导航方式定制浏览体验，从而将庞大的产品列表划分为各种类别。 ”

These tailor made intelligence to categorize the products is made possible by one of the popular Unsupervised learning techniques called clustering , where they group the set of customers based on their behavior and try to make sense of the data points generated by those segments of user, to offer tailor made services.

这些流行的无监督学习技术(称为聚类 )使这些量身定制的智能能够对产品进行分类，在这种技术中，他们根据自己的行为对客户群进行分组，并试图理解由这些用户细分产生的数据点，从而提供量身定制的服务。

因此，一些受欢迎的例子是： (So, some of the popular examples are :)

Market segmentation
市场细分
Product Segmentation
产品细分
User segmentation
用户细分
Organizing the system files into group of folders
将系统文件组织到文件夹组中
Organizing emails into different folder category etc..
将电子邮件组织到不同的文件夹类别等中。

为什么将其称为无监督？ (Why it is called unsupervised ?)

Because in this field of Machine learning the data set provided to train the ML models doesn’t have any pre-defined set of labels/outcome defined with-in the data , so the prediction or segmentation of data has to be done to group the set of people, product or data into a cluster by the model itself.

因为在机器学习的此领域提供的用于训练ML模型的数据集没有在数据中定义任何预定义的标签/结果，因此必须进行数据的预测或分段才能对模型本身将一组人员，产品或数据集合到一个集群中。

例如： (For Example :)

In case of problem where you are given the set of past data from the bank which has the list of user attributes along with one target column attributes which labels the user as

如果出现问题，您会从银行获得一组过去的数据，其中包含用户属性列表以及一个将用户标记为

Defaulter
默认值
Non-Defaulter
非默认值

Now our models has to be trained on these data with a known target to achieve as a result which is to predict whether any user which comes int the loan disbursal system will default or not is a kind of Supervised Machine learning model .

现在我们的模型必须在这些数据上训练有一个已知的目标，结果是可以预测进入贷款支付系统的任何用户是否会违约是一种监督机器学习模型。

But What if you had the data which has no such kind of target column available and your model has to group the customers into a set of defaulters and non-defaulter , well when your model is trained to perform these kind of segmentation it is known to be an Unsupervised learning model.

但是，如果您拥有的数据没有此类目标列可用，并且您的模型必须将客户分组为一组默认值和非默认值，那么当训练您的模型以执行此类细分时，众所周知成为无监督的学习模型。

So, with this basic understanding of unsupervised learning it’s time to get into the fundamentals of Clustering which is a kind of unsupervised learning . Here we will cover :

因此，基于对无监督学习的基本了解，是时候深入了解聚类的基础知识了，它是一种无监督学习。在这里，我们将介绍：

What Is Clustering In Unsupervised ML ?
什么是无监督ML中的聚类？
What Are The Types Of Clustering?
群集的类型有哪些？
What Is K-Means Clustering ?
什么是K均值聚类？

什么是群集？ (What Is Clustering ?)

It is a mechanism of grouping the set of given data to create a segments based on the concept of similarity among those data points. The intuition behind the concept of similarity comes from the word called distance .

它是一种将给定数据集进行分组以基于这些数据点之间的相似性概念创建段的机制。相似的概念背后的直觉来自于所谓的距离的话。

什么是集群？ (What Is Cluster?)

It is a collection of data object which are similar

它是相似的数据对象的集合

So, it is important here to understand two highlighted world in the definition above

因此，重要的是要了解上面定义中的两个突出显示的世界

Similarity
相似
Distance
距离

聚类中的相似性概念： (The Concept of Similarity In Clustering :)

In cluster analysis , we stress on the concept of data point similarity, where similarity is a measure of distance between those given data points .

在聚类分析中，我们强调数据点相似性的概念，其中相似性是对给定数据点之间距离的度量。

Those distance to measure how close the given data points are used to infer how similar those data points . Some of the popular distance measuring techniques are

那些距离用来测量给定数据点的接近程度，用以推断这些数据点的相似程度。一些流行的距离测量技术是

Manhattan Distance
曼哈顿距离
Euclidean Distances
欧氏距离
Chebyshev Distances
切比雪夫距离
Minkowski Distance
明可夫斯基距离

欧氏距离： (Euclidean Distance :)

Is probably the most common measure of distance we all are very familiar with in data science or mathematical world.

这可能是我们在数据科学或数学世界中都非常熟悉的最常见的距离度量。

As per wiki,

根据维基，

In the field of mathematics, the Euclidean distance or Euclidean metric is the “ordinary” straight-line distance between two points in Euclidean space.

在数学领域， 欧几里得距离或欧几里得度量是欧几里得空间中两点之间的“普通”直线距离。

The Euclidean distance between points X and Y is the length of the line segment connecting then, In Cartesian coordinates, Euclidean distance (d) :

X点和Y点之间的欧几里得距离是连接的线段的长度，在直角坐标系中， 欧几里得距离(d)：

from X to Y, or from Y to X is given by the Pythagorean formula:

从X到Y，或从Y到X由毕达哥拉斯公式给出：

欧式距离：2维，3维和N维： (Euclidean Distance : 2 Dimension, 3 Dimension & N- Dimension :)

Euclidean distance as discussed used the popular Pythagorean theorem to calculate the measure of distance between the given set of vectors/points in n dimensional space.

讨论的欧几里得距离使用流行的毕达哥拉斯定理来计算n维空间中给定向量/点集之间的距离。

Below are the formula for the same in 2, 3 and n- dimensional space :

以下是2维，3维和n维空间中的相同公式：

曼哈顿距离： (Manhattan Distance :)

Unlike Euclidean distance, where we calculated the sum of the squares of the given vector points, here the distance between two points is the

与欧几里得距离不同，我们计算给定矢量点的平方和，此处两点之间的距离为

sum of the absolute differences of their Cartesian coordinates.

笛卡尔坐标的绝对差之和。

This metric of distance is also known as snake distance, city block distance, or Manhattan length, This names has taken inspiration form the grid layout of most streets on the island of Manhattan, which causes the shortest path a car could take between two intersections in the borough to have length equal to the intersections’ distance in taxicab kind of geometry

这种距离度量也称为 蛇距， 街区距离 或 曼哈顿长度 。该名称的灵感来自曼哈顿岛上大多数街道的网格布局，这导致汽车在加利福尼亚州的两个交叉点之间可以走的最短路径自治市镇的长度等于出租车形状中的相交距离

Manhattan distance which is also called a taxicab distance can be defined by the below given formula’s

曼哈顿距离，也称为出租车距离，可以通过以下公式来定义

Chebysev距离： (Chebysev Distance:)

Also popularly called as Chess Board distance :

也通常称为国际象棋棋盘距离：

It is nothing but the Max(Of Manhattan Distance )

就是最大(曼哈顿距离)

根据维基， (As per wiki,)

In mathematics, Chebyshev distance (or Tchebychev distance), maximum metric, is a metric defined on a vector space where the distance between two vectors is the greatest of their differences along any coordinate dimension. It is named after Pafnuty Chebyshev.

在数学中， Chebyshev距离 (或 Tchebychev距离 )( 最大度量 )是在向量空间上定义的 度量 ，其中两个向量之间的距离是沿任何坐标维度的最大差异。 它以 Pafnuty Chebyshev 命名。

It is also known as chessboard distance, since in the game of chess the minimum number of moves needed by a king to go from one square on a chessboard to another equals the Chebyshev distance between the centers of the squares, if the squares have side length one, as represented in 2-D spatial coordinates with axes aligned to the edges of the board.

这也称为 棋盘距离 ，因为在下棋时，国王从棋盘上的一个正方形移到另一个正方形所需的最小移动次数等于正方形中心之间的切比雪夫距离。一个以二维空间坐标表示，其轴与电路板的边缘对齐。

So , for two vectors or points x and y, with standard coordinates xi and yi respectively, is given in the below figure. Also for 2 dimensional plane, we can see the formula below.

因此，下图给出了两个向量或点x和y分别具有标准坐标xi和yi的情况。同样对于二维平面，我们可以看到以下公式。

So now that we have understood the fundamentals of similarity based on measure of distance , its time to know what are the types of clustering and how do they make use of the above discussed distance metric to cluster the given vectors of data or an object .

因此，现在我们已经了解了基于距离度量的相似性基础，是时候知道什么是聚类类型了，以及它们如何利用上述距离度量来聚类给定的数据或对象矢量。

无监督学习中的聚类类型： (Types Of Clustering In Unsupervised Learning :)

There are basically two major categorization of clustering in the field of unsupervised learning

在无监督学习领域中，聚类基本上有两个主要类别

Connectivity based clustering : Also known as Hierarchical clustering
基于连接性的集群：也称为分层集群
Centroid Based Clustering : K-Means being the most popular kind
基于质心的聚类： K-Means是最受欢迎的一种

基于连接的群集： (Connectivity Based Clustering :)

For a tabular dataframe with N no of columns and rows, if we calculate the distance between every pair of an object in a row to find which of those are closely related or similar, to be further clustered together, we call this expensive mechanism of clustering as connectivity based clustering . The intuition behind this extensive approach is;

对于没有N个列和行的表格数据框，如果我们计算一行中每对对象之间的距离，以找出其中哪些紧密相关或相似，然后将它们进一步聚在一起，则我们将这种昂贵的聚类机制称为作为基于连接的群集。这种广泛方法背后的直觉是：

That objects being more related to nearby objects than to the objects which are farther away

这些对象与附近的对象比与更远的对象更相关

When the size of the data set is not very large this kind of clustering is very effective , but if data set is too big , this can be really resource intensive. For example , if we have a data set with 1000 rows than it will lead of 1/2 a million pairs of data to be analysed for similarity , this could be extremely costly to process. Imagine if the no of rows becomes 10,000.

当数据集的大小不是很大时，这种聚类非常有效，但是如果数据集太大，则可能会占用大量资源。例如，如果我们有一个包含1000行的数据集，那么它将导致1/2百万对数据的相似性被分析，这可能会非常昂贵。想象一下，如果行数变为10,000。

So to sum up :

综上所述：

These connectivity based algorithms connect “objects” to form “clusters” based on their distance. A cluster can be described largely by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will form, which can be represented using a dendrogram, which explains where the common name “hierarchical clustering” comes from, these algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other at certain distances

这些基于连通性的算法根据对象的距离将“对象”连接起来以形成“簇”。可以通过连接集群各部分所需的最大距离来大致描述集群。在不同的距离处，将形成不同的聚类，可以使用树状图表示，这解释了通用名称“分层聚类”的来源，这些算法不提供数据集的单个分区，而是提供一个广泛的层次结构。在特定距离彼此融合的集群

I have covered hierarchical based connectivity clustering in detail in one of my article linked below, do take some time to understand the same in more depth.

我在下面链接的一篇文章中详细介绍了基于层次的连接性群集，需要花一些时间来更深入地了解它们。

基于质心的聚类： (Centroid Based Clustering :)

Unlike hierarchical/connectivity based clustering Centroid-based clustering organizes the data into non-hierarchical clusters.

与基于层次/连接性的聚类不同，基于质心的聚类将数据组织到非层次性聚类中。

基于质心聚类的直觉： (Intuition Behind Centroid Based Clustering :)

Here we get the pre-defined number of clusters at the outset .So, Instead of visiting each and every pair of object in n no of rows to calculate the distance , this algorithm requires you to define what are no of clusters we want to obtain , based on that centroid of those clusters are identified and the distance of the data points are calculated with respect to those identified centroids.

这里我们从一开始就获得了预定义的聚类数，因此，与其访问n个行中的每一对对象都不计算距离，该算法还需要您定义要获取的聚类数，基于这些聚类的质心被识别，并针对那些识别出的质心计算数据点的距离。

This algorithm is very cheap as compared to hierarchical clustering, which can be understood by the example. So if you had 1000 rows and 5 clusters are defined at the outset . The algo has to process only 5*1000= 5000 data points , which would have been 1/2 million data points in the case of connectivity based clustering algorithm.

与分层聚类相比，该算法非常便宜，可以通过示例理解。 因此，如果您有1000行并且一开始就定义了5个群集。 该算法仅需处理5 * 1000 = 5000个数据点，在基于连接的聚类算法的情况下，将是1/2百万个数据点。

我们怎么会没有集群？ (How does we come No of cluster ?)

We will answer this question when we uncover K-means clustering , but to ponder , it is related to popular method known as Elbow Method .

当我们发现K-means聚类时，我们将回答这个问题，但是要想一想，它与流行的方法Elbow Method有关。

K-均值聚类： (K-Means Clustering :)

k-means is the most widely-used centroid-based clustering algorithm. Centroid-based algorithms are efficient but sensitive to initial conditions and outliers. We will get into the details of K-Means clustering in the next part of this series of unsupervised learning , where we will cover

k均值是使用最广泛的基于质心的聚类算法。基于质心的算法有效，但对初始条件和异常值敏感。在本系列无监督学习的下一部分中，我们将详细介绍K-Means聚类的细节，

What Is K-Means Clustering ?
什么是K均值聚类？
How does it work ?
它是如何工作的？
Implementing k-means clustering algorithm using hands-on python lab
使用动手python实验室实现k-means聚类算法