聚类分析ml树_ml k表示聚类

聚类分析ml树

K-means is partitional clustering, the method to partition n data points into k partitions. It is a weird term because clustering is partitioning the data. Actually, partitional clustering gets through the whole data from the beginning to find the k partition. On the other hand, hierarchical clustering starts from a single point. Now, let’s look at K-means

K均值是分区聚类,是将n个数据点分区为k个分区的方法。 这是一个奇怪的术语,因为集群正在对数据进行分区。 实际上,分区聚类从一开始就遍历了整个数据,找到了k个分区。 另一方面,层次聚类从一个点开始。 现在,让我们看一下K均值

K均值(K-means)

K-means is just finding the k centroid of the clusters. Centroid means the average of each coordinate of data points in the cluster. The initialization of centroids is really important. I will explain it later and we will just start with random initialization. We pick the first centroid randomly. We assign each data points to the centroid that is close to it. We have k groups. We can calculate the new centroid by averaging coordinates. We reassign the data points. Repeat the process until the centroids do not change.

K-means只是找到簇的k重心。 质心是指群集中数据点每个坐标的平均值。 重心的初始化非常重要。 我将在后面解释,我们将从随机初始化开始。 我们随机选择第一个质心。 我们将每个数据点分配给靠近它的质心。 我们有k组。 我们可以通过平均坐标来计算新的质心。 我们重新分配数据点。 重复该过程,直到质心不变。

  1. Initialization (Randomly Pick the point as the centroid)

    初始化(随机选择点作为质心)
  2. Assign each data point to the centroid that is close to it.

    将每个数据点分配给靠近它的质心。
  3. Calculate the new centroid by averaging coordinates(We can use other statistics other than the mean, K-medoid)

    通过平均坐标来计算新质心(我们可以使用除均值,K质心之外的其他统计量)
  4. Repeat 2,3 until it is converged. (The centroids are not changed.)

    重复2,3直到收敛。 (质心不变)。

Note: Using K-means with categorical values is not recommended because the distance and centroid problem is not easy to solve. => K-medoid/PAM can be used to easily find the centroid.

注意:由于距离和质心问题不容易解决,因此不建议将K-means与分类值一起使用。 => K-medoid / PAM可用于轻松找到质心。

Tips: I recommend you to scale the input before the k-means because if the scale is different, the result can be bad.

提示:我建议您在k均值之前缩放输入,因为如果缩放比例不同,则结果可能会很差。

种类 (Types)

There are two types of clustering methods in K-means clustering. They are hard clustering and soft clustering. Hard clustering assigns data points to the nearest centroid. Soft clustering gives the score with respect to all centroids. The score can be anything, similarity score, distance, and affinity.

K均值聚类有两种类型的聚类方法。 它们是硬集群和软集群。 硬聚类将数据点分配给最近的质心。 软聚类给出所有质心的得分。 分数可以是任何东西,相似性分数,距离和亲和力。

初始化 (Initialization)

K-means clustering can provide significantly different clusters depending on how you initialize your centroid because it can converge to a local minimum, not a global minimum. There is no perfect way to avoid this but I want to suggest a few ways.

K均值聚类可以根据质心的初始化方式提供不同的聚类,因为它可以收敛到局部最小值,而不是全局最小值。 没有完美的方法可以避免这种情况,但是我想提出一些方法。

  • If you know the proper centroid already(you did previously another clustering.), then you can choose the predefined centroids.

    如果您已经知道正确的质心(您之前进行过另一个聚类),则可以选择预定义的质心。
  • You can run the algorithm multiple times. You can compare each model by inertia methods, it calculates the distances between its centroids and makes the score. The small inertia(distance), the better model.

    您可以多次运行该算法。 您可以通过惯性方法比较每个模型,并计算其质心之间的距离并进行评分。 惯性(距离)越小,模型越好。
  • K-means++, it selects the centroids that are distant from one another. It makes the probability distribution of distance and makes the distant points be selected one another centroids. This is the default in sci-kit learn.

    K-均值++,它选择彼此远离的质心。 它使距离的概率分布并使远点彼此重心选择。 这是sci-kit学习中的默认设置。

求出K,最佳簇数 (Finding the K, the optimal number of clusters)

If you think we can choose just low inertia, it is not that easy because the bigger k can get the smaller inertia. Inertia is calculated based on the distances between data points and centroids. We can use inertia in a different manner. we call it the elbow method.

如果您认为我们只能选择低惯性,那不是那么容易,因为k越大,惯性就越小。 惯性是根据数据点和质心之间的距离计算的。 我们可以以不同的方式使用惯性。 我们称其为肘法。

Image for post
Elbow methods
弯头方法

You can check where the inertia rapidly drops and it is steady at the next K. It looks like the elbow.

您可以检查惯性Swift下降的位置,并在下一个K处保持稳定。它看起来像弯头。

A silhouette score is another way to find out proper k. It is calculated by (b-a)/max(a, b). a is the mean distances to the other instances in the same cluster. b is the mean distance to the instance of the next closest cluster. Its range is [-1, 1]. 1 means the instance is in the correct cluster. 0 means the instance is in the boundary. -1 means that the instance may have been in the wrong cluster. Normally, we visualize it:

轮廓分数是找出合适k的另一种方法。 通过(ba)/ max(a,b)计算。 a是到同一集群中其他实例的平均距离。 b是到下一个最接近簇的实例的平均距离。 其范围是[-1,1]。 1表示实例位于正确的群集中。 0表示实例在边界中。 -1表示实例可能位于错误的群集中。 通常,我们将其可视化:

Image for post

The red dot line is the mean of the silhouette coefficient. The single horizontal line represents the silhouette coefficient of every instance.

红点线是轮廓系数的平均值。 一条水平线代表每个实例的轮廓系数。

K均值的局限性 (The limitation of K-means)

Image for post

K-means clustering is actually the specific type of GMM. GMM can have covariances but k-means can have the only variance of the cluster itself. Therefore, you can only fit the spherical shape cluster.

K-均值聚类实际上是GMM的特定类型。 GMM可以具有协方差,但k均值可以具有聚类本身的唯一方差。 因此,您只能拟合球形簇。

This post is published on 9/23/2020.

此帖发布于9/23/2020。

翻译自: https://medium.com/swlh/ml-k-means-clustering-8b7e3d420b89

聚类分析ml树

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值