Unsupervised clustering—— KMeans

最新推荐文章于 2024-01-13 14:07:35 发布

Kelly Fu

最新推荐文章于 2024-01-13 14:07:35 发布

阅读量2.9k

点赞数

分类专栏： Machine Learning

本文链接：https://blog.csdn.net/kelly_fumiao/article/details/106415463

版权

本文介绍了无监督聚类中的KMeans算法，包括其工作原理、优缺点，并详细解释了如何通过Elbow方法选择合适的K值。此外，还探讨了Bisecting KMeans算法，它是为了解决KMeans可能陷入局部最优的问题。最后，文章展示了如何在代码中应用这些方法来分析数据。

摘要由CSDN通过智能技术生成

1. 理解

简单来说，无监督聚类也是对数据实体的分类，只是在这个分类过程中，没有一个标签告诉你这个实体应该属于哪一类，分类的准则实际上是数据实体本身的相似度。这里相似度(similarity measure)的衡量常见的是距离(e.g. euclidean-based distance or correlation-based distance)。

KMeans算法是最典型的无监督聚类算法。 K代表了有K个类别，这是用户需要自定义的。也可以说是K个中心点(cluster centers/centriods), 每个中心点是该类别(cluster)的中心(the mean of the values in the cluster). KMeans算法可以用于给客户分类，具有相似行为或者属性的用户会分为一类。 (Application: kmeans algorithm is very popular and used in a variety of applications such as market segmentation, document clustersing, image segmentation and image compression, etc.)

2. 方法

2.1 k-means算法

K-means是一种迭代的算法。

设置k
随机初始化不重复的k个位置作为中心(cluster centers)
对于每一个数据点，把他分类到距离它最近的一个中心点
重新计算每个类的中心
重复上述算法，直到中心不再变化

在这里插入图片描述

算法的评估可以使用SSE(sum of squared error), 选择不同的数量K的中心，算法收敛后的SSE不同，这可以作为一个选择K数量的参考。SSE越小说明数据点更加靠近质心，分类效果更好。

算法实现参考

K-means总结：

算法作用于Numeric values，对初始中心位置敏感。
优点: 实现简单。
缺点：会陷入局部最小，如下如图所示。

在这里插入图片描述

2.2 bisecting k-means算法

这个算法的出现实际上解决了k-means算法陷入了local maximum的问题。刚开始所有的数据看成一个cluster，然后应用k-means算法将它一分为二。接着选择一个cluster继续一分为二，选择的依据是SSE最小。重复这个过程，直到达到用户设定的K的数量。

3. 代码

3.1 通过Elbow图来选择合适的K

Elbow method gives us an idea on what a good k number of clusters would be based on the sum of squared distance (SSE) between data points and their assigned clusters’ centroids. We pick k at the spot where SSE starts to flatten out and forming an elbow. We’ll use the geyser dataset and evaluate SSE for different values of k and see where the curve might form an elbow and flatten out.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
from sklearn.cluster import KMeans

# Read some data, skipping any header
# Error, there is no header for the csv file
# RogersGirolami_kMeans = pd.read_csv( "Data/RogersGirolami_kMeans.csv")
RogersGirolami_kMeans = pd.read_csv( "Data/RogersGirolami_kMeans.csv"