K均值聚类

最新推荐文章于 2023-11-30 10:46:53 发布

Better-1

最新推荐文章于 2023-11-30 10:46:53 发布

阅读量1.7k

点赞数 1

分类专栏：机器学习

本文链接：https://blog.csdn.net/caihuanqia/article/details/112641587

版权

机器学习专栏收录该内容

17 篇文章 0 订阅

订阅专栏

K均值聚类的基本思想就是通过迭代找到K个簇的一种划分方案，使得聚类结果对应的代价函数最小，特别地，代价函数定义为各个样本距离所属的簇中心点的误差平方和。

这里的代价函数是各个点离所属类别的中心点的距离。

缺点：容易受到初始值和离群点的影响—所以需要预处理。

K均值算法的调优：
1、数据归一化，离群点处理–因此需要预处理
为什么需要归一化：如果不进行归一化，那么均值和方差大的维度将对数据的聚类结果产生决定性的影响。没有统一单位和归一化处理的数据无法进行聚类。
2、合理选择K值，尝试不同的K值，然后画出曲线，取拐点。一般来说，K越大，误差的平方和越小，选择拐点。开始会急剧下降，然后平稳。

3、采用核函数
欧氏距离来度量LOSS，是基于假设：各个数据簇有相同的先验分布，并呈现出球形的分布。对于非凸的数据分布，引入核函数，**通过一个非线性的映射，将输入空间的数据点映射到高位的特征空间，并在新的特征空间进行聚类。**在传统的K均值失效下，引入和函数可以有效。

解释采用EM算法。—是解决概率模型中含有无法预测的隐含变量情况下的参数估计问题。
EM算法就是先固定一个变量使目标函数变为凸优化函数，然后求导得到最值，利用最优参数更新固定的变量，
进入下一个循环。

一个最直观了解 EM 算法思路的是 K-Means 算法。在 K-Means 聚类时，每个聚类簇的质心是隐含数据。我们会假设 K 个初始化质心，即 EM 算法的 E 步；然后计算得到每个样本最近的质心，并把样本聚类到最近的这个质心，即 EM 算法的 M 步。重复这个 E 步和 M 步，直到质心不再变化为止，这样就完成了 K-Means 聚类。

隐变量就是各个类别的中心，先初始化各个类别的中心，再将所有点进行归类，归类到距离最近的类中心，这样就实现了距离损失函数的极小化。然后更新类中心，再次对各点进行归类…迭代，直到类别点没有更新。

改进版是K-MEANS++:
主要就是原来的K-MEANS是随机选择的初始点，这个会对结果有一定的影响，改进就是选择n个点之后，第n+1个点最好是远离之前的n个点。

https://www.geeksforgeeks.org/k-means-clustering-introduction/

def UpdateMean(n,mean,item): 
    for i in range(len(mean)): 
        m = mean[i]; 
        m = (m*(n-1)+item[i])/float(n); 
        mean[i] = round(m, 3); 
    return mean;

def FindColMinMax(items): 
    n = len(items[0]); 
    minima = [sys.maxint for i in range(n)];  # python int类型支持的最大值
    maxima = [-sys.maxint -1 for i in range(n)]; 
      
    for item in items: 
        for f in range(len(item)): 
            if (item[f] < minima[f]): 
                minima[f] = item[f]; 
              
            if (item[f] > maxima[f]): 
                maxima[f] = item[f]; 
  
return minima,maxima; 

def InitializeMeans(items, k, cMin, cMax): 
  
    # Initialize means to random numbers between 
    # the min and max of each column/feature     
    f = len(items[0]); # number of features 
    means = [[0 for i in range(f)] for j in range(k)]; 
      
    for mean in means: 
        for i in range(len(mean)): 
  
            # Set value to a random float 
            # (adding +-1 to avoid a wide placement of a mean) 
            mean[i] = uniform(cMin[i]+1, cMax[i]-1); 
  
    return means; 

def EuclideanDistance(x, y):  
    S = 0; # The sum of the squared differences of the elements  
    for i in range(len(x)):  
        S += math.pow(x[i]-y[i], 2) 
  
    #The square root of the sum 
    return math.sqrt(S) 

def Classify(means,item):  # 将点添加到距离最近的类中心。
    # Classify item to the mean with minimum distance     
    minimum = sys.maxint; 
    index = -1; 
  
    for i in range(len(means)): 
  
        # Find distance from item to mean 
        dis = EuclideanDistance(item, means[i]); 
  
        if (dis < minimum): 
            minimum = dis; 
            index = i; 
      
    return index;
     
     
def CalculateMeans(k,items,maxIterations=100000): 
  
    # Find the minima and maxima for columns 
    cMin, cMax = FindColMinMax(items); 
      
    # Initialize means at random points 
    means = InitializeMeans(items,k,cMin,cMax); 
      
    # Initialize clusters, the array to hold 
    # the number of items in a class 
    clusterSizes= [0 for i in range(len(means))]; 
  
    # An array to hold the cluster an item is in 
    belongsTo = [0 for i in range(len(items))]; 
  
    # Calculate means 
    for e in range(maxIterations): 
        # If no change of cluster occurs, halt 
        noChange = True; 
        for i in range(len(items)): 
  
            item = items[i]; 
  
            # Classify item into a cluster and update the corresponding means.         
            index = Classify(means,item); 
  
            clusterSizes[index] += 1; 
            cSize = clusterSizes[index]; 
            means[index] = UpdateMean(cSize,means[index],item);   ### 这里的update应该是放在所有点归于新的簇之后才更新均值点~~
  
            # Item changed cluster 
            if(index != belongsTo[i]): 
                noChange = False; 
  
            belongsTo[i] = index; 
  
        # Nothing changed, return 
        if (noChange): 
            break; 
  
    return means;

最后把各个点归到簇中。

def FindClusters(means,items): 
    clusters = [[] for i in range(len(means))]; # Init clusters 
      
    for item in items: 
  
        # Classify item into a cluster 
        index = Classify(means,item); 
  
        # Add item to cluster 
        clusters[index].append(item); 
  
    return clusters;

Better-1

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
K均值聚类

K均值聚类的基本思想就是通过迭代找到K个簇的一种划分方案，使得聚类结果对应的代价函数最小，特别地，代价函数定义为各个样本距离所属的簇中心点的误差平方和。这里的代价函数是各个点离所属类别的中心点的距离。缺点：容易受到初始值和离群点的影响—所以需要预处理。K均值算法的调优：1、数据归一化，离群点处理–因此需要预处理为什么需要归一化：如果不进行归一化，那么均值和方差大的维度将对数据的聚类结果产生决定性的影响。没有统一单位和归一化处理的数据无法进行聚类。2、合理选择K值，尝试不同的K值，然后画出曲线，
复制链接

扫一扫