二分k均值 Python实现

二分k-均值算法:

算法思想:

首先将所有点作为一个簇,然后将该簇一分为二。之后选择能最大程度降低聚类代价函数(也就是误差平方和)的簇划分为两个簇。以此进行下去,直到簇的数目等于用户给定的数目k为止

算法伪代码:

*************************************************************
将所有数据点看成一个簇
当簇数目小于k时
对每一个簇

在给定的簇上面进行k-均值聚类(k=2)

计算总误差

选择使得误差最大的那个簇进行划分操作

*************************************************************

Python代码实现:

from numpy import *
import pdb
import matplotlib.pyplot as plt

def createCenter(dataSet,k):
    n = shape(dataSet)[0]
    d = shape(dataSet)[1]
    centroids = zeros((k,d))
    for i in range(k):
        c = int(random.uniform(0,n-1))  #float
        centroids[i,:] = dataSet[c,:]
    return centroids
    
def getDist(vec1,vec2):
    return sqrt(sum(power(vec1 - vec2,2)))
    
def kmeans(dataSet,k):
    n = shape(dataSet)[0]
    clusterAssment = mat(zeros((n,2)))
    centroids = createCenter(dataSet,k)
    
    clusterChnaged = True
    while clusterChnaged:
        clusterChnaged = False
        
        for i in range(n):
            minDist = inf
            minIndex = -1
            for j in range(k):
                distJI = getDist(dataSet[i,:],centroids[j,:])
                if distJI < minDist:
                    minDist = distJI
                    minIndex = j
            if clusterAssment[i,0] != minIndex:  #Convergence condition: distributions no longer change
                clusterChnaged = True
                clusterAssment[i,:] = minIndex,minDist**2
        
        #update centroids
        for  i in range(k):
            ptsdataSet = dataSet[nonzero(clusterAssment[:,0].A == i)[0]]
            centroids[i,:] = mean(ptsdataSet,axis = 0)     
    return centroids,clusterAssment 
    
def print_result(dataSet,k,centroids,clusterAssment):
    n,d = dataSet.shape
    if d !=2:
        print "Cannot draw!"
        return 1
    mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr']
    if k > len(mark):
        print "Sorry your k is too large"
        return 1
        
    for i in range(n):
        markIndex = int(clusterAssment[i,0])
        plt.plot(dataSet[i, 0],dataSet[i, 1],mark[markIndex])
    mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb']  
    # draw the centroids  
    for i in range(k):  
        plt.plot(centroids[i, 0], centroids[i, 1], mark[i], markersize = 12)  
    plt.show()     

def biKmeans(dataSet, k):  
    numSamples = dataSet.shape[0]  
    # first column stores which cluster this sample belongs to,  
    # second column stores the error between this sample and its centroid  
    clusterAssment = mat(zeros((numSamples, 2)))  
  
    # step 1: the init cluster is the whole data set  
    centroid = mean(dataSet, axis = 0).tolist()[0]  
    centList = [centroid]  
    for i in xrange(numSamples):  
        clusterAssment[i, 1] = getDist(mat(centroid), dataSet[i, :])**2  
  
    while (len(centList) < k):  
        # min sum of square error  
        minSSE = inf  
        numCurrCluster = len(centList)  
        # for each cluster  
        for i in range(numCurrCluster):  
            # step 2: get samples in cluster i  
            pointsInCurrCluster = dataSet[nonzero(clusterAssment[:, 0].A == i)[0], :]  
  
            # step 3: cluster it to 2 sub-clusters using k-means  
            centroids, splitClusterAssment = kmeans(pointsInCurrCluster, 2)  
  
            # step 4: calculate the sum of square error after split this cluster  
            splitSSE = sum(splitClusterAssment[:, 1])  
            notSplitSSE = sum(clusterAssment[nonzero(clusterAssment[:, 0].A != i)[0], 1])  
            currSplitSSE = splitSSE + notSplitSSE  
  
            # step 5: find the best split cluster which has the min sum of square error  
            if currSplitSSE < minSSE:  
                minSSE = currSplitSSE  
                bestCentroidToSplit = i  
                bestNewCentroids = centroids.copy()  
                bestClusterAssment = splitClusterAssment.copy()  
  
        # step 6: modify the cluster index for adding new cluster  
        bestClusterAssment[nonzero(bestClusterAssment[:, 0].A == 1)[0], 0] = numCurrCluster  
        bestClusterAssment[nonzero(bestClusterAssment[:, 0].A == 0)[0], 0] = bestCentroidToSplit  
  
        # step 7: update and append the centroids of the new 2 sub-cluster  
        centList[bestCentroidToSplit] = bestNewCentroids[0, :]  
        centList.append(bestNewCentroids[1, :])  
  
        # step 8: update the index and error of the samples whose cluster have been changed  
        clusterAssment[nonzero(clusterAssment[:, 0].A == bestCentroidToSplit), :] = bestClusterAssment
        plt.figure()
        print_result(dataSet,len(centList),mat(centList),clusterAssment)        
  
    print 'Congratulations, cluster using bi-kmeans complete!'  
    return mat(centList), clusterAssment  

其中,biKmeans(dataSet,k)为二分算法的主体,过程大体如下:

1.初始化质心,并建立所需要的数据存储结构

2.对每一个簇进行二分,选出最好的

3.更新各个簇的元素个数

划分结果:


二分的优点:

  • 二分K均值算法可以加速K-means算法的执行速度,因为相似度计算少了
  • 不受初始化问题的影响,因为随机点选取少了,且每一步保证误差最小

k均值的结果:


  • 0
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
二分K均值是一种基于K均值算法的改进方法,用于图像分割时可以得到更好的结果。 实现步骤如下: 1. 将图像转化为一维数组。 2. 设定初始的K个聚类中心(可以随机选择或使用其他方法),计算每个像素点与聚类中心的距离,并将其归入距离最近的类别中。 3. 计算每个类别的平均值作为新的聚类中心。 4. 重复步骤2和步骤3,直到聚类中心不再发生变化或达到设定的迭代次数。 5. 对于二分K均值,将所有样本归入一个初始类别中,然后将其分成两个子类别。对每一个子类别重复1-4步骤,直到达到设定的最大类别数或某些停止条件。 6. 最终将每个像素点归入距离最近的聚类中心所在的类别中,即可得到图像分割结果。 下面是Python代码实现: ```python import numpy as np from PIL import Image def binary_kmeans(image_path, k_max=10, max_iter=20): # Load image image = Image.open(image_path) pixels = np.array(image).reshape(-1, 3) # Initialize with one cluster clusters = [(0, pixels)] # Binary K-means loop while len(clusters) < k_max: # Choose the cluster with maximum SSE sse_list = [np.sum((cluster[1] - np.mean(cluster[1], axis=0)) ** 2) for cluster in clusters] max_sse_idx = np.argmax(sse_list) max_sse_cluster = clusters[max_sse_idx] # Split the cluster into two sub-clusters kmeans = KMeans(n_clusters=2, max_iter=max_iter).fit(max_sse_cluster[1]) sub_clusters = [(max_sse_cluster[0] * 2 + i, max_sse_cluster[1][kmeans.labels_ == i]) for i in range(2)] # Replace the original cluster with the sub-clusters clusters = clusters[:max_sse_idx] + sub_clusters + clusters[max_sse_idx + 1:] # Assign each pixel to the closest cluster labels = np.zeros(len(pixels), dtype=int) for cluster in clusters: labels[np.argmin(np.sum((pixels - np.mean(cluster[1], axis=0)) ** 2, axis=1))] = cluster[0] # Reshape the labels back to an image return labels.reshape(image.size[1], image.size[0]) ``` 其中,`image_path`为需要分割的图像路径,`k_max`为最大聚类数,`max_iter`为每个子类别的最大迭代次数。该函数返回每个像素点所属的类别标签,可以使用不同的颜色来可视化分割结果。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值