k-means

 

k-means 是一种无监督的迭代求解的聚类分析算法,其步骤是将数据分成k个聚类,随机选择k个点作为作为初始的聚类中心,计算各个数据到中心的距离,然后分配数据给最近的聚类中心。之后每次分配都会重新计算聚类中心,重新分配,不断迭代,直到某个条件(聚类中心不再变化或无数据需要被分配等)终止。

创建num个数据,大小从x-y

def create_data(num,x,y):

    a = np.random.randint(x, y, size=[num, 2])

    # print(points)
    x = a[:,0]
    y = a[:,1]
    return x,y,a

产生最初的中心点

def initCent(points,k):
    numSamples, dim = points.shape
    print(points.shape)
    centroids = np.zeros((k, dim))
    for i in range(k):
        index = int(random.uniform(0, numSamples))
        centroids[i] = points[index]
    return centroids

计算中心点到各个数据点的距离

def distance(points, centroids, k):
    clalist=[]
    for data in points:
        diff = np.tile(data, (k, 1)) - centroids  #相减   (np.tile(a,(2,1))就是把a先沿x轴复制1倍,即没有复制,仍然是 [0,1,2]。 再把结果沿y方向复制2倍得到array([[0,1,2],[0,1,2]]))
        squaredDiff = diff ** 2     #平方
        squaredDist = np.sum(squaredDiff, axis=1)   #和  (axis=1表示行)
        distance = squaredDist ** 0.5  #开根号
        clalist.append(distance)
    clalist = np.array(clalist)  #返回一个每个点到质点的距离len(dateSet)*k的数组
    return clalist

重新计算中心点

def rezhi(points,cent,k):
    dis  = distance(points,cent,k)
    minDistIndices = np.argmin(dis, axis=1)
    newCentroids = pd.DataFrame(points).groupby(minDistIndices).mean()
    newCentroids = newCentroids.values

    change = newCentroids - cent
    return change,newCentroids

最后:

def k_mean(points,k,cent):
    chang,newcen = rezhi(points,cent,k)
    while np.any(chang!=0):
        chang, newcen = rezhi(points, newcen, k)
    centroids = sorted(newcen.tolist())
    cluster = []
    dis = distance(points, newcen, k)
    minDistIndices = np.argmin(dis, axis=1)
    print(minDistIndices)
    for i in range(k):
        cluster.append([])
    for i, j in enumerate(minDistIndices):  # enymerate()可同时遍历索引和遍历元素
        cluster[j].append(points[i])
    return centroids, cluster,minDistIndices

if __name__ == '__main__':
    x,y,points= create_data(100,1,100)
    k = 3
    print(points)
    p = initCent(points,k)
    d = distance(points,p,k)
    print(d)
    c,n = rezhi(points,p,k)
    centroids, cluster,t = k_mean(points, k,p)
    print('质心为:%s' % centroids)
    print('集群为:%s' % cluster)
    plt.scatter(x, y, label='First')
    for j in range(len(centroids)):
        plt.scatter(centroids[j][0], centroids[j][1], marker='x', color='red', s=50, label='质心')
        plt.show
    color = ['maroon','blue','black','green','yellow','gray']
    for j in range(len(cluster)):
        for i in range(len(cluster[j])):
            plt.scatter(cluster[j][i][0], cluster[j][i][1], marker='o', color=color[j], s=50, label='集群')
    plt.show()

 

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值