kmeans聚类

最新推荐文章于 2023-12-12 20:27:06 发布

SyGoing

最新推荐文章于 2023-12-12 20:27:06 发布

阅读量566

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/ouyangfushu/article/details/84987392

版权

机器学习专栏收录该内容

14 篇文章 4 订阅

订阅专栏

kmeans聚类

kmeans聚类算法是一种基础的聚类算法，是机器学习中常用的无监督学习算法，虽然算法比较简单，但是在机器学习中占用重要的地位，是必须掌握的基础算法。

1、算法流程

1 ）随机选取k个中心点

2 ）遍历所有数据，将每个数据划分到最近（一般采用欧氏距离度量数据点到中心的距离，不过有些地方采用的则是其他的度量方式，如YOLOv2&v3中anchor聚类则采用IOU距离作为度量）的中心点中。

3 ）计算每个聚类中所有数据点坐标的平均值，并作为新的中心点

4 ）重复2-3，直到这k个中线点不再变化（收敛了），或执行了足够多的迭代

时间复杂度：O(I*n*k*m)

空间复杂度：O(n*m)

其中m为每个样本的特征维度（或者说坐标维度），n为数据量，I为迭代次数。一般I,k,m均可认为是常量，所以时间和空间复杂度可以简化为O(n)，即线性的。

2、python实现

# -*- coding: utf-8 -*-

#第一种则是调用
# from sklearn.cluster import KMeans
# import matplotlib.pyplot as plt
#
# x = [2.273, 27.89, 30.519, 62.049, 29.263, 62.657, 75.735, 24.344, 17.667, 68.816, 69.076, 85.691]
# y = [68.367, 83.127, 61.07, 69.343, 68.748, 90.094, 62.761, 43.816, 86.765, 76.874, 57.829, 88.114]
# plt.plot(x, y, 'b.')
# plt.show()
#
# points = [[i,j] for i,j in zip(x,y)]      #Python递推式，将x和y中的数据依次选出构成点集
# y_pred = KMeans(n_clusters=2).fit_predict(points)  #将数据聚为2类
# print('聚类结果：', y_pred)                       #打印聚类的结果
# plt.scatter(x, y, c=y_pred, marker='*')
# plt.show()

# 自己编写
import numpy as np
from numpy import *
import random
import matplotlib.pyplot as plt

#取自机器学习
def loadDataSet(fileName):                     #general function to parse tab -delimited floats
    dataMat = []                                #assume last column is target value
    fr = open(fileName)
    for line in fr.readlines():
        curLine = line.strip().split('\t')
        fltLine = map(float,curLine)            #map all elements to float()
        dataMat.append(fltLine)
    return dataMat

#欧式距离
#    return np.sqrt(np.sum(np.square(np.array(vecA),np.array(vecB))))  #la.norm(vecA-vecB)
def distL2(vecA, vecB):
    nA=np.array(vecA)
    nB=np.array(vecB)
    nA=np.squeeze(nA)
    nB=np.squeeze(nB)
    if nA.shape<2 or nB.shape<2:
        hh=0
        print("error",hh)
    return sqrt(pow(nA[0]-nB[0],2)+pow(nA[1]-nB[1],2))



def genCenters(dataset,k):
    temp = []
    while len(temp) < k:
        index = np.random.randint(0, len(dataset))
        if index not in temp:
            temp.append(dataset[index])
    return temp

def clusterMean(dataset):

    return np.sum(np.array(dataset)) / len(dataset)

def kmeans(dataset,initCen,k):
    differ_flag=True
    centroids=initCen
    class_ids=np.zeros(len(dataset))
    while differ_flag:
        all_kinds=[]
        for _ in range(k):
            temp = []
            all_kinds.append(temp)
        for index in range(len(dataset)):
            print(index)
            temp = []
            item=dataset[index]
            for cen in centroids:
                temp.append(distL2(item, cen))

            index_class=temp.index(min(temp))
            class_ids[index]=index_class
            all_kinds[index_class].append(item)

        # 更新均值点
        #center_ = np.array([clusterMean(i) for i in all_kinds])
        center_=[]
        for i in all_kinds:
            np_i=np.array(i)
            mean_=np.mean(np_i,0)
            center_.append(mean_.tolist())

        if (center_ == centroids):
            differ_flag=False
        else:
            centroids=center_
    return centroids,class_ids

def show(dataSet, k, centroids, clusterAssment):
     from matplotlib import pyplot as plt
     numSamples, dim = dataSet.shape
     mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr']
     for i in xrange(numSamples):
         markIndex = int(clusterAssment[i])
         plt.plot(dataSet[i, 0], dataSet[i, 1], mark[markIndex])
     mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb']
     for i in range(k):
         plt.plot(centroids[i, 0], centroids[i, 1], mark[i], markersize = 12)
     plt.show()

dataset=loadDataSet("./testSet_kmeans.txt")
initCens=genCenters(dataset,4)
centers,classindexs=kmeans(dataset,initCens,4)
print(centers)
print("finished")

np_dataSet=mat(dataset)
mat_centers=mat(centers)
show(np_dataSet,4,mat_centers,classindexs)

SyGoing

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
kmeans聚类

kmeans聚类 kmeans聚类算法是一种基础的聚类算法，是机器学习中常用的无监督学习算法，虽然算法比较简单，但是在机器学习中占用重要的地位，是必须掌握的基础算法。1、算法流程 1 ）随机选取k个中心点 2 ）遍历所有数据，将每个数据划分到最近（一般采用欧氏距离度量数据...
复制链接

扫一扫