聚类算法之K-means实践

最新推荐文章于 2022-03-13 19:56:42 发布

weixin_30920597

最新推荐文章于 2022-03-13 19:56:42 发布

阅读量72

点赞数

文章标签：人工智能 python 数据结构与算法

原文链接：http://www.cnblogs.com/tosouth/p/4746219.html

版权

聚类算法总结

一 ‘层次’方法

层次方法创建给定数据对象集的层次分解。根据层次的分解的形成方式，层次的方法又可以分为凝聚和分裂方法。

凝聚法：自底向上。开始将每个对象形成单独的组，然后层次合并相似的组，直到所有的组合合并成一个或者满足某个终止条件。

分裂法：自顶向下。开始将所有对象置于一个簇中，每次迭代，簇分裂成更小的簇，直到每个对象都各在一个簇中或者满足某个终止条件。

二 ‘划分’方法

给定n个对象或者数据元组的数据库，划分方法构造数据的k个划分，每个划分为一个簇，k《n。给定要构造的划分数组k,划分方创建一个初始划分。然后采用迭代重定位技术，尝试通过对象在组间移动来改进划分。

典型的划分方法是：K-means 和 K中心点

2.1 K-means 思想

input：

K，簇的数目
D，包含n个对象的数据集

output:

k个簇的集合

process：

从D中任意选择K个对象作为初始簇中心
repeat:

　　　　　　根据各个簇均值，将每个对象再指派到最相似的簇中；

　　　　　　更新各个簇均值

　　 until 簇不再发生改变

2.2 K-means 实践

本文通过对一个blogdata.txt进行研究，里面第一行是博客里出现的常用词，第一列是各博客的名称，其余每行均是该博客中这些常用词出现的频率。实现对这些博客的聚类。

这是blogdata.txt的下载地址： http://pan.baidu.com/s/1nt7r7a5

2.2.1 读数据、返回一个包含n个对象的数据集

import random
import numpy as np

def readfile(filename):#filename为文件名，是一个包含数据集的txt文件。
    lines = [line for line in file(filename)]
    
    colnames = lines[0].strip().split('\t')[1:]
    rownames = []
    data = []
    for line in lines[1:]:
        p = line.strip().split('\t')
        rownames.append(p[0])
        data.append([float(x) for x in p[1:]])
        
    return rownames,colnames,data#主要要使用data数据集

2.2.2 初始化k个簇均值

import random

ranges = [(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in xrange(len(rows[0]))]
cluster = [[random.random()*(ranges[j][1]-ranges[j][0])+ranges[j][0] for j in xrange(len(rows[0]))] for i in xrange(k)]

2.2.3 根据k个簇均值，找到k个簇

    import numpy as np    

    rows = np.array(rows)
    cluster = np.array(cluster)
    m,n = np.shape(rows)
    
    lastmatches = None
     
    for t in xrange(500):
       
        bestmatches = [[] for i in xrange(k)]#一个bestmatches序列，里面k个空序列，里面第i个序列存放第i个簇均值响应的簇所包含的所有点
        for i in xrange(m):
            distance = np.sum((np.tile(rows[i],(k,1))-cluster)**2,1)#每一个点与这k个簇均值比较，得到的距离
            distancesort = np.argsort(distance)
            bestmatches[distancesort[0]].append(i)
        
        if bestmatches == lastmatches : break#终止条件有两个，一是t等于500，二是bestmatches == lastmatches 
        lastmatches = bestmatches

2.2。4 得到k个簇之后，更新簇均值

for i in xrange(k):            
    if len(bestmatches[i])>0:
        temp = []
        for j in xrange(len(bestmatches[i])):                    
            temp.append(rows[bestmatches[i][j]])
            com = np.vstack(tuple(temp))
            cluster[i] = np.mean(np.array(com),0)

2.2.3 和2.2.4 是一个repeat循环的过程，直到稳定。。。

2.2.5 完整代码

import random
import numpy as np

def readfile(filename):
    lines = [line for line in file(filename)]
    
    colnames = lines[0].strip().split('\t')[1:]
    rownames = []
    data = []
    for line in lines[1:]:
        p = line.strip().split('\t')
        rownames.append(p[0])
        data.append([float(x) for x in p[1:]])
        
    return rownames,colnames,data

def Kmeans(rows,k):
    
    #初始k个簇均值
    ranges = [(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in xrange(len(rows[0]))]
    cluster = [[random.random()*(ranges[j][1]-ranges[j][0])+ranges[j][0] for j in xrange(len(rows[0]))] for i in xrange(k)]
    
    rows = np.array(rows)
    cluster = np.array(cluster)
    m,n = np.shape(rows)
    
    lastmatches = None
    for i in xrange(500):
            
        bestmatches = [[] for i in xrange(k)]
        for i in xrange(m):
            distance = np.sum((np.tile(rows[i],(k,1))-cluster)**2,1)
            distancesort = np.argsort(distance)
            bestmatches[distancesort[0]].append(i)
        
        if bestmatches == lastmatches : break
        lastmatches = bestmatches
    
        for i in xrange(k):
            
            if len(bestmatches[i])>0:
                temp = []
                for j in xrange(len(bestmatches[i])):                    
                    temp.append(rows[bestmatches[i][j]])
                com = np.vstack(tuple(temp))
                cluster[i] = np.mean(np.array(com),0)
    
    return bestmatches
    

if __name__ == '__main__':
    k = 10
    filename = 'blogdata.txt'
    rownames,colnames,data = readfile(filename)
    
    print Kmeans(data,k)