读了博客园的一篇 文章
受到启发,写了一个K均值的python实现,代码如下:
import random
from math import sqrt
sample=[
[1,1,0.5],
[0.3,0,0.19],
[0,0.15,0.13],
[0.24,0.76,0.25],
[0.3,0.76,0.06],
[1,1,0],
[1,0.76,0.5],
[1,0.76,0.5],
[0.7,0.76,0.25],
[1,1,0.5],
[1,1,0.25],
[1,1,0.5],
[0.7,0.76,0.5],
[0.7,0.68,0.5],
[1,1,0.5]
]
samplename=['中国','日本','韩国','伊朗','沙特','伊拉克','卡塔尔','阿联酋','乌兹别克斯坦','泰国','越南','阿曼','巴林','朝鲜','印尼']
def EDistance(v1,v2):
tmp=sum([pow(v1[i]-v2[i],2) for i in range(len(v1))])
return sqrt(tmp)
class kcluster:
k=3
distance=EDistance
rows=sample
#获取用于比较的序列的在各个维度上均值组成的序列
def getavg(self,rows,seed):
n=len(rows)
if n==0:
return seed
rs=[]
for i in range(len(rows[0])):
rs.append(sum([row[i] for row in rows])/n)
return rs
#根据种子获取与种子最接近的序列
def getbestmatch(self,rows,seeds):
bestmatch={}
for i in range(self.k):
bestmatch.setdefault(i,[])
#判断每个序列最匹配的种子
for row in rows:
d=9999
whichseed=0
i=0
for seed in seeds:
tmp=EDistance(row,seed)
if tmp<d:
d=tmp
whichseed=i
i+=1
bestmatch[whichseed].append(row)
return bestmatch
#生成随机种子
def getseeds(self):
#每个维度上最值组成的元组
minandmax=[]
for i in range(len(self.rows[0])):
minandmax.append((min([row[i] for row in self.rows]),max([row[i] for row in self.rows])))
seeds=[]
for i in range(self.k):
#生成随机种子
seeds.append([random.random()*(row[1]-row[0])+row[0] for row in minandmax])
return seeds
#K均值聚类的主函数
def kcluster(self):
#生成种子
seeds=self.getseeds()
lastseeds=seeds[:]
while True:
#根据种子生成最佳聚类
bestmatch=self.getbestmatch(self.rows,seeds)
#print(seeds)
#print(bestmatch)
#移动种子到匹配序列的均值处
for i in range(self.k):
seeds[i]=self.getavg(bestmatch[i],seeds[i])
#print(seeds)
#print(lastseeds)
if lastseeds==seeds:
NullType = False
for j in range(self.k):
if len(bestmatch[j]) == 0:
NullType = True
if not NullType:
break
else:
return self.kcluster()
else: lastseeds=seeds[:]
return bestmatch
obj=kcluster()
rs=obj.kcluster()
print(rs)
for j in range(obj.k):
for i in range(len(sample)):
if sample[i] in rs[j]:
print(samplename[i],end=' ')
print('')
运行后你会发现两点:
1. 种子的选择会对聚类结果造成很大的影响
2. 但无论种子怎么选,中国足球都是三流
R语言自带 kmeans, method参数指定计算距离的函数类型。
评估几个分类最好:
numofc <- c()
bssp <- c()
for (i in (2:15)){
kmeans <- kmeans(na.omit(subset(train, select = -type)), i)
numofc[i-1] <- i
bssp[i-1] <- kmeans$betweenss/kmeans$totss
}
result <- data.frame(numofc[],bssp[])
qplot(numofc, bssp, data = result,geom = c(“point”,”smooth”))
betweenss/totss, 其实就跟评估线性模型那个R平方是一个含义,用聚类中心代表了这个分类里所有点之后解释了方差(在聚类中方差就演变成了各点到中心点的距离的平方和)的多少比例, 这个值越高越好。
另有一篇文章讲的是提高k-means算法的效率一种数据结构 kd tree和ball tree:
http://blog.csdn.net/skyline0623/article/details/8154911