给了二维数据集 Aggretation.txt,可视化后发现是论文插图之一,这里是第一次尝试,后续,可以使用自动化脚本调超参,找到分成 论文中7个簇 的超参数 即可。
第一步是导入必要的包,前两个作聚类避不开的距离运算,是python内置的;最后一个第三方库作decision graph和聚类结果的 可视化
from decimal import Decimal
from math import pow
import matplotlib.pyplot as plt
第二步导入数据,同时计算聚类点数
# import dataset
file_path=u"E:\\a_school\\books\挖掘\homework\Aggregation.txt"
file=open(file_path, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
# transform into calculable number
x1=[]
x2=[]
obj_num=0
for line in file:
v=line.split(',')
x1.append(Decimal(v[0]))
x2.append(Decimal(v[1]))
obj_num+=1
第三步先算出点间距离,形成一个距离查表
# calculate the dij
d=[]
for i in range(obj_num):
di=[]
for j in range(obj_num):
dij= Decimal(pow(((x1[i]-x1[j])**2+(x2[i]-x2[j])**2),0.5)).quantize(Decimal('.01'))
di.append(dij)
d.append(di)
第四步设个超参
# set superparameter d_c
d_c=Decimal(0.170)
第五步计算 local density
# calculate p_i
p=[]
for i in range(obj_num):
pi=0
for j in range(obj_num):
if d[i][j]- d_c < 0 :
pi+=1
p.append(pi)
第六步识别 nearest neighbor with higher local density 并计算 the distance to it
# caluculate delta_i on each point
delta = []
nearest_neighbor = []
for i in range(obj_num):
di=[]
ni=[]
for j in range(obj_num):
if p[i] < p[j]:
di.append(d[i][j])
ni.append(j)
if(len(di)==0):
delta.append(max(d[i]))
nearest_neighbor.append(-1)
else:
delta.append(min(di))
nearest_neighbor.append(ni[di.index(min(di))])
第七步,这里由于发现 数据集是 论文中的插图,所以,直接可视化了 decision graph,直接看离群点和聚类中心
plt.scatter(delta,p,s=0.1,data=range(obj_num))
第八步,挑出超参为0.17条件下,可视化数量的聚类中心obj。
# idenify cluster center points
center_num=4
sorted_delta=sorted(delta,reverse=True)
cluster_center=[]
for i in range(center_num):
cluster_center.append(delta.index(sorted_delta[i]))
第九步、直接分簇了,因为没有离群点,这里发现一个问题,result.keys() 里应该有 1 ,但是最后所有与 1 同簇的都没法分配,因此使用break跳出循环,手动分配
result={ 672:1,
671:2,
80:3,
79:4}
rest_num=obj_num-center_num
rest_obj = set(range(obj_num))
for i in range(center_num):
rest_obj.remove(cluster_center[i])
# assign the rest to each cluster
while len(result) < obj_num:
remove=[]
for i in rest_obj:
if nearest_neighbor[i] in set(result.keys()):
result[i]=result[nearest_neighbor[i]]
remove.append(i)
print(i)
for i in remove:
rest_obj.remove(i)
if len(remove)==0:
break
for i in rest_obj:
result[i]=result[1]
# dict 按 key 排序
result = sorted(result)
第十部、可视化,下图是 4 个簇 的结果图
# visualize the clustering result
cluster_result=[]
for i in range(obj_num):
cluster_result.append(result[i])
plt.scatter(x1,x2,c=cluster_result,s=0.1)
PS: IDE用 Anaconda App之一 JupyterLab 。