frommathimportsqrtdefk_means(data_pts,k=None):""" Return k (x,y) pairs where:
k = number of clusters
and each
(x,y) pair = centroid of cluster
data_pts should be a list of (x,y) tuples, e.g.,
data_pts=[ (0,0), (0,5), (1,3) ]
"""""" Helper functions """deflists_are_same(la,lb):# see if two lists have the same elementsout=Falseforiteminla:ifitemnotinlb:out=Falsebreakelse:out=Truereturnoutdefdistance(a,b):# distance between (x,y) points a and breturnsqrt(abs(a[0]-b[0])**2+abs(a[1]-b[1])**2)defaverage(a):# return the average of a one-dimensional list (e.g., [1, 2, 3])returnsum(a)/float(len(a))""" Set up some initial values """ifkisNone:# if the user didn't supply a number of means to look for, try to estimate how many there aren=len(data_pts)#number of pointsinthe dataset
k=int(sqrt(n/2))# number of clusters - see# http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set#Rule_of_thumbifk<1:# make sure there's at least one clusterk=1""" Randomly generate k clusters and determine the cluster centers,
or directly generate k random points as cluster centers. """init_clusters=data_pts[:]# put all of the data points into clustersshuffle(init_clusters)# put the data points in random orderinit_clusters=init_clusters[0:k]# only keep the first k random clustersold_clusters,new_clusters={},{}foritemininit_clusters:old_clusters[item]=[]# every cluster has a list of points associated with it. Initially, it's 0while1:# just keep going forever, until our break condition is mettmp={}forkinold_clusters:# create an editable version of the old_clusters dictionarytmp[k]=[]""" Associate each point with the closest cluster center. """forpointindata_pts:# for each (x,y) data pointmin_clust=Nonemin_dist=1000000000# absurdly large, should be larger than the maximum distance for most data setsforpcintmp:# for every possible closest clusterpc_dist=distance(point,pc)ifpc_dist
min_clust=pc
tmp[min_clust].append(point)# add each point to its closest cluster's list of associated points""" Recompute the new cluster centers. """forkintmp:associated=tmp[k]xs=[pt[0]forptinassociated]# build up a list of x'sys=[pt[1]forptinassociated]# build up a list of y'sx=average(xs)# x coordinate of new clustery=average(ys)# y coordinate of new clusternew_clusters[(x,y)]=associated# these are the points the center was built off of, they're *probably* still associatediflists_are_same(old_clusters.keys(),new_clusters.keys()):# if we've reached equilibrium, return the pointsreturnold_clusters.keys()else:# otherwise, we'll go another round. let old_clusters = new_clusters, and clear new_clusters.old_clusters=new_clusters
new_clusters={}