python使用k-means算法代码案例-Python实现的Kmeans++算法实例

最新推荐文章于 2020-12-08 20:43:26 发布

weixin_37988176

最新推荐文章于 2020-12-08 20:43:26 发布

阅读量1.8k

点赞数 1

本文介绍了KMeans++算法在Python和MATLAB中的实现细节，包括Python中的kpp和lloyd函数，以及MATLAB中的kmeanspp函数。这些函数用于二维数据点的聚类，通过选取初始中心点并迭代更新来达到最佳聚类效果。Python代码实现了从生成随机数据点到绘制聚类结果的完整流程，MATLAB代码则展示了如何高效地实现KMeans++初始化和KMeans迭代过程。

摘要由CSDN通过智能技术生成

from math import pi, sin, cos

from collections import namedtuple

from random import random, choice

from copy import copy

try:

import psyco

psyco.full()

except ImportError:

pass

FLOAT_MAX = 1e100

class Point:

__slots__ = ["x", "y", "group"]

def __init__(self, x=0.0, y=0.0, group=0):

self.x, self.y, self.group = x, y, group

def generate_points(npoints, radius):

points = [Point() for _ in xrange(npoints)]

# note: this is not a uniform 2-d distribution

for p in points:

r = random() * radius

ang = random() * 2 * pi

p.x = r * cos(ang)

p.y = r * sin(ang)

return points

def nearest_cluster_center(point, cluster_centers):

"""Distance and index of the closest cluster center"""

def sqr_distance_2D(a, b):

return (a.x - b.x) ** 2 + (a.y - b.y) ** 2

min_index = point.group

min_dist = FLOAT_MAX

for i, cc in enumerate(cluster_centers):

d = sqr_distance_2D(cc, point)

if min_dist > d:

min_dist = d

min_index = i

return (min_index, min_dist)

'''

points是数据点，nclusters是给定的簇类数目

cluster_centers包含初始化的nclusters个中心点，开始都是对象->(0,0,0)

'''

def kpp(points, cluster_centers):

cluster_centers[0] = copy(choice(points)) #随机选取第一个中心点

d = [0.0 for _ in xrange(len(points))] #列表，长度为len(points)，保存每个点离最近的中心点的距离

for i in xrange(1, len(cluster_centers)): # i=1...len(c_c)-1

sum = 0

for j, p in enumerate(points):

d[j] = nearest_cluster_center(p, cluster_centers[:i])[1] #第j个数据点p与各个中心点距离的最小值

sum += d[j]

sum *= random()

for j, di in enumerate(d):

sum -= di

if sum > 0:

continue

cluster_centers[i] = copy(points[j])

break

for p in points:

p.group = nearest_cluster_center(p, cluster_centers)[0]

'''

points是数据点，nclusters是给定的簇类数目

'''

def lloyd(points, nclusters):

cluster_centers = [Point() for _ in xrange(nclusters)] #根据指定的中心点个数，初始化中心点，均为(0,0,0)

# call k++ init

kpp(points, cluster_centers) #选择初始种子点

# 下面是kmeans

lenpts10 = len(points) >> 10

changed = 0

while True:

# group element for centroids are used as counters

for cc in cluster_centers:

cc.x = 0

cc.y = 0

cc.group = 0

for p in points:

cluster_centers[p.group].group += 1 #与该种子点在同一簇的数据点的个数

cluster_centers[p.group].x += p.x

cluster_centers[p.group].y += p.y

for cc in cluster_centers: #生成新的中心点

cc.x /= cc.group

cc.y /= cc.group

# find closest centroid of each PointPtr

changed = 0 #记录所属簇发生变化的数据点的个数

for p in points:

min_i = nearest_cluster_center(p, cluster_centers)[0]

if min_i != p.group:

changed += 1

p.group = min_i

# stop when 99.9% of points are good

if changed <= lenpts10:

break

for i, cc in enumerate(cluster_centers):

cc.group = i

return cluster_centers

def print_eps(points, cluster_centers, W=400, H=400):

Color = namedtuple("Color", "r g b");

colors = []

for i in xrange(len(cluster_centers)):

colors.append(Color((3 * (i + 1) % 11) / 11.0,

(7 * i % 11) / 11.0,

(9 * i % 11) / 11.0))

max_x = max_y = -FLOAT_MAX

min_x = min_y = FLOAT_MAX

for p in points:

if max_x < p.x: max_x = p.x

if min_x > p.x: min_x = p.x

if max_y < p.y: max_y = p.y

if min_y > p.y: min_y = p.y

scale = min(W / (max_x - min_x),

H / (max_y - min_y))

cx = (max_x + min_x) / 2

cy = (max_y + min_y) / 2

print "%%!PS-Adobe-3.0 %%%%BoundingBox: -5 -5 %d %d" % (W + 10, H + 10)

print ("/l {rlineto} def /m {rmoveto} def " +

"/c { .25 sub exch .25 sub exch .5 0 360 arc fill } def " +

"/s { moveto -2 0 m 2 2 l 2 -2 l -2 -2 l closepath " +

" gsave 1 setgray fill grestore gsave 3 setlinewidth" +

" 1 setgray stroke grestore 0 setgray stroke }def")

for i, cc in enumerate(cluster_centers):

print ("%g %g %g setrgbcolor" %

(colors[i].r, colors[i].g, colors[i].b))

for p in points:

if p.group != i:

continue

print ("%.3f %.3f c" % ((p.x - cx) * scale + W / 2,

(p.y - cy) * scale + H / 2))

print (" 0 setgray %g %g s" % ((cc.x - cx) * scale + W / 2,

(cc.y - cy) * scale + H / 2))

print " %%%%EOF"

def main():

npoints = 30000

k = 7 # # clusters

points = generate_points(npoints, 10)

cluster_centers = lloyd(points, k)

print_eps(points, cluster_centers)

main()

上述代码实现的算法是针对二维数据的，所以Point对象有三个属性，分别是在x轴上的值、在y轴上的值、以及所属的簇的标识。函数lloyd是kmeans++算法的整体实现，其先是通过kpp函数选取合适的种子点，然后对数据集实行kmeans算法进行聚类。kpp函数的实现完全符合上述kmeans++的基本思路的2、3、4步。

4、matlab版本的kmeans++

复制代码代码如下:

function [L,C] = kmeanspp(X,k)

%KMEANS Cluster multivariate data using the k-means++ algorithm.

% [L,C] = kmeans_pp(X,k) produces a 1-by-size(X,2) vector L with one class

% label per column in X and a size(X,1)-by-k matrix C containing the

% centers corresponding to each class.

% Version: 2013-02-08

% Authors: Laurent Sorber (Laurent.Sorber@cs.kuleuven.be)

L = [];

L1 = 0;

while length(unique(L)) ~= k

% The k-means++ initialization.

C = X(:,1+round(rand*(size(X,2)-1))); %size(X,2)是数据集合X的数据点的数目，C是中心点的集合

L = ones(1,size(X,2));

for i = 2:k

D = X-C(:,L); %-1

D = cumsum(sqrt(dot(D,D,1))); %将每个数据点与中心点的距离，依次累加

if D(end) == 0, C(:,i:k) = X(:,ones(1,k-i+1)); return; end

C(:,i) = X(:,find(rand < D/D(end),1)); %find的第二个参数表示返回的索引的数目

[~,L] = max(bsxfun(@minus,2*real(C'*X),dot(C,C,1).')); %碉堡了，这句，将每个数据点进行分类。

end

% The k-means algorithm.

while any(L ~= L1)

L1 = L;

for i = 1:k, l = L==i; C(:,i) = sum(X(:,l),2)/sum(l); end

[~,L] = max(bsxfun(@minus,2*real(C'*X),dot(C,C,1).'),[],1);

end

这个函数的实现有些特殊，参数X是数据集，但是是将每一列看做一个数据点，参数k是指定的聚类数。返回值L标记了每个数据点的所属分类，返回值C保存了最终形成的中心点（一列代表一个中心点）。测试一下：

复制代码代码如下:

>> x=[randn(3,2)*.4;randn(4,2)*.5+ones(4,1)*[4 4]]

x =

-0.0497 0.5669

0.5959 0.2686

0.5636 -0.4830

4.3586 4.3634

4.8151 3.8483

4.2444 4.1469

4.5173 3.6064

>> [L, C] = kmeanspp(x',2)

L =

2 2 2 1 1 1 1

C =

4.4839 0.3699

3.9913 0.1175

好了，现在开始一点点理解这个实现，顺便巩固一下matlab知识。

unique函数用来获取一个矩阵中的不同的值，示例：

复制代码代码如下:

>> unique([1 3 3 4 4 5])

ans =

1 3 4 5

>> unique([1 3 3 ; 4 4 5])

ans =

所以循环 while length(unique(L)) ~= k 以得到了k个聚类为结束条件，不过一般情况下，这个循环一次就结束了，因为重点在这个循环中。

rand是返回在(0,1)这个区间的一个随机数。在注释%-1所在行，C被扩充了，被扩充的方法类似于下面：

复制代码代码如下:

>> C =[];

>> C(1,1) = 1

C =

>> C(2,1) = 2

C =

>> C(:,[1 1 1 1])

ans =

1 1 1 1

2 2 2 2

>> C(:,[1 1 1 1 2])

Index exceeds matrix dimensions.

C中第二个参数的元素1，其实是代表C的第一列数据，之所以在值2时候出现Index exceeds matrix dimensions.的错误，是因为C本身没有第二列。如果C有第二列了：

复制代码代码如下:

>> C(2,2) = 3;

>> C(2,2) = 4;

>> C(:,[1 1 1 1 2])

ans =

1 1 1 1 3

2 2 2 2 4

dot函数是将两个矩阵点乘，然后把结果在某一维度相加：

复制代码代码如下:

>> TT = [1 2 3 ; 4 5 6];

>> dot(TT,TT)

ans =

17 29 45

>> dot(TT,TT,1 )

ans =

17 29 45

cumsum是累加函数：

复制代码代码如下:

>> cumsum([1 2 3])

ans =

1 3 6

>> cumsum([1 2 3; 4 5 6])

ans =

1 2 3

5 7 9

max函数可以返回两个值，第二个代表的是max数的索引位置：

复制代码代码如下:

>> [~, L] = max([1 2 3])

L =

>> [~,L] = max([1 2 3;2 3 4])

L =

2 2 2

其中~是占位符。

关于bsxfun函数，官方文档指出：

复制代码代码如下:

C = bsxfun(fun,A,B) applies the element-by-element binary operation specified by the function handle fun to arrays A and B, with singleton expansion enabled

其中参数fun是函数句柄，关于函数句柄见资料[9]。下面是bsxfun的一个示例：

复制代码代码如下:

>> A= [1 2 3;2 3 4]

A =

1 2 3

2 3 4

>> B=[6;7]

B =

>> bsxfun(@minus,A,B)

ans =

-5 -4 -3

对于：

复制代码代码如下:

[~,L] = max(bsxfun(@minus,2*real(C'*X),dot(C,C,1).'));

max的参数是这样一个矩阵，矩阵有n列，n也是数据点的个数，每一列代表着对应的数据点与各个中心点之间的距离的相反数。不过这个距离有些与众不同，算是欧几里得距离的变形。

假定数据点是2维的，某个数据点为(x1,y1)，某个中心点为(c1,d1)，那么通过bsxfun(@minus,2real(C'X),dot(C,C,1).')的计算，数据点与中心点的距离为2c1x1 + 2d1y1 -c1.^2 - c2.^2，可以变换为x1.^2 + y1.^2 - (c1-x1).^2 - (d1-y1).^2。对于每一列而言，由于是数据点与各个中心点之间的计算，所以可以忽略x1.^2 + y1.^2，最终计算结果是欧几里得距离的平方的相反数。这也说明了使用max的合理性，因为一个数据点的所属簇取决于与其距离最近的中心点，若将距离取相反数，则应该是值最大的那个点。

本文原创发布php中文网，转载请注明出处，感谢您的尊重！