这是前些天机器学习课程的一个小实验。
实验原理
本实验距离标准采用欧几里得距离。聚类算法采用K平均算法。K平均算法基本思想如下:
(1)、任选K个对象作为初始类中心
(2)、repeat
(3)、计算类中对象的平均值,将每个对象重新赋给最类似的类
(4)、更新类的平均值
(5)、直到不再发生变化
可以想象,如果需要聚类的点是有限的,那么每一步,它们距离各自所对应的最近的平均值点的距离会逐渐减小,直到到达一个最小值不在变化为止,只要点是有限的,这个最小值就一定是存在且唯一的。
实验环境
这个实验推荐软件MATLAB,我使用的是MATLAB2014a,操作系统是Windows10专业版。
实验要求
老师给了两组数据,要求我们利用K平均算法对这两组数据进行聚类,一组为二维空间数据,K为5,另一组为三维空间数据,K为7 。
二维问题的解决
先将数据的散点绘制出来,对数据有个大致的了解。代码如下:
load('2d-data.mat');
x = r(:,1);
y = r(:,2);
scatter(x,y)
所绘制的散点图如下所示:
然后利用k平均算法进行操作
代码如下:
load('2d-data.mat');
x = r(:,1);
y = r(:,2);
kx = [0.1832 0.5979 9.1695 11.3244 5.6329];
ky = [0.0799 9.6767 8.9553 0.7174 5.3970];
kind = zeros(1,500);%存放某一个点属于那一个类
a = zeros(1,5);%存放某一个点分别与五个点之间的距离
s = zeros(1,500);%存放某一个点到它所属类的距离
b = 0;
ssum = 0;
kindnum = 1;
for i=1:500
for j = 1:5
a(j) = sqrt((kx(j)-x(i))^2+(ky(j)-y(i))^2);
end
b = a(1);
for m = 2:5;
if b>a(m)
b = a(m);
kindnum = m;
else
end
end
s(i) = b;
kind(i) = kindnum;
kindnum = 1;
end
%计算误差和
for i=1:500
ssum = ssum + s(i);
end
kjnum = 0;%属于某一类的数量
kjxsum = 0;
kjysum = 0;
%循环之以求最佳
for forir=1:6
%求所有属于某一类的平均值
for j = 1:5
for m = 1:500
if kind(m)==j
kjnum = kjnum+1;
kjxsum = kjxsum + x(m);
kjysum = kjysum + y(m);
else
end
end
%新的临时中心点
kx(j) = kjxsum/kjnum;
ky(j) = kjysum/kjnum;
kjnum = 0;
kjxsum = 0;
kjysum = 0;
end
for i=1:500
for j = 1:5
a(j) = sqrt((kx(j)-x(i))^2+(ky(j)-y(i))^2);
end
b = a(1);
for m = 2:5;
if b>a(m)
b = a(m);
kindnum = m;
else
end
end
s(i) = b;
kind(i) = kindnum;
kindnum = 1;
end
end
ssum = 0;
%计算误差和
for i=1:500
ssum = ssum + s(i);
end
x1i = 1;
x2i = 1;
x3i = 1;
x4i = 1;
x5i = 1;
for i=1:500
if kind(i)==1
x1(x1i) = x(i);
y1(x1i) = y(i);
x1i = x1i + 1;
else if kind(i) == 2
x2(x2i) = x(i);
y2(x2i) = y(i);
x2i = x2i+1;
else if kind(i) == 3
x3(x3i) = x(i);
y3(x3i) = y(i);
x3i = x3i+1;
else if kind(i) == 4
x4(x4i) = x(i);
y4(x4i) = y(i);
x4i = x4i+1;
else
x5(x5i) = x(i);
y5(x5i) = y(i);
x5i = x5i+1;
end
end
end
end
end
scatter(x1,y1,3)
hold on;
scatter(x2,y2,3)
hold on;
scatter(x3,y3,3)
hold on;
scatter(x4,y4,3)
hold on;
scatter(x5,y5,3)
hold on;
scatter(kx,ky,90,'p')
hold on;
ssumempty
这里,我使用了“半自动”的模式,通过修改循环的次数,直到误差和不再变化(同时也是最小)。结果如下所示,其中的五角星为每一类的平均值点。五个平均值的坐标分别为(-0.1108,-0.0647),(-0.2540,10.1273),(10.0842,10.0162),(9.9627,-0.0961)和(5.3281,5.0506) 。此时的欧式距离和为612.5909。
三维问题的解决
先将数据的散点图绘制出来观察数据规律
代码如下:
load('3d-data.mat')
x = r(:,1);
y = r(:,2);
z = r(:,3);
scatter3(x,y,z)
图像如下:
用k平均算法进行聚类。
代码如下:
load('3d-data.mat');
x = r(:,1);
y = r(:,2);
z = r(:,3);
kx = [9.7230 1.0262 1.5833 0.4717 -1.6163 10.4582 10.8928];
ky = [9.5779 0.4131 0.9143 9.1354 10.9840 0.6157 0.3589];
kz = [11.5846 0.1446 9.3184 0.7930 9.0882 0.7019 10.9018];
kind = zeros(1,700);%存放某一个点属于那一个类
a = zeros(1,7);%存放某一个点分别与七个点之间的距离
s = zeros(1,700);%存放某一个点到它所属类中心的距离
b = 0;
ssum = 0;%误差和
kindnum = 1;
for i=1:700
for j = 1:7
a(j) = sqrt((kx(j)-x(i))^2+(ky(j)-y(i))^2+(kz(j)-z(i))^2);
end
b = a(1);
for m = 2:7;
if b>a(m)
b = a(m);
kindnum = m;
else
end
end
s(i) = b;
kind(i) = kindnum;
kindnum = 1;
end
%计算误差和
for i=1:700
ssum = ssum + s(i);
end
kjnum = 0;%属于某一类的数量
kjxsum = 0;
kjysum = 0;
kjzsum = 0;
%循环之以求最佳
for forir=1:8
%求所有属于某一类的平均值
for j = 1:7
for m = 1:700
if kind(m)==j
kjnum = kjnum + 1;
kjxsum = kjxsum + x(m);
kjysum = kjysum + y(m);
kjzsum = kjzsum + z(m);
else
end
end
%新的临时中心点
kx(j) = kjxsum/kjnum;
ky(j) = kjysum/kjnum;
kz(j) = kjzsum/kjnum;
kjnum = 0;
kjxsum = 0;
kjysum = 0;
kjzsum = 0;
end
%重新分类
for i=1:700
for j = 1:7
a(j) = sqrt((kx(j)-x(i))^2+(ky(j)-y(i))^2+(kz(j)-z(i))^2);
end
b = a(1);
for m = 2:7;
if b>a(m)
b = a(m);
kindnum = m;
else
end
end
s(i) = b;
kind(i) = kindnum;
kindnum = 1;
end
end
ssum = 0;
%计算误差和
for i=1:500
ssum = ssum + s(i);
end
ssum
%找出所属类以画图
x1i = 1;
x2i = 1;
x3i = 1;
x4i = 1;
x5i = 1;
x6i = 1;
x7i = 1;
for i=1:700
if kind(i)==1
x1(x1i) = x(i);
y1(x1i) = y(i);
z1(x1i) = z(i);
x1i = x1i + 1;
else if kind(i) == 2
x2(x2i) = x(i);
y2(x2i) = y(i);
z2(x2i) = z(i);
x2i = x2i+1;
else if kind(i) == 3
x3(x3i) = x(i);
y3(x3i) = y(i);
z3(x3i) = z(i);
x3i = x3i+1;
else if kind(i) == 4
x4(x4i) = x(i);
y4(x4i) = y(i);
z4(x4i) = z(i);
x4i = x4i+1;
else if kind(i) == 5
x5(x5i) = x(i);
y5(x5i) = y(i);
z5(x5i) = z(i);
x5i = x5i+1;
else if kind(i) == 6
x6(x6i) = x(i);
y6(x6i) = y(i);
z6(x6i) = z(i);
x6i = x6i+1;
else
x7(x7i) = x(i);
y7(x7i) = y(i);
z7(x7i) = z(i);
x7i = x7i+1;
end
end
end
end
end
end
end
scatter3(x1,y1,z1,3)
hold on;
scatter3(x2,y2,z2,3)
hold on;
scatter3(x3,y3,z3,3)
hold on;
scatter3(x4,y4,z4,3)
hold on;
scatter3(x5,y5,z5,3)
hold on;
scatter3(x6,y6,z6,3)
hold on;
scatter3(x7,y7,z7,3)
hold on;
scatter3(kx,ky,kz,90,'p')
hold on;
这里的基本思想和二维时的情景是一样的。结果图像如下所示。七个类中心坐标分别为:(9.9641,10.0875,9.9957),(-0.1625,-0.0271,0.0249),(-0.0105,-0.0366,10.1016),(0.1838,10.0791,-0.0820),(-0.0374,9.9986,10.0079),(10.1295,0.0617,0.0603)和(9.9811,-0.1085,10.0376)。欧氏距离和为790.5695。
收获体会
K平均算法在当时学习数据库系统概念课程的时候最后提到过。但是一直没有亲手来通过数据进行运用。这次小试牛刀感觉还可以。