K平均算法

这是前些天机器学习课程的一个小实验。

实验原理

本实验距离标准采用欧几里得距离。聚类算法采用K平均算法。K平均算法基本思想如下:

(1)、任选K个对象作为初始类中心

(2)、repeat

(3)、计算类中对象的平均值,将每个对象重新赋给最类似的类

(4)、更新类的平均值

(5)、直到不再发生变化

可以想象,如果需要聚类的点是有限的,那么每一步,它们距离各自所对应的最近的平均值点的距离会逐渐减小,直到到达一个最小值不在变化为止,只要点是有限的,这个最小值就一定是存在且唯一的。

实验环境

这个实验推荐软件MATLAB,我使用的是MATLAB2014a,操作系统是Windows10专业版。

实验要求

老师给了两组数据,要求我们利用K平均算法对这两组数据进行聚类,一组为二维空间数据,K为5,另一组为三维空间数据,K为7 。

二维问题的解决

先将数据的散点绘制出来,对数据有个大致的了解。代码如下:

load('2d-data.mat');

x = r(:,1);

y = r(:,2);

scatter(x,y)

所绘制的散点图如下所示:



然后利用k平均算法进行操作

代码如下:

load('2d-data.mat');
x = r(:,1);
y = r(:,2);
kx = [0.1832 0.5979 9.1695 11.3244 5.6329];
ky = [0.0799 9.6767 8.9553 0.7174 5.3970];
kind = zeros(1,500);%存放某一个点属于那一个类
a = zeros(1,5);%存放某一个点分别与五个点之间的距离
s = zeros(1,500);%存放某一个点到它所属类的距离
b = 0;
ssum = 0;
kindnum = 1;
for i=1:500
    for j = 1:5
        a(j) = sqrt((kx(j)-x(i))^2+(ky(j)-y(i))^2);
    end
    b = a(1);
    for m = 2:5;
       if b>a(m)
           b = a(m);
           kindnum = m;
       else
       end
    end
    s(i) = b;
    kind(i) = kindnum;
    kindnum = 1;
end

%计算误差和
for i=1:500
    ssum = ssum + s(i);
end

kjnum = 0;%属于某一类的数量
kjxsum = 0;
kjysum = 0;
%循环之以求最佳
for forir=1:6
    
    %求所有属于某一类的平均值
    for j = 1:5
        for m = 1:500
            if kind(m)==j
                kjnum = kjnum+1;
                kjxsum = kjxsum + x(m);
                kjysum = kjysum + y(m);
            else
            end
        end
        %新的临时中心点
        kx(j) = kjxsum/kjnum;
        ky(j) = kjysum/kjnum;
        kjnum = 0;
        kjxsum = 0;
        kjysum = 0;
    end
    
    for i=1:500
        for j = 1:5
            a(j) = sqrt((kx(j)-x(i))^2+(ky(j)-y(i))^2);
        end
        b = a(1);
        for m = 2:5;
            if b>a(m)
                b = a(m);
                kindnum = m;
            else
            end
        end
        s(i) = b;
        kind(i) = kindnum;
        kindnum = 1;
    end
end

ssum = 0;
%计算误差和
for i=1:500
    ssum = ssum + s(i);
end

x1i = 1;
x2i = 1;
x3i = 1;
x4i = 1;
x5i = 1;

for i=1:500
    if kind(i)==1
        x1(x1i) = x(i);
        y1(x1i) = y(i);
        x1i = x1i + 1;
    else if kind(i) == 2
            x2(x2i) = x(i);
            y2(x2i) = y(i);
            x2i = x2i+1;
        else if kind(i) == 3
                x3(x3i) = x(i);
                y3(x3i) = y(i);
                x3i = x3i+1;
            else if kind(i) == 4
                    x4(x4i) = x(i);
                    y4(x4i) = y(i);
                    x4i = x4i+1;
                else
                    x5(x5i) = x(i);
                    y5(x5i) = y(i);
                    x5i = x5i+1;
                end
            end
        end
    end
end

scatter(x1,y1,3)
hold on;
scatter(x2,y2,3)
hold on;
scatter(x3,y3,3)
hold on;
scatter(x4,y4,3)
hold on;
scatter(x5,y5,3)
hold on;
scatter(kx,ky,90,'p')
hold on;
ssumempty

这里,我使用了“半自动”的模式,通过修改循环的次数,直到误差和不再变化(同时也是最小)。结果如下所示,其中的五角星为每一类的平均值点。五个平均值的坐标分别为(-0.1108,-0.0647),(-0.2540,10.1273),(10.0842,10.0162),(9.9627,-0.0961)和(5.3281,5.0506) 。此时的欧式距离和为612.5909。



三维问题的解决

先将数据的散点图绘制出来观察数据规律

代码如下:

load('3d-data.mat')

 x = r(:,1);

 y = r(:,2);

 z = r(:,3);

 scatter3(x,y,z)

 

图像如下:



用k平均算法进行聚类。

代码如下:

load('3d-data.mat');
x = r(:,1);
y = r(:,2);
z = r(:,3);
kx = [9.7230 1.0262 1.5833 0.4717 -1.6163 10.4582 10.8928];
ky = [9.5779 0.4131 0.9143 9.1354 10.9840 0.6157 0.3589];
kz = [11.5846 0.1446 9.3184 0.7930 9.0882 0.7019 10.9018];
kind = zeros(1,700);%存放某一个点属于那一个类
a = zeros(1,7);%存放某一个点分别与七个点之间的距离
s = zeros(1,700);%存放某一个点到它所属类中心的距离
b = 0;
ssum = 0;%误差和
kindnum = 1;
for i=1:700
    for j = 1:7
        a(j) = sqrt((kx(j)-x(i))^2+(ky(j)-y(i))^2+(kz(j)-z(i))^2);
    end
    b = a(1);
    for m = 2:7;
       if b>a(m)
           b = a(m);
           kindnum = m;
       else
       end
    end
    s(i) = b;
    kind(i) = kindnum;
    kindnum = 1;
end

%计算误差和
for i=1:700
    ssum = ssum + s(i);
end

kjnum = 0;%属于某一类的数量
kjxsum = 0;
kjysum = 0;
kjzsum = 0;
%循环之以求最佳
for forir=1:8
    
    %求所有属于某一类的平均值
    for j = 1:7
        for m = 1:700
            if kind(m)==j
                kjnum = kjnum + 1;
                kjxsum = kjxsum + x(m);
                kjysum = kjysum + y(m);
                kjzsum = kjzsum + z(m);
            else
            end
        end
        %新的临时中心点
        kx(j) = kjxsum/kjnum;
        ky(j) = kjysum/kjnum;
        kz(j) = kjzsum/kjnum;
        kjnum = 0;
        kjxsum = 0;
        kjysum = 0;
        kjzsum = 0;
    end
    
    %重新分类
    for i=1:700
        for j = 1:7
            a(j) = sqrt((kx(j)-x(i))^2+(ky(j)-y(i))^2+(kz(j)-z(i))^2);
        end
        b = a(1);
        for m = 2:7;
            if b>a(m)
                b = a(m);
                kindnum = m;
            else
            end
        end
        s(i) = b;
        kind(i) = kindnum;
        kindnum = 1;
    end
end

ssum = 0;
%计算误差和
for i=1:500
    ssum = ssum + s(i);
end
ssum
%找出所属类以画图
x1i = 1;
x2i = 1;
x3i = 1;
x4i = 1;
x5i = 1;
x6i = 1;
x7i = 1;

for i=1:700
    if kind(i)==1
        x1(x1i) = x(i);
        y1(x1i) = y(i);
        z1(x1i) = z(i);
        x1i = x1i + 1;
    else if kind(i) == 2
            x2(x2i) = x(i);
            y2(x2i) = y(i);
            z2(x2i) = z(i);
            x2i = x2i+1;
        else if kind(i) == 3
                x3(x3i) = x(i);
                y3(x3i) = y(i);
                z3(x3i) = z(i);
                x3i = x3i+1;
            else if kind(i) == 4
                    x4(x4i) = x(i);
                    y4(x4i) = y(i);
                    z4(x4i) = z(i);
                    x4i = x4i+1;
                else if kind(i) == 5
                    x5(x5i) = x(i);
                    y5(x5i) = y(i);
                    z5(x5i) = z(i);
                    x5i = x5i+1;
                    else if kind(i) == 6
                            x6(x6i) = x(i);
                            y6(x6i) = y(i);
                            z6(x6i) = z(i);
                            x6i = x6i+1;
                        else
                            x7(x7i) = x(i);
                            y7(x7i) = y(i);
                            z7(x7i) = z(i);
                            x7i = x7i+1;
                        end
                    end
                end
            end
        end
    end
end

scatter3(x1,y1,z1,3)
hold on;
scatter3(x2,y2,z2,3)
hold on;
scatter3(x3,y3,z3,3)
hold on;
scatter3(x4,y4,z4,3)
hold on;
scatter3(x5,y5,z5,3)
hold on;
scatter3(x6,y6,z6,3)
hold on;
scatter3(x7,y7,z7,3)
hold on;
scatter3(kx,ky,kz,90,'p')
hold on;

这里的基本思想和二维时的情景是一样的。结果图像如下所示。七个类中心坐标分别为:(9.9641,10.0875,9.9957),(-0.1625,-0.0271,0.0249),(-0.0105,-0.0366,10.1016),(0.1838,10.0791,-0.0820),(-0.0374,9.9986,10.0079),(10.1295,0.0617,0.0603)和(9.9811,-0.1085,10.0376)。欧氏距离和为790.5695。



收获体会

K平均算法在当时学习数据库系统概念课程的时候最后提到过。但是一直没有亲手来通过数据进行运用。这次小试牛刀感觉还可以。



评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值