作用
吧一个数据矩阵分为k堆
原理
首先在数据中随机找K个点, 并且计算其余的点与它们的距离
比如k=3, 数据中有100个点
第一个点离k1近, 那么第一个点就是k1堆的
循环100次, 这样就吧100个点分成了3堆
然后计算这3堆的平均点,用mean函数
这样我们就得到了3个平均点,然后返回上面的步骤,继续计算这3个点与其余100个点的距离,并且分堆
继续计算平均点,继续计算距离,继续分堆,如此反复..
反复几次后,这3个点就不动了,于是数据便被我们分为了3堆
几张图
代码
注意: 如果k值过高返回值可能会出现infinite的情况
function [ centerPoints ] = my_k_mean( data_in, k )
clc;
[data_instance, data_dimension] = size(data_in);
data_run = zeros(data_instance,1);
data_run = [data_in data_run];
if data_instance < k
error('Data instance less than k');
end
rndPick = randi([1, data_instance - k + 1]);
centerPoints = zeros(k,data_dimension);
for i=1:k
centerPoints(i,:) = data_in(rndPick + i - 1,:);
end
fClusterChanged = 1;
iteration = 0;
while fClusterChanged == 1
iteration = iteration + 1;
disp(['--------------------Iteration: ' num2str(iteration)]);
for i=1:data_instance
pointX = data_in(i,:);
%calcute the distance
minDist = 0;
for j=1:k
dist = pdist([centerPoints(j,:);pointX],'euclidean');
if minDist > dist || j == 1
minDist = dist;
%classifier on the data
data_run(i,data_dimension + 1) = j;
end
end
end
old_center_points = centerPoints;
%calcute the new center points
classes_count = zeros(1,k);
for i=1:data_instance
clissfier = data_run(i, data_dimension + 1);
centerPoints(clissfier,:) = centerPoints(clissfier,:) + data_run(i,1:data_dimension);
classes_count(clissfier) = classes_count(clissfier) + 1;
end
for i=1:k
centerPoints(i,:) = centerPoints(i,:) / classes_count(i);
end
if or(old_center_points == centerPoints , iteration > 200)
fClusterChanged = 0;
else
fClusterChanged = 1;
end
end
end