前言

kmeans是最简单的聚类算法之一,但是运用十分广泛。最近在工作中也经常遇到这个算法。kmeans一般在数据分析前期使用,选取适当的k,将数据分类后,然后分类研究不同聚类下数据的特点。

本文记录学习kmeans算法相关的内容,包括算法原理,收敛性,效果评估聚,最后带上R语言的例子,作为备忘。

算法原理

kmeans的计算方法如下:

1 随机选取k个中心点

2 遍历所有数据,将每个数据划分到最近的中心点中

3 计算每个聚类的平均值,并作为新的中心点

4 重复2-3,直到这k个中线点不再变化(收敛了),或执行了足够多的迭代

时间复杂度:O(I*n*k*m)

空间复杂度:O(n*m)

其中m为每个元素字段个数,n为数据量,I为跌打个数。一般I,k,m均可认为是常量,所以时间和空间复杂度可以简化为O(n),即线性的。

算法收敛

从kmeans的算法可以发现,SSE其实是一个严格的坐标下降(Coordinate Decendet)过程。设目标函数SSE如下:

SSE(

【无人机布局优化】基于k-mean聚类的无人机布局优化matlab源码_matlab

,

【无人机布局优化】基于k-mean聚类的无人机布局优化matlab源码_布局优化_02

,…,

【无人机布局优化】基于k-mean聚类的无人机布局优化matlab源码_matlab_03

) =

【无人机布局优化】基于k-mean聚类的无人机布局优化matlab源码_matlab_04

采用欧式距离作为变量之间的聚类函数。每次朝一个变量

的方向找到最优解,也就是求偏倒数,然后等于0,可得

c_i=

【无人机布局优化】基于k-mean聚类的无人机布局优化matlab源码_布局优化_05

其中m是c_i所在的簇的元素的个数

也就是当前聚类的均值就是当前方向的最优解(最小值),这与kmeans的每一次迭代过程一样。所以,这样保证SSE每一次迭代时,都会减小,最终使SSE收敛。

由于SSE是一个非凸函数(non-convex function),所以SSE不能保证找到全局最优解,只能确保局部最优解。但是可以重复执行几次kmeans,选取SSE最小的一次作为最终的聚类结果。

0-1规格化

由于数据之间量纲的不相同,不方便比较。举个例子,比如游戏用户的在线时长和活跃天数,前者单位是秒,数值一般都是几千,而后者单位是天,数值一般在个位或十位,如果用这两个变量来表征用户的活跃情况,显然活跃天数的作用基本上可以忽略。所以,需要将数据统一放到0~1的范围,将其转化为无量纲的纯数值,便于不同单位或量级的指标能够进行比较和加权。具体计算方法如下:

其中【无人机布局优化】基于k-mean聚类的无人机布局优化matlab源码_matlab_06

【无人机布局优化】基于k-mean聚类的无人机布局优化matlab源码_matlab_07

属于A。

轮廓系数

轮廓系数(Silhouette Coefficient)结合了聚类的凝聚度(Cohesion)和分离度(Separation),用于评估聚类的效果。该值处于-1~1之间,值越大,表示聚类效果越好。具体计算方法如下:

  1. 对于第i个元素x_i,计算x_i与其同一个簇内的所有其他元素距离的平均值,记作a_i,用于量化簇内的凝聚度。
  2. 选取x_i外的一个簇b,计算x_i与b中所有点的平均距离,遍历所有其他簇,找到最近的这个平均距离,记作b_i,用于量化簇之间分离度。
  3. 对于元素x_i,轮廓系数s_i = (b_i – a_i)/max(a_i,b_i)
  4. 计算所有x的轮廓系数,求出平均值即为当前聚类的整体轮廓系数

从上面的公式,不难发现若s_i小于0,说明x_i与其簇内元素的平均距离小于最近的其他簇,表示聚类效果不好。如果a_i趋于0,或者b_i足够大,那么s_i趋近与1,说明聚类效果比较好。

K值选取

在实际应用中,由于Kmean一般作为数据预处理,或者用于辅助分类贴标签。所以k一般不会设置很大。可以通过枚举,令k从2到一个固定值如10,在每个k值上重复运行数次kmeans(避免局部最优解),并计算当前k的平均轮廓系数,最后选取轮廓系数最大的值对应的k作为最终的集群数目。

 
  • 1.
%%%%%%%%%%%Add cell breathing

%The UAV coverage radius is R_m = 100m

%The area is 100*100m

clear;clc;

lengthOfArea =400;

R_m = 100;

n_users = 50;

users_dist = rand(n_users,2) * lengthOfArea; % The coordinate of every user

min_users_inaUAV = 5;

max_users_in_aUAV = 10;

power_Trans = 40;

R_cov_all = 0;

iteration = 10;

rate = zeros(iteration,1);

SINR_threshold = -5;% The threshold is dm

Rate2_low = [];

powers_collect = [];

%plot(users_dist(:,1),users_dist(:,2)3'r*');

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%===================Kmeans===========

stop=0;

k_index_old = 0;

powers_all_old = 0;

%%%%%%% Kmeans tryting

for k_cluster = 3:40;

change = 1;

while change == 1;

height_all = zeros(k_cluster,1);

[k_index,k_center] = kmeans(users_dist,k_cluster);

if k_index == k_index_old

change = 0;

end

k_index_old = k_index;





%===================================

%%%%%%%%%%%%%%%%%%%

%----------In order to avoid too much overlapped area, I modify Kmeans by

%-----------------selecting the centroid from the middle of farthest users in

%-----------------one cluster

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%=========find the farthest distance in one cluster

max_k_all = [];

k_means_center = [];

rate = [];

users_data=[];

for k_k = 1:k_cluster

max_all = [];

users_in_acluster = zeros(size(users_dist(k_index==k_k,:),1),2);

users_in_acluster = users_dist(k_index==k_k,:);

for user_index = 1:size(users_in_acluster,1)

[max_dista, max_index]= max((users_in_acluster(user_index,1) - users_in_acluster(:,1)).^2+(users_in_acluster(user_index,2) - users_in_acluster(:,2)).^2);

max_all = [max_all;[sqrt(max_dista), user_index, max_index]];

end

[max_k,max_index]= max(max_all(:,1));

max_index_in_acluster = max_all(max_index,2:3);

coor_center = sum(users_in_acluster(max_index_in_acluster,:),1)/2;

radiusOfCenter = norm(users_in_acluster(max_index_in_acluster(1),:)-users_in_acluster(max_index_in_acluster(2),:))/2;

k_means_center = [k_means_center; coor_center radiusOfCenter];%store the locations and radius of each cluster

end

%---------------------------------The locations of clusters have been fixed----------------------

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%------------------The overlapped area---------------------------------

all_overlap_area = 0;

all_area = 0;

for k = 1:length(k_means_center)-1

for k_left = k+1:length(k_means_center)

radius_two_cluster = sqrt(sum((k_means_center(k,1:2)-k_means_center(k_left,1:2)).^2));

if radius_two_cluster < (k_means_center(k,3)+k_means_center(k_left,3))

r_1 = k_means_center(k,3); r_2 = k_means_center(k_left,3); D = radius_two_cluster;

theta_1 = acos((r_1^2+D^2-r_2^2)/(2*r_1*D));

theta_2 = acos((r_2^2+D^2-r_1^2)/(2*r_2*D));

overlap_area = theta_1*r_1^2 + theta_2*r_2^2-r_1*D*sin(theta_1);

all_overlap_area = all_overlap_area+overlap_area;

end

end

end

%---------------------------------------------------------------------

%r_distance = zeros(length(users_dist),length(k_means_center));

users_sum = 0;

%%%%%%%%%%%%%%%%%%%%%%

%--------Cell breathing to constrain the number of users.

users_overlap = 0;

for k = 1:k_cluster

r_distance = 0;

users_in_cell = users_dist(k_index==k,:);

for k_users = 1: size(users_in_cell,1)

r_distance = norm(users_in_cell(k_users,:)-k_means_center(k,1:2));

if r_distance>k_means_center(k,3)

k_means_center(k,3) = r_distance;

end


end

end

%--------------cell breathing

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%计算所有重叠部分的用户数

for k = 1:k_cluster

users_in_cell = users_dist(k_index==k,:);

for k_users = 1: size(users_in_cell,1)

for k_cluster_else = 1:k_cluster

if k_cluster_else~=k && norm(users_in_cell(k_users,:)-k_means_center(k_cluster_else,1:2))<k_means_center(k_cluster_else,3)

users_overlap = users_overlap+1;

end

end

end

end


% if length(users_in_cluster)>= max_users_in_aUAV

% [~,dis_label] = sort(r_distance(:,k));

% k_means_center(k,3) = r_distance(dis_label(max_users_in_aUAV),k);

% end

% users_sum = users_sum + sum(r_distance(:,k)<=k_means_center(k,3));


%--------------This function has been closed.

%%%%%%%%%%%%%%%%%%%%%%



%--------------For Power Allocation

index_users = users_data(:,1);

for cluster_for_p = 1:k_cluster

users_PL_cluster_p = users_data(index_users==cluster_for_p,2:end);

SINR_cluster = [];

for user_in_c_p = 1:size(users_PL_cluster_p,1)

SINR_interf = 0;

SINR_signal = powers_all(cluster_for_p) - users_PL_cluster_p(user_in_c_p,1);

count = 1;

for user_intf = 1:k_cluster

if user_intf~=cluster_for_p

count = count+1;

SINR_interf = SINR_interf + 10.^((powers_all(cluster_for_p)-users_PL_cluster_p(user_in_c_p,count))/10);

end

end

SINR_user = SINR_signal- 10*log10(SINR_interf);

SINR_cluster = [SINR_cluster; SINR_user 10*log10(SINR_interf)];


end

[min_SINR, min_user] = min(SINR_cluster(:,1));

powers_all(cluster_for_p) = SINR_threshold + SINR_cluster(min_user,2) + users_PL_cluster_p(min_user,1);

end

powers_collect = [powers_collect; powers_all'];

SINR_iter = powerAllocation(k_cluster, users_dist,k_index,powers_all,k_means_center,height_all);

end

%--------------End

%%%%%%%%%%%%%%%%

break

end

end

end %While end


if stop == 1

break

end

end

%rate(iter) = rate(iter)/k_cluster;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%---------------Calculate the new SINR

Rate2_low = powerAllocation(k_cluster, users_dist,k_index,powers_collect(size(powers_collect,1)-1,:),k_means_center,height_all);

figure; hold on; plot(rate,'r*');plot(Rate2_low,'ko');

figure; hold on;

for power_iter = 1:size(powers_collect)

plot(powers_collect(power_iter,:),'.','MarkerSize',18)

end
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
  • 100.
  • 101.
  • 102.
  • 103.
  • 104.
  • 105.
  • 106.
  • 107.
  • 108.
  • 109.
  • 110.
  • 111.
  • 112.
  • 113.
  • 114.
  • 115.
  • 116.
  • 117.
  • 118.
  • 119.
  • 120.
  • 121.
  • 122.
  • 123.
  • 124.
  • 125.
  • 126.
  • 127.
  • 128.
  • 129.
  • 130.
  • 131.
  • 132.
  • 133.
  • 134.
  • 135.
  • 136.
  • 137.
  • 138.
  • 139.
  • 140.
  • 141.
  • 142.
  • 143.
  • 144.
  • 145.
  • 146.
  • 147.
  • 148.
  • 149.
  • 150.
  • 151.
  • 152.
  • 153.
  • 154.
  • 155.
  • 156.
  • 157.
  • 158.
  • 159.
  • 160.
  • 161.
  • 162.
  • 163.
  • 164.
  • 165.
  • 166.
  • 167.
  • 168.
  • 169.
  • 170.
  • 171.
  • 172.
  • 173.
  • 174.
  • 175.
  • 176.
  • 177.
  • 178.
  • 179.
  • 180.
  • 181.
  • 182.
  • 183.
  • 184.
  • 185.
  • 186.
  • 187.
  • 188.
  • 189.
  • 190.
  • 191.
  • 192.
  • 193.
  • 194.
  • 195.
  • 196.
  • 197.
  • 198.
  • 199.
  • 200.
  • 201.
  • 202.
  • 203.
  • 204.
  • 205.
  • 206.
  • 207.
  • 208.
  • 209.
  • 210.
  • 211.
  • 212.
  • 213.
  • 214.
  • 215.
  • 216.
  • 217.
  • 218.
  • 219.
  • 220.
  • 221.
  • 222.
  • 223.
  • 224.
  • 225.
  • 226.
  • 227.
  • 228.
  • 229.
  • 230.
  • 231.
  • 232.
  • 233.
  • 234.
  • 235.
  • 236.
  • 237.
  • 238.
  • 239.
  • 240.
  • 241.
  • 242.
  • 243.
  • 244.
  • 245.
  • 246.
  • 247.
  • 248.
  • 249.
  • 250.
  • 251.
  • 252.
  • 253.
  • 254.
  • 255.
  • 256.
  • 257.
  • 258.
  • 259.
  • 260.
  • 261.
  • 262.
  • 263.
  • 264.
  • 265.
  • 266.
  • 267.
  • 268.
  • 269.
  • 270.
  • 271.
  • 272.
  • 273.
  • 274.
  • 275.
  • 276.
  • 277.
  • 278.
  • 279.
  • 280.
  • 281.
  • 282.
  • 283.
  • 284.
  • 285.
  • 286.
  • 287.
  • 288.
  • 289.
  • 290.
  • 291.
  • 292.
  • 293.
  • 294.
  • 295.
  • 296.
  • 297.
  • 298.
  • 299.
  • 300.
  • 301.
  • 302.
  • 303.
  • 304.
  • 305.
  • 306.
  • 307.
  • 308.
  • 309.
  • 310.
  • 311.
  • 312.
  • 313.
  • 314.
  • 315.
  • 316.
  • 317.
  • 318.

【无人机布局优化】基于k-mean聚类的无人机布局优化matlab源码_布局优化_08