看了西关书的聚类算法,算法原理很容易明白,接下来就是整理成自己的理解思路,然后一步一步来实现算法,那么就来做吧。
hierarchyClustering算法
- 思想
不同层次对数据集进行划分,AGENS是一种自底而上的聚合策略。首先将数据集中的每个样本看作初始聚类簇,算法运行中找出距离最近的两个聚类簇进行合并,该过程不断重复直到预设的聚类簇个数。
- 算法步骤
输入:样本集D(m,n),聚类簇个数k
输出:聚类簇集C
Step1 初始化聚类簇集合,距离矩阵 init_hierarchy()
输入:样本集D; 输出:聚类簇集,距离矩阵
Step1.1 聚类簇集: 直接将原始数据集赋值给聚类簇集合。(聚类簇类型:结构体,包括样本集合,索引)
Step1.2计算距离矩阵(m,m-1),第一行表示第一个样本与其他样本的距离值(子函数:聚类簇距离度量函数d, get_dist(C1_data,C2_data))
Step2 聚类, hierarchy_clustering()
输入:聚类簇个数k,聚类簇集C,距离矩阵;输出:k个聚类簇集
While 当前聚类集中个数q>k
Step2.1 根据距离函数寻找最近的两个聚类簇Ci, Cj,合并Ci = Ci∪Cj(set_)
Stepg2.2更新聚类簇集合:删除第j个聚类簇,j:end的聚类簇索引依次-1
Step2.3更新距离矩阵M:删除M的第j行第j列,更新i行j(1:q-1)列,j(1:q-1)行i列
q=q-1
end while
层次聚类代码下载链接
ok啦,接下来不废话上代码(Matlab发布形式)
function Main()
clc;clear;close all
melon_data = load('melon4.0.txt');
melon_data(:,1)=[];k=4;
%step1
[dist_matrix,cluster_set]= init_hierarchy(melon_data);
%step2
cluster_ret = hierarchy_clustering(dist_matrix,cluster_set,k);
show(cluster_ret);
show_cluster_tree(melon_data);
end
subfunction
%step1
function [dist_matrix,cluster_set]= init_hierarchy(melon_data)
cluster_set=[];
[m,~] =size(melon_data);
dist_matrix = zeros(m,m);
for i=1:m
cluster.ind = i;
cluster.data = melon_data(i,:);
cluster_set = [cluster_set;cluster];
end
for i = 1:m
for j = i+1:m
dist_matrix(i,j) =get_dist(cluster_set(i),cluster_set(j));
dist_matrix(j,i) =dist_matrix(i,j);
end
end
end
%get two cluster dis
function dist =get_dist(cluster1,cluster2)
[m1,~]=size(cluster1.data);%2x2
dist = 10000;
for i =1:m1
tm_dist = pdist2(cluster2.data,cluster1.data(i,:));
[min_value,~]=min(tm_dist);
if min_value < dist
dist = min_value;
end
end
end
%step2
function cluster = hierarchy_clustering(dist_matrix,cluster_set,k)
while cluster_set(end).ind >k
[Ci,Cj]= find_closest_cluster(dist_matrix); %Ci<Cj
cluster_set = update_cluster_set(Ci,Cj,cluster_set);
dist_matrix = update_dist_matrix(Ci,Cj,cluster_set,dist_matrix);
end
cluster = cluster_set;
end
function [Ci,Cj]= find_closest_cluster(dist_matrix)
[m,n] = size(dist_matrix);
min_dist = 10000;
start=2;Ci=0;Cj=0;
for i = 1:m-1
[tem_dist,min_ind] = min(dist_matrix(i,start:n));
if tem_dist<min_dist
min_dist=tem_dist;
Ci =i;
Cj = min_ind+start-1;
end
start = start+1;
end
if Ci>Cj
tem_c = Cj;
Cj =Ci;
Ci =tem_c;
end
end
function cluster = update_cluster_set(Ci,Cj,cluster_set)
for i =Cj+1:length(cluster_set)
cluster_set(i).ind = cluster_set(i).ind -1;%update
end
% 合并聚类簇,序号小的 合并 序号大的 Ci < Cj
cluster_set(Ci).data = [cluster_set(Ci).data;cluster_set(Cj).data];
cluster_set(Cj)=[]; %delete
cluster = cluster_set;
end
function dist_m = update_dist_matrix(Ci,Cj,cluster_set,dist_matrix)
%step2.3
dist_matrix(:,Cj)=[];
dist_matrix(Cj,:)=[];
%updat Ci行,1:cluster_set(end).ind 列
for j = 1:cluster_set(end).ind
dist_matrix(Ci,j) = get_dist(cluster_set(Ci),cluster_set(j));
dist_matrix(j,Ci) = dist_matrix(Ci,j);
end
dist_m = dist_matrix;
end
function show(cluster_ret)
plot(cluster_ret(1).data(:,1),cluster_ret(1).data(:,2),'or');hold on
plot(cluster_ret(2).data(:,1),cluster_ret(2).data(:,2),'og');
plot(cluster_ret(3).data(:,1),cluster_ret(3).data(:,2),'sr');
plot(cluster_ret(4).data(:,1),cluster_ret(4).data(:,2),'sg');
xlabel('density');ylabel('sugar rate');
end
function show_cluster_tree(melon_data)
Y=pdist(melon_data);
Z = linkage(Y,'single');
dendrogram(Z)
xlabel('sample number');ylabel('cluster distance');
end