自适应谱聚类
谱聚类中常用构造K近邻关系的相似度矩阵,这里引用实现相关算法,首先根据输入的数据data和K值求解KNN矩阵:get_knn_distance.m
function [A]=get_knn_distance(data,K)
% Input : data: N*D 维的数据,N表示数据点数,D表示数据维数
% K: 近邻数
%
% Output :A:KNN矩阵
dataT = data';
[n,d] = size(data);
dist_matrix = dist(dataT);
% 找到K近邻
[value index] = sort(dist_matrix, 1);%按列从小到大重新排列value重排结果,index重排编号
KNNindex = index(2:K+1, :);%去掉自己0的编号,K近邻的的编号
rowindex = reshape(KNNindex, size(KNNindex, 1)*size(KNNindex, 2), 1);%K近邻编号形n*1矩阵
tempindex = repmat(1:n, K, 1);%K行1列的矩阵
columnindex = reshape(tempindex, size(tempindex, 1)*size(tempindex, 2), 1);%矩阵列编号形成n*1矩阵
tempvalue = value(2:K+1, :);%去掉自己0的值
value1 = reshape(tempvalue, size(tempvalue, 1)*size(tempvalue, 2), 1);%K近邻编号的值形n*1矩阵
%形成新的K近邻矩阵
A= sparse(rowindex, columnindex, double(value1),n,n);
A1 = triu(A);%取上三角函数
A1 = A1 + A1';
A2 = tril(A);%取下三角函数
A2 = A2 + A2';
A = max(A1, A2);
%A = full(A);
求解前K个最小的特征值对应的特征向量sc.m。对于谱聚类来说,参数sigma是一个非常敏感的超参数,现有大量论文基于局部密度实现自适应的参数,有兴趣的可以查找研究
function [cluster_labels, evd_time, kmeans_time, total_time] = sc(A, sigma_matrix, num_clusters)
%SC Spectral clustering using a sparse similarity matrix (t-nearest-neighbor).
%
% Input : A : N-by-N sparse distance matrix, where
% N is the number of data
% sigma_matrix : sigma value used in computing similarity,
% num_clusters : number of clusters
%
% Output : cluster_labels : N-by-1 vector containing cluster labels
% evd_time : running time for eigendecomposition
% kmeans_time : running time for k-means
% total_time : total running time
%
% Convert the sparse distance matrix to a sparse similarity matrix,
% where S = exp^(-(A^2 / 2*sigma^2)).
% Note: This step can be ignored if A is sparse similarity matrix.
%
disp('Converting distance matrix to similarity matrix...');
tic;
A = A.*A;
%A = -A/(2*sigma*sigma);%取-,求最大的特征值
%disp(A)
A = -A./sigma_matrix;%sigma_matrix存放的是局部自适应的参数矩阵
%disp(A)
n=size(A,1);
% 求解相似性矩阵
S= spfun(@exp, A);
%disp(S)
clear A;
toc;
%
% Do laplacian, L = D^(-1/2) * S * D^(-1/2)
%
disp('Doing Laplacian...');
D = sum(S, 2) + (1e-10);
D = sqrt(1./D); % D^(-1/2)
D = spdiags(D, 0, n, n);
%disp(D)
L = D * S * D;
clear D S;
time1 = toc;
%
% Do eigendecomposition, if L =
% D^(-1/2) * S * D(-1/2) : set 'LM' (Largest Magnitude), or
% I - D^(-1/2) * S * D(-1/2): set 'SM' (Smallest Magnitude).
%L = eye(n)-(D^(-1/2) * S * D^(-1/2)); %归一化拉普拉斯矩阵
%
disp('Performing eigendecomposition...');
OPTS.disp = 0;
[V, val] = eigs(L, num_clusters, 'LM', OPTS);%V特征向量,val特征值对应的对角矩阵
time2 = toc;
%
% Do k-means
%
disp('Performing kmeans...');
% Normalize each row to be of unit length
sq_sum = sqrt(sum(V.*V, 2)) + 1e-20;
U = V ./ repmat(sq_sum, 1, num_clusters);
clear sq_sum V;
cluster_labels = k_means(U, [], num_clusters);
total_time = toc;
%
% Calculate and show time statistics
%
evd_time = time2 - time1;
kmeans_time = total_time - time2;
total_time;
disp('Finished!');
众所周知,kmeans聚类对初始点的选择敏感,此处引用的kmeans聚类可根据centers的值简单的选择初始化方式
function cluster_labels = k_means(data, centers, num_clusters)
%K_MEANS Euclidean k-means clustering algorithm.
%
% Input : data : N-by-D data matrix, where N is the number of data,
% D is the number of dimensions
% centers : K-by-D matrix, where K is num_clusters, or
% 'random', random initialization, or
% [], empty matrix, orthogonal initialization
% num_clusters : Number of clusters
%
% Output : cluster_labels : N-by-1 vector of cluster assignment
%
% Reference: Dimitrios Zeimpekis, Efstratios Gallopoulos, 2006.
% http://scgroup.hpclab.ceid.upatras.gr/scgroup/Projects/TMG/
%
% Parameter setting
%
iter = 0;
qold = inf;
threshold = 0.001;
%
% Check if with initial centers
%
if strcmp(centers, 'random')
disp('Random initialization...');
centers = random_init(data, num_clusters);
elseif isempty(centers)
disp('Orthogonal initialization...');
centers = orth_init(data, num_clusters);
end
%
% Double type is required for sparse matrix multiply
%
data = double(data);
centers = double(centers);
%
% Calculate the distance (square) between data and centers
%
n = size(data, 1);
x = sum(data.*data, 2)';%1*n
X = x(ones(num_clusters, 1), :);%num_clusters*n
y = sum(centers.*centers, 2);%num_clusters*1
Y = y(:, ones(n, 1));%num_clusters*n
P = X + Y - 2*centers*data';%
%
% Main program
%
while 1
iter = iter + 1;
% Find the closest cluster for each data point
[val, ind] = min(P, [], 1);%返回一个行向量,每列最小值的行号
% Sum up data points within each cluster
P = sparse(ind, 1:n, 1, num_clusters, n);%对应的聚类中心和数据编号
centers = P*data;%中心点权值和
% Size of each cluster, for cluster whose size is 0 we keep it empty
cluster_size = P*ones(n, 1);%每一类数量
% For empty clusters, initialize again
zero_cluster = find(cluster_size==0);%返回列列表的标号
if length(zero_cluster) > 0%聚类中心为空,进行初始化处理
disp('Zero centroid. Initialize again...');
centers(zero_cluster, :)= random_init(data, length(zero_cluster));
cluster_size(zero_cluster) = 1;
end
% Update centers
centers = spdiags(1./cluster_size, 0, num_clusters, num_clusters)*centers;
% Update distance (square) to new centers
y = sum(centers.*centers, 2);
Y = y(:, ones(n, 1));
P = X + Y - 2*centers*data';
% Calculate objective function value
qnew = sum(sum(sparse(ind, 1:n, 1, size(P, 1), size(P, 2)).*P));
mesg = sprintf('Iteration %d:\n\tQold=%g\t\tQnew=%g', iter, full(qold), full(qnew));
disp(mesg);
% Check if objective function value is less than/equal to threshold
if threshold >= abs((qnew-qold)/qold)
mesg = sprintf('\nkmeans converged!');
disp(mesg);
break;
end
qold = qnew;
end
cluster_labels = ind';
%----------------------------------------------------------
function init_centers = random_init(data, num_clusters)
%RANDOM_INIT Initialize centroids choosing num_clusters rows of data at random
%
% Input : data : N-by-D data matrix, where N is the number of data,
% D is the number of dimensions
% num_clusters : Number of clusters
%
% Output: init_centers : K-by-D matrix, where K is num_clusters
rand('twister', sum(100*clock));
init_centers = data(ceil(size(data, 1)*rand(1, num_clusters)), :);
function init_centers = orth_init(data, num_clusters)
%ORTH_INIT Initialize orthogonal centers for k-means clustering algorithm.
%
% Input : data : N-by-D data matrix, where N is the number of data,
% D is the number of dimensions
% num_clusters : Number of clusters
%
% Output: init_centers : K-by-D matrix, where K is num_clusters
%
% Find the num_clusters centers which are orthogonal to each other
%
Uniq = unique(data, 'rows'); % Avoid duplicate centers
num = size(Uniq, 1);
first = ceil(rand(1)*num); % Randomly select the first center
init_centers = zeros(num_clusters, size(data, 2)); % Storage for centers
init_centers(1, :) = Uniq(first, :);
Uniq(first, :) = [];
c = zeros(num-1, 1); % Accumalated orthogonal values to existing centers for non-centers
% Find the rest num_clusters-1 centers
for j = 2:num_clusters
c = c + abs(Uniq*init_centers(j-1, :)');
[minimum, i] = min(c); % Select the most orthogonal one as next center
init_centers(j, :) = Uniq(i, :);
Uniq(i, :) = [];
c(i) = [];
end
clear c Uniq;
此处实现了自适应的谱聚类,对高斯核函数的2*sigma*sigma进行了改造,文献[3][4],文献[3]使用每个数据的第sigma_k个近邻距离作为自己的sigma,文献[4]在[3]基础上添加共享近邻数 Sim(xi,xj)=exp(−d(xi,xj)2/2σ2) , Sim(xi,xj)=exp(−d(xi,xj)2/(2σi∗σj)) , Sim(xi,xj)=exp(−d(xi,xj)2/(2σi∗σj∗(SNN+1)))
function [sigma_matrix]=density_adaptive_sigma(data,sigma_k,sigma_type)
%
% by wangboajia 2017.4.14
% input : data
% : sigma_k近邻数
% :sigma_type:输入数据类型:字符型-----KNN----->第K近邻
% -----SNN----->第K近邻+共享近邻
% 数值型---->直接赋值
% output:sigma_matrix:输出参数矩阵
dataT = data';
[n,d] = size(data);
dist_matrix = dist(dataT);
% 找到sigma_k近邻
[value index] = sort(dist_matrix, 1);%按列从小到大重新排列value重排结果,index重排编号
KNNindex = index(2:sigma_k+1, :);%去掉自己0的编号,K近邻的的编号
KNNvalue = value(2:sigma_k+1, :);%去掉自己0的值
if (ischar(sigma_type))
if strcmp(sigma_type,'KNN')
sigma_matrix=sigma_knn(KNNvalue,sigma_k,n);
elseif strcmp(sigma_type,'SNN')
sigma_matrix=sigma_snn(KNNvalue,KNNindex,sigma_k,n);
end
elseif ( isnumeric(sigma_type))
sigma_matrix=sigma_num(sigma_type,n);
end
%----------------------------------------------------------
%某数据的第sigma_k个近邻距离表示sigma
function sigma_matrix=sigma_knn(KNNvalue,sigma_k,n)
sigma_knn_value = KNNvalue(sigma_k,:);
sigma_matrix=sigma_knn_value'*sigma_knn_value*2;
%----------------------------------------------------------
%某对数据的sigma_k个近邻共享近邻数
function sigma_matrix=sigma_snn(KNNvalue,KNNindex,sigma_k,n)
%%共享近邻个数
sigma_snn=zeros(n,n);
for i = 1:n-1
for j = i+1:n
sigma_snn(i,j) = size(intersect( KNNindex(:,i),KNNindex(:,j)),1);
end
end
temp = triu(sigma_snn);%取上三角函数
sigma_snn = temp + temp';
sigma_knn_value = KNNvalue(sigma_k,:);
sigma_matrix=sigma_knn_value'*sigma_knn_value*2.*(sigma_snn+1);
%----------------------------------------------------------
%%统一赋初值
function sigma_matrix=sigma_num(sigma_type,n)
sigma_matrix=ones(n,n);
sigma_matrix=sigma_matrix*sigma_type*sigma_type;
测试:testspectral.m
load('two.txt');
dataSet=two;
num_clusters=2;
[n,d]=size(dataSet);
K=10; %K取值可以从1到n-1,使用K=n-1退化为原始谱聚类
[A]=get_knn_distance(dataSet,K);%KNN相似矩阵
sigma_k=7; %sigma函数的自适应选择,论文建议使用7
sigma_type='SNN';%此处选择KNN则不加共享近邻数,选择SNN加,也可以单独赋数值型数据如0.1
%sigma_type='KNN';
%sigma_type=0.1;
[sigma_matrix]=density_adaptive_sigma(dataSet,sigma_k,sigma_type);%求取自适应sigma矩阵
[cluster_labels, evd_time, kmeans_time, total_time] = sc(A, sigma_matrix, num_clusters);%谱聚类
disp(['数据样本的数量为:',num2str(n)]);
disp(['求取前K个特征值和特征向量的时间:',num2str(evd_time)]);
disp(['K-means聚类时间:',num2str(kmeans_time)]);
disp(['谱聚类总时间:',num2str(total_time)]);
X=dataSet(cluster_labels==1,1);
Y=dataSet(cluster_labels==1,2);
plot(dataSet(cluster_labels==1,1),dataSet(cluster_labels==1,2),'r+', dataSet(cluster_labels==2,1),dataSet(cluster_labels==2,2),'b.');
数据和代码下载
这里下载
参考:
[1]Matlab相关函数
[2]Parallel Spectral Clustering in Distributed Systems Wen-Yen Chen, Yangqiu Song, Hongjie Bai, Chih-Jen Lin, and Edward Chang IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 33, No. 3, pp. 568-586, March 2011(源代码下载)
[3]Zelnik-Manor, L., Perona, P.,2004. Self-tuning spectral clustering. In Proceedings of
NIPS’04, pp. 1601–1608.
[4]Local density adaptive similarity measurement for spectral clustering.Pattern Recognition Letters 32 (2011) 352–358