最近在做数据降维的处理工作接触到drtoolbox工具箱,感谢:http://blog.csdn.net/xiaowei_cqu 提供的博文,我又做了些总结。
1.MATLAB drtoolbox 介绍
The Matlab Toolbox for Dimensionality Reduction contains Matlab implementations of 38 techniques for dimensionality reduction and metric learning.
官方网站:
http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html
ThisMatlab toolbox implements 32 techniques for dimensionality reduction. Thesetechniques are all available through the COMPUTE_MAPPING function or trhoughthe GUI. The following techniques are available:
- Principal Component Analysis ('PCA')
- Linear Discriminant Analysis ('LDA')
- Multidimensional scaling ('MDS')
- Probabilistic PCA ('ProbPCA')
- Factor analysis ('FactorAnalysis')
- Sammon mapping ('Sammon')
- Isomap ('Isomap')
- Landmark Isomap ('LandmarkIsomap')
- Locally Linear Embedding ('LLE')
- Laplacian Eigenmaps ('Laplacian')
- Hessian LLE ('HessianLLE')
- Local Tangent Space Alignment ('LTSA')
- Diffusion maps ('DiffusionMaps')
- Kernel PCA ('KernelPCA')
- Generalized Discriminant Analysis('KernelLDA')
- Stochastic Neighbor Embedding ('SNE')
- Symmetric Stochastic Neighbor Embedding('SymSNE')
- t-Distributed Stochastic Neighbor Embedding('tSNE')
- Neighborhood Preserving Embedding ('NPE')
- Linearity Preserving Projection ('LPP')
- Stochastic Proximity Embedding ('SPE')
- Linear Local Tangent Space Alignment('LLTSA')
- Conformal Eigenmaps ('CCA', implemented asan extension of LLE)
- Maximum Variance Unfolding ('MVU',implemented as an extension of LLE)
- Landmark Maximum Variance Unfolding('LandmarkMVU')
- Fast Maximum Variance Unfolding ('FastMVU')
- Locally Linear Coordination ('LLC')
- Manifold charting ('ManifoldChart')
- Coordinated Factor Analysis ('CFA')
- Gaussian Process Latent Variable Model('GPLVM')
- Autoencoders using stack-of-RBMs pretraining('AutoEncoderRBM')
- Autoencoders using evolutionary optimization('AutoEncoderEA')
Furthermore,the toolbox contains 6 techniques for intrinsic dimensionality estimation.These techniques are available through the function INTRINSIC_DIM. Thefollowing techniques are available:
- Eigenvalue-based estimation ('EigValue')
- Maximum Likelihood Estimator ('MLE')
- Estimator based on correlation dimension('CorrDim')
- Estimator based on nearest neighborevaluation ('NearNb')
- Estimator based on packing numbers('PackingNumbers')
- Estimator based on geodesic minimum spanningtree ('GMST')
Inaddition to these techniques, the toolbox contains functions for prewhiteningof data (the function PREWHITEN), exact and estimate out-of-sample extension(the functions OUT_OF_SAMPLE and OUT_OF_SAMPLE_EST), and a function thatgenerates toy datasets (the function GENERATE_DATA).
Thegraphical user interface of the toolbox is accessible through the DRGUIfunction.
2.安装
将下载好的drtoolbox工具包解压到指定目录:D:\MATLAB\R2012b\toolbox
找到' D:\MATLAB\R2012b\toolbox\local\pathdef.m'文件,打开,并把路径添加到该文件中,保存。
运行rehash toolboxcache 命令,完成工具箱加载
>>rehashtoolboxcache
测试
>>what drtoolbox
3.工具箱说明
数据降维基本原理是将样本点从输入空间通过线性或非线性变换映射到一个低维空间,从而获得一个关于原数据集紧致的低维表示。
算法基本分类:
线性/非线性
线性降维是指通过降维所得到的低维数据能保持高维数据点之间的线性关系。线性降维方法主要包括PCA、LDA、LPP(LPP其实是LaplacianEigenmaps的线性表示);非线性降维一类是基于核的,如KPCA,此处暂不讨论;另一类就是通常所说的流形学习:从高维采样数据中恢复出低维流形结构(假设数据是均匀采样于一个高维欧式空间中的低维流形),即找到高维空间中的低维流形,并求出相应的嵌入映射。非线性流形学习方法有:Isomap、LLE、LaplacianEigenmaps、LTSA、MVU
整体来说,线性方法计算块,复杂度低,但对复杂的数据降维效果较差。
监督/非监督
监督式和非监督式学习的主要区别在于数据样本是否存在类别信息。非监督降维方法的目标是在降维时使得信息的损失最小,如PCA、LPP、Isomap、LLE、LaplacianEigenmaps、LTSA、MVU;监督式降维方法的目标是最大化类别间的辨别信,如LDA。事实上,对于非监督式降维算法,都有相应的监督式或半监督式方法的研究。
全局/局部
局部方法仅考虑样品集合的局部信息,即数据点与临近点之间的关系。局部方法以LLE为代表,还包括LaplacianEigenmaps、LPP、LTSA。
全局方法不仅考虑样本几何的局部信息,和考虑样本集合的全局信息,及样本点与非临近点之间的关系。全局算法有PCA、LDA、Isomap、MVU。
由于局部方法并不考虑数据流形上相距较远的样本之间的关系,因此,局部方法无法达到“使在数据流形上相距较远的样本的特征也相距较远”的目的。
4.工具箱使用
工具箱提供给用户使用的接口函数都在与这个Readme文件同路径的目录,主要包括如下文件:
使用实例:
clc
clear
close all
% 产生测试数据
[X, labels] = generate_data('helix', 2000);
figure
scatter3(X(:,1), X(:,2), X(:,3), 5, labels)
title('Original dataset')
drawnow
% 估计本质维数
no_dims = round(intrinsic_dim(X, 'MLE'));
disp(['MLE estimate of intrinsic dimensionality: ' num2str(no_dims)]);
% PCA降维
[mappedX, mapping] = compute_mapping(X, 'PCA', no_dims);
figure
scatter(mappedX(:,1), mappedX(:,2), 5, labels)
title('Result of PCA')
% Laplacian降维
[mappedX, mapping] = compute_mapping(X, 'Laplacian', no_dims, 7);
figure
scatter(mappedX(:,1), mappedX(:,2), 5, labels(mapping.conn_comp))
title('Result of Laplacian Eigenmaps')
drawnow
% Isomap降维
[mappedX, mapping] = compute_mapping(X, 'Isomap', no_dims);
figure
scatter(mappedX(:,1), mappedX(:,2), 5, labels(mapping.conn_comp))
title('Result of Isomap')
drawnow