如何对特征维度超过20000的数据进行PCA?
当传统的PCA算法应用大非常高维的数据集时,会受到限制。例如,在生物信息学中的基因表达数据。但是通常我们仅需要前两个或者三个主蹭饭进行可视化数据。对于仅抽取前k个成分,我们可以使用基于sensible principal components 分析的PPCA(probabilistic PCA )。例如使用修改的PCA matlab脚本[ppca.m],其也可以应用到不完备的数据集。
data=rand(100,10); % artificial data set of 100 variables (e.g., genes) and 10 samples
[pc,W,data_mean,xr,evals,percentVar] = ppca(data,3); % download: ppca.m
plot(pc(1,:),pc(2,:),'.');
title('{\bf PCA}');
xlabel(['PC 1 (',num2str(round(percentVar(1)*10)/10),'%)',]);
ylabel(['PC 2 (',num2str(round(percentVar(2)*10)/10),'%)',]);
Update: A new Matlab package by Alexander Ilin includes a collection of several algorithms of PCA to use on high-dimensional data including missing data (Ilin and Raiko, 2010).
GNU R: For probabilistic PCA (PPCA) using GNU R, see the Bioconductor package pcaMethods, also published in Bioinformatics by W. Stacklies et al. (pdf)
1.参考文献:
http://www.nlpca.org/pca-principal-component-analysis-matlab.html