Matlab自带PCA程序princomp Training & Testing及高维数据解决方法


关于PCA的介绍和程序使用,请参照下面文章

http://blog.csdn.net/watkinsong/article/details/8234766


 [COEFF,SCORE,latent] = princomp(X) returns latent, a vector containing the eigenvalues of the covariance matrix of X.

在n行p列的数据集X上做主成分分析。返回主成分系数。X的每行表示一个样本的观测值,每一列表示特征变量。

返回的COEFF是一个p行p列的矩阵,每一列包含一个主成分的系数,列是按主成分变量递减顺序排列

返回的SCORE是对主分的打分,也就是说原X矩阵在主成分空间的表示。SCORE每行对应样本观测值,每列对应一个主成份(变量),它的行和列的数目和X的行列数目相同。


返回的latent是一个向量,它是返回的按降序排列的特征值,根据这个你可以手动的选择降维以后的数据要选择前多少列。

<span style="font-size:12px;">cumsum(latent)./sum(latent)</span>


(1) 处理高维数据, out of memory

注意,当维数p超过样本个数n的时候,用[...] = princomp(X,'econ')来计算,这样会显著提高计算速度

否则的话,会报错,out of memory


 一般情况下,如果你的每个样本的特征维数远远大于样本数,比如30*1000000的维数,princomp要加上'econ', 就是princomp(x,'econ')这样使用,可以很大程度的加快计算速度,而且不会内存溢出,否则会经常报内存溢出。


[...] = PRINCOMP(X,'econ') returns only the elements of LATENT that are
    not necessarily zero, i.e., when N <= P, only the first N-1, and the
    corresponding columns of COEFF and SCORE.  This can be significantly
    faster when P >> N.


(2) 训练和测试

如果你需要对测试样本降维,一般情况下,使用matlab自带的方式,肯定需要对测试样本减去一个训练样本均值,因为你在给训练样本降维的时候减去了均值,所以测试样本也要减去均值,然后乘以coeff这个矩阵,就获得了测试样本降维后的数据。比如说你的测试样本是1*1000000,那么乘上一个1000000*29的降维矩阵,就获得了1*29的降维后的测试样本的降维数据。


我的PCA代码如下

    %% PCA process using Princomp based on training data
    % training 
     train_features=trainData(:,1:numOfColumns-1);
     train_labels=trainData(:,numOfColumns);
     
     train_features_tfidf = train_features(:,108:size(train_features,2));
     mean_value_train_tfidf = mean(train_features_tfidf,1);  % will be used on testing data
     [coeff, score, latent, Tsquared] = princomp(train_features_tfidf,'econ');
     % coeff
     % score
     %% select the number of Principle component
     latent= latent/sum(latent);
     
     sumRate=0;
     selection_index = [];
     for k = 1:length(latent)
        sumRate = sumRate + latent(k);
        selection_index(k) = k;
        if sumRate>0.95 
            break;
        end  
     end

%      Trough the following test, we can see the score_test and score
%      computed are absolutely the same .!!!
%      training_features_mean = train_features - ones(size(train_features,1),1) * mean(train_features,1);
%      score_test = training_features_mean * coeff;
     
    train_features =[ train_features(:,1:107),score];% [ train_features(:,1:107),score(:,selection_index)];
     
     % testing
     test_features=testData(:,1:numOfColumns-1);
     test_labels=testData(:,numOfColumns);
     
     test_features_tfidf = test_features(:,108:size(test_features,2));
     test_features_mean = test_features_tfidf - ones(size(test_features_tfidf,1),1) * mean_value_train_tfidf; 
     test_features_score = test_features_mean * coeff;   %test_features_mean * coeff(:,selection_index);
     test_features = [test_features(:,1:107),test_features_score];
     


  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值