PCA中total variance的解释

             最近在看LMNN的论文时, 发现作者做实验的起始步骤中首先用PCA对高纬度sample features进行降维处理时, 提到如何选取目标低纬度的值, 其提供的方法是: "account for 95% of its total variance." 


这里的total variance是啥意思呢? google了一下, 以下这篇文章有很好的解释:

http://support.sas.com/publishing/pubcat/chaps/55129.pdf


其中有这样一段话:

What is meant by “total variance” in the data set?  To understand the meaning of “total
variance” as it is used in a principal component analysis, remember that the observed
variables are standardized in the course of the analysis.  This means that each variable is
transformed so that it has a mean of zero and a variance of one.  The “total variance” in the
data set is simply the sum of the variances of these observed variables.  Because they have
been standardized to have a variance of one, each observed variable contributes one unit of
variance to the “total variance” in the data set.  Because of this, the total variance in a
principal component analysis will always be equal to the number of observed variables
being analyzed.  For example, if seven variables are being analyzed, the total variance will
equal seven.  The components that are extracted in the analysis will partition this variance: 
perhaps the first component will account for 3.2 units of total variance; perhaps the second
component will account for 2.1 units.  The analysis continues in this way until all of the
variance in the data set has been accounted for.


其中指出: the total variance in a principal component analysis will always be equal to the number of observed  variables。 从后面提供的例子也可以知道, 其考虑的是特征根排序后的权值大小, 比如原始输入的feature space维度为20, 经过PCA后可以计算所有的20个特征根(降序排列), 然后找出前N个总和刚好大于所有20个特征根总和的95%, 此时的N就是所需要降维的目标值。

import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from mpl_toolkits.mplot3d import Axes3D from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler data=pd.read_csv('H:/analysis_results/mean_HN.csv') data.head() x=data.iloc[:,1:7] y=data.iloc[:,6] scaler=StandardScaler() scaler.fit(x) x_scaler=scaler.transform(x) print(x_scaler.shape) pca=PCA(n_components=3) x_pca=pca.fit_transform(x_scaler) print(x_pca.shape) #查看各个主成分对应的方差大小和占全部方差的比例 #可以看到前2个主成分已经解释了样本分布的90%的差异了 print('explained_variance_:',pca.explained_variance_) print('explained_variance_ratio_:',pca.explained_variance_ratio_) print('total explained variance ratio of first 6 principal components:',sum(pca.explained_variance_ratio_)) #将分析的结果保存成字典 result={ 'explained_variance_:',pca.explained_variance_, 'explained_variance_ratio_:',pca.explained_variance_ratio_, 'total explained variance ratio:',np.sum(pca.explained_variance_ratio_)} df=pd.DataFrame.from_dict(result,orient='index',columns=['value']) df.to_csv('H:/analysis_results/Cluster analysis/pca_explained_variance_HN.csv') #可视化各个主成分贡献的方差 #fig1=plt.figure(figsize=(10,10)) #plt.rcParams['figure.dpi'] = 300#设置像素参数值 plt.rcParams['path.simplify'] = False#禁用抗锯齿效果 plt.figure() plt.plot(np.arange(1,4),pca.explained_variance_,color='blue', linestyle='-',linewidth=2) plt.xticks(np.arange(1, 4, 1))#修改X轴间隔为1 plt.title('PCA_plot_HN') plt.xlabel('components_n',fontsize=16) plt.ylabel('explained_variance_',fontsize=16) #plt.savefig('H:/analysis_results/Cluster analysis/pca_explained_variance_HN.png') plt.show()报错'numpy.float64' object is not iterable,如何修改
最新发布
06-10
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值