最近在看LMNN的论文时, 发现作者做实验的起始步骤中首先用PCA对高纬度sample features进行降维处理时, 提到如何选取目标低纬度的值, 其提供的方法是: "account for 95% of its total variance."
这里的total variance是啥意思呢? google了一下, 以下这篇文章有很好的解释:
http://support.sas.com/publishing/pubcat/chaps/55129.pdf
其中有这样一段话:
What is meant by “total variance” in the data set? To understand the meaning of “total
variance” as it is used in a principal component analysis, remember that the observed
variables are standardized in the course of the analysis. This means that each variable is
transformed so that it has a mean of zero and a variance of one. The “total variance” in the
data set is simply the sum of the variances of these observed variables. Because they have
been standardized to have a variance of one, each observed variable contributes one unit of
variance to the “total variance” in the data set. Because of this, the total variance in a
principal component analysis will always be equal to the number of observed variables
being analyzed. For example, if seven variables are being analyzed, the total variance will
equal seven. The components that are extracted in the analysis will partition this variance:
perhaps the first component will account for 3.2 units of total variance; perhaps the second
component will account for 2.1 units. The analysis continues in this way until all of the
variance in the data set has been accounted for.
其中指出: the total variance in a principal component analysis will always be equal to the number of observed variables。 从后面提供的例子也可以知道, 其考虑的是特征根排序后的权值大小, 比如原始输入的feature space维度为20, 经过PCA后可以计算所有的20个特征根(降序排列), 然后找出前N个总和刚好大于所有20个特征根总和的95%, 此时的N就是所需要降维的目标值。