主成分分析-CSDN博客

理解主成分分析

主成分分析（PCA）是一种数据降维技巧，它能将大量相关变量转化为一组很少的不相关变量，这些无关变量称为主成分。例如，使用PCA可将30个相关（很可能冗余）的环境变量转化为5个无关的成分变量，并且尽可能地保留原始数据集的信息。

相对而言，探索性因子分析（EFA）是一系列用来发现一组变量的潜在结构的方法。它通过寻找一组更小的、潜在的或隐藏的结构来解释已观测到的、显式的变量间的关系。例如，Harman74.cor包含了24个心理测验间的相互关系，受试对象为145个七年级或八年级的学生。假使应用EFA来探索该数据，结果表明276个测验间的相互关系可用四个学生能力的潜在因子（语言能力、反应速度、推理能力和记忆能力）来进行解释。

文中相关数据集可以在公众号后台回复：“主成分分析”，获取

数据准备和探索

library(ggplot2)	
library(psych)	
train = read.csv("~/mydata/NHLtrain.csv")	
head(train)

##       Team  ppg Goals_For Goals_Against Shots_For Shots_Against PP_perc	
## 1  Anaheim 1.26      2.62          2.29      30.3          27.5    23.0	
## 2  Arizona 0.95      2.54          2.98      27.6          31.0    17.7	
## 3   Boston 1.13      2.88          2.78      32.0          30.4    20.5	
## 4  Buffalo 0.99      2.43          2.62      29.5          30.6    18.9	
## 5  Calgary 0.94      2.79          3.13      29.2          29.0    17.0	
## 6 Carolina 1.05      2.39          2.70      29.9          27.6    16.8	
##   PK_perc CF60_pp CA60_sh OZFOperc_pp Give Take  hits  blks	
## 1    87.2  111.64   94.07     78.4141 9.78 5.22 27.24 14.37	
## 2    77.3   97.65   96.06     72.5217 5.67 5.89 22.13 14.02	
## 3    82.2  118.33   94.43     79.4393 8.60 6.11 26.39 14.44	
## 4    82.6   97.39  100.57     76.2105 6.34 5.26 23.37 13.35	
## 5    75.6   94.00  102.79     77.0624 9.80 6.99 20.73 16.10	
## 6    84.3  102.95   80.65     81.3953 8.00 9.22 18.90 11.95

这个数据集包含30支大联盟球队的统计数据，目标是建立一个模型来预测一支队伍的总积分。其中各字段的含义如下：

Team:球队所在城市
ppg:平均每场得分
Goals_For:平均每场进球数
Goals_Against:平均每场失球数
Shots_For:平均每场射中球门次数
Shots_Against:平均每场被射中球门次数
PP_perc:球队以多打少时进球百分比
PK_perc:球队以多打少球门不失的时间百分比
CF60_pp:球队在每60分钟以多打少时间内获得的Corsi分值
CA60_sh:对手在每60分钟以多打少时间内获得的Corsi分值
OZFOperc_pp:球队以多打少时，在进攻区域发生的争球次数百分比
Give:平均每场的丢球次数
Take:平均每场抢断次数
hits:平均每场身体冲撞次数
blks:平均每场封堵对方射门次数

library(corrplot)	
train_scale = scale(train[,-1:-2])	
train_scale_m = cor(train_scale)	
corrplot(train_scale_m, type = "lower", order = "hclust",	
         col = c("black", "white"), bg = "lightblue")

640?wx_fmt=png

从图上我们可以至少发现Shots_Against与Goals_Against正相关，PP_perc及PK_perc和Goals_Against负相关。由此可知，这个数据集非常适合提取主成分。

判断主成分个数

fa.parallel(train_scale, fa="pc", n.iter=100, show.legend = FALSE, 	
            main="Scree plot with parallel analysis")

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  3

abline(h=1, lwd=0.8, col="green")

640?wx_fmt=png

上图展示了基于观测值特征值的碎石检验，根据100个随机数据矩阵推导出来的特征值均线（虚线），以及大于1的特征值准则表明保留4个主成分即可,5个主成分是很令人信服的。

经验表明：主成分解释的方差累加起来，应该能够解释70%的方差。

正交旋转与解释

pca_rotate = principal(train_scale, nfactors = 5, rotate = "varimax")	
pca_rotate

## Principal Components Analysis	
## Call: principal(r = train_scale, nfactors = 5, rotate = "varimax")	
## Standardized loadings (pattern matrix) based upon correlation matrix	
##                 RC1   RC2   RC5   RC3   RC4   h2   u2 com	
## Goals_For     -0.21  0.82  0.21  0.05 -0.11 0.78 0.22 1.3	
## Goals_Against  0.88 -0.02 -0.05  0.21  0.00 0.82 0.18 1.1	
## Shots_For     -0.22  0.43  0.76 -0.02 -0.10 0.81 0.19 1.8	
## Shots_Against  0.73 -0.02 -0.20 -0.29  0.20 0.70 0.30 1.7	
## PP_perc       -0.73  0.46 -0.04 -0.15  0.04 0.77 0.23 1.8	
## PK_perc       -0.73 -0.21  0.22 -0.03  0.10 0.64 0.36 1.4	
## CF60_pp       -0.20  0.12  0.71  0.24  0.29 0.69 0.31 1.9	
## CA60_sh        0.35  0.66 -0.25 -0.48 -0.03 0.85 0.15 2.8	
## OZFOperc_pp   -0.02 -0.18  0.70 -0.01  0.11 0.53 0.47 1.2	
## Give          -0.02  0.58  0.17  0.52  0.10 0.65 0.35 2.2	
## Take           0.16  0.02  0.01  0.90 -0.05 0.83 0.17 1.1	
## hits          -0.02 -0.01  0.27 -0.06  0.87 0.83 0.17 1.2	
## blks           0.19  0.63 -0.18  0.14  0.47 0.70 0.30 2.4	
## 	
##                        RC1  RC2  RC5  RC3  RC4	
## SS loadings           2.69 2.33 1.89 1.55 1.16	
## Proportion Var        0.21 0.18 0.15 0.12 0.09	
## Cumulative Var        0.21 0.39 0.53 0.65 0.74	
## Proportion Explained  0.28 0.24 0.20 0.16 0.12	
## Cumulative Proportion 0.28 0.52 0.72 0.88 1.00	
## 	
## Mean item complexity =  1.7	
## Test of the hypothesis that 5 components are sufficient.	
## 	
## The root mean square of the residuals (RMSR) is  0.08 	
##  with the empirical chi square  28.59  with prob <  0.19 	
## 	
## Fit based upon off diagonal values = 0.91

SS loadings行包含了与主成分相关联的特征值。
Proportion Var行表示的是每个主成分对整个数据集的解释程度。
Cumulative Var行表示的是主成分对整个数据集的累计解释程度。
Proportion Explained表示解释贡献度占比。
Cumulative Proportion表示解释累计贡献度。

对于第一个主成分，变量Goals_Against和Shots_Against具有非常高的正载荷，而PP_perc和PK_perc具有非常高的负载荷。

第一个主成分解释了21%的方差, 第二个主成分解释了18%的方差，……，五个主成分累计解释了74%的方差。

根据主成分建立因子得分

`1`	`round(unclass(pca_rotate$weights),2)`

##                 RC1   RC2   RC5   RC3   RC4	
## Goals_For     -0.02  0.36  0.10 -0.01 -0.18	
## Goals_Against  0.36  0.01  0.12  0.10 -0.06	
## Shots_For      0.09  0.17  0.49 -0.14 -0.26	
## Shots_Against  0.28  0.00  0.03 -0.21  0.15	
## PP_perc       -0.32  0.19 -0.18 -0.06  0.08	
## PK_perc       -0.29 -0.12 -0.01  0.00  0.13	
## CF60_pp        0.03  0.01  0.34  0.07  0.15	
## CA60_sh        0.15  0.31 -0.03 -0.33 -0.07	
## OZFOperc_pp    0.13 -0.10  0.46 -0.12 -0.02	
## Give           0.00  0.23 -0.01  0.32  0.04	
## Take           0.02  0.00 -0.10  0.60 -0.03	
## hits          -0.03 -0.07  0.01 -0.05  0.76	
## blks           0.00  0.25 -0.22  0.12  0.42

利用如下公式可得到主成分得分：

RC1 = -0.02*Goals_For + 0.36*Goals_Against + …… + 0.00*blks

需要注意的是注意：这里的Goals_For等变量是train_scale数据中的变量，即标准化之后的变量。这里我们就不手动计算了，直接用代码生成结果

pca_scores = data.frame(pca_rotate$scores)	
head(pca_scores)

##           RC1          RC2        RC5        RC3        RC4	
## 1 -2.21526408  0.002821488  0.3161588 -0.1572320  1.5278033	
## 2  0.88147630 -0.569239044 -1.2361419 -0.2703150 -0.0113224	
## 3  0.10321189  0.481754024  1.8135052 -0.1606672  0.7346531	
## 4 -0.06630166 -0.630676083 -0.2121434 -1.3086231  0.1541255	
## 5  1.49662977  1.156905747 -0.3222194  0.9647145 -0.6564827	
## 6 -0.48902169 -2.119952370  1.0456190  2.7375097 -1.3735777

回归分析

pca_scores$ppg = train$ppg  # 将响应变量作为一列加入数据	
fit1 = lm(ppg~., data = pca_scores)	
summary(fit1)

## 	
## Call:	
## lm(formula = ppg ~ ., data = pca_scores)	
## 	
## Residuals:	
##       Min        1Q    Median        3Q       Max 	
## -0.163274 -0.048189  0.003718  0.038723  0.165905 	
## 	
## Coefficients:	
##              Estimate Std. Error t value Pr(>|t|)    	
## (Intercept)  1.111333   0.015752  70.551  < 2e-16 ***	
## RC1         -0.112201   0.016022  -7.003 3.06e-07 ***	
## RC2          0.070991   0.016022   4.431 0.000177 ***	
## RC5          0.022945   0.016022   1.432 0.164996    	
## RC3         -0.017782   0.016022  -1.110 0.278044    	
## RC4         -0.005314   0.016022  -0.332 0.743003    	
## ---	
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1	
## 	
## Residual standard error: 0.08628 on 24 degrees of freedom	
## Multiple R-squared:  0.7502,  Adjusted R-squared:  0.6981 	
## F-statistic: 14.41 on 5 and 24 DF,  p-value: 1.446e-06

回归结果表明，模型在整体上是显著的，但是有3个主成分是不显著的，可以简单处理，选择保留，但是我们不妨只保留RC1和RC2看看结果如何。

fit2 = lm(ppg ~ RC1 + RC2, data=pca_scores)	
summary(fit2)

## 	
## Call:	
## lm(formula = ppg ~ RC1 + RC2, data = pca_scores)	
## 	
## Residuals:	
##      Min       1Q   Median       3Q      Max 	
## -0.18914 -0.04430  0.01438  0.05645  0.16469 	
## 	
## Coefficients:	
##             Estimate Std. Error t value Pr(>|t|)    	
## (Intercept)  1.11133    0.01587  70.043  < 2e-16 ***	
## RC1         -0.11220    0.01614  -6.953  1.8e-07 ***	
## RC2          0.07099    0.01614   4.399 0.000153 ***	
## ---	
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1	
## 	
## Residual standard error: 0.0869 on 27 degrees of freedom	
## Multiple R-squared:  0.7149,  Adjusted R-squared:  0.6937 	
## F-statistic: 33.85 on 2 and 27 DF,  p-value: 4.397e-08

模型2拟合优度几乎没有多大变化，但是模型更见简练。因此我们选择模型2（我们可以用其他准则来选择模型，结果也是模型2比模型1更优）.

train$pred = round(fit2$fitted.values, 2)	
ggplot(train, aes(pred, ppg, label = Team)) +	
  geom_point() +	
  geom_text(size = 3.5, hjust = 0.1, vjust = -0.5, angle = 0) +	
  xlim(0.8,1.4) +  ylim(0.8, 1.5) +	
  geom_smooth(method = "lm", se = FALSE)

640?wx_fmt=png

从上图我们可以看出，主成分和球队积分之间存在强烈的线性关系。

另一项分析内容是绘制出球队及其因子得分之间的关系。

pca_scores$Team = train$Team	
ggplot(pca_scores, aes(RC1, RC2, label = Team)) + 	
  geom_point() + 	
  geom_text(size = 2.75, hjust = 0.2, vjust = -0.75, angle = 0)+	
  xlim(-2.5, 2.5) + ylim(-3.0, 2.5)

640?wx_fmt=png