主成分分析原理
读取数据
library('ggplot2')
# First code snippet
prices <- read.csv(file.path('data', 'stock_prices.csv'),
stringsAsFactors = FALSE)
prices[1, ]
# Date Stock Close
#1 2011-05-25 DTE 51.12
日期格式转换
这里用到了lub*包的ymd函数将日期转换为日期格式.
# Second code snippet
library('lubridate')
prices <- transform(prices, Date = ymd(Date))
#Date Stock Close
#1 2011-05-25 DTE 51.12
cast数据reshape
# Third code snippet
library('reshape')
date.stock.matrix <- cast(prices, Date ~ Stock, value = 'Close')
# Fourth code snippet
prices <- subset(prices, Date != ymd('2002-02-01'))
prices <- subset(prices, Stock != 'DDR')
date.stock.matrix <- cast(prices, Date ~ Stock, value = 'Close')
整理后格式如下
> date.stock.matrix[1,]
Date ADC AFL ARKR AZPN CLFD DTE ENDP FLWS FR GMXR GPC
1 2002-01-02 17.7 23.78 8.15 17.1 3.19 42.37 11.54 15.77 31.16 4.5 36.09
HE ISSC ISSI KSS MTSC NWN ODFL PARL RELV SIGM STT TRIB UTR
1 40.41 7.82 12.78 70.23 10.03 26.2 13.4 1.92 1.3 1.75 52.11 1.5 39.34
在使用cast函数时, 在波浪符号左边指定数据用数据源中那些列作为输出矩阵的行, 在波浪符号右边指定哪些列作为输出矩阵的列.
PCA
> pca <- princomp(date.stock.matrix[, 2:ncol(date.stock.matrix)])
> pca
Call:
princomp(x = date.stock.matrix[, 2:ncol(date.stock.matrix)])
Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
29.1001249 20.4403404 12.6726924 11.4636450 8.4963820 8.1969345
Comp.7 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12
5.5438308 5.1300931 4.7786752 4.2575099 3.3050931 2.6197715
Comp.13 Comp.14 Comp.15 Comp.16 Comp.17 Comp.18
2.4986181 2.1746125 1.9469475 1.8706240 1.6984043 1.6344116
Comp.19 Comp.20 Comp.21 Comp.22 Comp.23 Comp.24
1.2327471 1.1280913 0.9877634 0.8583681 0.7390626 0.4347983
24 variables and 2366 observations.
查看第一载荷,并利用第一载荷总结数据为一列
# Eighth code snippet
principal.component <- pca$loadings[, 1]
# Ninth code snippet
loadings <- as.numeric(principal.component)
ggplot(data.frame(Loading = loadings),
aes(x = Loading, fill = 1)) +
geom_density() +
theme(legend.position = 'none')
# Tenth code snippet
market.index <- predict(pca)[, 1]