转载自:http://site.douban.com/182577/widget/notes/12866356/note/267793990/
承蒙船长大人提携 有机会在小站写些跟量化有关的短文 内容预测是对正在上的研究生课程的总结笔记 借以巩固所学知识并练习英文写作 风格估计会捉摸不定 还请多多指教!
- 戌
How to Draw Informative and Decent Pair Plots in R
为便于说明,我以Ecdat包中的Capm数据为例。Capm的前三行数据如下所示。
To summarize ways to draw pair plots in R, I take the dataset “Capm” in the “Ecdat” package as an example. The first three rows of Capm are shown below.
> data(Capm,package="Ecdat")
> head(Capm,3)
rfood rdur rcon rmrf rf
1 -4.59 0.87 -6.84 -6.99 0.33
2 2.62 3.46 2.78 0.99 0.29
3 -1.67 -2.28 -0.48 -1.46 0.35
Capm数据组包括了1960年1月到2002年12月股票市场的月数据,共有516组数据。列名“rfood”、“rdur”、“rcon”、“rmrf”和“rf”分别代表食品业、非易耗品业、建筑业和市场投资组合的超额收益和无风险收益。
The dataset “Capm” includes monthly observations of stock market from January 1960 to December 2002, 516 observations in total. The column names “rfood”, “rdur”, “rcon”, “rmrf”, and “rf” denote excess returns of the food industry, durables industry, construction industry, market portfolio, and risk-free return, respectively .
通常,当人们想知道两组数据间是否具有线性相关性时会画相关性分析图。下面我们用plot()函数画出市场投资组合和食品业超额收益的散点图,期望可以得出它们线性相关的结论。
Generally, people plot pair plots when wondering if there is any linear relationship between two groups of data. Here we use the plot() function to draw a scatter plot of the excess returns of market portfolio and food industry, expecting to identify a linear relationship between the returns.
> # Plot food sector excess returns versus market excess returns using plot function
> plot(Capm[,"rmrf"]/100,Capm[,"rfood"]/100,ylab="Food industry excess return",
+ xlab="Market excess return")
从上图我们可以初步得出市场投资组合和食品业的超额收益间有较强的正线性相关性的结论。为了增大上图的信息量,我们可以在调用plot()函数时加入col=rgb(0,0,100,50,maxColorValue=255)这个参数。
From the above plot, we can initially conclude that there is a strong positive linear relationship between the market portfolio excess returns and the food industry excess returns. To make the plot more informative, we can add a helpful argument, col=rgb(0,0,100,50,maxColorValue=255) in the plot() function.
> # Plot food sector excess returns versus market excess returns using plot function with col=rgb()
> plot(Capm[,"rmrf"]/100,Capm[,"rfood"]/100,ylab="Food industry excess return",
+ xlab="Market excess return",pch=19,col=rgb(0,0,100,50,maxColorValue=255))
rgb()函数,即RGB色别标志,给出对应于给定的(在0和最大值之间)红、绿和蓝三原色强度的颜色参数。这里的色别标志是指标准sRGB色彩空间(IEC标准61966)。通过在绘画中加入rgb(),我们可以得到一幅数据点呈半透明状的相关性分析图。这帮助我们得知在这一时间段内数据点的群聚情况。
rgb(), or RGB Color Specification, is a function which creates colors corresponding to the given intensities (between 0 and max) of the red, green and blue primaries. The color specification refers to the standard sRGB colorspace (IEC standard 61966) . By running the code of plotting with rgb(), we get a pair plot whose points are plotted in a semitransparent way. This helps us detect how the data points in the time period cluster.
有时候,我们需要知道多组数据间的相关性。我们可以通过直接将数据组的名字带入plot()函数的方法代替指定plot()函数中的x轴和y轴的代入数据。为便于说明,数据组的相关性数值计算如下。
Sometimes, we need to know the correlations between data groups more than two. Instead of specifying the x axis and y axis of a plot in the plot() function, we can simply plug the name of the dataset in the plot() function. To help illustration, the correlations of the dataset are computed as follows.
> cor(Capm/100)
rfood rdur rcon rmrf rf
rfood 1.00000000 0.66885253 0.72093253 0.77307668 -0.02691534
rdur 0.66885253 1.00000000 0.78501333 0.85989534 -0.09965299
rcon 0.72093253 0.78501333 1.00000000 0.89613950 -0.07210179
rmrf 0.77307668 0.85989534 0.89613950 1.00000000 -0.07680539
rf -0.02691534 -0.09965299 -0.07210179 -0.07680539 1.00000000
> # Plot pair plots of returns
> plot(Capm/100,pch=19,col=rgb(0,0,100,50,maxColorValue=255))
由上图可知,在rfood和rdur、rfood和rcon、rfood和rmrf、rdur和rcon、rdur和rmrf,rcon和rmrf之间可能存在线性相关关系。
Concluded from the pair plots, there may be linear relationships between rfood and rdur, rfood and rcon, rfood and rmrf, rdur and rcon, rdur and rmrf, rcon and rmrf.
另外,在散点图中将col参数设置为rainbow(n, end=0.9)可以帮助人们看到相关性随时间的变化。rainbow()函数创建一个由n种连续颜色组成的向量。
Moreover, specifying col=rainbow(n, end=0.9) in scatter plots allows one to see the changing relationship over time. The rainbow() function creates a vector n contiguous colors.
> # Plot pair plots of returns with col=rainbow()
> plot(Capm/100,col = rainbow(500,end=0.9),pch=19)
作为一个开源软件,R常常通过在人们厌烦了使用某个特定函数的时候提供一些代替函数来表现其友好性。例如,如果你再也不想在画相关性分析图时输入p-l-o-t这几个字母了,欢迎你输入s-p-l-o-m,当然在此之前你需要下载好lattice包并将其载入库内。
As open source software, R behaves quite friendly that it commonly gives people alternatives when they are tired of using one specific function :) For example, if you are unwilling to type p-l-o-t any more when you need to draw pair plots, you are welcomed to enter s-p-l-o-m with the lattice package downloaded from the internet and loaded to the library ahead.
> # Plot pair plots of returns using the splom function from the lattice package
> library(lattice)
> splom(Capm/100,pch=19,col=rgb(0,0,100,50,maxColorValue=255))
老实说,我个人不认为使用splom()函数是比plot()函数更好的选择。p-l-o-t是R中一个日常使用率极高的函数,即使我在夜里做梦时也能大声并清晰地说出这个函数的名字,但我却十有八九没法儿拼出所谓的s-p-o-x-x还是s-p-l-x-x函数的名字,更不用说splom()毫无进步,除非让画出来的图变得更加难以辨认和乱七八糟也算作进步。但是不要灰心气馁哦,亲!更好的替代函数的确是存在的。例如,你只需使用另一个基本绘画函数paris()。这个函数的使用方法和plot()函数几乎一样。
To be honest, I personally do not consider using splom() function rather than plot() function as a better choice. The p-l-o-t is a daily-used function in R of which the name can be spoken aloud even I am dreaming at night while I am unable to spell the name of the so-called s-p-o-x-x or s-p-l-x-x function nine times out of ten, let alone the splom() function makes no improvement unless making a plot unreadable and messy counts. However, keep your chin up, my friend. Better rather than worse alternatives do exist. For example, you can simply use another basic plotting function, pairs(). This function can be applied almost in the same way as the plot() function.
pairs(Capm/100,labels=c("Food","Durables","Construction","Market","Risk-free"),
pch=19,col=rgb(0,0,100,50,maxColorValue=255))
注意到在这个例子中,我使用了一个新的变量labels。这个变量可在不改变原始数据列名的情况下对相关性图进行重新标注。
Note that in this example, I add a new argument labels which relabels the pairs in intuitive names while does not rewrite the column names of raw data.
此外,我们还可以投入pcaPP包中的plotcov()函数和ellipse包中的plotcorr()函数的怀抱。
Also, we can turn to the plotcov() function in the pcaPP package and the plotcorr() function in the ellipse package.
plotcov()函数高效简洁省事地同时给出了数据组内各组数据间的相关性数值和分布形状。
The function plotcov() provides both the values and the shape of correlations of the dataset at the same time which is informative, concise and space-saving.
> # Plot pair plots of returns using the plotcov function from the pacPP package
> library(pcaPP)
> plotcov(cor(Capm/100),method1="correlation")
偶尔人们会过度追求极简主义,所以我们得出了如下由plotcorr()函数画出的图。
Sometimes people try so hard to make things concise, so they get an output as the function plotcorr() does.
> # Plot pair plots of returns using the plotcorr function from the ellipse package
> library(ellipse)
> plotcorr(cor(Capm/100),num=T,diag=F,type="upper")
作为一个无印良品风格控,在我不需要考虑-0.02691534和0,-0.09965299、-0.07210179、-0.07680539和-1,0.66885253、0.72093253和7 即 7 %,0.77307668、0.78501333和8即0.8之间的“巨大”差异时,我会毫不犹豫地表达我对plotcorr风格的喜爱。多数情况下,我们只需要用相关性分析图帮助我们判断线性相关性的存在趋势,因此0.77307668和0.78501333并没有鸿沟般的差别。同时,plotcorr()函数使人们通过设置type参数为“upper”、“lower” 和 “full”可以得到上三角、下三角或者完整的相关性矩阵。
I am a Muji-style addict which means I will undoubtedly love the plotcorr-style if I do not have to care about the “big” difference between -0.02691534 and 0, -0.09965299, -0.07210179, -0.07680539 and -1, 0.66885253, 0.72093253 and 7 meaning 7 percent, 0.77307668, 0.78501333 and 8 meaning 0.8. In most time, we use pair plots simply to identify trends of linear relationships, thus 0.77307668 does not significantly differ from 0.78501333. Also, the plotcorr() allows people to have a upper-triangle, lower-triangle or full correlation matrix on their plots by equaling the type argument to “upper”, “lower” and “full”.
Sometimes, geeks (not me) think they have to do something advanced to distinguish them from people with average IQ (like me), so their plotcorr() function is capable of informing the significance of sample correlation coefficient by using the test statistic shown below.
即如果相关性的值在其置信区间内,它们对应的椭圆会被涂成蓝色,否则在大于置信区间的上边界时会被涂成红色,小于下边界时被涂成黄色。因为Capm例子中的样本相关性相当靠谱,所以我们得到的全是蓝色椭圆。
That is, if the values of correlation are inside their confidence intervals, their corresponding ellipses will be filled in blue, otherwise will be red if values are larger than the upper bound of the confidence intervals and yellow if smaller than the lower bound. Since the sample correlations are quite reliable in the Capm example, we get all blue ellipses.
> # Plot pair plots of returns with test statistic
> sig.r <- function(p,n)
+ {
+ df <- n-2
+ t.stat <- qt(p,df)
+ sig.r <- t.stat/sqrt(t.stat^2+df)
+ return(sig.r)
+ }
> r.threshold <- sig.r(0.975,4)
> col <- ifelse(cor(Capm/100)>r.threshold,"red",ifelse(cor(Capm/100)< -r.threshold,"yellow","blue"))
> plotcorr(cor(Capm/100),col=col,diag=F,cex.lab=0.75,type="upper",numbers=F)
最后,因为plotcov()函数的设计初衷是在一幅图中完成两个估计的协方差矩阵的直接比较,我们接下来用它来比较Capm的样本相关性和稳健相关性。
Finally, since the plotcov() function is initially designed to allow a direct comparison of two estimations of the covariance matrix in a plot , we use it to compare the sample correlations and robust correlations of the Capm dataset as follows.
> # Compare sample correlation matrix with robust correlation matrix
> library(robust)
> cor.sample <- cor(Capm/100)
> cor.robust <- covRob(Capm/100,cor=T)
> plotcov(cov1=cor.sample,cov2=cor.robust,method1="sample",method2="robust")
多数情况下,Capm的样本相关性接近于其稳健相关性。上三角区域里几乎重叠的椭圆也证明了我们可以对我们的相关性数值抱有信心。
At most time, the values of sample correlation are close to that of robust correlation in the Capm example. The almost-overlapped ellipses in the upper triangle also prove that we can be confident with our values of correlations.
最后,船长大人向我推荐了corrgram包的corrgram()函数。作为一个相关性图专业户,corrgram()函数可通过设置面板参数以多种形式给出数据组间的关系。
Last but not the least, the corrgram() function in the corrgram package is introduced to me by Captain. Expertized in correlation plotting, the function demonstrates the relationship between data in various forms by setting the types of panels.
corrgram(Capm/100,labels=c("Food","Durables","Construction","Market","Risk-free"),
lower.panel=panel.shade,upper.panel=panel.pie,text.panel=panel.txt)
在上图中,面板下半部分斜线的方向将相关性分成正相关和负相关两类。另外,蓝色代表正相关,粉色代表负相关。颜色越深,涂色面积越大,意味着相关性越强。
In the above plot, the directions of slashes in the lower panel divide relationships into two categories, positive and negative. Also, blue denotes positive relationships while pink denotes negative relationships. The darker the colors and the bigger the painted areas are, the stronger the relationships between data.
corrgram(Capm/100,labels=c("Food","Durables","Construction","Market","Risk-free"),
lower.panel=panel.pts, upper.panel=panel.conf, diag.panel=panel.density)
corrgram(Capm/100,labels=c("Food","Durables","Construction","Market","Risk-free"),
panel=panel.ellipse, text.panel=panel.txt, diag.panel=panel.minmax)
PS:10号一鼓作气写完的初稿 因为final拖到今天干掉最后一科才闲下来小加小改后传上来= = 在豆瓣混了多年 还是头一遭提笔写这种类型的文章 数理金融的魅力果然大呀 春假结束前争取再写一篇关于hypothesis testing的小文 还请大家多多指教!