Ch7基本统计方法
7.1描述性统计分析
7.1.1方法云集
1、基础安装:summary()函数可获取描述性统计量
#7-1 summary()函数
> vars <- c("mpg", "hp", "wt")
> head(mtcars[vars])
mpg hp wt
Mazda RX4 21.0 110 2.620
Mazda RX4 Wag 21.0 110 2.875
Datsun 710 22.8 93 2.320
Hornet 4 Drive 21.4 110 3.215
Hornet Sportabout 18.7 175 3.440
Valiant 18.1 105 3.460
> summary(mtcars[vars])
mpg hp wt
Min. :10.40 Min. : 52.0 Min. :1.513
1st Qu.:15.43 1st Qu.: 96.5 1st Qu.:2.581
Median :19.20 Median :123.0 Median :3.325
Mean :20.09 Mean :146.7 Mean :3.217
3rd Qu.:22.80 3rd Qu.:180.0 3rd Qu.:3.610
Max. :33.90 Max. :335.0 Max. :5.424
2、自定义函数,并用sapply将自定义函数应用到每列上
#7-2利用sapply()计算,自行添加偏度峰度计算函数
> mystats <- function(x, na.omit=FALSE){
+ if (na.omit)
+ x <- x[!is.na(x)]
+ m <- mean(x)
+ n <- length(x)
+ s <- sd(x)
+ skew <- sum((x-m)^3/s^3)/n
+ kurt <- sum((x-m)^4/s^4)/n - 3
+ return(c(n=n, mean=m, stdev=s, skew=skew, kurtosis=kurt))
+ }
> sapply(mtcars[vars], mystats)
mpg hp wt
n 32.000000 32.0000000 32.00000000
mean 20.090625 146.6875000 3.21725000
stdev 6.026948 68.5628685 0.97845744
skew 0.610655 0.7260237 0.42314646
kurtosis -0.372766 -0.1355511 -0.02271075
3、扩展
利用Hmisc、pastecs、psych包中的describe函数等也可以进行类似的描述性计算
7.1.2分组计算描述性统计量
1、使用aggregate()分组
#7-6aggregate()函数分组
> aggregate(mtcars[vars], by=list(am=mtcars$am), mean)
am mpg hp wt
1 0 17.14737 160.2632 3.768895
2 1 24.39231 126.8462 2.411000
> aggregate(mtcars[vars], by=list(am=mtcars$am), sd)
am mpg hp wt
1 0 3.833966 53.90820 0.7774001
2 1 6.166504 84.06232 0.6169816
注:
aggregate()函数只能在调用的时候使用mean、sd这样的单返回值函数。若需返回多个统计量,可以使用by()函数
7.2频数表和列联表
示例使用数据:
> library(vcd)
> head(Arthritis)
ID Treatment Sex Age Improved
1 57 Treated Male 27 Some
2 46 Treated Male 29 None
3 77 Treated Male 30 None
4 17 Treated Male 32 Marked
5 36 Treated Male 46 Marked
6 23 Treated Male 58 Marked
7.2.1生成频数表
1、一维列联表
使用函数table()可生成简单频数表
#一维列联表
> options(digits = 3)
> mytable <- with(Arthritis, table(Improved))
> mytable
Improved
None Some Marked
42 14 28
#转化为比例值
> prop.table(mytable)
Improved
None Some Marked
0.500 0.167 0.333
#转化为百分比
> prop.table(mytable)*100
Improved
None Some Marked
50.0 16.7 33.3
2、二维列联表
方法一:table()函数
mytable <- table(A, B)
其中,A是行变量,B是列变量
方法二:xtabs()函数
mytable <- xtabs(~ A + B, data=mydata)
其中,mydata是一个矩阵或数据框。
需要进行交叉分类的变量为B,在右侧
频数向量为A,在左侧
#二维列联表
> mytable <- xtabs(~ Treatment+Improved, data = Arthritis)
> #以Treatment为频数向量,以Improved为进行交叉分类的变量
> mytable
Improved
Treatment None Some Marked
Placebo 29 7 7
Treated 13 7 21
#生成边际频数和比例,下标指代是第几个变量
#计算行和与行比例
> margin.table(mytable, 1)
Treatment
Placebo Treated
43 41
> prop.table(mytable, 1)
Improved
Treatment None Some Marked
Placebo 0.674 0.163 0.163
Treated 0.317 0.171 0.512
#计算列和与列比例
> margin.table(mytable, 2)
Improved
None Some Marked
42 14 28
> prop.table(mytable, 2)
Improved
Treatment None Some Marked
Placebo 0.69 0.50 0.25
Treated 0.31 0.50 0.75
#计算各单元格所占比例
> prop.table(mytable)
Improved
Treatment None Some Marked
Placebo 0.3452 0.0833 0.0833
Treated 0.1548 0.0833 0.2500
使用addmargins()函数为表格添加边际和
#添加边际和
> addmargins(mytable) #默认为表中的所有变量添加边际和
Improved
Treatment None Some Marked Sum
Placebo 29 7 7