R——统计描述与基础统计分析方法

Annaaphq

已于 2022-08-13 08:09:58 修改

阅读量1.9k

点赞数 1

文章标签： r语言 python 算法

于 2022-08-11 15:32:07 首次发布

本文链接：https://blog.csdn.net/Annaaphq/article/details/126286373

版权

统计描述与基础统计分析方法

一、描述性统计分析

1、单组数据汇总统计量

ISwR包中的 juul 数据集：

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-zHl8IqCf-1660202889328)(C:\Users\HP\AppData\Roaming\Typora\typora-user-images\1660102444436.png)]$

1.1 均数、标准差、中位数、分位数计算

下面模拟一个正态分布数据：

x <- rnorm(50)
mean(x)
sd(x)
var(x) #方差
median(x)
quantile(x) #四分位数间距

pvec <- seq(0,1,0.1) #指把0-1十等分
pvec
quantile(x,pvec)
#输出结果
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
> quantile(x,pvec)
          0%          10%          20% 
-2.071125496 -1.552765163 -0.808384396 
         30%          40%          50% 
-0.653572352 -0.508537771 -0.259518767 
         60%          70%          80% 
-0.001803851  0.386756225  0.770844689 
         90%         100% 
 1.316231949  3.179026002

1.2 统计描述过程中的缺失值处理

library(ISwR)
attach(juul)
mean(igf1)
mean(igf1,na.rm=T) #剔除缺失值
sum(!is.na(igf1))
summary(igf1)
summary(juul)
detach(juul)

juul$sex <- factor(juul$sex,labels=c("M","F")) #因子化
juul$menarche <- factor(juul$menarche,labels=c("No","Yes"))
juul$tanner <- factor(juul$tanner,
                      labels=c("I","II","III","IV","V"))
attach(juul)
summary(juul)
detach(juul)

#以下代码与上述5行代码等价
juul <- transform(juul,
                  sex=factor(sex,labels=c("M","F")),
                  menarche=factor(menarche,labels=c("No","Yes")),
                  tanner=factor(tanner,labels=c("I","II","III","IV","V")))
attach(juul)
summary(juul)

2、数据分布类型的图形描述

2.1 直方图

以上方x为例：

hist(x)

另一个例子，模拟数据：

mid.age <- c(2.5,7.5,13,16.5,17.5,19,22.5,44.5,70.5)
acc.count <- c(28,46,58,20,31,64,149,316,103)#指2.5岁的有28个这样
age.acc <- rep(mid.age,acc.count) #对应起来
brk <- c(0,5,10,16,17,18,20,25,60,80) #设置分隔
hist(age.acc,breaks=brk)

2.2 经验累积分布图形

以上方x为例：

n <- length(x)
plot(sort(x),(1:n)/n,type="s",ylim=c(0,1))
#按照x排序，小于等于x的数据占总数据的比例，type="s"指绘图的类型阶梯函数，ylim轴刻度

在这里插入图片描述

2.3 QQ图

qqnorm(x)

在这里插入图片描述

2.4 箱式图

par(mfrow=c(1,2))
boxplot(IgM)
boxplot(log(IgM))
par(mfrow=c(1,1))

在这里插入图片描述

3、分组数据汇总统计量

3.1 tapply: 分组计算统计量

attach(red.cell.folate) #自带的
tapply(folate,ventilation,mean) #分组计算
tapply(folate,ventilation,sd)
tapply(folate,ventilation,length)

#或
xbar <- tapply(folate, ventilation, mean)
s <- tapply(folate, ventilation, sd)
n <- tapply(folate, ventilation, length)
cbind(mean=xbar, std.dev=s, n=n) #按照列进行结合

tapply(igf1, tanner, mean)
tapply(igf1, tanner, mean, na.rm=T) #剔除缺失值

3.2 aggreate和by函数：分组计算统计量

aggregate(juul[c("age","igf1")],
          list(sex=juul$sex), mean, na.rm=T) #引用age和igf1两列，并按照sex分组计算均值
aggregate(juul[c("age","igf1")], juul["sex"], mean, na.rm=T) #另一个编写代码方式
by(juul, juul["sex"], summary)#第三种编写代码方式

4、分组数据的图形描述

4.1 直方图

attach(energy) #自带的数据集
expend.lean <- expend[stature=="lean"] #expend不是函数，是数据集中的变量
expend.obese <- expend[stature=="obese"] # energy 数据集中expend变量被分割成两个向量

par(mfrow=c(2,1))
hist(expend.lean,breaks=10,xlim=c(5,13),ylim=c(0,4),col="white")
hist(expend.obese,breaks=10,xlim=c(5,13),ylim=c(0,4),col="grey")
par(mfrow=c(1,1))

在这里插入图片描述

4.2 并联箱式图

以energy数据集为例

boxplot(expend ~ stature)

在这里插入图片描述

4.3 带状图

opar <- par(mfrow=c(2,2), mex=0.8, mar=c(3,3,2,1)+.1)
stripchart(expend ~ stature)
stripchart(expend ~ stature, method="stack") #堆积
stripchart(expend ~ stature, method="jitter") #可以help中搜索stripchart自己看一下
stripchart(expend ~ stature, method="jitter", jitter=.03)
par(opar)

stripchart(list(lean=expend.lean, obese=expend.obese))
# stripchart(expend.lean, expend.obese) #错误用法，这么写会报错的

有点像之前提到的点图

在这里插入图片描述

5、表格

#是一个不同婚姻状态下孕妇咖啡因摄入情况的数据
caff.marital <- matrix(c(652,1537,598,242,36,46,38,21,218
                         ,327,106,67),
                       nrow=3,byrow=T) #按行生成矩阵
caff.marital
colnames(caff.marital) <- c("0","1-150","151-300",">300") #添加标签名
rownames(caff.marital) <- c("Married","Prev.married","Single")
caff.marital
names(dimnames(caff.marital)) <- c("marital","consumption") #添加分类名
caff.marital
as.data.frame(as.table(caff.marital)) #生成表格形式

分析中，建议生成表格形式，方便后续分析

attach(juul)
table(sex) #juul dataset
table(sex,menarche)
table(menarche,tanner)

xtabs(~ tanner + sex, data=juul)
xtabs(~ dgn + diab + coma, data=stroke) #文章中用的不多

ftable(coma + diab ~ dgn, data=stroke)#~前面是行，后面是列

caff.marital
t(caff.marital) #转置

5.1 编辑表格和频数

attach(juul)
tanner.sex <- table(tanner,sex)
tanner.sex
margin.table(tanner.sex,1) #1代表行，按行计算频数
margin.table(tanner.sex,2) #2代表列
prop.table(tanner.sex,1) #按行来自算比例
tanner.sex/sum(tanner.sex) #对上述结果计算百分比，每个值除以所有的总和

6、表格的图形展示

6.1 条形图

total.caff <- margin.table(caff.marital,2)
total.caff
barplot(total.caff, col="white") #按列的频数画直方图

par(mfrow=c(2,2))
barplot(caff.marital, col="white") #自动有截断
barplot(t(caff.marital), col="white")
barplot(t(caff.marital), col="white", beside=T) #分开画
barplot(prop.table(t(caff.marital),2), col="white", beside=T) #按照比例来画
par(mfrow=c(1,1)) #记得还原画布

在这里插入图片描述

修饰一下：

barplot(prop.table(t(caff.marital),2),beside=T,
        legend.text=colnames(caff.marital),
        col=c("white","grey80","grey50","black"))

6.2 点图

dotchart(t(caff.marital), lcolor="black")

在这里插入图片描述

6.3 饼图

opar <- par(mfrow=c(2,2),mex=0.8, mar=c(1,1,2,1))
slices <- c("white","grey80","grey50","black")
pie(caff.marital["Married",], main="Married", col=slices)
pie(caff.marital["Prev.married",],
    main="Previously married", col=slices)
pie(caff.marital["Single",], main="Single", col=slices)
par(opar)

二、线性相关与秩相关

1、三种相关系数

cor()函数可以计算以下三种相关系数，而cov()函数可用来计算协方差。两个函数的参数有很多，其中与相关系数的计算有关的参数可以简化为：

cor(x, use= ,method= )

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-zj7tpoNV-1660202889337)(C:\Users\HP\AppData\Roaming\Typora\typora-user-images\1660198306910.png)]$

1.1 Pearson相关系数

先看一个数据集：

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NXuQO4x4-1660202889338)(C:\Users\HP\AppData\Roaming\Typora\typora-user-images\1660198323917.png)]$

检查一下，有缺失数据，缺失数据处理见上表，默认采用Pearson相关系数

library(ISwR)
attach(thuesen)
cor(blood.glucose,short.velocity) # WRONG
cor(blood.glucose,short.velocity,use="complete.obs")
cor(thuesen,use="complete.obs") #计算两两间的相关系数

1.2 Spearman秩相关

cor(blood.glucose,short.velocity,use="complete.obs",method = ("spearman"))

1.3 Kendall相关系数（非参数检验）

cor(blood.glucose,short.velocity,use="complete.obs",method = ("kendall"))

2、相关系数的显著性检验

可以使用cor.test()函数对单个的Pearson、 Spearman和Kendall相关系数进行检验。简化后的使用格式为：

cor.test(x, y, alternative = , method = )

其中的x和y为要检验相关性的变量， alternative则用来指定进行双侧检验或单侧检验（取值为"two.side"、 “less"或"greater”），而method用以指定要计算的相关类型（“pearson”、 “kendall” 或"spearman" ）。当研究的假设为总体的相关系数小于0 时，请使用alternative=“less” 。在研究的假设为总体的相关系数大于0 时，应使用alternative=“greater”。在默认情况下，假设为alternative=“two.side”（总体相关系数不等于0）

#Pearson相关系数检验
cor.test(blood.glucose,short.velocity)

#结果输出
	Pearson's product-moment correlation

data:  blood.glucose and short.velocity
t = 2.101, df = 21, p-value = 0.0479 #p值
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.005496682 0.707429479
sample estimates:
      cor 
0.4167546 #Pearson相关系数

#Spearman秩相关
cor.test(blood.glucose,short.velocity,method="spearman")

#Kendall相关系数
cor.test(blood.glucose,short.velocity,method="kendall")

3、偏相关

多因素回归分析中，系数是偏回归系数，控制其他混杂因素的影响，判断两个因素是否相关，偏相关是指在控制一个或多个定量变量时，另外两个定量变量之间的相互关系。你可以使用ggm包中的pcor()函数计算偏相关系数。 ggm包没有被默认安装，在第一次使用之前需要先进行安装。函数调用格式为： pcor(u, S) 其中的u是一个数值向量，前两个数值表示要计算相关系数的变量下标，其余的数值为条件变量（即要排除影响的变量）的下标。 S为变量的协方差阵

states<- state.x77[,1:6] #state.x77 数据集
summary(states)#都是连续性数据，可以算
cov(states)
cor(states)
cor(states, method="spearman")
x <- states[,c("Population", "Income", "Illiteracy", "HS Grad")]
y <- states[,c("Life Exp", "Murder")]
cor(x,y)

# partial correlations
library(ggm)
# partial correlation of population and murder rate, controlling
# for income, illiteracy rate, and HS graduation rate
pcor(c(1,5,2,3,6), cov(states)) #控制2，3，6列，看1，5之间偏相关系数

对偏相关系数进行假设检验：

pcor.test(pcor(c(1,5,2,3,6), cov(states)),5,50) #5个变量，样本量为50

#或者这样写
pcor.test(0.3462724,5,50) #把偏相关系数结果带进去

三、组间差异t检验

1、单样本t检验

#数据集
Here is an example concerning daily energy intake in kJ for 11 women(Altman, 1991, p. 183). First, the values are placed in a data vector:
<daily.intake <- c(5260,5470,5640,6180,6390,6515,

- 6805,7515,7515,8230,8770)

daily.intake <- c(5260,5470,5640,6180,6390,6515,
                  6805,7515,7515,8230,8770)
mean(daily.intake)
sd(daily.intake)
quantile(daily.intake)

t.test(daily.intake,mu=7725) #mu总体均数

#结果输出
data:  daily.intake
t = -2.8208, df = 10, p-value = 0.01814 #p值与t值
alternative hypothesis: true mean is not equal to 7725
95 percent confidence interval: #样本均数的可信区间
 5986.348 7520.925
sample estimates:
mean of x 
 6753.636

2、两样本t检验

#数据集
We return to the daily energy expenditure data and consider the problemof comparing energy expenditures between lean and obese women.
The factor stature contains the group and the numeric variable expend
the energy expenditure in mega-Joules #胖瘦的能量消耗

t.test(expend~stature) #给到的结果是两组均值差值的95%CI
t.test(expend~stature, var.equal=T) #假定方差齐，更稳健一点

#99%CI
t.test(vital.capacity~group,conf=0.99,data=vitcap)

3、方差齐性检验

Even though it is possible in R to perform the two-sample t test withou the assumption that the variances are the same, you may still be
interested in testing that assumption, and R provides the var.test function for that purpose, implementing an F test on the ratio of the group variances. It is called the same way as t.test

检验：

var.test(expend~stature)

#结果输出
data:  expend by stature
F = 0.78445, num df = 12, denom df = 8,
p-value = 0.6797 #说明方差齐
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.1867876 2.7547991
sample estimates:
ratio of variances 
          0.784446

4、配对样本t检验

#数据集
The data on pre- and postmenstrual energy intake in a group of women.
There data are entered from the command line, but they are also
available as a data set in the ISwR package #比较绝经前与绝经后有无差别

attach(intake)
intake
post - pre #算前后差值
t.test(pre, post, paired=T)

t.test(pre, post) #WRONG!不行，这么算是两样本t检验

四、组间差异秩和检验

数据集和t检验是对应的

1、单样本Wilcoxon符号秩检验

The t tests are fairly robust against departures from the normal distribution especially in larger samples, but sometimes you wish to avoid making that assumption. To this end, the distribution-free methods are
convenient. These are generally obtained by replacing data with corresponding order statistics.

wilcox.test(daily.intake, mu=7725)

2、两样本Wilcoxon符号秩检验

You might prefer a nonparametric test if you doubt the normal distribution assumptions of the t test. The two-sample Wilcoxon test is based on replacing the data by their rank (without regard to grouping) and calculating the sum of the ranks in one group, thus reducing the problem to one of sampling n1 values without replacement from the numbers 1 to n1 +n2. This is done using wilcox.test, which behaves similarly to t.test

wilcox.test(expend~stature)

wilcox.test(react)
wilcox.test(vital.capacity~group, data=vitcap)

能用t检验就可以用非参数检验，反过来却未必

3、配对Wilcoxon符号秩检验

The paired Wilcoxon test is the same as a one-sample Wilcoxon signedrank test on the differences. The call is completely analogous to t.test

wilcox.test(pre, post, paired=T)

the problem to one of sampling n1 values without replacement from the numbers 1 to n1 +n2. This is done using wilcox.test, which behaves similarly to t.test

wilcox.test(expend~stature)

wilcox.test(react)
wilcox.test(vital.capacity~group, data=vitcap)

能用t检验就可以用非参数检验，反过来却未必

3、配对Wilcoxon符号秩检验

The paired Wilcoxon test is the same as a one-sample Wilcoxon signedrank test on the differences. The call is completely analogous to t.test

wilcox.test(pre, post, paired=T)

Annaaphq

关注

1
点赞
踩
25

收藏

觉得还不错? 一键收藏
0
评论
R——统计描述与基础统计分析方法

R统计描述、相关分析、t检验与非参数检验
复制链接

扫一扫

R——统计描述与基础统计分析方法

一、描述性统计分析

1、单组数据汇总统计量

1.1 均数、标准差、中位数、分位数计算

1.2 统计描述过程中的缺失值处理

2、数据分布类型的图形描述

2.1 直方图

2.2 经验累积分布图形

2.3 QQ图

2.4 箱式图

3、分组数据汇总统计量

3.1 tapply: 分组计算统计量

3.2 aggreate和by函数：分组计算统计量

4、分组数据的图形描述

4.1 直方图

4.2 并联箱式图

4.3 带状图

5、表格

5.1 编辑表格和频数

6、表格的图形展示

6.1 条形图

6.2 点图

6.3 饼图

二、线性相关与秩相关

1、三种相关系数

1.1 Pearson相关系数

1.2 Spearman秩相关

1.3 Kendall相关系数 （非参数检验）

2、相关系数的显著性检验

3、偏相关

三、组间差异t检验

1、单样本t检验

2、两样本t检验

3、方差齐性检验

4、配对样本t检验

四、组间差异秩和检验

1、单样本Wilcoxon符号秩检验

2、两样本Wilcoxon符号秩检验

3、配对Wilcoxon符号秩检验

3、配对Wilcoxon符号秩检验

1.3 Kendall相关系数（非参数检验）