Go Over a Complete Example of Parametric Bootstrap Test with R

本文提供了一个使用R进行参数化引导检验的完整案例,以测试20-29岁和30-39岁人群平均脉搏率是否不同。通过陈述假设、导入数据、探索性分析、参数化、执行引导测试和得出统计推断,得出结论为样本差异的观察值对应的p值较大,未能拒绝原假设,即两个年龄段的平均脉搏率无显著差异。
摘要由CSDN通过智能技术生成


Below I did a complete example of a parametric test to illustrate some important concepts.

Question

Suppose we obtained a random sample (which is representative of the US population), and we grouped the individuals into different groups according to which age decade they are in (the AgeDecade column, whose values are strings). The data is about the individuals’ heartbeat per minute (pulse rate), which corresponds to the Pulse column in the data below. In this problem, we want to test if the collected data provide evidence that the average pulse rate of people aged 20-29 is different from the average pulse rate of people aged 30-39. You may assume pulse rates follow a normal distribution (for the entire population as well as for each age category). We run 10000 repetitions.

Solution

1. State the hypotheses

  • H 0 : μ 30 − 39 − μ 20 − 29 = 0 H_0:\mu_{30-39}-\mu_{20-29}=0 H0:μ3039μ2029=0
  • H a : μ 30 − 39 − μ 20 − 29 ≠ 0 H_a: \mu_{30-39}-\mu_{20-29}\ne 0 Ha:μ3039μ2029=0
    where μ \mu μ stands for the mean pulse rate for the corresponding US population, and the test statistic of interest is thus the difference x ˉ 30 − 39 − x ˉ 20 − 29 \bar x_{30-39}-\bar x_{20-29} xˉ3039xˉ2029
  • Equivalently, the hypotheses may be stated as:
    • Null hypothesis: the average pulse rate of people aged 20-29 is not different from the average pulse rate of people aged 30-39
    • Alternative hypothesis: the average pulse rate of people aged 20-29 is different from the average pulse rate of people aged 30-39.

2. Importing and examining the data

- You can refer to the code and outputs below
# load the data (.csv file located in the same file/working directory)
nhanes <- read.csv("nhanes.csv")

# Examine the data
head(nhanes)
summary(nhanes)
Fig.1 head(nhanes)
Fig.2 summary(nhanes)

3. Exploratory analysis of the data

  • From the boxplot below, we can infer that the test statistic is negative i.e. there is at least a difference in the means of the observed sample
  • -0.7768498 is this difference
# side-by-side boxplot our sample
boxplot(Pulse ~ AgeDecade, data = nhanes, horizontal = TRUE) 
boxplot(Pulse ~ AgeDecade, data = nhanes, horizontal = TRUE) Fig.2 boxplot()
# Calculating the test statistic of the sample
elder <- nhanes$Pulse[nhanes$AgeDecade=="30-39"]
elder_mean_pulse <- mean(elder)

younger <- nhanes$Pulse[nhanes$AgeDecade=="20-29"]
younger_mean_pulse <- mean(younger)

obs_diff <- elder_mean_pulse - younger_mean_pulse
obs_diff #  -0.7768498 is the result

4. Paramatrizing

  • Parametrizing the population under the null hypothesis of no difference using the obtained data
  • Because we assumed both groups of sample are from the same population, we can pool them together to estimate that hypothetical population. Moreover, since the sample is a random sample, we can assume the sample is representative of the “null population”, so the sample mean is going to estimate the population mean, and the sample standard deviation is going to estimate the population standard deviation.
mean <- mean(nhanes$Pulse)
mean # 75.15403

sd <- sd(nhanes$Pulse)
sd # 11.78723
# N(mean = 75.15403, sd = 11.78723) is the null population 

5. Executing the bootstrap test

  • The general implementation is straightforward. However, I must mention one big thing to notice in this step (I’ve made this mistake in the past):
    Because we have a two-tailed test, the order of your subtraction does not really matter (i.e. you can calculate either x ˉ 20 − 29 − x ˉ 30 − 39 \bar x_{20-29}-\bar x_{30-39} xˉ2029xˉ3039 or vice versa. Here, we assume we stritcly follow the hypotheses to use x ˉ 30 − 39 − x ˉ 20 − 29 \bar x_{30-39}-\bar x_{20-29} xˉ3039xˉ2029, and we denote our observation to be o b s _ d i f f = − 0.7768498 obs\_diff = -0.7768498 obs_diff=0.7768498. But one thing should always be implemented: you always want to make sure that when you are calculating the empirical p p p-value, you are looking for the probability that, under the null hypothesis, a random sample produces a difference in sample means that is at least as extreme or more extreme than your observed sample’s difference, which means

    P ( ( x ˉ 30 − 39 − x ˉ 20 − 29 = o b s _ d i f f ) ⏟ a s   e x t r e m e   a s   o u r   o b s e r v a t i o n ∨ ( x ˉ 30 − 39 − x ˉ 20 − 29 < o b s _ d i f f ) ∨ ( x ˉ 30 − 39 − x ˉ 20 − 29 > o b s _ d i f f ⏟ m o r e   e x t r e m e   t h a n   o u r   o b s e r v a t i o n ) ∣ H 0 ) P(\underbrace {(\bar x_{30-39}-\bar x_{20-29}=obs\_diff)}_{as ~extreme ~as ~our ~observation} \lor \underbrace{(\bar x_{30-39}-\bar x_{20-29}<obs\_diff) \lor (\bar x_{30-39}-\bar x_{20-29}>obs\_diff}_{more~extreme~than~our ~observation})|H_0) P(as extreme as our observation (xˉ3039xˉ2029=obs_diff)more extreme than our observation (xˉ3039xˉ2029<obs_diff)(xˉ3039xˉ2029>obs_diff)H0). Thus, we have two abs() function used in the line empirical_p <- mean(abs(diffs) >= abs(obs_diff)) below.
    Moreover, we have to round() the sampled values because the pulse rate should be integers

diffs <- rep(NA,10000)
for(i in seq_along(diffs)){
  group1 <- round(rnorm(length(elder),  mean = mean,sd = sd))
  group2 <- round(rnorm(length(younger),mean = mean,sd = sd))
  diff <- mean(group1) - mean(group2)
  diffs[i] <- diff
}
empirical_p <- mean(abs(diffs) >= abs(obs_diff)) 
# notice the two abs() above
# Moreover, recall that mean() applied 
# to a logical vector computes the proportion
empirical_p # this is  0.241

6. Draw conclusion (statistical inference)

  • Given the assumption that random sampling is implemented, we do a parametric bootstrap test. The empirical p-value, obtained by drawing random samples from a normal distribution N ( 75.15403 , 11.78723 ) N(75.15403, 11.78723) N(75.15403,11.78723) 10000 times, is 0.241, which is large. This is weak evidence against the null hypothesis of no difference, so we fail to reject the null hypothesis and may suggest that there is no difference between the average pulse rate of people aged 20-29 and the average pulse rate of people aged 30-39.

Full Code

nhanes <- read.csv("nhanes.csv")


head(nhanes)
summary(nhanes)


boxplot(Pulse ~ AgeDecade, data = nhanes, horizontal = TRUE) 


elder <- nhanes$Pulse[nhanes$AgeDecade=="30-39"]
elder_mean_pulse <- mean(elder)
younger <- nhanes$Pulse[nhanes$AgeDecade=="20-29"]
younger_mean_pulse <- mean(younger)
obs_diff <- elder_mean_pulse - younger_mean_pulse
obs_diff


diffs <- rep(NA,10000)
for(i in seq_along(diffs)){
  group1 <- round(rnorm(length(elder),  mean = mean,sd = sd))
  group2 <- round(rnorm(length(younger),mean = mean,sd = sd))
  diff <- mean(group1) - mean(group2)
  diffs[i] <- diff
}
empirical_p <- mean(abs(diffs) >= abs(obs_diff))
empirical_p
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值