Chapter 5: Exercise 9
Bootstrap应用——以Boston住房数据集为例
一、导入数据集
library(MASS)
summary(Boston)
## crim zn indus chas
## Min. : 0.01 Min. : 0.0 Min. : 0.46 Min. :0.0000
## 1st Qu.: 0.08 1st Qu.: 0.0 1st Qu.: 5.19 1st Qu.:0.0000
## Median : 0.26 Median : 0.0 Median : 9.69 Median :0.0000
## Mean : 3.61 Mean : 11.4 Mean :11.14 Mean :0.0692
## 3rd Qu.: 3.68 3rd Qu.: 12.5 3rd Qu.:18.10 3rd Qu.:0.0000
## Max. :88.98 Max. :100.0 Max. :27.74 Max. :1.0000
## nox rm age dis
## Min. :0.385 Min. :3.56 Min. : 2.9 Min. : 1.13
## 1st Qu.:0.449 1st Qu.:5.89 1st Qu.: 45.0 1st Qu.: 2.10
## Median :0.538 Median :6.21 Median : 77.5 Median : 3.21
## Mean :0.555 Mean :6.29 Mean : 68.6 Mean : 3.79
## 3rd Qu.:0.624 3rd Qu.:6.62 3rd Qu.: 94.1 3rd Qu.: 5.19
## Max. :0.871 Max. :8.78 Max. :100.0 Max. :12.13
## rad tax ptratio black
## Min. : 1.00 Min. :187 Min. :12.6 Min. : 0.3
## 1st Qu.: 4.00 1st Qu.:279 1st Qu.:17.4 1st Qu.:375.4
## Median : 5.00 Median :330 Median :19.1 Median :391.4
## Mean : 9.55 Mean :408 Mean :18.5 Mean :356.7
## 3rd Qu.:24.00 3rd Qu.:666 3rd Qu.:20.2 3rd Qu.:396.2
## Max. :24.00 Max. :711 Max. :22.0 Max. :396.9
## lstat medv
## Min. : 1.73 Min. : 5.0
## 1st Qu.: 6.95 1st Qu.:17.0
## Median :11.36 Median :21.2
## Mean :12.65 Mean :22.5
## 3rd Qu.:16.95 3rd Qu.:25.0
## Max. :37.97 Max. :50.0
二、设置随机种子,保证输出结果一致
set.seed(1)
attach(Boston)
三、问题求解
a、对medv的总体均值的估计 μ ^ \hat{\mu} μ^
medv.mean = mean(medv)
medv.mean
## [1] 22.53281
μ ^ = [ 1 ] 22.53281 \hat{\mu}=[1] 22.53281 μ^=[1]22.53281
b、 S E μ ^ SE_{\hat{\mu}} SEμ^的估计,并解释这个结果
medv.err = sd(medv)/sqrt(length(medv))
medv.err
## [1] 0.4088611
S E μ ^ ^ = 0.4088611 \hat{SE_{\hat{\mu}}}=0.4088611 SEμ^^=0.4088611 ,使用样本的标准差除以观测的平方根来计算样本均值的标准误差。
c、使用bootstrap法计算 S E μ ^ SE_{\hat{\mu}} SEμ^
boot.fn = function(data, index) return(mean(data[index]))
library(boot)
bstrap = boot(medv, boot.fn, 1000)
bstrap
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 -0.02520692 0.4049032
(0.4049 vs 0.4089),发现通过bootstrap得出的结果与(b)中得到的结果几乎相等,差异较少。
d、给出medv均值的95%的置信区间,bootstrap与t.test(Boston$medv)法进行比较。
way1:t.test(Boston$medv)
t.test(Boston$medv)
##
## One Sample t-test
##
## data: medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
way2:bootstrap法
c(bstrap$t0 - 2 * 0.4119, bstrap$t0 + 2 * 0.4119)
## [1] 21.70901 23.35661
bootstrap: 21.73~23.34;
t.test: 21.71~23.36;
bootstrap法估计值与 t.test 估计值仅相差 0.02。
e、给出medv总体中位数的估计 μ ^ m e d \hat{\mu}_{med} μ^med
medv.med = median(medv)
medv.med
## [1] 21.2
μ ^ m e d = 21.2 \hat{\mu}_{med}=21.2 μ^med=21.2
f、bootstrap法估计 S E μ ^ m e d SE_{\hat{\mu}_{med}} SEμ^med
boot.fn = function(data, index) return(median(data[index]))
boot(medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.02395 0.3820469
中位数为 21.2,SE 为 0.382。 与总体均值相比,标准误差较小。
g、计算Boston郊区的medv的10%分位数的估计 μ ^ 0.1 \hat{\mu}_{0.1} μ^0.1
medv.tenth = quantile(medv, c(0.1))
medv.tenth
#计算标准误差,标准误差=样本标准差/观测的平方根
medv.tenth.err = sd(medv) / sqrt(length(medv))
medv.tenth.err
## 10%
## 12.75
## 0.4088611
μ ^ 0.1 = 12.75 \hat{\mu}_{0.1}=12.75 μ^0.1=12.75, S E μ ^ 0.1 ^ = 0.409 \hat{SE_{\hat{\mu}_{0.1}}}=0.409 SEμ^0.1^=0.409
h、bootstrap法估计 μ ^ 0.1 \hat{\mu}_{0.1} μ^0.1
boot.fn = function(data, index) return(quantile(data[index], c(0.1)))
boot(medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0311 0.5063093
μ ^ 0.1 = 12.75 \hat{\mu}_{0.1}=12.75 μ^0.1=12.75,SE 为 0.506,估计的标准误差较小,但是与(g)的结果相比,误差稍稍较大。