Data science note 3

each sample gives us potentially a different result:there is sampling variability
we need to remember this sampling variability when interpreting the sampling results .
A larger sample size resulted in smaller variability.

R play roulette

values=c(“00”,0:36)
play=sample(values,1)
play

#bet on red
red_values=c(1,3,5 ,…)
red_bet=function(spin_value){
return (spin_value %in% red_values)}
red_bet(play)

normal distribution

x=rnorm(10000,mean=70,sd=4)
t=seq(min(x),max(x),length.out=1000)
hist(x,probability=TRUE,ylim=c(0,0.1))
lines(t,dnorm(t,mean=70,sd=4),col=2,lwd=2)

seq(from,to,length.out=)
表示生成一组从from到to的数量为num的数
by = ((to - from)/(length.out - 1))
附seq的其它用法:
seq(from, to)
seq(from, to, by= )
seq(from, to, length.out= )
seq(along.with= )
seq(from)
seq(length.out= )

interpreting the result from one sample

ssize=50
actualsample=sample(UGRace,ssize,replace=F)
sum(actualsample==“Black or African American,non-Hispanic”)/ssize

in reality ,we have only one sample and we do not get to see the entire population.How can we get an idea of how wrong we could be ?
to repeat the random process,we need a population to sample from.
as a “best guess” for the population we can use the sample we actually observed.
repeated sampling with replacement for the population defined by the sample is an idea.

bootstrap
code

ssize=50
actualsample=sample(UGRace,ssize,replace=F)
B=10000
bsampleprop=NULL
for(i in 1:B)
{
bootsample=sample(actualsample,ssize,replace=T)
bsampleprop=c(bsampleprop,sum(bootsample==“Black or African American,non-Hispanic”)/ssize)
}

a toy example

population=c(3,3,3,5,5,5,7,7,7,9,9,9,10,10,10)
mean(population)

#here is our sample with its sample mean
actualsample=sample(population,5,replace=F)
mean(actualsample)
#we can calculate the standard deviation of the bootstrap sample to estimate the standard error of the sample estimator
sqrt(var(bsamplemean))
#or we can calculate a range of possible estimated value using quantiles
quantile(bsamplemean,c(.05,.95))

bootstrap principle

the bootstrap works here because we are able to replicate the chance process of drawing from an unknown population by drawing with replacement from our observed sample
works in the sense that it gives a pretty good estimate of the sampling variability.even when we do not know the exact population generating our sample.
also works in that we did not have to use any complicated calculations to get this estimate.

bootstrap doesn’t always work

1.if our chance process were different,then drawing repeatedly from our observed sample will not help us.
2.we have to believe that the resampling process we use is a reasonable approximation to the way our data was generated.
3.we might call this no free lunch principle: we can use the computer to do calculations for us about the sampling processes only when we know how to tell the computer to mimic the sampling process

bootstrap summary

we have a sample to infer about the population.
our estimate or summary from a sample will be different for different samples
we need to be aware of sampling variability
Bootstrap principle is a way to get to know sampling variability

step

1.start with data sample ( x 1 , . . . , x n ) (x_1,...,x_n) (x1,...,xn)

2.draw a sample of size n with replacement from ( x 1 , . . . , x n ) (x_1,...,x_n) (x1,...,xn)
call it ( x 1 ∗ , . . . , x n ∗ ) (x_1^{*},...,x_n^{*}) (x1,...,xn)
compute the summary statistics S ( x 1 ∗ , . . . , x n ∗ ) S(x_1^{*},...,x_n^{*}) S(x1,...,xn)

repeat these steps for B times

3.finally we have B of variables from S ( x 1 ∗ , . . . , x n ∗ ) S(x_1^{*},...,x_n^{*}) S(x1,...,xn)

1.drawback:it must replicate the same chance mechanism used to get the sample from the population
2.computation complexity

we have not commented on this before ,but a number of the histograms of the estimates derived by multiple samples look like normal distribution.

looking at the problem of estimating th mean for U[50,100].We have noted that as the sample size increases the variability of the sample estimates decreases

Uniform[a,b]
mean=(b+a)/2
variance=(b^2 - a^2)/12
it depends on the length of the interval

causes of variability
1.sample size
2.variation in the population

data:x1,…,xn
average x ˉ = 1 n ∑ i = 1 n x i \bar{x}=\frac1n\sum^n_{i=1}x_i xˉ=n1i=1nxi
xi=1 if i th person is black
=0 otherwise
xbar=proportion of black students
1.taking sums(or averages) of many independent quantities we obtain a normal distribution(this is known at the central limit theorem)
2.this can be used as a definition of what the normal distribution is and where it comes from
3. it is also known as Gaussian,as Gauss derives it in Theoria motus corporum coelestium in sectionibus conicis solem ambientium(1809)(a fairly impressive work)

sampling variance I:the bootstrap

The variance of the bootstrap samples is an estimate of the variance of S(X1,…,Xn)across all possible samples
average ( ( S ( X 1 , . . . , X n ) − S b ) 2 ) ≈ 1 B ∑ b = 1 B ( S b − S ˉ ) 2 ((S(X_1,...,X_n)-S_b)^2)≈\frac1{B}\sum^{B}_{b=1}(S_b-\bar{S})^2 ((S(X1,...,Xn)Sb)2)B1b=1B(SbSˉ)2
standard error ( ( S ( X 1 , . . . , X n ) ) ≈ 1 B ∑ b = 1 B ( S b − S ˉ ) 2 ((S(X_1,...,X_n))≈\sqrt{\frac1{B}\sum^{B}_{b=1}(S_b-\bar{S})^2} ((S(X1,...,Xn))B1b=1B(SbSˉ)2

sampling variance II:

sample(X1,…,Xn)
Sample summmary:S(X1,…,Xn)which we use to estimate the corresponding value in the population Spop
the variance of the sample gives us a way to estimate of the variance of S(X1,…,Xn) across all possible samples
A v e r a g e ( ( S ( X 1 , . . . , X n ) − S p o p ) 2 ) ≈ 1 n ∑ i = 1 n ( X i − X ˉ ) 2 n Average((S(X_1,...,X_n)-S_{pop})^2)≈\frac1{n}\sum^{n}_{i=1}\frac{(X_i-\bar{X})^2}n Average((S(X1,...,Xn)Spop)2)n1i=1nn(XiXˉ)2
S t a n d a r d E r r o r ( S ( X 1 , . . . , X n ) ) ≈ 1 n ∑ i = 1 n ( X i − X ˉ ) 2 n Standard Error(S(X_1,...,X_n))≈\sqrt{\frac1{n}\sum^{n}_{i=1}\frac{(X_i-\bar{X})^2}n} StandardError(S(X1,...,Xn))n1i=1nn(XiXˉ)2

interval estimation
the idea is to report not one number estimate ,but a range of possible values that you expect to cover the true value of the population summary
two rules that we can write out and guarantee that ,in repeated experiments,the intervals cover the true value of the population summary 95% of the times
For averages
( X ˉ − 2 S D ( X 1 , . . . . , X n ) n , X ˉ + 2 S D ( X 1 , . . . . , X n ) n (\bar{X}-\frac{2SD(X_1,....,X_n)}{\sqrt{n}},\bar{X}+\frac{2SD(X_1,....,X_n)}{\sqrt{n}} (Xˉn 2SD(X1,....,Xn),Xˉ+n 2SD(X1,....,Xn)

For Bootstrap
( Q u a n t i l e B o o t s a m p l e ( 0.025 ) , Q u a n t i l e B o o t s a m p l e ( 0.975 ) ) (Quantile_{Bootsample}(0.025),Quantile_{Bootsample}(0.975)) (QuantileBootsample(0.025),QuantileBootsample(0.975))

Rstudio

library(readr)
stanford =read_delim(“D:\专业英文拓展\data\Stanford.txt”,"\t",escape_double=FALSE)
stanford=stanford[,c(1,3)]
names(stanford)=c(“Race/Ethnicity”,“Number”)
stanford=stanford[-10,]
UGRace=rep(Stanford$“Race/Ethnicity”,stanford$Number)
par(mar=c(5,22,4,2))
barplot(sort(summary(as.factor(UGRace)))/length(UGRace),horiz=T,las=1)

"\t"默认分隔符为制表符,stanford=stanford[-10,]删去第十行
par(mar=c(5,22,4,2))# base R style graphics command for margins #par allows you to have multiple graphs side by side
summary(as.factor(UGRace)))计算频数 ,horiz=T,坐标轴旋转,las=1字符均水平。
horizotal参数指定是否将图旋转90度使得x轴平行于纸的长边。horizontal=T,就是要旋转90°,从而绘制横向柱状图

ssize=100
B=1000
SamplePropB=NULL
SamplePropH=NULL
for(i in 1:B){observation=sample(UGRace,ssize,replace=FALSE)
SamplePrppB=c(SamplePropB,sum(observation==“Black or African American,non-Hispanic”)/ssize)
SamplePropB=c(SamplePropH,sum(obervation==“Hispanic/Llatino”)/ssize)
}
par(mfrow=c(1,2))
hist(SamplePropB,main=“Black or African American”,xlab=“Sample Proportion”,xlim=c(0,.3))
abline(v=0.064,col=2,lwd=2)
hist(SamplePropH,main=“Hispanic/Latino”,xlab=“Sample proportion”,xlim=c(0,.4))
abline(v=0.16,col=2,lwd=2)

进行B=1000次模拟获得均值分布

在这里插入图片描述

#Game of Roullette
values=c(“00”,0:36)
play=sample(values,1)
play

ssize1=100
ssize2=1000
B=1000
x=NULL
y=NULL
for(i in 1:B){
x=c(x,mean(red_bet(sample(value,ssize1,replace=TRUE))))
y=c(y,mean(red_bet(sample(value,ssize2,replace=TRUE))))
}
par(mfrow=c(1,2))
hist(x,main=“100 spins”,xlab=“Proportion of red”,xlim=c(0.28,0.62))
abline(v=18/38,col=2,lwd=1)
hist(y,main=“1000 spins”,xlab=“Proportion of red”,xlim=c(0.28,0.62))
abline(v=18/38,col=2,lwd=1)

在这里插入图片描述
#normal distribution
t=seq(min(X),max(X),length.out=1000)
hist(X,probability=TRUE,ylim=c(0,0.1))
lines(t,dnorm(t,mean=70,sd=4),col=2,lwd=2)

par(mfrow=c(1,2))
x=seq(-1,30,length,out=1000)
plot(x,dnorm(x,mean=15,sd=4),main=“density of a normal,mean=15 and sd=4”)
X=rnorm(1000,mean=15,sd=4)
hist(X,main=“Histogram of 1000 samples from N(15,4)”,freq=FALSE)
lines(x,dnorm(x,mean=15,sd=4),col=2)
在这里插入图片描述

#uniform distribution
par(mfrow=c(1,2))
x=seq(-1,30,length,out=1000)
plot(x,dunif(x,min=10,max=20),main=“density of a uniform between 10 and 20”)
X=runif(1000,min=10,max=20)
hist(X,main=“Histogram of 1000 samples from U[10,20]”,freq=FALSE)
lines(x,dunif(x, min=10,max=20),col=2)

#chi-square distribution
par(mfrow=c(1,2))
x=seq(0,30,length.out=1000)
plot(x,dchisq(x,df=10),main=“density of a chi-square with 10 df”,ylab=“density”,pty=“l”)
X=rchisq(1000,df=10)
hist(X,main=“Histogram of 1000 samples from chi^2(10)”,freq=FALSE)
lines(x,dchisq(x,df=10),col=2)

#Empirical density plots
ssize=50
X=rnorm(ssize,mean=5,sd=2)
hist(X,freq=FALSE)
lines(density(X),col=“blue”,lwd=2)
lines(sort(X),dnorm(sort(X),mean=5,sd=2),col=“red”,lwd=2)

#ssize=1000
#ssize=10000
ssize <- 100000
X <- rnorm(ssize,mean=5,sd=2)
par(mfrow=c(1,2))
hist(X, freq=FALSE)
lines(sort(X), dnorm(sort(X),mean=5,sd=2), col=‘red’, lwd=2)
hist(X, freq=FALSE,breaks=70)
lines(sort(X), dnorm(sort(X),mean=5,sd=2), col=‘red’, lwd=2)
hist(X, freq=FALSE,breaks=70) breaks 代表间隔点的个数,个数越多越密,越逼近正态

#Question :what do you observe from the above three plots?

#sample mean examples
X=runif(10000,min=50,max=100)
par(mfrow=c(1,1))
hist(X,freq=FALSE)
abline(v=mean(X),col=2,lwd=2)
mean(X)
lines(sort(X),dunif(sort(X),min=50,max=100),col=“blue”)

ssize=c(10,100,1000,5000)
B=1000
smean=matrix(nrow=B,ncol=length(ssize))
for(i in 1:B){
for(j in 1:length(ssize)){
x=runif(ssize[j],min=50,max=100)
smean[i,j]=mean(X)}}
par(mfrow=c(2,2))
for(i in 1:4){
hist(smean[,i],freq=FALSE,main=“Histogram of 1000 sample means”,xlab=paste(“Mean of a sample of size”,ssize[i]),xlim=c(60,90))}
par(mfrow=c(2,2),mar=c(5,4,4,2))
for(i in 1:4){
hist(smean[,i],freq=FALSE,main=“Histogram of 1000 sample means”,xlab =paste(“Mean if a sample of size”,ssize[i]),xlim=c(60,90))}

在这里插入图片描述
#poisson distribution
library(stats)
X=rpois(10000,lambda=15)
summary(X)
par(mfrow=c(1,1))
hist(X,freq=FALSE)
abline(v=mean(X),col=2,lwd=2)
points(sort(X),dpois(sort(X),15),col=“blue”,pch=20)

ssize=c(10,100,1000,5000)
B=1000
smean=matrix(nrow=B,ncol=length(ssize))
for(i in 1:B){
for(j in 1:length(ssize)){
X=rpois(ssize[j],15)
smean[i,j]=mean(X)
}}
par(mfrow=c(2,2))
for(i in 1:4){
hist(smean[,i],freq=FALSE,main=“histogram of 1000 sample means”,xlab=paste(“mean of a sample of size”,ssize[i]),xlim=c(11,19))
}
在这里插入图片描述

#Effect of sample size on sample mean and sampling variability
ssize=c(10,25,50,75,100,150,500,1000,5000,10000)
B=1000
smean=matrix(NA,nrow=B,ncol=length(ssize))
for(i in 1:B)
{
for(j in 1:length(ssize))
{
X=runif(ssize[j],min=50,max=100)
smean[i,j]=mean(X)
}
}
mean.sample=apply(smean,2,mean)
sd.sample=sqrt(apply(smean,2,var))
par(mfrow=c(1,2))
plot(ssize,mean.sample,xlab=“sample size”,ylab="",main=“average value of estimate”)
plot(ssize,sd.sample,xlab=“sample size”,ylab="",main=“standard deviation of estimate”)
#plot in the log scale
par(mfrow=c(1,2))
plot(ssize,log(mean.sample),xlab=“log(sample size)”,ylab="",main=“average value of estimate”)
plot(ssize,log(sd.sample),xlab=“sample size”,ylab="",main=“standard deviation of estimate”)

在这里插入图片描述

#bootstrap example
library(boot)
data(“claridge”)
summary(claridge)
#let us have a look
library(ggplot2)
ggplot(claridge,aes(x=dnan,y=hand))+geom_count(alpha=0.5,col=“blue”)+ylab(“propensity to use left hand”)+xlab(“DNA variant”)+ggtitle(“claridge data”)

geom_count为计算重叠的点的计数图 重叠的点越多点越大

在这里插入图片描述

B=10000
ssize=nrow(claridge)
bcorr=NULL
for(i in 1:B)
{
bsample=claridge[sample(1:ssize,ssize,replace=TRUE),]
bcorr=c(bcorr,cor(bsample[,1],bsample[,2]))
}
par(mfrow=c(1,1))
hist(bcorr,main=“Bootstrap variability”,xlab="")
nrow(claridge)计算行数,运用bootstrap,从样本中抽取样本:bsample=claridge[sample(1:ssize,ssize,replace=TRUE),]

在这里插入图片描述

library(dplyr)
correlations=vector()
for(i in 1:10000){
sample=sample_n(claridge,37,replace=T)
correlations[i]=cor(samples[,1],samples[,2])
}
correlations=data.frame(correlations)
plot5=ggplot(data=correlations,aes(x=correlations))+geom_histogram(bins=20,fill=“cornflowerblue”,alpha=0.8)+labs(x=“correlation in Bootstrap sample”,title=“Bootstrap variability”)+theme_bw()
plot5

mean(bcorr)
sd(bcorr)

ggplot内data需为dataframe,bins=20柱状20条,fill为填充颜色,alpha为透明度

在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值