Data science note 3

最新推荐文章于 2021-08-08 09:32:22 发布

upright man

最新推荐文章于 2021-08-08 09:32:22 发布

阅读量253

点赞数

分类专栏：小学期data science 文章标签： R语言 ggplot

本文链接：https://blog.csdn.net/uprightman_/article/details/107069543

版权

小学期data science 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

each sample gives us potentially a different result:there is sampling variability
we need to remember this sampling variability when interpreting the sampling results .
A larger sample size resulted in smaller variability.

R play roulette

values=c(“00”,0:36)
play=sample(values,1)
play

#bet on red
red_values=c(1,3,5 ,…)
red_bet=function(spin_value){
return (spin_value %in% red_values)}
red_bet(play)

normal distribution

x=rnorm(10000,mean=70,sd=4)
t=seq(min(x),max(x),length.out=1000)
hist(x,probability=TRUE,ylim=c(0,0.1))
lines(t,dnorm(t,mean=70,sd=4),col=2,lwd=2)

seq(from,to,length.out=) 表示生成一组从from到to的数量为num的数 by = ((to - from)/(length.out - 1)) 附seq的其它用法： seq(from, to) seq(from, to, by= ) seq(from, to, length.out= ) seq(along.with= ) seq(from) seq(length.out= )

interpreting the result from one sample

ssize=50
actualsample=sample(UGRace,ssize,replace=F)
sum(actualsample==“Black or African American,non-Hispanic”)/ssize

in reality ,we have only one sample and we do not get to see the entire population.How can we get an idea of how wrong we could be ?
to repeat the random process,we need a population to sample from.
as a “best guess” for the population we can use the sample we actually observed.
repeated sampling with replacement for the population defined by the sample is an idea.

bootstrap

code

ssize=50
actualsample=sample(UGRace,ssize,replace=F)
B=10000
bsampleprop=NULL
for(i in 1:B)
{
bootsample=sample(actualsample,ssize,replace=T)
bsampleprop=c(bsampleprop,sum(bootsample==“Black or African American,non-Hispanic”)/ssize)
}

a toy example

population=c(3,3,3,5,5,5,7,7,7,9,9,9,10,10,10)
mean(population)

#here is our sample with its sample mean
actualsample=sample(population,5,replace=F)
mean(actualsample)
#we can calculate the standard deviation of the bootstrap sample to estimate the standard error of the sample estimator
sqrt(var(bsamplemean))
#or we can calculate a range of possible estimated value using quantiles
quantile(bsamplemean,c(.05,.95))

bootstrap principle

the bootstrap works here because we are able to replicate the chance process of drawing from an unknown population by drawing with replacement from our observed sample
works in the sense that it gives a pretty good estimate of the sampling variability.even when we do not know the exact population generating our sample.
also works in that we did not have to use any complicated calculations to get this estimate.

bootstrap doesn’t always work

1.if our chance process were different,then drawing repeatedly from our observed sample will not help us.
2.we have to believe that the resampling process we use is a reasonable approximation to the way our data was generated.
3.we might call this no free lunch principle: we can use the computer to do calculations for us about the sampling processes only when we know how to tell the computer to mimic the sampling process

bootstrap summary

we have a sample to infer about the population.
our estimate or summary from a sample will be different for different samples
we need to be aware of sampling variability
Bootstrap principle is a way to get to know sampling variability

step

1.start with data sample $x_1,...,x_n)$

2.draw a sample of size n with replacement from $x_1,...,x_n)$ call it $x_1^{*},...,x_n^{*})$ compute the summary statistics $S(x_1^{*},...,x_n^{*})$ repeat these steps for B times

3.finally we have B of variables from $S(x_1^{*},...,x_n^{*})$

1.drawback:it must replicate the same chance mechanism used to get the sample from the population
2.computation complexity

we have not commented on this before ,but a number of the histograms of the estimates derived by multiple samples look like normal distribution.

looking at the problem of estimating th mean for U[50,100].We have noted that as the sample size increases the variability of the sample estimates decreases

Uniform[a,b]
mean=(b+a)/2
variance=(b^2 - a^2)/12
it depends on the length of the interval

causes of variability
1.sample size
2.variation in the population

data:x1,…,xn
average $\bar{x}=\frac1n\sum^n_{i=1}x_i$
xi=1 if i th person is black
=0 otherwise
xbar=proportion of black students
1.taking sums(or averages) of many independent quantities we obtain a normal distribution(this is known at the central limit theorem)
2.this can be used as a definition of what the normal distribution is and where it comes from
3. it is also known as Gaussian,as Gauss derives it in Theoria motus corporum coelestium in sectionibus conicis solem ambientium(1809)(a fairly impressive work)

sampling variance I:the bootstrap

The variance of the bootstrap samples is an estimate of the variance of S(X1,…,Xn)across all possible samples
average $((S(X_1,...,X_n)-S_b)^2)≈\frac1{B}\sum^{B}_{b=1}(S_b-\bar{S})^2$
standard error $((S(X_1,...,X_n))≈\sqrt{\frac1{B}\sum^{B}_{b=1}(S_b-\bar{S})^2}$

sampling variance II:

sample(X1,…,Xn)
Sample summmary:S(X1,…,Xn)which we use to estimate the corresponding value in the population Spop
the variance of the sample gives us a way to estimate of the variance of S(X1,…,Xn) across all possible samples
$Average((S(X_1,...,X_n)-S_{pop})^2)≈\frac1{n}\sum^{n}_{i=1}\frac{(X_i-\bar{X})^2}n$
$Error(S(X_1,...,X_n))≈\sqrt{\frac1{n}\sum^{n}_{i=1}\frac{(X_i-\bar{X})^2}n}$

interval estimation
the idea is to report not one number estimate ,but a range of possible values that you expect to cover the true value of the population summary
two rules that we can write out and guarantee that ,in repeated experiments,the intervals cover the true value of the population summary 95% of the times
For averages
$(\bar{X}-\frac{2SD(X_1,....,X_n)}{\sqrt{n}},\bar{X}+\frac{2SD(X_1,....,X_n)}{\sqrt{n}}$

For Bootstrap
$Quantile_{Bootsample}(0.025),Quantile_{Bootsample}(0.975))$

Rstudio

library(readr)
stanford =read_delim(“D:\专业英文拓展\data\Stanford.txt”,"\t",escape_double=FALSE)
stanford=stanford[,c(1,3)]
names(stanford)=c(“Race/Ethnicity”,“Number”)
stanford=stanford[-10,]
UGRace=rep(Stanford$“Race/Ethnicity”,stanford$Number)
par(mar=c(5,22,4,2))
barplot(sort(summary(as.factor(UGRace)))/length(UGRace),horiz=T,las=1)

"\t"默认分隔符为制表符，stanford=stanford[-10,]删去第十行 par(mar=c(5,22,4,2))# base R style graphics command for margins #par allows you to have multiple graphs side by side summary(as.factor(UGRace)))计算频数 ,horiz=T，坐标轴旋转，las=1字符均水平。 horizotal参数指定是否将图旋转90度使得x轴平行于纸的长边。horizontal=T,就是要旋转90°,从而绘制横向柱状图

ssize=100
B=1000
SamplePropB=NULL
SamplePropH=NULL
for(i in 1:B){observation=sample(UGRace,ssize,replace=FALSE)
SamplePrppB=c(SamplePropB,sum(observation==“Black or African American,non-Hispanic”)/ssize)
SamplePropB=c(SamplePropH,sum(obervation==“Hispanic/Llatino”)/ssize)
}
par(mfrow=c(1,2))
hist(SamplePropB,main=“Black or African American”,xlab=“Sample Proportion”,xlim=c(0,.3))
abline(v=0.064,col=2,lwd=2)
hist(SamplePropH,main=“Hispanic/Latino”,xlab=“Sample proportion”,xlim=c(0,.4))
abline(v=0.16,col=2,lwd=2)

进行B=1000次模拟获得均值分布

在这里插入图片描述

#Game of Roullette
values=c(“00”,0:36)
play=sample(values,1)
play

ssize1=100
ssize2=1000
B=1000
x=NULL
y=NULL
for(i in 1:B){
x=c(x,mean(red_bet(sample(value,ssize1,replace=TRUE))))
y=c(y,mean(red_bet(sample(value,ssize2,replace=TRUE))))
}
par(mfrow=c(1,2))
hist(x,main=“100 spins”,xlab=“Proportion of red”,xlim=c(0.28,0.62))
abline(v=18/38,col=2,lwd=1)
hist(y,main=“1000 spins”,xlab=“Proportion of red”,xlim=c(0.28,0.62))
abline(v=18/38,col=2,lwd=1)

在这里插入图片描述
#normal distribution
t=seq(min(X),max(X),length.out=1000)
hist(X,probability=TRUE,ylim=c(0,0.1))
lines(t,dnorm(t,mean=70,sd=4),col=2,lwd=2)

par(mfrow=c(1,2))
x=seq(-1,30,length,out=1000)
plot(x,dnorm(x,mean=15,sd=4),main=“density of a normal,mean=15 and sd=4”)
X=rnorm(1000,mean=15,sd=4)
hist(X,main=“Histogram of 1000 samples from N(15,4)”,freq=FALSE)
lines(x,dnorm(x,mean=15,sd=4),col=2)
在这里插入图片描述

#uniform distribution
par(mfrow=c(1,2))
x=seq(-1,30,length,out=1000)
plot(x,dunif(x,min=10,max=20),main=“density of a uniform between 10 and 20”)
X=runif(1000,min=10,max=20)
hist(X,main=“Histogram of 1000 samples from U[10,20]”,freq=FALSE)
lines(x,dunif(x, min=10,max=20),col=2)

#chi-square distribution
par(mfrow=c(1,2))
x=seq(0,30,length.out=1000)
plot(x,dchisq(x,df=10),main=“density of a chi-square with 10 df”,ylab=“density”,pty=“l”)
X=rchisq(1000,df=10)
hist(X,main=“Histogram of 1000 samples from chi^2(10)”,freq=FALSE)
lines(x,dchisq(x,df=10),col=2)

#Empirical density plots
ssize=50
X=rnorm(ssize,mean=5,sd=2)
hist(X,freq=FALSE)
lines(density(X),col=“blue”,lwd=2)
lines(sort(X),dnorm(sort(X),mean=5,sd=2),col=“red”,lwd=2)

#ssize=1000
#ssize=10000
ssize <- 100000
X <- rnorm(ssize,mean=5,sd=2)
par(mfrow=c(1,2))
hist(X, freq=FALSE)
lines(sort(X), dnorm(sort(X),mean=5,sd=2), col=‘red’, lwd=2)
hist(X, freq=FALSE,breaks=70)
lines(sort(X), dnorm(sort(X),mean=5,sd=2), col=‘red’, lwd=2)
hist(X, freq=FALSE,breaks=70) breaks 代表间隔点的个数，个数越多越密，越逼近正态

#Question :what do you observe from the above three plots?

#sample mean examples
X=runif(10000,min=50,max=100)
par(mfrow=c(1,1))
hist(X,freq=FALSE)
abline(v=mean(X),col=2,lwd=2)
mean(X)
lines(sort(X),dunif(sort(X),min=50,max=100),col=“blue”)

ssize=c(10,100,1000,5000)
B=1000
smean=matrix(nrow=B,ncol=length(ssize))
for(i in 1:B){
for(j in 1:length(ssize)){
x=runif(ssize[j],min=50,max=100)
smean[i,j]=mean(X)}}
par(mfrow=c(2,2))
for(i in 1:4){
hist(smean[,i],freq=FALSE,main=“Histogram of 1000 sample means”,xlab=paste(“Mean of a sample of size”,ssize[i]),xlim=c(60,90))}
par(mfrow=c(2,2),mar=c(5,4,4,2))
for(i in 1:4){
hist(smean[,i],freq=FALSE,main=“Histogram of 1000 sample means”,xlab =paste(“Mean if a sample of size”,ssize[i]),xlim=c(60,90))}

在这里插入图片描述
#poisson distribution
library(stats)
X=rpois(10000,lambda=15)
summary(X)
par(mfrow=c(1,1))
hist(X,freq=FALSE)
abline(v=mean(X),col=2,lwd=2)
points(sort(X),dpois(sort(X),15),col=“blue”,pch=20)

ssize=c(10,100,1000,5000)
B=1000
smean=matrix(nrow=B,ncol=length(ssize))
for(i in 1:B){
for(j in 1:length(ssize)){
X=rpois(ssize[j],15)
smean[i,j]=mean(X)
}}
par(mfrow=c(2,2))
for(i in 1:4){
hist(smean[,i],freq=FALSE,main=“histogram of 1000 sample means”,xlab=paste(“mean of a sample of size”,ssize[i]),xlim=c(11,19))
}
在这里插入图片描述

#Effect of sample size on sample mean and sampling variability
ssize=c(10,25,50,75,100,150,500,1000,5000,10000)
B=1000
smean=matrix(NA,nrow=B,ncol=length(ssize))
for(i in 1:B)
{
for(j in 1:length(ssize))
{
X=runif(ssize[j],min=50,max=100)
smean[i,j]=mean(X)
}
}
mean.sample=apply(smean,2,mean)
sd.sample=sqrt(apply(smean,2,var))
par(mfrow=c(1,2))
plot(ssize,mean.sample,xlab=“sample size”,ylab="",main=“average value of estimate”)
plot(ssize,sd.sample,xlab=“sample size”,ylab="",main=“standard deviation of estimate”)
#plot in the log scale
par(mfrow=c(1,2))
plot(ssize,log(mean.sample),xlab=“log(sample size)”,ylab="",main=“average value of estimate”)
plot(ssize,log(sd.sample),xlab=“sample size”,ylab="",main=“standard deviation of estimate”)

在这里插入图片描述

#bootstrap example
library(boot)
data(“claridge”)
summary(claridge)
#let us have a look
library(ggplot2)
ggplot(claridge,aes(x=dnan,y=hand))+geom_count(alpha=0.5,col=“blue”)+ylab(“propensity to use left hand”)+xlab(“DNA variant”)+ggtitle(“claridge data”)

geom_count为计算重叠的点的计数图重叠的点越多点越大

在这里插入图片描述

B=10000
ssize=nrow(claridge)
bcorr=NULL
for(i in 1:B)
{
bsample=claridge[sample(1:ssize,ssize,replace=TRUE),]
bcorr=c(bcorr,cor(bsample[,1],bsample[,2]))
}
par(mfrow=c(1,1))
hist(bcorr,main=“Bootstrap variability”,xlab="")
nrow(claridge)计算行数，运用bootstrap，从样本中抽取样本：bsample=claridge[sample(1:ssize,ssize,replace=TRUE),]

在这里插入图片描述

library(dplyr)
correlations=vector()
for(i in 1:10000){
sample=sample_n(claridge,37,replace=T)
correlations[i]=cor(samples[,1],samples[,2])
}
correlations=data.frame(correlations)
plot5=ggplot(data=correlations,aes(x=correlations))+geom_histogram(bins=20,fill=“cornflowerblue”,alpha=0.8)+labs(x=“correlation in Bootstrap sample”,title=“Bootstrap variability”)+theme_bw()
plot5

mean(bcorr)
sd(bcorr)

ggplot内data需为dataframe，bins=20柱状20条，fill为填充颜色，alpha为透明度

在这里插入图片描述

upright man

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Data science note 3

each sample gives us potentially a different result:there is sampling variabilitywe need to remember this sampling variability when interpreting the sampling results .A larger sample size resulted in smaller variability.R play roulettevalues=c(“00”,0:
复制链接

扫一扫