# [python skill]基于python的bootstrap analysis方法

Q：What is a bootstrap replicate?

A：A single value of a statistic computed from a bootstrap sample.

Q:How many unique bootstrap samples can be drawn (e.g., [-1, 0, 1] and [1, 0, -1] areunique), and what is the maximum mean you can get from a bootstrap sample?

A:There are 27 unique samples, and the maximum mean is 1.

为了更方便大家理解datacamp的老师们以谢菲尔德气象站从1883年到2015年的气象数据（伦敦的降雨情况）（导入numpy的rainfall中）为例作以分析：

for i in range(50):#做50次实验
# Generate bootstrap sample: bs_sample
bs_sample = np.random.choice(rainfall, size=len(rainfall))#每次有放回地抽取一个和rainfall一样长的样本bs_sample

# Compute and plot ECDF from bootstrap sample
x, y = ecdf(bs_sample)#为bs_sample生成一个CDF（累计密度函数），返回x值是降雨量，y值是小于这个x值的概率
_ = plt.plot(x, y, marker='.', linestyle='none',
c='gray', alpha=0.1)#将这次循环中的bs_sample(i)画在图上

# Compute and plot ECDF from original data
x, y = ecdf(rainfall)#为真实值生成一个CDF
_ = plt.plot(x, y, marker='.')#也画在图上

# Make margins and label axes
plt.margins(0.02)
_ = plt.xlabel('yearly rainfall (mm)')
_ = plt.ylabel('ECDF')

# Show the plot
plt.show()


output:

By graphically displaying the bootstrap samples with an ECDF, you can get a feel for how bootstrap sampling allows probabilistic descriptions of data.

def bootstrap_replicate_1d(data,func):
bs_sample=np.random.choice(data,len(data))
return func(bs_sample)
#产生一次bootstrap sample，并对这个sample进行一次func操作
def draw_bs_reps(data,func,size=1):
bs_replicates=np.empty(size)
for i in range(size):
bs_replicates[i]=bootstrap_replicate_1d(data,func)
return bs_replicates
#重复size次bootstrap_replicate_1d

draw_bs_reps函数则是对bootstrap sample进行重复，并将返回值记录在案，绳之以法，并最终返回案底。

# Take 10,000 bootstrap replicates of the mean: bs_replicates
bs_replicates = draw_bs_reps(rainfall,np.mean,size=10000)#将rainfall做bootstrap sample，并将结果均值，重复10000次

# Compute and print SEM
sem = np.std(rainfall) / np.sqrt(len(rainfall))#求出rainfall真实值的标准差
print(sem)
# Compute and print standard deviation of bootstrap replicates
bs_std = np.std(bs_replicates)#输出10000次平均值的标准差
print(bs_std)

# Make a histogram of the results
_ = plt.hist(bs_replicates, bins=50, normed=True)#做10000次实验的概率直方图
_ = plt.xlabel('mean annual rainfall (mm)')
_ = plt.ylabel('PDF')

# Show the plot
plt.show()

output：

10.5105491505
10.4659270712

10.5105491505
10.5062818229

If we repeated measurements over and over again,p% of the observed values would lie within the p%confidence interval.

np.percentile(bs_replicates,[2.5,97.5])

emergency！！！(存疑)

import numpy as np
from scipy import stats

X1=np.array([14.65,14.95,8.49,9.51,10.23,2.75])
Xmean=X1.mean()
Xstd=X1.std(ddof=1)
interval=stats.t.interval(0.95,len(X1-1),Xmean,Xstd)

print("置信区间为：",interval)

https://blog.csdn.net/brucewong0516/article/details/80205422

output:

array([ 779.76992481,  820.95043233])

# Generate 10,000 bootstrap replicates of the variance: bs_replicates
bs_replicates = draw_bs_reps(rainfall,np.var,10000)#对10000次实验求方差

# Put the variance in units of square centimeters
bs_replicates=bs_replicates/100

# Make a histogram of the results
_ = plt.hist(bs_replicates, bins=50, normed=True)
_ = plt.xlabel('variance of annual rainfall (sq. cm)')
_ = plt.ylabel('PDF')

# Show the plot
plt.show()


output:

# Draw bootstrap replicates of the mean no-hitter time (equal to tau): bs_replicates
bs_replicates = draw_bs_reps(nohitter_times,np.mean,10000)

# Compute the 95% confidence interval: conf_int
conf_int = np.percentile(bs_replicates,[2.5,97.5])

# Print the confidence interval
print('95% confidence interval =', conf_int, 'games')

# Plot the histogram of the replicates
_ = plt.hist(bs_replicates, bins=50, normed=True)
_ = plt.xlabel(r'$\tau$ (games)')
_ = plt.ylabel('PDF')

# Show the plot
plt.show()

output:

95% confidence interval = [ 660.67280876  871.63077689] games

def draw_bs_pairs_linreg(x, y, size=1):
"""Perform pairs bootstrap for linear regression."""

# Set up array of indices to sample from: inds
inds = np.arange(len(x))#选用pair中的一个变量长度作为指数

# Initialize replicates: bs_slope_reps, bs_intercept_reps
bs_slope_reps = np.empty(size)#初始化两个空list用于存储每次实验得到的参数
bs_intercept_reps = np.empty(size)

# Generate replicates
for i in range(size):#重复size次实验
bs_inds = np.random.choice(inds, size=len(inds))#抽取和变量长度相当个数的指数
bs_x, bs_y = x[bs_inds], y[bs_inds]#依据指数选取成对的变量
bs_slope_reps[i], bs_intercept_reps[i] = np.polyfit(bs_x,bs_y,1)#用选取到的变量进行线性拟合

return bs_slope_reps, bs_intercept_reps#返回斜率和截距


# Generate replicates of slope and intercept using pairs bootstrap
bs_slope_reps, bs_intercept_reps = draw_bs_pairs_linreg(illiteracy,fertility,1000)

# Compute and print 95% CI for slope
print(np.percentile(bs_slope_reps, [2.5,97.5]))

# Plot the histogram
_ = plt.hist(bs_slope_reps, bins=50, normed=True)
_ = plt.xlabel('slope')
_ = plt.ylabel('PDF')
plt.show()

output：

[ 0.04378061  0.0551616 ]

# Generate array of x-values for bootstrap lines: x
x = np.array([0,100])

# Plot the bootstrap lines
for i in range(100):
_ = plt.plot(x, bs_slope_reps[i]*x + bs_intercept_reps[i],
linewidth=0.5, alpha=0.2, color='red')

# Plot the data
_ = plt.plot(illiteracy,fertility,marker='.',linestyle='none')

# Label axes, set the margins, and show the plot
_ = plt.xlabel('illiteracy')
_ = plt.ylabel('fertility')
plt.margins(0.02)
plt.show()

output：

thats all thank you~

• 8
点赞
• 43
收藏
觉得还不错? 一键收藏
• 2
评论

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。