文章目录
- 1. Parameter estimation by optimization
- 1.1 Optimal parameters
- 1.2 How often do we get on-hitters?
- 1.3 Do the data follow our story?
- 1.4 How is this parameter optimal?
- 1.5 Linear regression by least squares
- 1.6 EDA of literacy/fertility data
- 1.7 Linear regression
- 1.8 How is it optimal?
- 1.9 The importance of EDA: Anscombe's quartet
- 1.10 The importance of EDA
- 1.11 Linear regression on appropriate Anscombe data
- 1.12 Linear regression on all Anscombe data
- 2. Bootstrap confidence intervals
- 2.1 Generating bootstrap replicates
- 2.2 Getting the terminology down
- 2.3 Bootstrapping by hand
- 2.4 Visualizing bootstrap samples
- 2.5 Bootstrap confidence intervals
- 2.6 Generating many bootstrap replicates
- 2.7 Bootstrap replicates of the mean and the SEM
- 2.8 Confidence intervals of rainfall data
- 2.9 Bootstrap replicates of other statistics
- 2.10 Confidence interval on the rate of no-hitters
- 2.11 Pairs bootstrap
- 2.12 A function to do pairs bootstrap
- 2.13 Pairs bootstrap of literacy/fertility data
- 2.14 Plotting bootstrap regressions
- 3. Introduction to hypothesis testing
- 3.1 Formulating and simulating a hypothesis
- 3.2 Generating a permutation sample
- 3.3 Visualizing permutation sampling
- 3.4 Test statistics and p-values
- 3.5 Test statistics
- 3.6 What is a p-value?
- 3.7 Generating permutation replicates
- 3.8 Look before you leap: EDA before hypothesis testing
- 3.9 Permutation test on frog data
- 3.10 Bootstrap hypothesis tests
- 3.11 A one-sample bootstrap hypothesis test
- 3.12 A two-sample bootstrap hypothesis test for difference of means
- 4. Hypothesis test examples
- 4.1 A/B testing
- 4.2 The vote for the Civil Right Act in 1964
- 4.3 What is equivalent?
- 4.4 A time-on-website analog
- 4.5 What should you have done first?
- 4.6 Test of correlation
- 4.7 Simulating a null hypothesis concerning correlation
- 4.8 Hypothesis test on Pearson correlation
- 4.9 Do neonicotinoid insecticides have unintended consequences?
- 4.10 Bootstrap hypothesis test on bee sperm counts
- 5. Putting it all together: a case study
- 5.1 Finch beaks and the need for statistics
- 5.2 EDA of beak depths of Darwin's finches
- 5.3 ECDFs of beak depths
- 5.4 Parameter estimates of beak depths
- 5.5 Hypothesis test: Are beaks deeper in 2012?
- 5.6 Variation in beak shapes
- 5.7 EDA of beak length and depth
- 5.8 Linear regressions
- 5.9 Displaying the linear regression results
- 5.10 Beak length to depth ratio
- 5.11 How different is the ratio?
- 5.12 Calculation of heritability
- 5.13 EDA of heritability
- 5.14 Correlation of offspring and parental data
- 5.15 Pearson correlation of offspring and parental data
- 5.16 Measuring heritability
- 5.17 Is beak depth heritable at all in G.scandens?
- 5.18 Final thoughts
1. Parameter estimation by optimization
1.1 Optimal parameters
1.2 How often do we get on-hitters?
The number of games played between each no-hitter in the modern era (1901-2015) of Major League Baseball is stored in the array nohitter_times
.
If you assume that no-hitters are described as a Poisson process, then the time between no-hitters is Exponentially distributed. As you have seen, the Exponential distribution has a single parameter, which we will call τ τ τ, the typical interval time. The value of the parameter τ τ τ that makes the exponential distribution best match the data is the mean interval time (where time is in units of number of games) between no-hitters.
Compute the value of this parameter from the data. Then, use np.random.exponential()
to “repeat” the history of Major League Baseball by drawing inter-no-hitter times from an exponential distribution with the
τ
τ
τ you found and plot the histogram as an approximation to the PDF.
Instruction
- Seed the random number generator with
42
. - Compute the mean time (in units of number of games) between no-hitters.
- Draw 100,000 samples from an Exponential distribution with the parameter you computed from the mean of the inter-no-hitter times.
- Plot the theoretical PDF using
plt.hist()
. Remember to use keyword argumentsbins=50
,normed=True
, andhisttype='step'
. Be sure to label your axes. - Show your plot.
在这里插入代码片
1.3 Do the data follow our story?
You have modeled no-hitters using an Exponential distribution. Create an ECDF of the real data. Overlay the theoretical CDF with the ECDF from the data. This helps you to verify that the Exponential distribution describes the observed data.
Instruction
- Compute an ECDF from the actual time between no-hitters (
nohitter_times
). Use theecdf()
function you wrote in the prequel course. - Create a CDF from the theoretical samples you took in the last exercise (
inter_nohitter_time
). - Plot
x_theor
andy_theor
as a line usingplt.plot()
. Then overlay the ECDF of the real data x and y as points. To do this, you have to specify the keyword argumentsmarker = '.'
andlinestyle = 'none'
in addition tox
andy
insideplt.plot()
. - Set a 2% margin on the plot.
- Show the plot.
在这里插入代码片
1.4 How is this parameter optimal?
Now sample out of an exponential distribution with τ τ τ being twice as large as the optimal τ τ τ. Do it again for τ τ τ half as large. Make CDFs of these samples and overlay them with your data. You can see that they do not reproduce the data as well. Thus, the τ τ τ you computed from the mean inter-no-hitter times is optimal in that it best reproduces the data.
Instruction
- Take
10000
samples out of an Exponential distribution with parameter τ 1 2 τ_\frac{1}{2} τ21 =tau/2
. - Take
10000
samples out of an Exponential distribution with parameter τ 2 τ_2 τ2 =2*tau
. - Generate CDFs from these two sets of samples using your
ecdf()
function. - Add these two CDFs as lines to your plot. This has been done for you, so hit ‘Submit Answer’ to view the plot!
在这里插入代码片
1.5 Linear regression by least squares
1.6 EDA of literacy/fertility data
In the next few exercises, we will look at the correlation between female literacy and fertility (defined as the average number of children born per woman) throughout the world. For ease of analysis and interpretation, we will work with the illiteracy rate.
It is always a good idea to do some EDA ahead of our analysis. To this end, plot the fertility versus illiteracy and compute the Pearson correlation coefficient. The Numpy array illiteracy
has the illiteracy rate among females for most of the world’s nations. The array fertility
has the corresponding fertility data.
Instruction
- Plot
fertility
(y-axis) versusilliteracy
(x-axis) as a scatter plot. - Set a 2% margin.
- Compute and print the Pearson correlation coefficient between
illiteracy
andfertility
.
在这里插入代码片
1.7 Linear regression
We will assume that fertility is a linear function of the female illiteracy rate. That is,
f
=
a
i
+
b
f=ai+b
f=ai+b, where a is the slope and b is the intercept. We can think of the intercept as the minimal fertility rate, probably somewhere between one and two. The slope tells us how the fertility rate varies with illiteracy. We can find the best fit line using np.polyfit()
.
Plot the data and the best fit line. Print out the slope and intercept. (Think: what are their units?)
Instruction
- Compute the slope and intercept of the regression line using
np.polyfit()
. Remember,fertility
is on the y-axis andilliteracy
on the x-axis. - Print out the slope and intercept from the linear regression.
- To plot the best fit line, create an array x that consists of 0 and 100 using
np.array()
. Then, compute the theoretical values ofy
based on your regression parameters. I.e.,y = a * x + b
. - Plot the data and the regression line on the same plot. Be sure to label your axes.
- Hit ‘Submit Answer’ to display your plot.
在这里插入代码片
1.8 How is it optimal?
The function np.polyfit()
that you used to get your regression parameters finds the optimal slope and intercept. It is optimizing the sum of the squares of the residuals, also known as RSS (for residual sum of squares). In this exercise, you will plot the function that is being optimized, the RSS, versus the slope parameter a
. To do this, fix the intercept to be what you found in the optimization. Then, plot the RSS vs. the slope. Where is it minimal?
Instruction
- Specify the values of the slope to compute the RSS. Use
np.linspace()
to get200
points in the range between0
and0.1
. For example, to get100
points in the range between0
and0.5
, you could usenp.linspace()
like so:np.linspace(0, 0.5, 100)
. - Initialize an array, rss, to contain the
RSS
usingnp.empty_like()
and the array you created above. Theempty_like()
function returns a new array with the same shape and type as a given array (in this case,a_vals
). - Write a
for
loop to compute the sum of RSS of the slope. Hint: the RSS is given bynp.sum((y_data - a * x_data - b)**2)
. The variableb
you computed in the last exercise is already in your namespace. Here,fertility
is they_data
andilliteracy
thex_data
. - Plot the RSS (
rss
) versus slope (a_vals
). - Hit ‘Submit Answer’ to see the plot!
在这里插入代码片
1.9 The importance of EDA: Anscombe’s quartet
1.10 The importance of EDA
Why should exploratory data analysis be the first step in an analysis of data (after getting your data imported and cleaned, of course)?
□ \square □ You can be protected from misinterpretation of the type demonstrated by Anscombe’s quartet.
■ \blacksquare ■ EDA provides a good starting point for planning the rest of your analysis.
□ \square □ EDA is not really any more difficult than any of the subsequent analysis, so there is no excuse for not exploring the data.
□ \square □ All of these reasons!
1.11 Linear regression on appropriate Anscombe data
For practice, perform a linear regression on the data set from Anscombe’s quartet that is most reasonably interpreted with linear regression.
Instruction
- Compute the parameters for the slope and intercept using
np.polyfit()
. The Anscombe data are stored in the arraysx
andy
. - Print the slope
a
and interceptb
. - Generate theoretical
x
x
x and
y
y
y data from the linear regression. Your
x
x
x array, which you can create with
np.array()
, should consist of3
and15
. To generate the y y y data, multiply the slope byx_theor
and add the intercept. - Plot the Anscombe data as a scatter plot and then plot the theoretical line. Remember to include the
marker='.'
andlinestyle='none'
keyword arguments in addition tox
andy
when to plot the Anscombe data as a scatter plot. You do not need these arguments when plotting the theoretical line. - Hit ‘Submit Answer’ to see the plot!
在这里插入代码片
1.12 Linear regression on all Anscombe data
Now, to verify that all four of the Anscombe data sets have the same slope and intercept from a linear regression, you will compute the slope and intercept for each set. The data are stored in lists; anscombe_x = [x1, x2, x3, x4]
and anscombe_y = [y1, y2, y3, y4]
, where, for example, x2
and y2
are the
x
x
x and
y
y
y values for the second Anscombe data set.
Instruction
- Write a
for
loop to do the following for each Anscombe data set.- Compute the slope and intercept.
- Print the slope and intercept.
在这里插入代码片
2. Bootstrap confidence intervals
2.1 Generating bootstrap replicates
2.2 Getting the terminology down
Getting tripped up over terminology is a common cause of frustration in students. Unfortunately, you often will read and hear other data scientists using different terminology for bootstrap samples and replicates. This is even more reason why we need everything to be clear and consistent for this course. So, before going forward discussing bootstrapping, let’s get our terminology down. If we have a data set with n n n repeated measurements, a bootstrap sample is an array of length n n n that was drawn from the original data with replacement. What is a bootstrap replicate?
□ \square □ Just another name for a bootstrap sample.
■ \blacksquare ■ A single value of a statistic computed from a bootstrap sample.
□ \square □ An actual repeat of the measurements.
2.3 Bootstrapping by hand
To help you gain intuition about how bootstrapping works, imagine you have a data set that has only three points, [-1, 0, 1]
. How many unique bootstrap samples can be drawn (e.g., [-1, 0, 1]
and [1, 0, -1]
are unique), and what is the maximum mean you can get from a bootstrap sample? It might be useful to jot down the samples on a piece of paper.
(These are too few data to get meaningful results from bootstrap procedures, but this example is useful for intuition.)
□ \square □ There are 3 unique samples, and the maximum mean is 0.
□ \square □ There are 10 unique samples, and the maximum mean is 0.
□ \square □ There are 10 unique samples, and the maximum mean is 1.
□ \square □ There are 27 unique samples, and the maximum mean is 0.
■ \blacksquare ■ There are 27 unique samples, and the maximum mean is 1.
2.4 Visualizing bootstrap samples
In this exercise, you will generate bootstrap samples from the set of annual rainfall data measured at the Sheffield Weather Station in the UK from 1883 to 2015. The data are stored in the NumPy array rainfall
in units of millimeters (mm). By graphically displaying the bootstrap samples with an ECDF, you can get a feel for how bootstrap sampling allows probabilistic descriptions of data.
Instruction
- Write a
for
loop to acquire50
bootstrap samples of the rainfall data and plot their ECDF.- Use
np.random.choice()
to generate a bootstrap sample from the NumPy arrayrainfall
. Be sure that thesize
of the resampled array islen(rainfall)
. - Use the function
ecdf()
that you wrote in the prequel to this course to generate thex
andy
values for the ECDF of the bootstrap samplebs_sample
. - Plot the ECDF values. Specify
color='gray'
(to make gray dots) andalpha=0.1
(to make them semi-transparent, since we are overlaying so many) in addition to the marker=’.’ and linestyle=‘none’ keyword arguments.
- Use
- Use
ecdf()
to generatex
andy
values for the ECDF of the original rainfall data available in the arrayrainfall
. - Plot the ECDF values of the original data.
- Hit ‘Submit Answer’ to visualize the samples!
在这里插入代码片
2.5 Bootstrap confidence intervals
2.6 Generating many bootstrap replicates
The function bootstrap_replicate_1d()
from the video is available in your namespace. Now you’ll write another function, draw_bs_reps(data, func, size=1)
, which generates many bootstrap replicates from the data set. This function will come in handy for you again and again as you compute confidence intervals and later when you do hypothesis tests.
For your reference, the bootstrap_replicate_1d()
function is provided below:
def bootstrap_replicate_1d(data, func):
"""Generate bootstrap replicate of 1D data."""
bs_sample = np.random.choice(data, len(data))
return func(bs_sample)
Instruction
- Define a function with call signature draw_bs_reps(data, func, size=1).
- Using
np.empty()
, initialize an array calledbs_replicates
of sizesize
to hold all of the bootstrap replicates. - Write a
for
loop that ranges oversize
and computes a replicate usingbootstrap_replicate_1d()
. Refer to the exercise description above to see the function signature ofbootstrap_replicate_1d()
. Store the replicate in the appropriate index ofbs_replicates
. - Return the array of replicates
bs_replicates
. This has already been done for you.
- Using
在这里插入代码片
2.7 Bootstrap replicates of the mean and the SEM
In this exercise, you will compute a bootstrap estimate of the probability density function of the mean annual rainfall at the Sheffield Weather Station. Remember, we are estimating the mean annual rainfall we would get if the Sheffield Weather Station could repeat all of the measurements from 1883 to 2015 over and over again. This is a probabilistic estimate of the mean. You will plot the PDF as a histogram, and you will see that it is Normal.
In fact, it can be shown theoretically that under not-too-restrictive conditions, the value of the mean will always be Normally distributed. (This does not hold in general, just for the mean and a few other statistics.) The standard deviation of this distribution, called the standard error of the mean, or SEM, is given by the standard deviation of the data divided by the square root of the number of data points. I.e., for a data set, sem = np.std(data) / np.sqrt(len(data))
. Using hacker statistics, you get this same result without the need to derive it, but you will verify this result from your bootstrap replicates.
Instruction
- Draw
10000
bootstrap replicates of the mean annual rainfall using yourdraw_bs_reps()
function and therainfall
array. Hint: Pass innp.mean
forfunc
to compute the mean.- As a reminder,
draw_bs_reps()
accepts 3 arguments:data
,func
, andsize
.
- As a reminder,
- Compute and print the standard error of the mean of
rainfall
.- The formula to compute this is
np.std(data) / np.sqrt(len(data))
.
- The formula to compute this is
- Compute and print the standard deviation of your bootstrap replicates
bs_replicates
. - Make a histogram of the replicates using the
normed=True
keyword argument and50
bins. - Hit ‘Submit Answer’ to see the plot!
在这里插入代码片
2.8 Confidence intervals of rainfall data
A confidence interval gives upper and lower bounds on the range of parameter values you might expect to get if we repeat our measurements. For named distributions, you can compute them analytically or look them up, but one of the many beautiful properties of the bootstrap method is that you can take percentiles of your bootstrap replicates to get your confidence interval. Conveniently, you can use the np.percentile()
function.
Use the bootstrap replicates you just generated to compute the 95% confidence interval. That is, give the 2.5th and 97.5th percentile of your bootstrap replicates stored as bs_replicates
. What is the 95% confidence interval?
□ \square □ (765, 776) mm/year
■ \blacksquare ■ ((780, 821) mm/year
□ \square □ ((761, 817) mm/year
□ \square □ ((761, 841) mm/year
2.9 Bootstrap replicates of other statistics
We saw in a previous exercise that the mean is Normally distributed. This does not necessarily hold for other statistics, but no worry: as hackers, we can always take bootstrap replicates! In this exercise, you’ll generate bootstrap replicates for the variance of the annual rainfall at the Sheffield Weather Station and plot the histogram of the replicates.
Here, you will make use of the draw_bs_reps()
function you defined a few exercises ago. It is provided below for your reference:
def draw_bs_reps(data, func, size=1):
"""Draw bootstrap replicates."""
# Initialize array of replicates
bs_replicates = np.empty(size)
# Generate replicates
for i in range(size):
bs_replicates[i] = bootstrap_replicate_1d(data, func)
return bs_replicates
Instruction
- Draw
10000
bootstrap replicates of the variance in annual rainfall, stored in therainfall
dataset, using yourdraw_bs_reps()
function. Hint: Pass innp.var
for computing the variance. - Divide your variance replicates (
bs_replicates
) by100
to put the variance in units of square centimeters for convenience. - Make a histogram of
bs_replicates
using thenormed=True
keyword argument and50
bins.
在这里插入代码片
2.10 Confidence interval on the rate of no-hitters
Consider again the inter-no-hitter intervals for the modern era of baseball. Generate 10,000 bootstrap replicates of the optimal parameter τ τ τ. Plot a histogram of your replicates and report a 95% confidence interval.
Instruction
- Generate
10000
bootstrap replicates of τ from thenohitter_times
data using yourdraw_bs_reps()
function. Recall that the optimal τ τ τ is calculated as the mean of the data. - Compute the 95% confidence interval using
np.percentile()
and passing in two arguments: The arraybs_replicates
, and the list of percentiles - in this case2.5
and97.5
. - Print the confidence interval.
- Plot a histogram of your bootstrap replicates. This has been done for you, so hit ‘Submit Answer’ to see the plot!
在这里插入代码片
2.11 Pairs bootstrap
2.12 A function to do pairs bootstrap
As discussed in the video, pairs bootstrap involves resampling pairs of data. Each collection of pairs fit with a line, in this case using np.polyfit(). We do this again and again, getting bootstrap replicates of the parameter values. To have a useful tool for doing pairs bootstrap, you will write a function to perform pairs bootstrap on a set of x,y data.
Instruction
- Define a function with call signature
draw_bs_pairs_linreg(x, y, size=1)
to perform pairs bootstrap estimates on linear regression parameters.- Use
np.arange()
to set up an array of indices going from0
tolen(x)
. These are what you will resample and use them to pick values out of thex
andy
arrays. - Use
np.empty()
to initialize the slope and intercept replicate arrays to be of sizesize
. - Write a
for
loop to:- Resample the indices
inds
. Usenp.random.choice()
to do this. - Make new
x
x
x and
y
y
y arrays
bs_x
andbs_y
using the the resampled indicesbs_inds
. To do this, slicex
andy
withbs_inds
. - Use
np.polyfit()
on the new x x x and y y y arrays and store the computed slope and intercept.
- Resample the indices
- Return the pair bootstrap replicates of the slope and intercept.
- Use
在这里插入代码片
2.13 Pairs bootstrap of literacy/fertility data
Using the function you just wrote, perform pairs bootstrap to plot a histogram describing the estimate of the slope from the illiteracy/fertility data. Also report the 95% confidence interval of the slope. The data is available to you in the NumPy arrays illiteracy
and fertility
.
As a reminder, draw_bs_pairs_linreg()
has a function signature of draw_bs_pairs_linreg(x, y, size=1)
, and it returns two values: bs_slope_reps
and bs_intercept_reps
.
Instruction
- Use your
draw_bs_pairs_linreg()
function to take1000
bootstrap replicates of the slope and intercept. The x-axis data isilliteracy
and y-axis data isfertility
. - Compute and print the 95% bootstrap confidence interval for the slope.
- Plot and show a histogram of the slope replicates. Be sure to label your axes. This has been done for you, so click ‘Submit Answer’ to see your histogram!
在这里插入代码片
2.14 Plotting bootstrap regressions
A nice way to visualize the variability we might expect in a linear regression is to plot the line you would get from each bootstrap replicate of the slope and intercept. Do this for the first 100 of your bootstrap replicates of the slope and intercept (stored as bs_slope_reps
and bs_intercept_reps
).
Instruction
- Generate an array of
x
x
x-values consisting of
0
and100
for the plot of the regression lines. Use thenp.array()
function for this. - Write a
for
loop in which you plot a regression line with a slope and intercept given by the pairs bootstrap replicates. Do this for100
lines.- When plotting the regression lines in each iteration of the
for
loop, recall the regression equationy = a*x + b
. Here,a
isbs_slope_reps[i]
andb
isbs_intercept_reps[i]
. - Specify the keyword arguments
linewidth=0.5
,alpha=0.2
, andcolor='red'
in your call toplt.plot()
.
- When plotting the regression lines in each iteration of the
- Make a scatter plot with
illiteracy
on the x-axis andfertility
on the y-axis. Remember to specify themarker='.'
andlinestyle='none'
keyword arguments. - Label the axes, set a 2% margin, and show the plot. This has been done for you, so hit ‘Submit Answer’ to visualize the bootstrap regressions!
在这里插入代码片
3. Introduction to hypothesis testing
3.1 Formulating and simulating a hypothesis
3.2 Generating a permutation sample
In the video, you learned that permutation sampling is a great way to simulate the hypothesis that two variables have identical probability distributions. This is often a hypothesis you want to test, so in this exercise, you will write a function to generate a permutation sample from two data sets.
Remember, a permutation sample of two arrays having respectively n1
and n2
entries is constructed by concatenating the arrays together, scrambling the contents of the concatenated array, and then taking the first n1
entries as the permutation sample of the first array and the last n2
entries as the permutation sample of the second array.
Instruction
- Concatenate the two input arrays into one using
np.concatenate()
. Be sure to pass indata1
anddata2
as one argument(data1, data2)
. - Use
np.random.permutation()
to permute the concatenated array. - Store the first
len(data1)
entries ofpermuted_data
asperm_sample_1
and the lastlen(data2)
entries ofpermuted_data
asperm_sample_2
. In practice, this can be achieved by using:len(data1)
andlen(data1):
to slicepermuted_data
. - Return
perm_sample_1
andperm_sample_2
.
在这里插入代码片
3.3 Visualizing permutation sampling
To help see how permutation sampling works, in this exercise you will generate permutation samples and look at them graphically.
We will use the Sheffield Weather Station data again, this time considering the monthly rainfall in June (a dry month) and November (a wet month). We expect these might be differently distributed, so we will take permutation samples to see how their ECDFs would look if they were identically distributed.
As a reminder, permutation_sample()
has a function signature of permutation_sample(data_1, data_2)
with a return value of permuted_data[:len(data_1)], permuted_data[len(data_1):]
, where permuted_data = np.random.permutation(np.concatenate((data_1, data_2)))
.
Instruction
- Write a
for
loop to generate 50 permutation samples, compute their ECDFs, and plot them.- Generate a permutation sample pair from
rain_june
andrain_november
using yourpermutation_sample()
function. - Generate the
x
andy
values for an ECDF for each of the two permutation samples for the ECDF using yourecdf()
function. - Plot the ECDF of the first permutation sample (
x_1
andy_1
) as dots. Do the same for the second permutation sample (x_2
andy_2
).
- Generate a permutation sample pair from
- Generate
x
andy
values for ECDFs for therain_june
andrain_november
data and plot the ECDFs using respectively the keyword argumentscolor='red'
andcolor='blue'
. - Label your axes, set a 2% margin, and show your plot. This has been done for you, so just hit ‘Submit Answer’ to view the plot!
在这里插入代码片
3.4 Test statistics and p-values
3.5 Test statistics
When performing hypothesis tests, your choice of test statistic should be:
□ \square □ something well-known, like the mean or median.
□ \square □ be a parameter that can be estimated.
■ \blacksquare ■ be pertinent to the question you are seeking to answer in your hypothesis test.
3.6 What is a p-value?
The p-value is generally a measure of:
□ \square □ the probability that the hypothesis you are testing is true.
□ \square □ the probability of observing your data if the hypothesis you are testing is true.
■ \blacksquare ■ the probability of observing a test statistic equally or more extreme than the one you observed, given that the null hypothesis is true.
3.7 Generating permutation replicates
As discussed in the video, a permutation replicate is a single value of a statistic computed from a permutation sample. As the draw_bs_reps()
function you wrote in chapter 2 is useful for you to generate bootstrap replicates, it is useful to have a similar function, draw_perm_reps()
, to generate permutation replicates. You will write this useful function in this exercise.
The function has call signature draw_perm_reps(data_1, data_2, func, size=1)
. Importantly, func
must be a function that takes two arrays as arguments. In most circumstances, func
will be a function you write yourself.
Instruction
- Define a function with this signature:
draw_perm_reps(data_1, data_2, func, size=1)
.- Initialize an array to hold the permutation replicates using
np.empty()
. - Write a
for
loop to:- Compute a permutation sample using your
permutation_sample()
function - Pass the samples into
func()
to compute the replicate and store the result in your array of replicates.
- Compute a permutation sample using your
- Return the array of replicates.
- Initialize an array to hold the permutation replicates using
在这里插入代码片
3.8 Look before you leap: EDA before hypothesis testing
Kleinteich and Gorb (Sci. Rep., 4, 5225, 2014) performed an interesting experiment with South American horned frogs. They held a plate connected to a force transducer, along with a bait fly, in front of them. They then measured the impact force and adhesive force of the frog’s tongue when it struck the target.
Frog A is an adult and Frog B is a juvenile. The researchers measured the impact force of 20 strikes for each frog. In the next exercise, we will test the hypothesis that the two frogs have the same distribution of impact forces. But, remember, it is important to do EDA first! Let’s make a bee swarm plot for the data. They are stored in a Pandas data frame, df
, where column ID
is the identity of the frog and column impact_force
is the impact force in Newtons (N).
Instruction
- Use
sns.swarmplot()
to make a bee swarm plot of the data by specifying thex
,y
, anddata
keyword arguments. - Label your axes.
- Show the plot.
在这里插入代码片
3.9 Permutation test on frog data
The average strike force of Frog A was 0.71 Newtons (N), and that of Frog B was 0.42 N for a difference of 0.29 N. It is possible the frogs strike with the same force and this observed difference was by chance. You will compute the probability of getting at least a 0.29 N difference in mean strike force under the hypothesis that the distributions of strike forces for the two frogs are identical. We use a permutation test with a test statistic of the difference of means to test this hypothesis.
For your convenience, the data has been stored in the arrays force_a
and force_b
.
Instruction
- Define a function with call signature
diff_of_means(data_1, data_2)
that returns the differences in means between two data sets, mean ofdata_1
minus mean ofdata_2
. - Use this function to compute the empirical difference of means that was observed in the frogs.
- Draw 10,000 permutation replicates of the difference of means.
- Compute the p-value.
- Print the p-value.
在这里插入代码片
3.10 Bootstrap hypothesis tests
3.11 A one-sample bootstrap hypothesis test
Another juvenile frog was studied, Frog C, and you want to see if Frog B and Frog C have similar impact forces. Unfortunately, you do not have Frog C’s impact forces available, but you know they have a mean of 0.55 N. Because you don’t have the original data, you cannot do a permutation test, and you cannot assess the hypothesis that the forces from Frog B and Frog C come from the same distribution. You will therefore test another, less restrictive hypothesis: The mean strike force of Frog B is equal to that of Frog C.
To set up the bootstrap hypothesis test, you will take the mean as our test statistic. Remember, your goal is to calculate the probability of getting a mean impact force less than or equal to what was observed for Frog B if the hypothesis that the true mean of Frog B’s impact forces is equal to that of Frog C is true. You first translate all of the data of Frog B such that the mean is 0.55 N. This involves adding the mean force of Frog C and subtracting the mean force of Frog B from each measurement of Frog B. This leaves other properties of Frog B’s distribution, such as the variance, unchanged.
Instruction
- Translate the impact forces of Frog B such that its mean is 0.55 N.
- Use your
draw_bs_reps()
function to take 10,000 bootstrap replicates of the mean of your translated forces. - Compute the p-value by finding the fraction of your bootstrap replicates that are less than the observed mean impact force of Frog B. Note that the variable of interest here is
force_b
. - Print your p-value.
在这里插入代码片
3.12 A two-sample bootstrap hypothesis test for difference of means
We now want to test the hypothesis that Frog A and Frog B have the same mean impact force, but not necessarily the same distribution, which is also impossible with a permutation test.
To do the two-sample bootstrap test, we shift both arrays to have the same mean, since we are simulating the hypothesis that their means are, in fact, equal. We then draw bootstrap samples out of the shifted arrays and compute the difference in means. This constitutes a bootstrap replicate, and we generate many of them. The p-value is the fraction of replicates with a difference in means greater than or equal to what was observed.
The objects forces_concat
and empirical_diff_means
are already in your namespace.
Instruction
- Compute the mean of all forces (from
forces_concat
) usingnp.mean()
. - Generate shifted data sets for both
force_a
andforce_b
such that the mean of each is the mean of the concatenated array of impact forces. - Generate 10,000 bootstrap replicates of the mean each for the two shifted arrays.
- Compute the bootstrap replicates of the difference of means by subtracting the replicates of the shifted impact force of Frog B from those of Frog A.
- Compute and print the p-value from your bootstrap replicates.
在这里插入代码片
4. Hypothesis test examples
4.1 A/B testing
4.2 The vote for the Civil Right Act in 1964
The Civil Rights Act of 1964 was one of the most important pieces of legislation ever passed in the USA. Excluding “present” and “abstain” votes, 153 House Democrats and 136 Republicans voted yea. However, 91 Democrats and 35 Republicans voted nay. Did party affiliation make a difference in the vote?
To answer this question, you will evaluate the hypothesis that the party of a House member has no bearing on his or her vote. You will use the fraction of Democrats voting in favor as your test statistic and evaluate the probability of observing a fraction of Democrats voting in favor at least as small as the observed fraction of 153/244. (That’s right, at least as small as. In 1964, it was the Democrats who were less progressive on civil rights issues.) To do this, permute the party labels of the House voters and then arbitrarily divide them into “Democrats” and “Republicans” and compute the fraction of Democrats voting yea.
Instruction
- Construct Boolean arrays,
dems
andreps
that contain the votes of the respective parties; e.g.,dems
has 153True
entries and 91False
entries. - Write a function,
frac_yea_dems(dems, reps)
that returns the fraction of Democrats that voted yea. The first input is an array of Booleans, Two inputs are required to use yourdraw_perm_reps()
function, but the second is not used. - Use your
draw_perm_reps()
function to draw 10,000 permutation replicates of the fraction of Democrat yea votes. - Compute and print the p-value.
在这里插入代码片
4.3 What is equivalent?
You have experience matching stories to probability distributions. Similarly, you use the same procedure for two different A/B tests if their stories match. In the Civil Rights Act example you just did, you performed an A/B test on voting data, which has a Yes/No type of outcome for each subject (in that case, a voter). Which of the following situations involving testing by a web-based company has an equivalent set up for an A/B test as the one you just did with the Civil Rights Act of 1964?
□ \square □ You measure how much time each customer spends on your website before and after an advertising campaign.
■ \blacksquare ■ You measure the number of people who click on an ad on your company’s website before and after changing its color.
□ \square □ You measure how many clicks each person has on your company’s website before and after changing the header layout.
4.4 A time-on-website analog
It turns out that you already did a hypothesis test analogous to an A/B test where you are interested in how much time is spent on the website before and after an ad campaign. The frog tongue force (a continuous quantity like time on the website) is an analog. “Before” = Frog A and “after” = Frog B. Let’s practice this again with something that actually is a before/after scenario.
We return to the no-hitter data set. In 1920, Major League Baseball implemented important rule changes that ended the so-called dead ball era. Importantly, the pitcher was no longer allowed to spit on or scuff the ball, an activity that greatly favors pitchers. In this problem you will perform an A/B test to determine if these rule changes resulted in a slower rate of no-hitters (i.e., longer average time between no-hitters) using the difference in mean inter-no-hitter time as your test statistic. The inter-no-hitter times for the respective eras are stored in the arrays nht_dead
and nht_live
, where “nht” is meant to stand for “no-hitter time.”
Since you will be using your draw_perm_reps()
function in this exercise, it may be useful to remind yourself of its call signature: draw_perm_reps(d1, d2, func, size=1)
.
Instruction
- Compute the observed difference in mean inter-nohitter time using -
diff_of_means()
. - Generate 10,000 permutation replicates of the difference of means using
draw_perm_reps()
. - Compute and print the p-value.
在这里插入代码片
4.5 What should you have done first?
That was a nice hypothesis test you just did to check out whether the rule changes in 1920 changed the rate of no-hitters. But what should you have done with the data first?
■ \blacksquare ■ Performed EDA, perhaps plotting the ECDFs of inter-no-hitter times in the dead ball and live ball eras.
□ \square □ Nothing. The hypothesis test was only a few lines of code.
4.6 Test of correlation
4.7 Simulating a null hypothesis concerning correlation
The observed correlation between female illiteracy and fertility in the data set of 162 countries may just be by chance; the fertility of a given country may actually be totally independent of its illiteracy. You will test this null hypothesis in the next exercise.
To do the test, you need to simulate the data assuming the null hypothesis is true. Of the following choices, which is the best way to do it?
□ \square □ Choose 162 random numbers to represent the illiteracy rate and 162 random numbers to represent the corresponding fertility rate.
□ \square □ Do a pairs bootstrap: Sample pairs of observed data with replacement to generate a new set of (illiteracy, fertility) data.
□ \square □ Do a bootstrap sampling in which you sample 162 illiteracy values with replacement and then 162 fertility values with replacement.
■ \blacksquare ■ Do a permutation test: Permute the illiteracy values but leave the fertility values fixed to generate a new set of (illiteracy, fertility) data.
□ \square □ Do a permutation test: Permute both the illiteracy and fertility values to generate a new set of (illiteracy, fertility data).
4.8 Hypothesis test on Pearson correlation
The observed correlation between female illiteracy and fertility may just be by chance; the fertility of a given country may actually be totally independent of its illiteracy. You will test this hypothesis. To do so, permute the illiteracy values but leave the fertility values fixed. This simulates the hypothesis that they are totally independent of each other. For each permutation, compute the Pearson correlation coefficient and assess how many of your permutation replicates have a Pearson correlation coefficient greater than the observed one.
Instruction
- Compute the observed Pearson correlation between
illiteracy
andfertility
. - Initialize an array to store your permutation replicates.
- Write a
for
loop to draw 10,000 replicates:- Permute the
illiteracy
measurements usingnp.random.permutation()
. - Compute the Pearson correlation between the permuted illiteracy array,
illiteracy_permuted
, andfertility
.
- Permute the
- Compute and print the p-value from the replicates.
在这里插入代码片
4.9 Do neonicotinoid insecticides have unintended consequences?
As a final exercise in hypothesis testing before we put everything together in our case study in the next chapter, you will investigate the effects of neonicotinoid insecticides on bee reproduction. These insecticides are very widely used in the United States to combat aphids and other pests that damage plants.
In a recent study, Straub, et al. (Proc. Roy. Soc. B, 2016) investigated the effects of neonicotinoids on the sperm of pollinating bees. In this and the next exercise, you will study how the pesticide treatment affected the count of live sperm per half milliliter of semen.
First, we will do EDA, as usual. Plot ECDFs of the alive sperm count for untreated bees (stored in the Numpy array control
) and bees treated with pesticide (stored in the Numpy array treated
).
Instruction
- Use your
ecdf()
function to generatex,y
values from thecontrol
andtreated
arrays for plotting the ECDFs. - Plot the ECDFs on the same plot.
- The margins have been set for you, along with the legend and axis labels. Hit ‘Submit Answer’ to see the result!
在这里插入代码片
4.10 Bootstrap hypothesis test on bee sperm counts
Now, you will test the following hypothesis: On average, male bees treated with neonicotinoid insecticide have the same number of active sperm per milliliter of semen than do untreated male bees. You will use the difference of means as your test statistic.
For your reference, the call signature for the draw_bs_reps()
function you wrote in chapter 2 is draw_bs_reps(data, func, size=1)
.
Instruction
- Compute the mean alive sperm count of
control
minus that oftreated
. - Compute the mean of all alive sperm counts. To do this, first concatenate
control
andtreated
and take the mean of the concatenated array. - Generate shifted data sets for both
control
andtreated
such that the shifted data sets have the same mean. This has already been done for you. - Generate 10,000 bootstrap replicates of the mean each for the two shifted arrays. Use your
draw_bs_reps()
function. - Compute the bootstrap replicates of the difference of means.
- The code to compute and print the p-value has been written for you. Hit ‘Submit Answer’ to see the result!
在这里插入代码片
5. Putting it all together: a case study
5.1 Finch beaks and the need for statistics
5.2 EDA of beak depths of Darwin’s finches
For your first foray into the Darwin finch data, you will study how the beak depth (the distance, top to bottom, of a closed beak) of the finch species Geospiza scandens has changed over time. The Grants have noticed some changes of beak geometry depending on the types of seeds available on the island, and they also noticed that there was some interbreeding with another major species on Daphne Major, Geospiza fortis. These effects can lead to changes in the species over time.
In the next few problems, you will look at the beak depth of G. scandens on Daphne Major in 1975 and in 2012. To start with, let’s plot all of the beak depth measurements in 1975 and 2012 in a bee swarm plot.
Instruction
- Create the beeswarm plot.
- Label the axes.
- Show the plot.
在这里插入代码片
5.3 ECDFs of beak depths
While bee swarm plots are useful, we found that ECDFs are often even better when doing EDA. Plot the ECDFs for the 1975 and 2012 beak depth measurements on the same plot.
Instruction
- Compute the ECDF for the 1975 and 2012 data.
- Plot the two ECDFs.
- Set a 2% margin and add axis labels and a legend to the plot.
- Hit ‘Submit Answer’ to view the plot!
在这里插入代码片
5.4 Parameter estimates of beak depths
Estimate the difference of the mean beak depth of the G. scandens samples from 1975 and 2012 and report a 95% confidence interval.
Since in this exercise you will use the draw_bs_reps()
function you wrote in chapter 2.
Instruction
- Compute the difference of the sample means.
- Take 10,000 bootstrap replicates of the mean for the 1975 beak depths using your
draw_bs_reps()
function. Also get 10,000 bootstrap replicates of the mean for the 2012 beak depths. - Subtract the 1975 replicates from the 2012 replicates to get bootstrap replicates of the difference of means.
- Use the replicates to compute the 95% confidence interval.
- Hit ‘Submit Answer’ to view the results!
在这里插入代码片
5.5 Hypothesis test: Are beaks deeper in 2012?
Your plot of the ECDF and determination of the confidence interval make it pretty clear that the beaks of G. scandens on Daphne Major have gotten deeper. But is it possible that this effect is just due to random chance? In other words, what is the probability that we would get the observed difference in mean beak depth if the means were the same?
Be careful! The hypothesis we are testing is not that the beak depths come from the same distribution. For that we could use a permutation test. The hypothesis is that the means are equal. To perform this hypothesis test, we need to shift the two data sets so that they have the same mean and then use bootstrap sampling to compute the difference of means.
Instruction
- Make a concatenated array of the 1975 and 2012 beak depths and compute and store its mean.
- Shift
bd_1975
andbd_2012
such that their means are equal to the one you just computed for the combined data set. - Take 10,000 bootstrap replicates of the mean each for the 1975 and 2012 beak depths.
- Subtract the 1975 replicates from the 2012 replicates to get bootstrap replicates of the difference.
- Compute and print the p-value. The observed difference in means you computed in the last exercise is still in your namespace as
mean_diff
.
在这里插入代码片
5.6 Variation in beak shapes
5.7 EDA of beak length and depth
The beak length data are stored as bl_1975
and bl_2012
, again with units of millimeters (mm). You still have the beak depth data stored in bd_1975
and bd_2012
. Make scatter plots of beak depth (y-axis) versus beak length (x-axis) for the 1975 and 2012 specimens.
Instruction
- Make a scatter plot of the 1975 data. Use the
color='blue'
keyword argument. Also use analpha=0.5
keyword argument to have transparency in case data points overlap. - Do the same for the 2012 data, but use the
color='red'
keyword argument. - Add a legend and label the axes.
- Show your plot.
在这里插入代码片
5.8 Linear regressions
Perform a linear regression for both the 1975 and 2012 data. Then, perform pairs bootstrap estimates for the regression parameters. Report 95% confidence intervals on the slope and intercept of the regression line.
You will use the draw_bs_pairs_linreg()
function you wrote back in chapter 2.
As a reminder, its call signature is draw_bs_pairs_linreg(x, y, size=1)
, and it returns bs_slope_reps
and bs_intercept_reps
. The beak length data are stored as bl_1975
and bl_2012
, and the beak depth data is stored in bd_1975
and bd_2012
.
Instruction
- Compute the slope and intercept for both the 1975 and 2012 data sets.
- Obtain 1000 pairs bootstrap samples for the linear regressions using your
draw_bs_pairs_linreg()
function. - Compute 95% confidence intervals for the slopes and the intercepts.
在这里插入代码片
5.9 Displaying the linear regression results
Now, you will display your linear regression results on the scatter plot, the code for which is already pre-written for you from your previous exercise. To do this, take the first 100 bootstrap samples (stored in bs_slope_reps_1975
, bs_intercept_reps_1975
, bs_slope_reps_2012
, and bs_intercept_reps_2012
) and plot the lines with alpha=0.2
and linewidth=0.5
keyword arguments to plt.plot()
.
Instruction
- Generate the
x
x
x-values for the bootstrap lines using
np.array()
. They should consist of 10 mm and 17 mm. - Write a
for
loop to plot 100 of the bootstrap lines for the 1975 and 2012 data sets. The lines for the 1975 data set should be'blue'
and those for the 2012 data set should be'red'
. - Hit ‘Submit Answer’ to view the plot!
在这里插入代码片
5.10 Beak length to depth ratio
The linear regressions showed interesting information about the beak geometry. The slope was the same in 1975 and 2012, suggesting that for every millimeter gained in beak length, the birds gained about half a millimeter in depth in both years. However, if we are interested in the shape of the beak, we want to compare the ratio of beak length to beak depth. Let’s make that comparison.
Instruction
- Make arrays of the beak length to depth ratio of each bird for 1975 and for 2012.
- Compute the mean of the length to depth ratio for 1975 and for 2012.
- Generate 10,000 bootstrap replicates each for the mean ratio for 1975 and 2012 using your
draw_bs_reps()
function. - Get a 99% bootstrap confidence interval for the length to depth ratio for 1975 and 2012.
- Print the results.
在这里插入代码片
5.11 How different is the ratio?
In the previous exercise, you computed the mean beak length to depth ratio with 99% confidence intervals for 1975 and for 2012. The results of that calculation are shown graphically in the plot accompanying this problem. In addition to these results, what would you say about the ratio of beak length to depth?
□
\square
□ The mean beak length-to-depth ratio decreased by about 0.1, or 7%, from 1975 to 2012. The 99% confidence intervals are not even close to overlapping, so this is a real change. The beak shape changed.
■ \blacksquare ■ It is impossible to say if this is a real effect or just due to noise without computing a p-value. Let me compute the p-value and get back to you.
5.12 Calculation of heritability
5.13 EDA of heritability
The array bd_parent_scandens
contains the average beak depth (in mm) of two parents of the species G. scandens
. The array bd_offspring_scandens
contains the average beak depth of the offspring of the respective parents. The arrays bd_parent_fortis
and bd_offspring_fortis
contain the same information about measurements from G. fortis birds.
Make a scatter plot of the average offspring beak depth (y-axis) versus average parental beak depth (x-axis) for both species. Use the alpha=0.5
keyword argument to help you see overlapping points.
Instruction
- Generate scatter plots for both species. Display the data for G. fortis in blue and G. scandens in red.
- Set the axis labels, make a legend, and show the plot.
在这里插入代码片
5.14 Correlation of offspring and parental data
In an effort to quantify the correlation between offspring and parent beak depths, we would like to compute statistics, such as the Pearson correlation coefficient, between parents and offspring. To get confidence intervals on this, we need to do a pairs bootstrap.
You have already written a function to do pairs bootstrap to get estimates for parameters derived from linear regression. Your task in this exercise is to make a new function with call signature draw_bs_pairs(x, y, func, size=1)
that performs pairs bootstrap and computes a single statistic on pairs samples defined. The statistic of interest is computed by calling func(bs_x, bs_y)
. In the next exercise, you will use pearson_r
for func
.
Instruction
- Set up an array of indices to sample from. (Remember, when doing pairs bootstrap, we randomly choose indices and use those to get the pairs.)
- Initialize the array of bootstrap replicates. This should be a one-dimensional array of length
size
. - Write a
for
loop to draw the samples. - Randomly choose indices from the array of indices you previously set up.
- Extract
x
values andy
values from the input array using the indices you just chose to generate a bootstrap sample. - Use
func
to compute the statistic of interest from the bootstrap samples ofx
andy
and store it in your array of bootstrap replicates. - Return the array of bootstrap replicates.
在这里插入代码片
5.15 Pearson correlation of offspring and parental data
The Pearson correlation coefficient seems like a useful measure of how strongly the beak depth of parents are inherited by their offspring. Compute the Pearson correlation coefficient between parental and offspring beak depths for G. scandens. Do the same for G. fortis. Then, use the function you wrote in the last exercise to compute a 95% confidence interval using pairs bootstrap.
Instruction
- Use the
pearson_r()
function you wrote in the prequel to this course to compute the Pearson correlation coefficient for G. scandens and G. fortis. - Acquire 1000 pairs bootstrap replicates of the Pearson correlation coefficient using the
draw_bs_pairs()
function you wrote in the previous exercise for G. scandens and G. fortis. - Compute the 95% confidence interval for both using your bootstrap replicates.
在这里插入代码片
5.16 Measuring heritability
Remember that the Pearson correlation coefficient is the ratio of the covariance to the geometric mean of the variances of the two data sets. This is a measure of the correlation between parents and offspring, but might not be the best estimate of heritability. If we stop and think, it makes more sense to define heritability as the ratio of the covariance between parent and offspring to the variance of the parents alone. In this exercise, you will estimate the heritability and perform a pairs bootstrap calculation to get the 95% confidence interval.
This exercise highlights a very important point. Statistical inference (and data analysis in general) is not a plug-n-chug enterprise. You need to think carefully about the questions you are seeking to answer with your data and analyze them appropriately. If you are interested in how heritable traits are, the quantity we defined as the heritability is more apt than the off-the-shelf statistic, the Pearson correlation coefficient.
Instruction
- Write a function
heritability(parents, offspring)
that computes heritability defined as the ratio of the covariance of the trait in parents and offspring divided by the variance of the trait in the parents. Hint: Remind yourself of thenp.cov()
function we covered in the prequel to this course. - Use this function to compute the heritability for G. scandens and G. fortis.
- Acquire 1000 bootstrap replicates of the heritability using pairs bootstrap for G. scandens and G. fortis.
- Compute the 95% confidence interval for both using your bootstrap replicates.
- Print the results.
在这里插入代码片
5.17 Is beak depth heritable at all in G.scandens?
The heritability of beak depth in G. scandens seems low. It could be that this observed heritability was just achieved by chance and beak depth is actually not really heritable in the species. You will test that hypothesis here. To do this, you will do a pairs permutation test.
Instruction
- Initialize your array of replicates of heritability. We will take 10,000 pairs permutation replicates.
- Write a
for
loop to generate your replicates.- Permute the
bd_parent_scandens
array usingnp.random.permutation()
. - Compute the heritability between the permuted array and the
bd_offspring_scandens
array using theheritability()
function you wrote in the last exercise. Store the result in the replicates array.
- Permute the
- Compute the p-value as the number of replicates that are greater than the observed
heritability_scandens
you computed in the last exercise.
在这里插入代码片