文章目录
- 1. Graphical exploratory data analysis
- 1.1 Introduction to exploratory aata analysis
- 1.2 Tukey's comments on EDA
- 1.3 Advantages of graphical EDA
- 1.4 Plotting a histogram
- 1.5 Plotting a histogram of iris data
- 1.6 Axis labels!
- 1.7 Adjusting the number of bins in a histogram
- 1.8 Plot all your data: Bee swarm plots
- 1.9 Bee swarm plot
- 1.9 Interpreting a bee swarm plot
- 1.10 Plot all of your data: ECDFs
- 1.11 Computing the ECDF
- 1.12 Plotting the ECDF
- 1.13 Comparison of ECDF
- 1.14 On toward the whole story!
- 2. Quantitative exploratory data analysis
- 2.1 Introduction to summary statistics: The sample mean and median
- 2.2 Means and medians
- 2.3 Computing means
- 2.4 Percentiles, outliers, and box plots
- 2.5 Computing percentiles
- 2.6 Comparing percentiles to ECDF
- 2.7 Box-and whisker plot
- 2.8 Variance and standard deviation
- 2.9 Computing the variance
- 2.10 The standard deviation and the variance
- 2.11 Covariance and the Pearson correlation coefficient
- 2.12 Scatter plots
- 2.13 Variance and covariance by looking
- 2.14 Computing the covariance
- 2.15 Computing the Pearson correlation coefficient
- 3. Thinking probailistically - Discrete variables
- 3.1 Probabilistic logic and statistical inference
- 3.2 What is the goal of statistical Inference?
- 3.3 Why do we use the language of probability?
- 3.4 Random number generators and hacker statistics
- 3.5 Generating random numbers using the np.random module
- 3.6 The np.random module and Bernoulli trials
- 3.7 How many defaults might we expect?
- 3.8 Will the bank fail?
- 3.9 Probability distributions and stories: The Binomial distribution
- 3.10 Sampling out of the Binomial distribution
- 3.11 Plotting the Binomial PMF
- 3.12 Poisson processes and the Poisson distribution
- 3.13 Relationship between Binomial and Poisson distributions
- 3.14 How many no-hitters in a season?
- 3.15 Was 20`5 anomalous?
- 4. Thinking probabilistically - Continuous variables
- 4.1 Probability density functions
- 4.2 Interpreting PDFs
- 4.3 Interpreting CDFs
- 4.4 Introduction to the Normal distribution
- 4.5 The Normal PDF
- 4.6 The Normal CDF
- 4.7 The Normal distribution: Properties and warnings
- 4.8 Gauss and the 10 Deutschmark banknote
- 4.9 Are the Belmont Stakes results Normally distributed?
- 4.10 What are the chances of a horse matching or beating Secretariat's record?
- 4.11 The Exponential distribution
- 4.12 Matching a story and a distribution
- 4.13 Waiting for the next Secretariat
- 4.14 If you have a story, you can simulate it!
- 4.15 Distribution of no-hitters and cycles
- 4.16 Final toughts
1. Graphical exploratory data analysis
1.1 Introduction to exploratory aata analysis
1.2 Tukey’s comments on EDA
Even though you probably have not read Tukey’s book, I suspect you already have a good idea about his viewpoint from the video introducing you to exploratory data analysis. Which of the following quotes is not directly from Tukey?
□ \square □ Exploratory data analysis is detective work.
□ \square □ There is no excuse for failing to plot and look.
□ \square □ The greatest value of a picture is that it forces us to notice what we never expected to see.
□ \square □ It is important to understand what you can do before you learn how to measure how well you seem to have done it.
■ \blacksquare ■ Often times EDA is too time consuming, so it is better to jump right in and do your hypothesis tests.
1.3 Advantages of graphical EDA
Which of the following is not true of graphical EDA?
□
\square
□ It often involves converting tabular data into graphical form.
□ \square □ If done well, graphical representations can allow for more rapid interpretation of data.
■ \blacksquare ■ A nice looking plot is always the end goal of a statistical analysis.
□
\square
□ There is no excuse for neglecting to do graphical EDA.
press
1.4 Plotting a histogram
1.5 Plotting a histogram of iris data
For the exercises in this section, you will use a classic data set collected by botanist Edward Anderson and made famous by Ronald Fisher, one of the most prolific statisticians in history. Anderson carefully measured the anatomical properties of samples of three different species of iris, Iris setosa, Iris versicolor, and Iris virginica. The full data set is available as part of scikit-learn. Here, you will work with his measurements of petal length.
Plot a histogram of the petal lengths of his 50 samples of Iris versicolor using matplotlib/seaborn’s default settings. Recall that to specify the default seaborn style, you can use sns.set()
, where sns
is the alias that seaborn
is imported as.
The subset of the data set containing the Iris versicolor petal lengths in units of centimeters (cm) is stored in the NumPy array versicolor_petal_length
.
In the video, Justin plotted the histograms by using the pandas
library and indexing the DataFrame to extract the desired column. Here, however, you only need to use the provided NumPy array. Also, Justin assigned his plotting statements (except for plt.show()
) to the dummy variable _
. This is to prevent unnecessary output from being displayed. It is not required for your solutions to these exercises, however it is good practice to use it. Alternatively, if you are working in an interactive environment such as a Jupyter notebook, you could use a ;
after your plotting statements to achieve the same effect. Justin prefers using _
. Therefore, you will see it used in the solution code.
Instruction
- Import
matplotlib.pyplot
andseaborn
as their usual aliases (plt
andsns
). - Use
seaborn
to set the plotting defaults. - Plot a histogram of the Iris versicolor petal lengths using
plt.hist()
and the provided NumPy arrayversicolor_petal_length
. - Show the histogram using
plt.show()
.
# Import plotting modules
import matplotlib.pyplot as plt
import seaborn as sns
# Set default Seaborn style
sns.set()
# Plot histogram of versicolor petal lengths
plt.hist(versicolor_petal_length)
# Show histogram
plt.show()
1.6 Axis labels!
In the last exercise, you made a nice histogram of petal lengths of Iris versicolor, but you didn’t label the axes! That’s ok; it’s not your fault since we didn’t ask you to. Now, add axis labels to the plot using plt.xlabel()
and plt.ylabel()
. Don’t forget to add units and assign both statements to _
. The packages matplotlib.pyplot
and seaborn
are already imported with their standard aliases. This will be the case in what follows, unless specified otherwise.
Instruction
- Label the axes. Don’t forget that you should always include units in your axis labels. Your -axis label is just
'count'
. Your -axis label is'petal length (cm)'
. The units are essential! - Display the plot constructed in the above steps using
plt.show()
.
# Plot histogram of versicolor petal lengths
_ = plt.hist(versicolor_petal_length)
# Label axes
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('count')
# Show histogram
plt.show()
1.7 Adjusting the number of bins in a histogram
The histogram you just made had ten bins. This is the default of matplotlib. The “square root rule” is a commonly-used rule of thumb for choosing number of bins: choose the number of bins to be the square root of the number of samples. Plot the histogram of Iris versicolor petal lengths again, this time using the square root rule for the number of bins
. You specify the number of bins using the bins keyword argument of plt.hist()
.
Instruction
- Import
numpy
asnp
. This gives access to the square root function,np.sqrt()
. - Determine how many data points you have using
len()
. - Compute the number of bins using the square root rule.
- Convert the number of bins to an integer using the built in
int()
function. - Generate the histogram and make sure to use the
bins
keyword argument. - Hit ‘Submit Answer’ to plot the figure and see the fruit of your labors!
# Import numpy
import numpy as np
# Compute number of data points: n_data
n_data = len(versicolor_petal_length)
# Number of bins is the square root of number of data points: n_bins
n_bins = np.sqrt(n_data)
# Convert number of bins to integer: n_bins
n_bins = int(n_bins)
# Plot the histogram
_ = plt.hist(versicolor_petal_length,
bins = n_bins)
# Label axes
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('count')
# Show histogram
plt.show()
1.8 Plot all your data: Bee swarm plots
1.9 Bee swarm plot
Make a bee swarm plot of the iris petal lengths. Your x-axis should contain each of the three species, and the y-axis the petal lengths. A data frame containing the data is in your namespace as df
.
For your reference, the code Justin used to create the bee swarm plot in the video is provided below:
_ = sns.swarmplot(x='state', y='dem_share', data=df_swing)
_ = plt.xlabel('state')
_ = plt.ylabel('percent of vote for Obama')
plt.show()
Instruction
- In the IPython Shell, inspect the DataFrame
df
usingdf.head()
. This will let you identify which column names you need to pass as thex
andy
keyword arguments in your call tosns.swarmplot()
. - Use
sns.swarmplot()
to make a bee swarm plot from the DataFrame containing the Fisher iris data set,df
. The x-axis should contain each of the three species, and the y-axis should contain the petal lengths. - Label the axes.
- Show your plot.
# Create bee swarm plot with Seaborn's default settings
_ = sns.swarmplot(x = 'species',
y = 'petal length (cm)',
data = df)
# Label the axes
_ = plt.xlabel('species')
_ = plt.ylabel('petal length (cm)')
# Show the plot
plt.show()
1.9 Interpreting a bee swarm plot
Which of the following conclusions could you draw from the bee swarm plot of iris petal lengths you generated in the previous exercise? For your convenience, the bee swarm plot is regenerated and shown to the right.
Instruction
□
\square
□ All I. versicolor petals are shorter than I. virginica petals.
□ \square □ I. setosa petals have a broader range of lengths than the other two species.
■ \blacksquare ■ I. virginica petals tend to be the longest, and I. setosa petals tend to be the shortest of the three species.
□ \square □ I. versicolor is a hybrid of I. virginica and I. setosa.
1.10 Plot all of your data: ECDFs
1.11 Computing the ECDF
In this exercise, you will write a function that takes as input a 1D array of data and then returns the x
and y
values of the ECDF. You will use this function over and over again throughout this course and its sequel. ECDFs are among the most important plots in statistical analysis. You can write your own function, foo(x,y)
according to the following skeleton:
def foo(a,b):
"""State what function does here"""
# Computation performed here
return x, y
The function foo()
above takes two arguments a
and b
and returns two values x
and y
. The function header def foo(a,b)
: contains the function signature foo(a,b)
, which consists of the function name, along with its parameters.
Instruction
- Define a function with the signature
ecdf(data)
. Within the function definition,- Compute the number of data points, n, using the
len()
function. - The
x
x
x-values are the sorted data. Use the
np.sort()
function to perform the sorting. - The
y
y
y data of the ECDF go from
1/n
to1
in equally spaced increments. You can construct this usingnp.arange()
. Remember, however, that the end value innp.arange()
is not inclusive. Therefore,np.arange()
will need to go from1
ton+1
. Be sure to divide this byn
. - The function returns the values
x
andy
.
- Compute the number of data points, n, using the
def ecdf(data):
"""Compute ECDF for a one-dimensional array of measurements."""
# Number of data points: n
n = len(data)
# x-data for the ECDF: x
x = np.sort(data)
# y-data for the ECDF: y
y = np.arange(1, n+1) / n
return x, y
1.12 Plotting the ECDF
You will now use your ecdf()
function to compute the ECDF for the petal lengths of Anderson’s Iris versicolor flowers. You will then plot the ECDF. Recall that your ecdf()
function returns two arrays so you will need to unpack them. An example of such unpacking is x, y = foo(data)
, for some function foo()
.
Instruction
- Use ecdf() to compute the ECDF of
versicolor_petal_length
. Unpack the output intox_vers
andy_vers
. - Plot the ECDF as dots. Remember to include
marker = '.'
andlinestyle = 'none'
in addition tox_vers
andy_vers
as arguments insideplt.plot()
. - Label the axes. You can label the y-axis
'ECDF'
. - Show your plot.
# Compute ECDF for versicolor data: x_vers, y_vers
x_vers, y_vers = ecdf(versicolor_petal_length)
# Generate plot
_ = plt.plot(x_vers,
y_vers,
marker = '.',
linestyle = 'none')
# Label the axes
_ = plt.xlabel('percent of versicolor_petal_length')
_ = plt.ylabel('ECDF')
# Display the plot
plt.show()
1.13 Comparison of ECDF
ECDFs also allow you to compare two or more distributions (though plots get cluttered if you have too many). Here, you will plot ECDFs for the petal lengths of all three iris species. You already wrote a function to generate ECDFs so you can put it to good use!
To overlay all three ECDFs on the same plot, you can use plt.plot()
three times, once for each ECDF. Remember to include marker='.'
and linestyle='none'
as arguments inside plt.plot()
.
Instruction
- Compute ECDFs for each of the three species using your
ecdf()
function. The variablessetosa_petal_length
,versicolor_petal_length
, andvirginica_petal_length
are all in your namespace. Unpack the ECDFs intox_set, y_set
,x_vers, y_vers
andx_virg, y_virg
, respectively. - Plot all three ECDFs on the same plot as dots. To do this, you will need three
plt.plot()
commands. Assign the result of each to_
. - A legend and axis labels have been added for you, so hit ‘Submit Answer’ to see all the ECDFs!
# Compute ECDFs
x_set, y_set = ecdf(setosa_petal_length)
x_vers, y_vers = ecdf(versicolor_petal_length)
x_virg, y_virg = ecdf(virginica_petal_length)
# Plot all ECDFs on the same plot
_ = plt.plot(x_set,
y_set,
marker='.',
linestyle='none')
_ = plt.plot(x_vers,
y_vers,
marker='.',
linestyle='none')
_ = plt.plot(x_virg,
y_virg,
marker='.',
linestyle='none')
# Annotate the plot
plt.legend(('setosa', 'versicolor', 'virginica'),
loc='lower right')
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('ECDF')
# Display the plot
plt.show()
1.14 On toward the whole story!
2. Quantitative exploratory data analysis
2.1 Introduction to summary statistics: The sample mean and median
2.2 Means and medians
Which one of the following statements is true about means and medians?
□
\square
□ An outlier can significantly affect the value of both the mean and the median.
■
\blacksquare
■ An outlier can significantly affect the value of the mean, but not the median.
□
\square
□ Means and medians are in general both robust to single outliers.
□
\square
□ The mean and median are equal if there is an odd number of data points.pre
2.3 Computing means
The mean of all measurements gives an indication of the typical magnitude of a measurement. It is computed using np.mean()
.
Instruction
- Compute the mean petal length of Iris versicolor from Anderson’s classic data set. The variable
versicolor_petal_length
is provided in your namespace. Assign the mean tomean_length_vers
. - Hit submit to print the result.
在这里插入代码片
2.4 Percentiles, outliers, and box plots
2.5 Computing percentiles
In this exercise, you will compute the percentiles of petal length of Iris versicolor.
Instruction
- Create
percentiles
, a NumPy array of percentiles you want to compute. These are the 2.5th, 25th, 50th, 75th, and 97.5th. You can do so by creating a list containing these ints/floats and convert the list to a NumPy array usingnp.array()
. For example,np.array([30, 50])
would create an array consisting of the 30th and 50th percentiles. - Use
np.percentile()
to compute the percentiles of the petal lengths from the Iris versicolor samples. The variableversicolor_petal_length
is in your namespace. - Print the percentiles.
在这里插入代码片
2.6 Comparing percentiles to ECDF
To see how the percentiles relate to the ECDF, you will plot the percentiles of Iris versicolor petal lengths you calculated in the last exercise on the ECDF plot you generated in chapter 1. The percentile variables from the previous exercise are available in the workspace as ptiles_vers
and percentiles
.
Note that to ensure the Y-axis of the ECDF plot remains between 0 and 1, you will need to rescale the percentiles
array accordingly - in this case, dividing it by 100.
Instruction
- Plot the percentiles as red diamonds on the ECDF. Pass the x and y co-ordinates -
ptiles_vers
andpercentiles/100
- as positional arguments and specify themarker='D'
,color='red'
andlinestyle='none'
keyword arguments. The argument for the y-axis -percentiles/100
has been specified for you. - Display the plot.
在这里插入代码片
2.7 Box-and whisker plot
Making a box plot for the petal lengths is unnecessary because the iris data set is not too large and the bee swarm plot works fine. However, it is always good to get some practice. Make a box plot of the iris petal lengths. You have a pandas DataFrame, df, which contains the petal length data, in your namespace. Inspect the data frame df
in the IPython shell using df.head()
to make sure you know what the pertinent columns are.
For your reference, the code used to produce the box plot in the video is provided below:
_ = sns.boxplot(x='east_west', y='dem_share', data=df_all_states)
_ = plt.xlabel('region')
_ = plt.ylabel('percent of vote for Obama')
In the IPython Shell, you can use sns.boxplot
? or help(sns.boxplot)
for more details on how to make box plots using seaborn.
Instruction
- The set-up is exactly the same as for the bee swarm plot; you just call
sns.boxplot()
with the same keyword arguments as you wouldsns.swarmplot()
. The x-axis is'species'
and y-axis is'petal length (cm)'
. - Don’t forget to label your axes!
- Display the figure using the normal call.
在这里插入代码片
2.8 Variance and standard deviation
2.9 Computing the variance
It is important to have some understanding of what commonly-used functions are doing under the hood. Though you may already know how to compute variances, this is a beginner course that does not assume so. In this exercise, we will explicitly compute the variance of the petal length of Iris veriscolor using the equations discussed in the videos. We will then use np.var()
to compute it.
Instruction
- Create an array called
differences
that is the difference between the petal lengths (versicolor_petal_length
) and the mean petal length. The variableversicolor_petal_length
is already in your namespace as a NumPy array so you can take advantage of NumPy’s vectorized operations. - Square each element in this array. For example,
x**2
squares each element in the arrayx
. Store the result asdiff_sq
. - Compute the mean of the elements in
diff_sq
usingnp.mean()
. Store the result asvariance_explicit
. - Compute the variance of
versicolor_petal_length
usingnp.var()
. Store the result asvariance_np
. - Print both
variance_explicit
andvariance_np
in oneprint
call to make sure they are consistent.
在这里插入代码片
2.10 The standard deviation and the variance
As mentioned in the video, the standard deviation is the square root of the variance. You will see this for yourself by computing the standard deviation using np.std()
and comparing it to what you get by computing the variance with np.var()
and then computing the square root.
Instruction
- Compute the variance of the data in the
versicolor_petal_length
array usingnp.var()
and store it in a variable calledvariance
. - Print the square root of this value.
- Print the standard deviation of the data in the
versicolor_petal_length
array usingnp.std()
.
在这里插入代码片
2.11 Covariance and the Pearson correlation coefficient
2.12 Scatter plots
When you made bee swarm plots, box plots, and ECDF plots in previous exercises, you compared the petal lengths of different species of iris. But what if you want to compare two properties of a single species? This is exactly what we will do in this exercise. We will make a scatter plot of the petal length and width measurements of Anderson’s Iris versicolor flowers. If the flower scales (that is, it preserves its proportion as it grows), we would expect the length and width to be correlated.
For your reference, the code used to produce the scatter plot in the video is provided below:
_ = plt.plot(total_votes/1000, dem_share, marker='.', linestyle='none')
_ = plt.xlabel('total votes (thousands)')
_ = plt.ylabel('percent of vote for Obama')
Instruction
- Use
plt.plot()
with the appropriate keyword arguments to make a scatter plot of versicolor petal length (x-axis) versus petal width (y-axis). The variablesversicolor_petal_length
andversicolor_petal_width
are already in your namespace. Do not forget to use themarker='.'
andlinestyle='none'
keyword arguments. - Label the axes.
- Display the plot.
在这里插入代码片
2.13 Variance and covariance by looking
Consider four scatter plots of x x x- y y y data, appearing to the right. Which has, respectively,
- the highest variance in the variable x,
- the highest covariance,
- negative covariance?
□ \square □ a, c, b
□ \square □ d, c, a
■ \blacksquare ■ d, c, b
□ \square □ d, d, b
2.14 Computing the covariance
The covariance may be computed using the Numpy function np.cov()
. For example, we have two sets of data x
and y
, np.cov(x, y)
returns a 2D array where entries [0,1]
and [1,0]
are the covariances. Entry [0,0]
is the variance of the data in x
, and entry [1,1]
is the variance of the data in y
. This 2D output array is called the covariance matrix, since it organizes the self- and covariance.
To remind you how the I. versicolor petal length and width are related, we include the scatter plot you generated in a previous exercise.
Instruction
- Use
np.cov()
to compute the covariance matrix for the petal length (versicolor_petal_length
) and width (versicolor_petal_width
) of I. versicolor. - Print the covariance matrix.
- Extract the covariance from entry
[0,1]
of the covariance matrix. Note that by symmetry, entry[1,0]
is the same as entry[0,1]
. - Print the covariance.
在这里插入代码片
2.15 Computing the Pearson correlation coefficient
As mentioned in the video, the Pearson correlation coefficient, also called the Pearson r, is often easier to interpret than the covariance. It is computed using the np.corrcoef()
function. Like np.cov()
, it takes two arrays as arguments and returns a 2D array. Entries [0,0]
and [1,1]
are necessarily equal to 1
(can you think about why?), and the value we are after is entry [0,1]
.
In this exercise, you will write a function, pearson_r(x, y)
that takes in two arrays and returns the Pearson correlation coefficient. You will then use this function to compute it for the petal lengths and widths of I. versicolor.
Instruction
- Define a function with signature
pearson_r(x, y)
. - Use
np.corrcoef()
to compute the correlation matrix ofx
andy
(pass them tonp.corrcoef()
in that order). - The function returns entry
[0,1]
of the correlation matrix. - Compute the Pearson correlation between the data in the arrays
versicolor_petal_length
andversicolor_petal_width
. Assign the result tor
. - Print the result.
在这里插入代码片
3. Thinking probailistically - Discrete variables
3.1 Probabilistic logic and statistical inference
3.2 What is the goal of statistical Inference?
Why do we do statistical inference?
□ \square □ To draw probabilistic conclusions about what we might expect if we collected the same data again.
□ \square □ To draw actionable conclusions from data.
□ \square □ To draw more general conclusions from relatively few data or observations.
■ \blacksquare ■ All of these.
3.3 Why do we use the language of probability?
Which of the following is not a reason why we use probabilistic language in statistical inference?
□ \square □ Probability provides a measure of uncertainty.
□ \square □ Probabilistic language is not very precise.
■ \blacksquare ■ Data are almost never exactly the same when acquired again, and probability allows us to say how much we expect them to vary.
3.4 Random number generators and hacker statistics
3.5 Generating random numbers using the np.random module
We will be hammering the np.random
module for the rest of this course and its sequel. Actually, you will probably call functions from this module more than any other while wearing your hacker statistician hat. Let’s start by taking its simplest function, np.random.random()
for a test spin. The function returns a random number between zero and one. Call np.random.random()
a few times in the IPython shell. You should see numbers jumping around between zero and one.
In this exercise, we’ll generate lots of random numbers between zero and one, and then plot a histogram of the results. If the numbers are truly random, all bars in the histogram should be of (close to) equal height.
You may have noticed that, in the video, Justin generated 4 random numbers by passing the keyword argument size=4
to np.random.random()
. Such an approach is more efficient than a for
loop: in this exercise, however, you will write a for
loop to experience hacker statistics as the practice of repeating an experiment over and over again.
Instruction
- Seed the random number generator using the seed
42
. - Initialize an empty array,
random_numbers
, of 100,000 entries to store the random numbers. Make sure you usenp.empty(100000)
to do this. - Write a
for
loop to draw 100,000 random numbers usingnp.random.random()
, storing them in therandom_numbers
array. To do so, loop overrange(100000)
. - Plot a histogram of
random_numbers
. It is not necessary to label the axes in this case because we are just checking the random number generator. Hit ‘Submit Answer’ to show your plot.
在这里插入代码片
3.6 The np.random module and Bernoulli trials
You can think of a Bernoulli trial as a flip of a possibly biased coin. Specifically, each coin flip has a probability p of landing heads (success) and probability
1
−
p
1−p
1−p of landing tails (failure). In this exercise, you will write a function to perform n
Bernoulli trials, perform_bernoulli_trials(n, p)
, which returns the number of successes out of n
Bernoulli trials, each of which has probability p
of success. To perform each Bernoulli trial, use the np.random.random()
function, which returns a random number between zero and one.
Instruction
- Define a function with signature
perform_bernoulli_trials(n, p)
.- Initialize to zero a variable
n_success
the counter ofTrue
s, which are Bernoulli trial successes. - Write a
for
loop where you perform a Bernoulli trial in each iteration and increment the number of success if the result isTrue
. Performn
iterations by looping overrange(n)
.- To perform a Bernoulli trial, choose a random number between zero and one using
np.random.random()
. If the number you chose is less thanp
, incrementn_success
(use the+= 1
operator to achieve this).
- To perform a Bernoulli trial, choose a random number between zero and one using
- The function returns the number of successes
n_success
.
- Initialize to zero a variable
在这里插入代码片
3.7 How many defaults might we expect?
Let’s say a bank made 100 mortgage loans. It is possible that anywhere between 0 and 100 of the loans will be defaulted upon. You would like to know the probability of getting a given number of defaults, given that the probability of a default is p = 0.05
. To investigate this, you will do a simulation. You will perform 100 Bernoulli trials using the perform_bernoulli_trials()
function you wrote in the previous exercise and record how many defaults we get. Here, a success is a default. (Remember that the word “success” just means that the Bernoulli trial evaluates to True
, i.e., did the loan recipient default?) You will do this for another 100 Bernoulli trials. And again and again until we have tried it 1000 times. Then, you will plot a histogram describing the probability of the number of defaults.
Instruction
- Seed the random number generator to 42.
- Initialize
n_defaults
, an empty array, usingnp.empty()
. It should contain 1000 entries, since we are doing 1000 simulations. - Write a
for
loop with1000
iterations to compute the number of defaults per 100 loans using theperform_bernoulli_trials()
function. It accepts two arguments: the number of trialsn
- in this case 100 - and the probability of successp
- in this case the probability of a default, which is0.05
. On each iteration of the loop store the result in an entry ofn_defaults
. - Plot a histogram of
n_defaults
. Include thenormed=True
keyword argument so that the height of the bars of the histogram indicate the probability. - Show your plot.
在这里插入代码片
3.8 Will the bank fail?
Plot the number of defaults you got from the previous exercise, in your namespace as n_defaults
, as a CDF. The ecdf()
function you wrote in the first chapter is available.
If interest rates are such that the bank will lose money if 10 or more of its loans are defaulted upon, what is the probability that the bank will lose money?
Instruction
- Compute the
x
andy
values for the ECDF ofn_defaults
. - Plot the ECDF, making sure to label the axes. Remember to include
marker = '.'
andlinestyle = 'none'
in addition tox
andy
in your callplt.plot()
. - Show the plot.
- Compute the total number of entries in your
n_defaults
array that were greater than or equal to 10. To do so, compute a boolean array that tells you whether a given entry ofn_defaults
is>= 10
. Then sum all the entries in this array usingnp.sum()
. For example,np.sum(n_defaults <= 5)
would compute the number of defaults with 5 or fewer defaults. - The probability that the bank loses money is the fraction of
n_defaults
that are greater than or equal to 10. Print this result by hitting ‘Submit Answer’!
在这里插入代码片
3.9 Probability distributions and stories: The Binomial distribution
3.10 Sampling out of the Binomial distribution
Compute the probability mass function for the number of defaults we would expect for 100 loans as in the last section, but instead of simulating all of the Bernoulli trials, perform the sampling using np.random.binomial()
. This is identical to the calculation you did in the last set of exercises using your custom-written perform_bernoulli_trials()
function, but far more computationally efficient. Given this extra efficiency, we will take 10,000 samples instead of 1000. After taking the samples, plot the CDF as last time. This CDF that you are plotting is that of the Binomial distribution.
Note: For this exercise and all going forward, the random number generator is pre-seeded for you (with np.random.seed(42)
) to save you typing that each time.
Instruction
- Draw samples out of the Binomial distribution using
np.random.binomial()
. You should use parametersn = 100
andp = 0.05
, and set thesize
keyword argument to10000
. - Compute the CDF using your previously-written
ecdf()
function. - Plot the CDF with axis labels. The x-axis here is the number of defaults out of 100 loans, while the y-axis is the CDF.
- Show the plot.
在这里插入代码片
3.11 Plotting the Binomial PMF
As mentioned in the video, plotting a nice looking PMF requires a bit of matplotlib trickery that we will not go into here. Instead, we will plot the PMF of the Binomial distribution as a histogram with skills you have already learned. The trick is setting up the edges of the bins to pass to plt.hist()
via the bins keyword argument. We want the bins centered on the integers. So, the edges of the bins should be -0.5, 0.5, 1.5, 2.5, ...
up to max(n_defaults) + 1.5
. You can generate an array like this using np.arange()
and then subtracting 0.5
from the array.
You have already sampled out of the Binomial distribution during your exercises on loan defaults, and the resulting samples are in the NumPy array n_defaults
.
Instruction
- Using
np.arange()
, compute the bin edges such that the bins are centered on the integers. Store the resulting array in the variablebins
. - Use
plt.hist()
to plot the histogram ofn_defaults
with thenormed=True
andbins=bins
keyword arguments. - Show the plot.
在这里插入代码片
3.12 Poisson processes and the Poisson distribution
3.13 Relationship between Binomial and Poisson distributions
You just heard that the Poisson distribution is a limit of the Binomial distribution for rare events. This makes sense if you think about the stories. Say we do a Bernoulli trial every minute for an hour, each with a success probability of 0.1. We would do 60 trials, and the number of successes is Binomially distributed, and we would expect to get about 6 successes. This is just like the Poisson story we discussed in the video, where we get on average 6 hits on a website per hour. So, the Poisson distribution with arrival rate equal to n p np np approximates a Binomial distribution for n n n Bernoulli trials with probability p p p of success (with n n n large and p p p small). Importantly, the Poisson distribution is often simpler to work with because it has only one parameter instead of two for the Binomial distribution.
Let’s explore these two distributions computationally. You will compute the mean and standard deviation of samples from a Poisson distribution with an arrival rate of 10. Then, you will compute the mean and standard deviation of samples from a Binomial distribution with parameters n n n and p p p such that n p = 10 np=10 np=10.
Instruction
- Using the
np.random.poisson()
function, draw10000
samples from a Poisson distribution with a mean of10
. - Make a list of the n and p values to consider for the Binomial distribution. Choose
n = [20, 100, 1000]
andp = [0.5, 0.1, 0.01]
so that np is always 10. - Using
np.random.binomial()
inside the providedfor
loop, draw10000
samples from a Binomial distribution with eachn, p
pair and print the mean and standard deviation of the samples. There are 3n, p
pairs:20, 0.5
,100, 0.1
, and1000, 0.01
. These can be accessed inside the loop asn[i], p[i]
.
在这里插入代码片
3.14 How many no-hitters in a season?
In baseball, a no-hitter is a game in which a pitcher does not allow the other team to get a hit. This is a rare event, and since the beginning of the so-called modern era of baseball (starting in 1901), there have only been 251 of them through the 2015 season in over 200,000 games. The ECDF of the number of no-hitters in a season is shown to the right. Which probability distribution would be appropriate to describe the number of no-hitters we would expect in a given season?
□
\square
□ Discrete uniform
□ \square □ Binomial
□ \square □ Poisson
■ \blacksquare ■ Both Binomial and Poisson, though Poisson is easier to model and compute.
□ \square □ Both Binomial and Poisson, though Binomial is easier to model and compute.
3.15 Was 20`5 anomalous?
1990 and 2015 featured the most no-hitters of any season of baseball (there were seven). Given that there are on average 251/115 no-hitters per season, what is the probability of having seven or more in a season?
Instruction
- Draw
10000
samples from a Poisson distribution with a mean of251/115
and assign ton_nohitters
. - Determine how many of your samples had a result greater than or equal to
7
and assign ton_large
. - Compute the probability,
p_large
, of having7
or more no-hitters by dividingn_large
by the total number of samples (10000
). - Hit ‘Submit Answer’ to print the probability that you calculated.
在这里插入代码片
4. Thinking probabilistically - Continuous variables
4.1 Probability density functions
4.2 Interpreting PDFs
Consider the PDF shown to the below (it may take a second to load!). Which of the following is true?
■
\blacksquare
■
x
x
x is more likely to be less than 10 than to be greater than 10.
□ \square □ x x x is more likely to be greater than 10 than to be less than 10.
□ \square □ We cannot tell from the PDF if x x x is more likely to be greater than or less than 10.
□ \square □ This is not a valid PDF because it has two peaks.
4.3 Interpreting CDFs
At right is the CDF corresponding to the PDF you considered in the last exercise. Using the CDF, what is the probability that x x x is greater than 10?
■ \blacksquare ■ 0.25
□ \square □ 0.75
□ \square □ 3.75
□ \square □ 15
4.4 Introduction to the Normal distribution
4.5 The Normal PDF
In this exercise, you will explore the Normal PDF and also learn a way to plot a PDF of a known distribution using hacker statistics. Specifically, you will plot a Normal PDF for various values of the variance.
Instruction
- Draw 100,000 samples from a Normal distribution that has a mean of
20
and a standard deviation of1
. Do the same for Normal distributions with standard deviations of3
and10
, each still with a mean of20
. Assign the results tosamples_std1
,samples_std3
andsamples_std10
, respectively. - Plot a histograms of each of the samples; for each, use 100 bins, also using the keyword arguments
normed=True
andhisttype='step'
. The latter keyword argument makes the plot look much like the smooth theoretical PDF. You will need to make 3plt.hist()
calls. - Hit ‘Submit Answer’ to make a legend, showing which standard deviations you used, and show your plot! There is no need to label the axes because we have not defined what is being described by the Normal distribution; we are just looking at shapes of PDFs.
# Draw 100000 samples from Normal distribution with stds of interest: samples_std1, samples_std3, samples_std10
samples_std1 = np.random.normal(20, 1, size = 100000)
samples_std3 = np.random.normal(20, 3, size = 100000)
samples_std10 = np.random.normal(20, 10, size = 100000)
# Make histograms
_ = plt.hist(samples_std1, normed=True, histtype = 'step', bins = 100)
_ = plt.hist(samples_std3, normed=True, histtype = 'step', bins = 100)
_ = plt.hist(samples_std10, normed=True, histtype = 'step', bins = 100)
# Make a legend, set limits and show plot
_ = plt.legend(('std = 1', 'std = 3', 'std = 10'))
plt.ylim(-0.01, 0.42)
plt.show()
4.6 The Normal CDF
Now that you have a feel for how the Normal PDF looks, let’s consider its CDF. Using the samples you generated in the last exercise (in your namespace as samples_std1
, samples_std3
, and samples_std10
), generate and plot the CDFs.
Instruction
- Use your
ecdf()
function to generate x and y values for CDFs:x_std1, y_std1
,x_std3, y_std3
andx_std10, y_std10
, respectively. - Plot all three CDFs as dots (do not forget the
marker
andlinestyle
keyword arguments!). - Hit submit to make a legend, showing which standard deviations you used, and to show your plot. There is no need to label the axes because we have not defined what is being described by the Normal distribution; we are just looking at shapes of CDFs.
# Generate CDFs
x_std1, y_std1 = ecdf(samples_std1)
x_std3, y_std3 = ecdf(samples_std3)
x_std10, y_std10 = ecdf(samples_std10)
# Plot CDFs
_ = plt.plot(x_std1, y_std1,
marker = '.', linestyle = 'none')
_ = plt.plot(x_std3, y_std3,
marker = '.', linestyle = 'none')
_ = plt.plot(x_std10, y_std10,
marker = '.', linestyle = 'none')
# Make a legend and show the plot
_ = plt.legend(('std = 1', 'std = 3', 'std = 10'),
loc='lower right')
plt.show()
4.7 The Normal distribution: Properties and warnings
4.8 Gauss and the 10 Deutschmark banknote
What are the mean and standard deviation, respectively, of the Normal distribution that was on the 10 Deutschmark banknote, shown to the right?
■
\blacksquare
■ mean = 3, std = 1
□ \square □ mean = 3, std = 2
□ \square □ mean = 0.4, std = 1
□ \square □ mean = 0.6, std = 6
4.9 Are the Belmont Stakes results Normally distributed?
Since 1926, the Belmont Stakes is a 1.5 mile-long race of 3-year old thoroughbred horses. Secretariat ran the fastest Belmont Stakes in history in 1973. While that was the fastest year, 1970 was the slowest because of unusually wet and sloppy conditions. With these two outliers removed from the data set, compute the mean and standard deviation of the Belmont winners’ times. Sample out of a Normal distribution with this mean and standard deviation using the np.random.normal()
function and plot a CDF. Overlay the ECDF from the winning Belmont times. Are these close to Normally distributed?
Note: Justin scraped the data concerning the Belmont Stakes from the Belmont Wikipedia page.
Instruction
- Compute mean and standard deviation of Belmont winners’ times with the two outliers removed. The NumPy array
belmont_no_outliers
has these data. - Take 10,000 samples out of a normal distribution with this mean and standard deviation using
np.random.normal()
. - Compute the CDF of the theoretical samples and the ECDF of the Belmont winners’ data, assigning the results to
x_theor, y_theor
andx, y
, respectively. - Hit submit to plot the CDF of your samples with the ECDF, label your axes and show the plot.
# Compute mean and standard deviation: mu, sigma
mu = np.mean(belmont_no_outliers)
sigma = np.std(belmont_no_outliers)
# Sample out of a normal distribution with this mu and sigma: samples
samples = np.random.normal(mu, sigma, size= 10000)
# Get the CDF of the samples and of the data
x_theor, y_theor = ecdf(samples)
x, y = ecdf(belmont_no_outliers)
# Plot the CDFs and show the plot
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('Belmont winning time (sec.)')
_ = plt.ylabel('CDF')
plt.show()
The theoretical CDF and the ECDF of the data suggest that the winning Belmont times are, indeed, Normally distributed. This also suggests that in the last 100 years or so, there have not been major technological or training advances that have significantly affected the speed at which horses can run this race.
4.10 What are the chances of a horse matching or beating Secretariat’s record?
Assume that the Belmont winners’ times are Normally distributed (with the 1970 and 1973 years removed), what is the probability that the winner of a given Belmont Stakes will run it as fast or faster than Secretariat?
Instruction
- Take 1,000,000 samples from the normal distribution using the
np.random.normal()
function. The meanmu
and standard deviationsigma
are already loaded into the namespace of your IPython instance. - Compute the fraction of samples that have a time less than or equal to Secretariat’s time of 144 seconds.
- Print the result.
# Take a million samples out of the Normal distribution: samples
samples = np.random.normal(mu, sigma, size = 1000000)
# Compute the fraction that are faster than 144 seconds: prob
prob = np.sum(samples <= 144)/1000000
# Print the result
print('Probability of besting Secretariat:', prob)
4.11 The Exponential distribution
4.12 Matching a story and a distribution
How might we expect the time between Major League no-hitters to be distributed? Be careful here: a few exercises ago, we considered the probability distribution for the number of no-hitters in a season. Now, we are looking at the probability distribution of the time between no hitters.
□ \square □ Normal
■ \blacksquare ■ Exponential
□ \square □ Poisson
□ \square □ Uniform
4.13 Waiting for the next Secretariat
Unfortunately, Justin was not alive when Secretariat ran the Belmont in 1973. Do you think he will get to see a performance like that? To answer this, you are interested in how many years you would expect to wait until you see another performance like Secretariat’s. How is the waiting time until the next performance as good or better than Secretariat’s distributed? Choose the best answer.
□ \square □ Normal, because the distribution of Belmont winning times are Normally distributed.
□ \square □ Normal, because there is a most-expected waiting time, so there should be a single peak to the distribution.
□ \square □ Exponential: It is very unlikely for a horse to be faster than Secretariat, so the distribution should decay away to zero for high waiting time.
■ \blacksquare ■ Exponential: A horse as fast as Secretariat is a rare event, which can be modeled as a Poisson process, and the waiting time between arrivals of a Poisson process is Exponentially distributed.
4.14 If you have a story, you can simulate it!
Sometimes, the story describing our probability distribution does not have a named distribution to go along with it. In these cases, fear not! You can always simulate it. We’ll do that in this and the next exercise.
In earlier exercises, we looked at the rare event of no-hitters in Major League Baseball. Hitting the cycle is another rare baseball event. When a batter hits the cycle, he gets all four kinds of hits, a single, double, triple, and home run, in a single game. Like no-hitters, this can be modeled as a Poisson process, so the time between hits of the cycle are also Exponentially distributed.
How long must we wait to see both a no-hitter and then a batter hit the cycle? The idea is that we have to wait some time for the no-hitter, and then after the no-hitter, we have to wait for hitting the cycle. Stated another way, what is the total waiting time for the arrival of two different Poisson processes? The total waiting time is the time waited for the no-hitter, plus the time waited for the hitting the cycle.
Now, you will write a function to sample out of the distribution described by this story.
Instruction
- Define a function with call signature
successive_poisson(tau1, tau2, size=1)
that samples the waiting time for a no-hitter and a hit of the cycle.- Draw waiting times
tau1
(size
number of samples) for the no-hitter out of an exponential distribution and assign tot1
. - Draw waiting times
tau2
(size
number of samples) for hitting the cycle out of an exponential distribution and assign tot2
. - The function returns the sum of the waiting times for the two events.
- Draw waiting times
def successive_poisson(tau1, tau2, size=1):
"""Compute time for arrival of 2 successive Poisson processes."""
# Draw samples out of first exponential distribution: t1
t1 = np.random.exponential(tau1)
# Draw samples out of second exponential distribution: t2
t2 = np.random.exponential(tau2)
return t1 + t2
4.15 Distribution of no-hitters and cycles
Now, you’ll use your sampling function to compute the waiting time to observe a no-hitter and hitting of the cycle. The mean waiting time for a no-hitter is 764 games, and the mean waiting time for hitting the cycle is 715 games.
Instruction
- Use your
successive_poisson()
function to draw 100,000 out of the distribution of waiting times for observing a no-hitter and a hitting of the cycle. - Plot the PDF of the waiting times using the step histogram technique of a previous exercise. Don’t forget the necessary keyword arguments. You should use
bins=100
,normed=True
, andhisttype='step'
. - Label the axes.
- Show your plot.
# Draw samples of waiting times: waiting_times
waiting_times = successive_poisson(764, 715,size = 100000)
# Make the histogram
_ = plt.hist(waiting_times, bins = 100,
normed = True, histtype='step')
# Label axes
_ = plt.xlabel('x')
_ = plt.ylabel('y')
# Show the plot
plt.show()