1. Read, clean, and validate
1.1 DataFrames and Series
1.2 Read the codebook
1.3 Exploring the NSFG data
To get the number of rows and columns in a DataFrame, you can read its shape
attribute.
To get the column names, you can read the columns
attribute. The result is an Index, which is a Pandas data structure that is similar to a list. Let’s begin exploring the NSFG data! It has been pre-loaded for you into a DataFrame called nsfg
.
Introduction
- Calculate the number of rows and columns in the DataFrame
nsfg
. - Display the names of the columns in
nsfg
. - Select the column
'birthwgt_oz1'
and assign it to a new variable calledounces
- Display the first 5 elements of
ounces
在这里插入代码片
1.4 Clean and Validate
1.5 Validate a variable
In the NSFG dataset, the variable 'outcome'
encodes the outcome of each pregnancy as shown below:
value | label |
---|---|
1 | Live birth |
2 | Induced abortion |
3 | Stillbirth |
4 | Miscarriage |
5 | Ectopic pregnancy |
6 | Current pregnancy |
How many pregnancies in this dataset ended with a live birth?
■ \blacksquare ■ 6489
□ \square □ 9538
□ \square □ 1469
□ \square □ 6
1.6 Clean a variable
In the NSFG dataset, the variable 'nbrnaliv'
records the number of babies born alive at the end of a pregnancy.
If you use .value_counts()
to view the responses, you’ll see that the value 8
appears once, and if you consult the codebook, you’ll see that this value indicates that the respondent refused to answer the question.
Your job in this exercise is to replace this value with np.nan
. Recall from the video how Allen replaced the values 98
and 99
in the ounces column using the .replace()
method:
ounces.replace([98, 99], np.nan, inplace=True)
Instruction
- In the
'nbrnaliv'
column, replace the value8
, in place, with the special valueNaN
. - Confirm that the value
8
no longer appears in this column by printing the values and their frequencies.
在这里插入代码片
1.7 Compute a variable
For each pregnancy in the NSFG dataset, the variable 'agecon'
encodes the respondent’s age at conception, and 'agepreg'
the respondent’s age at the end of the pregnancy.
Both variables are recorded as integers with two implicit decimal places, so the value 2575
means that the respondent’s age was 25.75
.
Instruction 1
Select 'agecon'
and 'agepreg'
, divide them by 100
, and assign them to the local variables agecon
and agepreg
.
在这里插入代码片
Instruction 2
Compute the difference, which is an estimate of the duration of the pregnancy. Keep in mind that for each pregnancy, agepreg
will be larger than agecon
.
在这里插入代码片
Instruction 3
Use .describe()
to compute the mean duration and other summary statistics.
在这里插入代码片
1.8 Filter and visualize
1.9 Make a histogram
Histograms are one of the most useful tools in exploratory data analysis. They quickly give you an overview of the distribution of a variable, that is, what values the variable can have, and how many times each value appears.
As we saw in a previous exercise, the NSFG dataset includes a variable 'agecon'
that records age at conception for each pregnancy. Here, you’re going to plot a histogram of this variable. You’ll use the bins
parameter that you saw in the video, and also a new parameter - histtype
- which you can read more about here in the matplotlib
documentation. Learning how to read documentation is an essential skill. If you want to learn more about matplotlib
, you can check out DataCamp’s Introduction to Matplotlib course.
Instruction 1
Plot a histogram of agecon
with 20
bins.
在这里插入代码片
Instruction 2
Adapt your code to make an unfilled histogram by setting the parameter histtype
to be 'step'
.
在这里插入代码片
1.10 Compute birth weight
Now let’s pull together the steps in this chapter to compute the average birth weight for full-term babies.
I’ve provided a function, resample_rows_weighted
, that takes the NSFG data and resamples it using the sampling weights in wgt2013_2015
. The result is a sample that is representative of the U.S. population.
Then I extract birthwgt_lb1
and birthwgt_oz1
, replace special codes with NaN
, and compute total birth weight in pounds, birth_weight
.
# Resample the data
nsfg = resample_rows_weighted(nsfg, 'wgt2013_2015')
# Clean the weight variables
pounds = nsfg['birthwgt_lb1'].replace([98, 99], np.nan)
ounces = nsfg['birthwgt_oz1'].replace([98, 99], np.nan)
# Compute total birth weight
birth_weight = pounds + ounces/16
Instruction
- Make a Boolean Series called
full_term
that is true for babies with'prglngth'
greater than or equal to 37 weeks. - Use
full_term
andbirth_weight
to select birth weight in pounds for full-term babies. Store the result infull_term_weight
. - Compute the mean weight of full-term babies.
在这里插入代码片
1.11 Filter
In the previous exercise, you computed the mean birth weight for full-term babies; you filtered out preterm babies because their distribution of weight is different.
The distribution of weight is also different for multiple births, like twins and triplets. In this exercise, you’ll filter them out, too, and see what effect it has on the mean.
Instruction
- Use the variable
'nbrnaliv'
to make a Boolean Series that isTrue
for single births (where'nbrnaliv'
equals1
) andFalse
otherwise. - Use Boolean Series and logical operators to select single, full-term babies and compute their mean birth weight.
- For comparison, select multiple, full-term babies and compute their mean birth weight.
在这里插入代码片
2. Distributions
2.1 Probaility mass functions
2.2 Make a PMF
The GSS dataset has been pre-loaded for you into a DataFrame called gss
. You can explore it in the IPython Shell to get familiar with it.
In this exercise, you’ll focus on one variable in this dataset, 'year'
, which represents the year each respondent was interviewed.
The Pmf
class you saw in the video has already been created for you. You can access it outside of DataCamp via the empiricaldist library.
Instruction 1
Make a PMF for year
with normalize=False
and display the result.
在这里插入代码片
Instruction 2
How many respondents were interviewed in 2016?
■ \blacksquare ■ 2867
□ \square □ 1613
□ \square □ 2538
□ \square □ 0.045897
2.3 Plot a PMF
Now let’s plot a PMF for the age of the respondents in the GSS dataset. The variable 'age'
contains respondents’ age in years.
Instruction 1
Select the 'age'
column from the gss
DataFrame and store the result in age
在这里插入代码片
Instruction 2
Make a normalized PMF of age
. Store the result in pmf_age
在这里插入代码片
Instruction 3
Plot pmf_age
as a bar chart
在这里插入代码片
2.4 Cumlative distribution functions
2.5 Make a CDF
In this exercise, you’ll make a CDF and use it to determine the fraction of respondents in the GSS dataset who are OLDER than 30.
The GSS dataset has been preloaded for you into a DataFrame called gss
.
As with the Pmf
class from the previous lesson, the Cdf
class you just saw in the video has been created for you, and you can access it outside of DataCamp via the empiricaldist
library.
Instruction 1
Select the 'age'
column. Store the result in age
.
Instruction 2
Compute the CDF of age
. Store the result in cdf_age
.
Instruction 3
Calculate the CDF of 30
.
Instruction 4
What fraction of the respondents in the GSS dataset are OLDER than 30?
■ \blacksquare ■ Approximately 75%
□ \square □ Approximately 65%
□ \square □ Approximately 45%
□ \square □ Approximately 25%
2.6 Compute IQR
Recall from the video that the interquartile range (IQR) is the difference between the 75th and 25th percentiles. It is a measure of variability that is robust in the presence of errors or extreme values.
In this exercise, you’ll compute the interquartile range of income in the GSS dataset. Income is stored in the 'realinc'
column, and the CDF of income has already been computed and stored in cdf_income
.
Instruction 1
Calculate the 75th percentile of income and store it in percentile_75th
.
Instruction 2
Calculate the 25th percentile of income and store it in percentile_25th
.
Instruction 3
Calculate the interquartile range of income. Store the result in iqr
.
Instruction 4
What is the interquartile range (IQR) of income in the GSS datset?
■ \blacksquare ■ Approximately 29676
□ \square □ Approximately 26015
□ \square □ Approximately 34702
□ \square □ Approximately 30655
2.7 Plot a CDF
The distribution of income in almost every country is long-tailed; that is, there are a small number of people with very high incomes.
In the GSS dataset, the variable 'realinc'
represents total household income, converted to 1986 dollars. We can get a sense of the shape of this distribution by plotting the CDF.
Instruction
- Select
'realinc'
from thegss
dataset. - Make a Cdf object called
cdf_income
. - Create a plot of cdf_income using
.plot()
.
在这里插入代码片
2.8 Comparing distributions
2.9 Distribution of education
Let’s begin comparing incomes for different levels of education in the GSS dataset, which has been pre-loaded for you into a DataFrame called gss
. The variable educ
represents the respondent’s years of education.
What fraction of respondents report that they have 12 years of education or fewer?
□ \square □Approximately 22%
□ \square □ Approximately 31%
□ \square □ Approximately 47%
■ \blacksquare ■ Approximately 53%
2.10 Extract eduction levels
Let’s create Boolean Series to identify respondents with different levels of education.
In the U.S, 12 years of education usually means the respondent has completed high school (secondary education). A respondent with 14 years of education has probably completed an associate degree (two years of college); someone with 16 years has probably completed a bachelor’s degree (four years of college).
Instruction
- Complete the line that identifies respondents with associate degrees, that is, people with 14 or more years of education but less than 16.
- Complete the line that identifies respondents with 12 or fewer years of education.
- Confirm that the mean of high is the fraction we computed in the previous exercise, about 53%.
在这里插入代码片
2.11 Plot income CDFs
Let’s now see what the distribution of income looks like for people with different education levels. You can do this by plotting the CDFs. Recall how Allen plotted the income CDFs of respondents interviewed before and after 1995:
Cdf(income[pre95]).plot(label='Before 1995')
Cdf(income[~pre95]).plot(label='After 1995')
You can assume that Boolean Series have been defined, as in the previous exercise, to identify respondents with different education levels: high
, assc
, and bach
.
Instruction
Fill in the missing lines of code to plot the CDFs.
在这里插入代码片
2.12 Modeling distributions
2.13 Distribution of income
In many datasets, the distribution of income is approximately lognormal, which means that the logarithms of the incomes fit a normal distribution. We’ll see whether that’s true for the GSS data. As a first step, you’ll compute the mean and standard deviation of the log of incomes using NumPy’s np.log10()
function.
Then, you’ll use the computed mean and standard deviation to make a norm
object using the scipy.stats.norm()
function.
Instruction
- Extract
'realinc'
fromgss
and compute its logarithm usingnp.log10()
. - Compute the mean and standard deviation of the result.
- Make a
norm
object by passing the computed mean and standard deviation tonorm()
.
在这里插入代码片
2.14 Comparing CDFs
To see whether the distribution of income is well modeled by a lognormal distribution, we’ll compare the CDF of the logarithm of the data to a normal distribution with the same mean and standard deviation. These variables from the previous exercise are available for use:
# Extract realinc and compute its log
log_income = np.log10(gss['realinc'])
# Compute mean and standard deviation
mean, std = log_income.mean(), log_income.std()
# Make a norm object
from scipy.stats import norm
dist = norm(mean, std)
dist
is a scipy.stats.norm
object with the same mean and standard deviation as the data. It provides .cdf()
, which evaluates the normal cumulative distribution function.
Be careful with capitalization: Cdf()
, with an uppercase C
, creates Cdf
objects. dist.cdf()
, with a lowercase c
, evaluates the normal cumulative distribution function.
在这里插入代码片
2.15 Comparing PDFs
In the previous exercise, we used CDFs to see if the distribution of income is lognormal. We can make the same comparison using a PDF and KDE. That’s what you’ll do in this exercise!
As before, the norm
object dist
is available in your workspace:
from scipy.stats import norm
dist = norm(mean, std)
Just as all norm
objects have a .cdf()
method, they also have a .pdf()
method.
To create a KDE plot, you can use Seaborn’s kdeplot()
function.
Instruction
- Evaluate the normal PDF using
dist
, which is anorm
object with the same mean and standard deviation as the data. - Make a KDE plot of the logarithms of the incomes, using
log_income
, which is a Series object.
在这里插入代码片
3. Relationships
3.1 Exploring relationships
3.2 PMF of age
PMF of ageDo people tend to gain weight as they get older? We can answer this question by visualizing the relationship between weight and age. But before we make a scatter plot, it is a good idea to visualize distributions one variable at a time. Here, you’ll visualize age using a bar chart first. Recall that all PMF objects have a .bar()
method to make a bar chart.
The BRFSS dataset includes a variable, 'AGE'
(note the capitalization!), which represents each respondent’s age. To protect respondents’ privacy, ages are rounded off into 5-year bins. 'AGE'
contains the midpoint of the bins.
Instruction
- Extract the variable
'AGE'
from the DataFramebrfss
and assign it toage
. - Plot the PMF of
age
as a bar chart.
在这里插入代码片
3.3 Scatter plot
Now let’s make a scatterplot of weight
versus age
. To make the code run faster, I’ve selected only the first 1000 rows from the brfss
DataFrame.
weight
and age
have already been extracted for you. Your job is to use plt.plot()
to make a scatter plot.
Instruction
Make a scatter plot of weight
and age
with format string 'o'
and alpha=0.1
.
在这里插入代码片
3.4 Jittering
In the previous exercise, the ages fall in columns because they’ve been rounded into 5-year bins. If we jitter them, the scatter plot will show the relationship more clearly. Recall how Allen jittered height
and weight
in the video:
height_jitter = height + np.random.normal(0, 2, size=len(brfss))
weight_jitter = weight + np.random.normal(0, 2, size=len(brfss))
Instruction
- Add random noise to
age
with mean0
and standard deviation2.5
. - Make a scatter plot between
weight
andage
with marker size 5 andalpha=0.2
. Be sure to also specify'o'
.
在这里插入代码片
3.5 Visualizing relationships
3.6 Height and weight
Previously we looked at a scatter plot of height and weight, and saw that taller people tend to be heavier. Now let’s take a closer look using a box plot. The brfss
DataFrame contains a variable '_HTMG10'
that represents height in centimeters, binned into 10 cm groups.
Recall how Allen created the box plot of 'AGE'
and 'WTKG3'
in the video, with the y-axis on a logarithmic scale:
sns.boxplot(x='AGE', y='WTKG3', data=data, whis=10)
plt.yscale('log')
3.7 Distribution of income
In the next two exercises we’ll look at relationships between income and other variables. In the BRFSS, income is represented as a categorical variable; that is, respondents are assigned to one of 8 income categories. The variable name is 'INCOME2'
. Before we connect income with anything else, let’s look at the distribution by computing the PMF. Recall that all Pmf objects have a .bar()
method.
Instruction
- Extract
'INCOME2'
from thebrfss
DataFrame and assign it toincome
. - Plot the PMF of
income
as a bar chart.
# Extract income
income = brfss['INCOME2']
# Plot the PMF
plt.pmf(income).bar()
# Label the axes
plt.xlabel('Income level')
plt.ylabel('PMF')plt.show()
3.8 Income and height
Let’s now use a violin plot to visualize the relationship between income and height.
Instruction
- Create a violin plot to plot the distribution of height (
'HTM4'
) in each income ('INCOME2'
) group. Specifyinner=None
to simplify the plot.
在这里插入代码片
3.9 Correlation
3.10 Computing correlations
The purpose of the BRFSS is to explore health risk factors, so it includes questions about diet. The variable '_VEGESU1'
represents the number of servings of vegetables respondents reported eating per day.
Let’s see how this variable relates to age and income.
Instruction
- From the
brfss
DataFrame, select the columns'AGE'
,'INCOME2'
, and'_VEGESU1'
. - Compute the correlation matrix for these variables.
在这里插入代码片
3.11 Interpreting correlations
In the previous exercise, the correlation between income and vegetable consumption is about 0.12
. The correlation between age and vegetable consumption is about -0.01
.
Which of the following are correct interpretations of these results:
- A: People with higher incomes eat more vegetables.
- B: The relationship between income and vegetable consumption is linear.
- C: Older people eat more vegetables.
- D: There could be a strong nonlinear relationship between age and vegetable consumption.
■ \blacksquare ■ A and C only.
□ \square □ B and D only.
□ \square □ B and C only.
□ \square □ A and D only.
3.12 Simple regression
3.13 Income and vegetables
As we saw in a previous exercise, the variable '_VEGESU1'
represents the number of vegetable servings respondents reported eating per day.
Let’s estimate the slope of the relationship between vegetable consumption and income.
Instruction
- Extract the columns
'INCOME2'
and'_VEGESU1'
fromsubset
intoxs
andys
respectively. - Compute the simple linear regression of these variables.
在这里插入代码片
3.14 Fit a line
Continuing from the previous exercise:
- Assume that
xs
andys
contain income codes and daily vegetable consumption, respectively, and res
contains the results of a simple linear regression ofys
ontoxs
.
Instruction
- Set
fx
to the minimum and maximum ofxs
, stored in a NumPy array. - Set
fy
to the points on the fitted line that correspond to thefx
.
在这里插入代码片
4. Multivariate Thinking
4.1 Limits of simple regression
4.2 Regression and causation
In the BRFSS dataset, there is a strong relationship between vegetable consumption and income. The income of people who eat 8 servings of vegetables per day is double the income of people who eat none, on average.
Which of the following conclusions can we draw from this data?
A. Eating a good diet leads to better health and higher income.
B. People with higher income can afford a better diet.
C. People with high income are more likely to be vegetarians.
□ \square □ A only.
□ \square □ B only.
□ \square □ B and C.
■ \blacksquare ■ None of them.
4.3 Using StatsModels
Let’s run the same regression using SciPy and StatsModels, and confirm we get the same results.
Instruction
- Compute the regression of
'_VEGESU1'
as a function of'INCOME2'
using SciPy’slinregress()
. - Compute the regression of
'_VEGESU1'
as a function of'INCOME2'
using StatsModels’smf.ols()
.
在这里插入代码片
4.4 Multiple regression
4.5 Plot income and education
To get a closer look at the relationship between income and education, let’s use the variable 'educ'
to group the data, then plot mean income in each group.
Instruction
- Group
gss
by'educ'
. Store the result ingrouped
. - From
grouped
, extract'realinc'
and compute the mean. - Plot
mean_income_by_educ
as a scatter plot. Specify'o'
andalpha=0.5
.
在这里插入代码片
4.6 Non-linear model of eaduction
The graph in the previous exercise suggests that the relationship between income and education is non-linear. So let’s try fitting a non-linear model.
Instruction
- Add a column named
'educ2'
to thegss
DataFrame; it should contain the values from'educ'
squared. - Run a regression model that uses
'educ'
,'educ2'
,'age'
, and'age2'
to predict'realinc'
.
在这里插入代码片
4.7 Visualizing regression results
4.8 Making predictions
At this point, we have a model that predicts income using age, education, and sex.
Let’s see what it predicts for different levels of education, holding age
constant.
Instruction
- Using
np.linspace()
, add a variable named'educ'
todf
with a range of values from0
to20
. - Add a variable named
'age'
with the constant value30
. - Use
df
to generate predicted income as a function of education.
在这里插入代码片
4.9 Visualizing predictions
Now let’s visualize the results from the previous exercise!
Instruction
- Plot
mean_income_by_educ
using circles ('o'
). Specify analpha
of0.5
. - Plot the prediction results with a line, with
df['educ']
on the x-axis andpred
on the y-axis.
在这里插入代码片
4.10 Logistic regression
4.11 Predicting a binary variable
Let’s use logistic regression to predict a binary variable. Specifically, we’ll use age, sex, and education level to predict support for legalizing cannabis (marijuana) in the U.S.
In the GSS dataset, the variable grass
records the answer to the question “Do you think the use of marijuana should be made legal or not?”
Instruction 1
Fill in the parameters of smf.logit()
to predict grass
using the variables age
, age2
, educ
, and educ2
, along with sex
as a categorical variable.
在这里插入代码片
Instruction 2
Add a column called educ
and set it to 12 years; then compute a second column, educ2
, which is the square of educ
.
在这里插入代码片
Instruction 3
Generate separate predictions for men and women.
在这里插入代码片
Instruction 4
Fill in the missing code to compute the mean of 'grass'
for each age group, and then the arguments of plt.plot()
to plot pred2
versus df['age']
with the label 'Female'
.
在这里插入代码片