Chapter 1
1. Loading data and Identify variable types
data(): load data into R
glimpse(): identify variable types
2. Filtering based on a factor
eg1. the following code filters the mtcars dataset for cars containing 6 cylinders
mtcars %>%
filter(cyl == 6)
eg2. Create a new dataset called email50_big that is a subset of the original email50 dataset containing only emails with "big" numbers. This information is stored in the number variable.
# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
filter(number == "big")
eg3. droplevels(): remove unused levels of factor variables from dataset
# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
filter(number == "big")
# Table of the number variable
table(email50_big$number)
# 输出结果
none small big
0 0 7
# Drop levels
email50_big$number_dropped <- droplevels(email50_big$number)
# Table of the number variable
table(email50_big$number_dropped)
# 输出结果
big
7
3. Discretize a variable
eg1. create a categorical version of the num_char variable in the email50 dataset. num_char is the number of characters in an email, in thousands. This new variable will have two levels ("below median" and "at or above median") depending on whether an email has less than the median number of characters or equal to or more than that value.
# Calculate median number of characters: med_num_char
med_num_char <- median(email50$num_char)
# Create num_char_cat variable in email50
email50_fortified <- email50 %>%
mutate(num_char_cat = ifelse(num_char < med_num_char, "below median", "at or above median"))
# Count emails in each category
email50_fortified %>%
count(num_char_cat)
eg2. Create a new column in email50 called number_yn that is "no" if there is no number in the email and "yes" otherwise. Use case_when() for this . Assign this to email50_fortified.
# Create number_yn column in email50
email50_fortified <- email50 %>%
mutate(
number_yn = case_when(
# if number is "none", make number_yn "no"
number == "none" ~ "no",
# if number is not "none", make number_yn "yes"
number != "none" ~ "yes"
)
)
4. Visualizing numerical data
eg. Create a scatterplot of number of exclamation points (exclaim_mess) on the y-axis vs. number of characters (num_char) on the x-axis. (1) Color points by whether or not the email is spam. (2) Note that the spam variable is stored as numerical (0/1) but we want to use it as a categorical variable in this plot. To do this, force R to think of it as such with the factor() function.
# Load ggplot2
library(ggplot2)
# Scatterplot of exclaim_mess vs. num_char
ggplot(email50, aes(x = num_char, y = exclaim_mess, color = factor(spam))) +
geom_point()
Chapter 2
1. Observational studies and experiments
(1) Observational study:
Collect data in a way that does not directly interfere with how the data arise
Only correlation can be inferred
(2) Experiment:
Randomly assign subjects to various treatments
Causation can be inferred
2. Random sampling and random assignment
(1) Random sampling:
At selection of subjects from population
Helps generalizability of results
(2) Random assignment:
At selection of subjects from population
Helps infer causation from results
3. Simpson's paradox
eg1. Number of males and females admitted
Pass the gender and admission status columns to count() on the ucb_admit dataset (which is already pre-loaded) to count how many of each gender are admitted and how many are rejected.
# Load packages
library(dplyr)
glimpse(ucb_admit)
# 输出结果
Rows: 4,526
Columns: 3
$ Admit <fct> Admitted, Admitted, Admitted, Admitted, Admitted, Admitted, ...
$ Gender <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Male, ...
$ Dept <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", ...
# Count number of male and female applicants admitted
ucb_admit %>%
count(Gender, Admit)
# 输出结果
# A tibble: 4 x 3
Gender Admit n
<fct> <fct> <int>
1 Male Admitted 1198
2 Male Rejected 1493
3 Female Admitted 557
4 Female Rejected 1278
eg2. Proportion of males and females admitted overall
The table of counts of gender and admission status you developed earlier is available as ucb_admission_counts.
ucb_admission_counts %>%
# Group by gender
group_by(Gender) %>%
# Create new variable
mutate(prop = n / sum(n)) %>%
# Filter for admitted
filter(Admit == "Admitted")
Chapter 3
1. Sampling strategies
(1) Simple random sample
(2) Stratified sample
(3) Cluster sample
(4) Multistage sample
2. Sampling in R
us_regions
state region
1 Connecticut Northeast
2 Maine Northeast
3 Massachusetts Northeast
4 New Hampshire Northeast
5 Rhode Island Northeast
6 Vermont Northeast
7 New Jersey Northeast
8 New York Northeast
9 Pennsylvania Northeast
10 Illinois Midwest
11 Indiana Midwest
12 Michigan Midwest
13 Ohio Midwest
14 Wisconsin Midwest
15 Iowa Midwest
16 Kansas Midwest
17 Minnesota Midwest
18 Missouri Midwest
19 Nebraska Midwest
20 North Dakota Midwest
21 South Dakota Midwest
22 Delaware South
23 Florida South
24 Georgia South
25 Maryland South
26 North Carolina South
27 South Carolina South
28 Virginia South
29 District of Columbia South
30 West Virginia South
31 Alabama South
32 Kentucky South
33 Mississippi South
34 Tennessee South
35 Arkansas South
36 Louisiana South
37 Oklahoma South
38 Texas South
39 Arizona West
40 Colorado West
41 Idaho West
42 Montana West
43 Nevada West
44 New Mexico West
45 Utah West
46 Wyoming West
47 Alaska West
48 California West
49 Hawaii West
50 Oregon West
51 Washington West
eg1.Simple random sample in R
The dplyr package and us_regions data frame have been loaded. (1) Use simple random sampling to select eight states from us_regions. Save this sample in a data frame called states_srs. (2) Count the number of states from each region in your sample.
# Simple random sample: states_srs
states_srs <- us_regions %>%
sample_n(8)
# Count states by region
states_srs %>%
count(region)
eg2. Stratified sample in R
(1) Use stratified sampling to select a total of 8 states, where each stratum is a region. Save this sample in a data frame called states_str. (Remember that there are 4 regions, each to be sampled equally!) (2) Count the number of states from each region in your sample to confirm that each region is represented equally in your sample.
# Stratified sample
states_str <- us_regions %>%
group_by(region) %>%
sample_n(size = 2)
# Count states by region
states_str %>%
count(region)
3. Principles of experimental design
(1) Control: compare treatment of interest to a control group
(2) Randomize: randomly assign subjects to treatments
(3) Replicate: collect a sufficiently large sample within a study, or replicate the entire study
(4) Block: account for the potential effect of confounding variables (Group subjects into blocks based on these variables. Randomize within each block to treatment groups)
eg1. A researcher designs a study to test the effect of light and noise levels on exam performance of students. The researcher also believes that light and noise levels might have different effects on males and females, so she wants to make sure both genders are represented equally under different conditions.
There are 2 explanatory variables (light and noise), 1 blocking variable (gender), and 1 response variable (exam performance).
eg2. Explanatory variables are conditions you can impose on the experimental units, while blocking variables are characteristics that the experimental units come with that you would like to control for.
eg3. In random sampling, we use stratifying to control for a variable. In random assignment, we use blocking to achieve the same goal.