Datacamp: Introduction to Data in R

Miyuki酱

已于 2022-08-17 09:59:44 修改

阅读量651

点赞数

分类专栏： Datacamp自学笔记文章标签： r语言

于 2022-08-10 18:10:44 首次发布

本文链接：https://blog.csdn.net/weixin_51825567/article/details/126252122

版权

Datacamp自学笔记专栏收录该内容

6 篇文章 1 订阅

订阅专栏

Chapter 1

1. Loading data and Identify variable types

data(): load data into R

glimpse(): identify variable types

2. Filtering based on a factor

eg1. the following code filters the mtcars dataset for cars containing 6 cylinders

mtcars %>%
  filter(cyl == 6)

eg2. Create a new dataset called email50_big that is a subset of the original email50 dataset containing only emails with "big" numbers. This information is stored in the number variable.

# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
  filter(number == "big")

eg3. droplevels(): remove unused levels of factor variables from dataset

# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
  filter(number == "big")

# Table of the number variable
table(email50_big$number)

# 输出结果
none small   big 
   0     0     7

# Drop levels
email50_big$number_dropped <- droplevels(email50_big$number)

# Table of the number variable
table(email50_big$number_dropped)

# 输出结果
big 
  7

3. Discretize a variable

eg1. create a categorical version of the num_char variable in the email50 dataset. num_char is the number of characters in an email, in thousands. This new variable will have two levels ("below median" and "at or above median") depending on whether an email has less than the median number of characters or equal to or more than that value.

# Calculate median number of characters: med_num_char
med_num_char <- median(email50$num_char)

# Create num_char_cat variable in email50
email50_fortified <- email50 %>%
  mutate(num_char_cat = ifelse(num_char < med_num_char, "below median", "at or above median"))
  
# Count emails in each category
email50_fortified %>%
  count(num_char_cat)

eg2. Create a new column in email50 called number_yn that is "no" if there is no number in the email and "yes" otherwise. Use case_when() for this . Assign this to email50_fortified.

# Create number_yn column in email50
email50_fortified <- email50 %>%
  mutate(
    number_yn = case_when(
      # if number is "none", make number_yn "no"
      number == "none" ~ "no",
      # if number is not "none", make number_yn "yes"
      number != "none" ~ "yes"
    )
  )

4. Visualizing numerical data

eg. Create a scatterplot of number of exclamation points (exclaim_mess) on the y-axis vs. number of characters (num_char) on the x-axis. (1) Color points by whether or not the email is spam. (2) Note that the spam variable is stored as numerical (0/1) but we want to use it as a categorical variable in this plot. To do this, force R to think of it as such with the factor() function.

# Load ggplot2
library(ggplot2)

# Scatterplot of exclaim_mess vs. num_char
ggplot(email50, aes(x = num_char, y = exclaim_mess, color = factor(spam))) +
  geom_point()

Chapter 2

1. Observational studies and experiments

(1) Observational study:

Collect data in a way that does not directly interfere with how the data arise

Only correlation can be inferred

(2) Experiment:

Randomly assign subjects to various treatments

Causation can be inferred

2. Random sampling and random assignment

(1) Random sampling:

At selection of subjects from population

Helps generalizability of results

(2) Random assignment:

At selection of subjects from population

Helps infer causation from results

3. Simpson's paradox

eg1. Number of males and females admitted

Pass the gender and admission status columns to count() on the ucb_admit dataset (which is already pre-loaded) to count how many of each gender are admitted and how many are rejected.

# Load packages
library(dplyr)
glimpse(ucb_admit)

# 输出结果
Rows: 4,526
Columns: 3
$ Admit  <fct> Admitted, Admitted, Admitted, Admitted, Admitted, Admitted, ...
$ Gender <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Male, ...
$ Dept   <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", ...

# Count number of male and female applicants admitted
ucb_admit %>%
  count(Gender, Admit)

# 输出结果
# A tibble: 4 x 3
  Gender Admit        n
  <fct>  <fct>    <int>
1 Male   Admitted  1198
2 Male   Rejected  1493
3 Female Admitted   557
4 Female Rejected  1278

eg2. Proportion of males and females admitted overall

The table of counts of gender and admission status you developed earlier is available as ucb_admission_counts.

ucb_admission_counts %>%
  # Group by gender
  group_by(Gender) %>%
  # Create new variable
  mutate(prop = n / sum(n)) %>%
  # Filter for admitted
  filter(Admit == "Admitted")

Chapter 3

1. Sampling strategies

(1) Simple random sample

(2) Stratified sample

(3) Cluster sample

(4) Multistage sample

2. Sampling in R

us_regions
                  state    region
1           Connecticut Northeast
2                 Maine Northeast
3         Massachusetts Northeast
4         New Hampshire Northeast
5          Rhode Island Northeast
6               Vermont Northeast
7            New Jersey Northeast
8              New York Northeast
9          Pennsylvania Northeast
10             Illinois   Midwest
11              Indiana   Midwest
12             Michigan   Midwest
13                 Ohio   Midwest
14            Wisconsin   Midwest
15                 Iowa   Midwest
16               Kansas   Midwest
17            Minnesota   Midwest
18             Missouri   Midwest
19             Nebraska   Midwest
20         North Dakota   Midwest
21         South Dakota   Midwest
22             Delaware     South
23              Florida     South
24              Georgia     South
25             Maryland     South
26       North Carolina     South
27       South Carolina     South
28             Virginia     South
29 District of Columbia     South
30        West Virginia     South
31              Alabama     South
32             Kentucky     South
33          Mississippi     South
34            Tennessee     South
35             Arkansas     South
36            Louisiana     South
37             Oklahoma     South
38                Texas     South
39              Arizona      West
40             Colorado      West
41                Idaho      West
42              Montana      West
43               Nevada      West
44           New Mexico      West
45                 Utah      West
46              Wyoming      West
47               Alaska      West
48           California      West
49               Hawaii      West
50               Oregon      West
51           Washington      West

eg1.Simple random sample in R

The dplyr package and us_regions data frame have been loaded. (1) Use simple random sampling to select eight states from us_regions. Save this sample in a data frame called states_srs. (2) Count the number of states from each region in your sample.

# Simple random sample: states_srs
states_srs <- us_regions %>%
  sample_n(8)

# Count states by region
states_srs %>%
  count(region)

eg2. Stratified sample in R

(1) Use stratified sampling to select a total of 8 states, where each stratum is a region. Save this sample in a data frame called states_str. (Remember that there are 4 regions, each to be sampled equally!) (2) Count the number of states from each region in your sample to confirm that each region is represented equally in your sample.

# Stratified sample
states_str <- us_regions %>%
  group_by(region) %>%
  sample_n(size = 2)

# Count states by region
states_str %>%
  count(region)

3. Principles of experimental design

(1) Control: compare treatment of interest to a control group

(2) Randomize: randomly assign subjects to treatments

(3) Replicate: collect a sufficiently large sample within a study, or replicate the entire study

(4) Block: account for the potential effect of confounding variables (Group subjects into blocks based on these variables. Randomize within each block to treatment groups)

eg1. A researcher designs a study to test the effect of light and noise levels on exam performance of students. The researcher also believes that light and noise levels might have different effects on males and females, so she wants to make sure both genders are represented equally under different conditions.

There are 2 explanatory variables (light and noise), 1 blocking variable (gender), and 1 response variable (exam performance).

eg2. Explanatory variables are conditions you can impose on the experimental units, while blocking variables are characteristics that the experimental units come with that you would like to control for.

eg3. In random sampling, we use stratifying to control for a variable. In random assignment, we use blocking to achieve the same goal.