Datacamp: Introduction to Data in R

Chapter 1

1. Loading data and Identify variable types

    data(): load data into R

    glimpse(): identify variable types

2. Filtering based on a factor

    eg1. the following code filters the mtcars dataset for cars containing 6 cylinders

mtcars %>%
  filter(cyl == 6)

    eg2. Create a new dataset called email50_big that is a subset of the original email50 dataset containing only emails with "big" numbers. This information is stored in the number variable.

# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
  filter(number == "big")

    eg3. droplevels(): remove unused levels of factor variables from dataset

# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
  filter(number == "big")

# Table of the number variable
table(email50_big$number)

# 输出结果
none small   big 
   0     0     7 
# Drop levels
email50_big$number_dropped <- droplevels(email50_big$number)

# Table of the number variable
table(email50_big$number_dropped)

# 输出结果
big 
  7

3. Discretize a variable

    eg1. create a categorical version of the num_char variable in the email50 dataset. num_char is the number of characters in an email, in thousands. This new variable will have two levels ("below median" and "at or above median") depending on whether an email has less than the median number of characters or equal to or more than that value.

# Calculate median number of characters: med_num_char
med_num_char <- median(email50$num_char)

# Create num_char_cat variable in email50
email50_fortified <- email50 %>%
  mutate(num_char_cat = ifelse(num_char < med_num_char, "below median", "at or above median"))
  
# Count emails in each category
email50_fortified %>%
  count(num_char_cat)

    eg2. Create a new column in email50 called number_yn that is "no" if there is no number in the email and "yes" otherwise. Use case_when() for this . Assign this to email50_fortified.

# Create number_yn column in email50
email50_fortified <- email50 %>%
  mutate(
    number_yn = case_when(
      # if number is "none", make number_yn "no"
      number == "none" ~ "no",
      # if number is not "none", make number_yn "yes"
      number != "none" ~ "yes"
    )
  )

4. Visualizing numerical data

    eg. Create a scatterplot of number of exclamation points (exclaim_mess) on the y-axis vs. number of characters (num_char) on the x-axis. (1) Color points by whether or not the email is spam. (2) Note that the spam variable is stored as numerical (0/1) but we want to use it as a categorical variable in this plot. To do this, force R to think of it as such with the factor() function.

# Load ggplot2
library(ggplot2)

# Scatterplot of exclaim_mess vs. num_char
ggplot(email50, aes(x = num_char, y = exclaim_mess, color = factor(spam))) +
  geom_point()

Chapter 2

1. Observational studies and experiments

    (1) Observational study:

         Collect data in a way that does not directly interfere with how the data arise

         Only correlation can be inferred

    (2) Experiment: 

         Randomly assign subjects to various treatments

         Causation can be inferred

 2. Random sampling and random assignment

    (1) Random sampling:

         At selection of subjects from population

         Helps generalizability of results

    (2) Random assignment:

         At selection of subjects from population

         Helps infer causation from results

 3. Simpson's paradox

   eg1. Number of males and females admitted

           Pass the gender and admission status columns to count() on the ucb_admit dataset (which is already pre-loaded) to count how many of each gender are admitted and how many are rejected.

# Load packages
library(dplyr)
glimpse(ucb_admit)

# 输出结果
Rows: 4,526
Columns: 3
$ Admit  <fct> Admitted, Admitted, Admitted, Admitted, Admitted, Admitted, ...
$ Gender <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Male, ...
$ Dept   <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", ...

# Count number of male and female applicants admitted
ucb_admit %>%
  count(Gender, Admit)

# 输出结果
# A tibble: 4 x 3
  Gender Admit        n
  <fct>  <fct>    <int>
1 Male   Admitted  1198
2 Male   Rejected  1493
3 Female Admitted   557
4 Female Rejected  1278

    eg2. Proportion of males and females admitted overall

            The table of counts of gender and admission status you developed earlier is available as ucb_admission_counts.

ucb_admission_counts %>%
  # Group by gender
  group_by(Gender) %>%
  # Create new variable
  mutate(prop = n / sum(n)) %>%
  # Filter for admitted
  filter(Admit == "Admitted")

Chapter 3

1. Sampling strategies

    (1) Simple random sample

    (2) Stratified sample

    (3) Cluster sample

    (4) Multistage sample

 2. Sampling in R

us_regions
                  state    region
1           Connecticut Northeast
2                 Maine Northeast
3         Massachusetts Northeast
4         New Hampshire Northeast
5          Rhode Island Northeast
6               Vermont Northeast
7            New Jersey Northeast
8              New York Northeast
9          Pennsylvania Northeast
10             Illinois   Midwest
11              Indiana   Midwest
12             Michigan   Midwest
13                 Ohio   Midwest
14            Wisconsin   Midwest
15                 Iowa   Midwest
16               Kansas   Midwest
17            Minnesota   Midwest
18             Missouri   Midwest
19             Nebraska   Midwest
20         North Dakota   Midwest
21         South Dakota   Midwest
22             Delaware     South
23              Florida     South
24              Georgia     South
25             Maryland     South
26       North Carolina     South
27       South Carolina     South
28             Virginia     South
29 District of Columbia     South
30        West Virginia     South
31              Alabama     South
32             Kentucky     South
33          Mississippi     South
34            Tennessee     South
35             Arkansas     South
36            Louisiana     South
37             Oklahoma     South
38                Texas     South
39              Arizona      West
40             Colorado      West
41                Idaho      West
42              Montana      West
43               Nevada      West
44           New Mexico      West
45                 Utah      West
46              Wyoming      West
47               Alaska      West
48           California      West
49               Hawaii      West
50               Oregon      West
51           Washington      West

    eg1.Simple random sample in R

           The dplyr package and us_regions data frame have been loaded. (1) Use simple random sampling to select eight states from us_regions. Save this sample in a data frame called states_srs. (2) Count the number of states from each region in your sample.

# Simple random sample: states_srs
states_srs <- us_regions %>%
  sample_n(8)

# Count states by region
states_srs %>%
  count(region)

    eg2. Stratified sample in R

            (1) Use stratified sampling to select a total of 8 states, where each stratum is a region. Save this sample in a data frame called states_str. (Remember that there are 4 regions, each to be sampled equally!) (2) Count the number of states from each region in your sample to confirm that each region is represented equally in your sample.

# Stratified sample
states_str <- us_regions %>%
  group_by(region) %>%
  sample_n(size = 2)

# Count states by region
states_str %>%
  count(region)

3. Principles of experimental design

    (1) Control: compare treatment of interest to a control group

    (2) Randomize: randomly assign subjects to treatments

    (3) Replicate: collect a sufficiently large sample within a study, or replicate the entire study

    (4) Block: account for the potential effect of confounding variables (Group subjects into blocks based on these variables. Randomize within each block to treatment groups)

    eg1. A researcher designs a study to test the effect of light and noise levels on exam performance of students. The researcher also believes that light and noise levels might have different effects on males and females, so she wants to make sure both genders are represented equally under different conditions.

         There are 2 explanatory variables (light and noise), 1 blocking variable (gender), and 1 response variable (exam performance).

    eg2. Explanatory variables are conditions you can impose on the experimental units, while blocking variables are characteristics that the experimental units come with that you would like to control for.

    eg3. In random sampling, we use stratifying to control for a variable. In random assignment, we use blocking to achieve the same goal.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
《Dobson: generalised linear models》是一本介绍广义线性模型的书籍。广义线性模型是统计学中一种非常重要的模型,可以用于解决许多实际问题。在这本书中,Dobson先生详细介绍了广义线性模型的理论基础和实际应用。 首先,书中详细介绍了广义线性模型的数学背景和理论基础。它基于指数分布家族,并使用连接函数将线性预测器与响应变量相关联。读者可以学习到如何选择合适的指数分布家族和连接函数,以及如何推导出模型的估计参数。 其次,书中还介绍了广义线性模型的常见应用。这些应用包括二项分布模型、泊松分布模型和正态分布模型等。读者可以学习到如何通过使用广义线性模型来分析二元数据、计数数据以及连续数据等。 此外,书中还涵盖了广义线性模型的拟合和诊断。读者可以了解到如何使用最大似然估计方法来拟合模型,并使用残差图和假设检验来诊断模型的合适性和有效性。 最后,书中还介绍了广义线性模型的扩展,例如混合效应模型和广义估计方程。这些扩展使广义线性模型能够处理更加复杂的数据结构和数据类型。 总之,Dobson的《广义线性模型》是一本非常有用的统计学教材。它提供了广义线性模型的基本概念、理论和实践应用,为读者提供了丰富的知识和技能来进行统计分析和模型建立。无论是对于统计学学生、研究人员还是专业人士,这本书都是一本值得推荐的参考书籍。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值