R-应用流行病学和公共卫生-7.数据清洗-2

掉头就走

已于 2022-04-10 19:54:42 修改

阅读量401

点赞数

分类专栏：流病文章标签： r语言

于 2022-04-10 17:24:49 首次发布

原文链接：https://epirhandbook.com/en/cleaning-data-and-core-functions.html#nomenclature-1

版权

流病专栏收录该内容

12 篇文章 14 订阅

订阅专栏

心态崩了，上一篇续写因为ctrlz重写了两遍，因为提交不上又重写了两遍。。。分了吧

选择或重新排序列

使用select()from dplyr选择要保留的列，并指定它们在数据框中的顺序。

注意：在以下示例中，linelist数据框已修改select()并显示，但未保存。这是出于演示目的。修改后的列名通过管道将数据框传输到names().

以下是清洁管道链中此时行列表中的所有列名：

names(linelist)
##  [1] "case_id"              "generation"           "date_infection"       "date_onset"           "date_hospitalisation" "date_outcome"        
##  [7] "outcome"              "gender"               "hospital"             "lon"                  "lat"                  "infector"            
## [13] "source"               "age"                  "age_unit"             "row_num"              "wt_kg"                "ht_cm"               
## [19] "ct_blood"             "fever"                "chills"               "cough"                "aches"                "vomit"               
## [25] "temp"                 "time_admission"       "merged_header"        "x28"

保留列

仅选择要保留的列，将他们的名字放在select()命令中，不带引号。它们将按照您提供的顺序出现在数据框中。请注意，如果您包含不存在的列，R 将返回错误（any_of()如果您不希望在这种情况下出现错误，请参阅下面的使用）。

# linelist dataset is piped through select() command, and names() prints just the column names
linelist %>% 
  select(case_id, date_onset, date_hospitalisation, fever) %>% 
  names()  # display the column names
## [1] "case_id"              "date_onset"           "date_hospitalisation" "fever"

“tidyselect”辅助函数

这些辅助函数的存在使指定要保留、丢弃或转换的列变得容易。它们来自 tidyselect 包，该包包含在tidyverse中，并且是dplyr函数中如何选择列的基础。

例如，如果你想对列重新排序，这everything()是一个有用的函数来表示“所有其他尚未提及的列”。下面的命令将列移动date_onset到date_hospitalisation数据集的开头（左侧），但之后保留所有其他列。注意everything()是用空括号写的：

# move date_onset and date_hospitalisation to beginning
linelist %>% 
  select(date_onset, date_hospitalisation, everything()) %>% 
  names()
##  [1] "date_onset"           "date_hospitalisation" "case_id"              "generation"           "date_infection"       "date_outcome"        
##  [7] "outcome"              "gender"               "hospital"             "lon"                  "lat"                  "infector"            
## [13] "source"               "age"                  "age_unit"             "row_num"              "wt_kg"                "ht_cm"               
## [19] "ct_blood"             "fever"                "chills"               "cough"                "aches"                "vomit"               
## [25] "temp"                 "time_admission"       "merged_header"        "x28"

以下是其他“tidyselect”辅助函数，它们也可以在 dplyr函数中工作，如select()、across()和summarise()：

everything()- 未提及的所有其他列
last_col()- 最后一列
where()- 将函数应用于所有列并选择为 TRUE 的列
contains()- 包含字符串的列
- 例子：select(contains("time"))
starts_with()- 匹配指定的前缀
- 例子：select(starts_with("date_"))
ends_with()- 匹配指定的后缀
- 例子：select(ends_with("_post"))
matches()- 应用正则表达式 (regex)
- 例子：select(matches("[pt]al"))
num_range()- 一个数字范围，如 x01、x02、x03
any_of()- 匹配 IF 列存在但如果未找到则不返回错误
- 例子：select(any_of(date_onset, date_death, cardiac_arrest))

此外，使用普通运算符，例如c()列出多列、:连续列、!相反、&AND 和|OR。

用于where()指定列的逻辑标准。如果在内部提供函数where()，请不要包含函数的空括号。下面的命令选择数字类的列。

# select columns that are class Numeric
linelist %>% 
  select(where(is.numeric)) %>% 
  names()
## [1] "generation" "lon"        "lat"        "row_num"    "wt_kg"      "ht_cm"      "ct_blood"   "temp"

用contains()仅选择列名包含指定字符串的列。ends_with()和starts_with()提供更多细微差别。

# select columns containing certain characters
linelist %>% 
  select(contains("date")) %>% 
  names()
## [1] "date_infection"       "date_onset"           "date_hospitalisation" "date_outcome"

该函数的matches()工作方式与正则表达式类似，contains()但可以提供正则表达式（参见Characters 和 strings页面），例如括号内由 OR 条分隔的多个字符串：

# searched for multiple character matches
linelist %>% 
  select(matches("onset|hosp|fev")) %>%   # note the OR symbol "|"
  names()
## [1] "date_onset"           "date_hospitalisation" "hospital"             "fever"

注意：如果数据中不存在您专门提供的列名，它可能会返回错误并停止您的代码。考虑使用any_of()来引用可能存在或不存在的列，这在否定（删除）选择中特别有用。

这些列中只有一个存在，但不会产生错误，并且代码会继续运行，而不会停止您的清理链。

linelist %>% 
  select(any_of(c("date_onset", "village_origin", "village_detection", "village_residence", "village_travel"))) %>% 
  names()
## [1] "date_onset"

删除列

通过在列名（例如 select(-outcome)）或列名向量（如下所示）前放置减号“-” 来指示要删除的列。所有其他列将被保留。

linelist %>% 
  select(-c(date_onset, fever:vomit)) %>% # remove date_onset and all columns from fever to vomit
  names()
##  [1] "case_id"              "generation"           "date_infection"       "date_hospitalisation" "date_outcome"         "outcome"             
##  [7] "gender"               "hospital"             "lon"                  "lat"                  "infector"             "source"              
## [13] "age"                  "age_unit"             "row_num"              "wt_kg"                "ht_cm"                "ct_blood"            
## [19] "temp"                 "time_admission"       "merged_header"        "x28"

您还可以使用基本R 语法删除列，方法是将其定义为NULL. 例如：

linelist$date_onset <- NULL   # deletes column with base R syntax

去重

有关如何对数据进行重复数据删除的广泛选项，请参阅有关重复数据删除的手册页面。这里只介绍一个非常简单的行重复数据删除示例。

dplyr包提供了该distinct()功能。此函数检查每一行并将数据框减少到唯一的行。也就是说，它会删除 100% 重复的行。

在评估重复行时，它会考虑一系列列 - 默认情况下它会考虑所有列。如重复数据删除页面所示，您可以调整此列范围，以便仅针对某些列评估行的唯一性。

在这个简单的例子中，我们只是将空命令distinct()添加到管道链中。这可确保没有与其他行 100% 重复的行（跨所有列进行评估）。

我们从linelist中的nrow(linelist)行开始。重复数据删除后有nrow(linelist)行。任何删除的行都将是其他行的 100% 重复。

linelist <- linelist %>% 
  distinct()

下面，该distinct()命令被添加到清洗管道链中：

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # remove column
    select(-c(row_num, merged_header, x28)) %>% 
  
    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    #####################################################
    
    # de-duplicate
    distinct()

列创建和转换

我们建议使用 dplyr 函数mutate()添加新列或修改现有列。

下面是使用 mutate() 创建新列的示例。语法是：mutate(new_column_name = value or transformation)

在 Stata 中，这类似于 command generate，但 Rmutate()也可用于修改现有列。

新列

创建新列的最基本mutate()命令可能如下所示。它创建一个新列new_col，其中每行的值为 10。

linelist <- linelist %>% 
  mutate(new_col = 10)

您还可以引用其他列中的值来执行计算。下面，将bmi创建一个新列来保存每个病例的体重指数 (BMI) - 使用公式 BMI = kg/m^2 计算，使用 columnht_cm和 column wt_kg。

linelist <- linelist %>% 
  mutate(bmi = wt_kg / (ht_cm/100)^2)

如果创建多个新列，请用逗号和新行分隔每个列。str_glue()下面是新列的示例，包括由使用stringr包组合的其他列的值组成的列。

new_col_demo <- linelist %>%                       
  mutate(
    new_var_dup    = case_id,             # new column = duplicate/copy another existing column
    new_var_static = 7,                   # new column = all values the same
    new_var_static = new_var_static + 5,  # you can overwrite a column, and it can be a calculation using other variables
    new_var_paste  = stringr::str_glue("{hospital} on ({date_hospitalisation})") # new column = pasting together values from other columns
    ) %>% 
  select(case_id, hospital, date_hospitalisation, contains("new"))        # show only new columns, for demonstration purposes

提示：mutate()函数的一个变体transmute()。此函数添加一个新列，就像一样mutate()，但也会删除/删除您在其括号中未提及的所有其他列。

转换列类

包含日期、数字或逻辑值 (TRUE/FALSE) 的列只有在正确分类的情况下才会按预期运行。类字符的“2”和类数字的 2 是有区别的！

有一些方法可以在导入命令期间设置列类，但这通常很麻烦。请参阅对象类的R 基础部分以了解有关转换对象和列的类的更多信息。

首先，让我们对重要的列进行一些检查，看看它们是否是正确的类。我们在开始跑skim()的时候也看到了这一点。

目前，该age列的类是字符。为了进行定量分析，我们需要将这些数字识别为数字！要解决此问题，请使用通过mutate()转换重新定义列的功能。我们将列定义为自身，但转换为不同的类。这是一个基本示例，转换或确保列age是数字类：

linelist <- linelist %>% 
  mutate(age = as.numeric(age))

以类似的方式，您可以使用as.character()and as.logical()。要转换为类 Factor，您可以使用factor()from base R 或as_factor()from forcats。在“因素”页面中阅读有关此内容的更多信息。

转换为 Date 类时必须小心。使用日期页面解释了几种方法。通常，原始日期值必须全部采用相同的格式才能正确进行转换（例如“MM/DD/YYYY”或“DD MM YYYY”）。转换为 Date 类后，检查您的数据以确认每个值都已正确转换。

分组数据

如果您的数据框已经分组（请参阅分组数据页面），mutate()则其行为可能与数据框未分组时不同。任何汇总函数，如mean()、median()、max()等，都将按组计算，而不是按所有行计算。

# age normalized to mean of ALL rows
linelist %>% 
  mutate(age_norm = age / mean(age, na.rm=T))

# age normalized to mean of hospital group
linelist %>% 
  group_by(hospital) %>% 
  mutate(age_norm = age / mean(age, na.rm=T))

转换多列

通常要编写简洁的代码，您希望一次将相同的转换应用于多个列。可以使用dplyr包（也包含在tidyverse包中）中的across()函数一次将转换应用于多个列。across()可以与任何dplyr函数一起使用，但通常用于select()、mutate()、filter()或summarise(). 在描述性表格页面中查看它是如何应用的。

指定参数的列.cols =和要应用于的函数.fns =。提供给.fns函数的任何其他参数都可以包含在逗号之后，仍然在across().

`across()`列选择

指定参数的列.cols =。您可以单独命名它们，或使用“tidyselect”辅助函数。将函数指定为.fns =。请注意，使用下面演示的函数模式，函数编写时不带括号 ( )。

在这里，转换as.character()应用与across()命名为内部的特定列。

linelist <- linelist %>% 
  mutate(across(.cols = c(temp, ht_cm, wt_kg), .fns = as.character))

“tidyselect”辅助函数可用于帮助您指定列。它们在上面有关选择和重新排序列的部分中进行了详细说明，它们包括：everything()、last_col()、where()、starts_with()、ends_with()、contains()、matches()、num_range和any_of()

这是一个如何将所有列更改为字符类的示例：

#to change all columns to character class
linelist <- linelist %>% 
  mutate(across(.cols = contains("date"), .fns = as.character))

下面是一个对当前属于 POSIXct 类（显示时间戳的原始日期时间类）的列进行变异的示例 - 换句话说，函数的is.POSIXct()计算结果为TRUE. 然后我们想将函数as.Date()应用于这些列，将它们转换为普通类 Date。

linelist <- linelist %>% 
  mutate(across(.cols = where(is.POSIXct), .fns = as.Date))

请注意，across()我们还使用函数where()as is.POSIXctis 评估为 TRUE 或 FALSE。
请注意，这is.POSIXct()是来自包lubridate。其他类似的“is”函数，如is.character()、is.numeric()和is.logical()来自基础 R

`across()`职能

您可以阅读文档以?across获取有关如何向across(). 几点总结：有几种方法可以指定要在列上执行的函数，您甚至可以定义自己的函数：

您可以单独提供函数名称（例如 mean或as.character）
您可以提供purrr风格的功能（例如 ~ mean(.x, na.rm = TRUE)）（请参阅此页面）
您可以通过提供一个列表来指定多个功能（例如 list(mean = mean, n_miss = ~ sum(is.na(.x))）。
- 如果您提供多个函数，则每个输入列将返回多个转换后的列，其名称格式为col_fn. .names =您可以使用胶合语法（参见Characters and strings页面）调整新列的命名方式，其中{.col}和{.fn}是输入列和函数的简写。

`coalesce()`

这个dplyr函数在每个位置找到第一个非缺失值。它按照您指定的顺序用第一个可用值“填充”缺失值。

这是数据框上下文之外的示例：假设您有两个向量，一个包含患者的检测村庄，另一个包含患者的居住村庄。您可以使用 coalesce 为每个索引选择第一个非缺失值：

village_detection <- c("a", "b", NA,  NA)
village_residence <- c("a", "c", "a", "d")

village <- coalesce(village_detection, village_residence)
village    # print
## [1] "a" "b" "a" "d"

如果您提供数据框列，这同样适用：对于每一行，该函数将为新列值分配您提供的列中的第一个非缺失值（按提供的顺序）。

linelist <- linelist %>% 
  mutate(village = coalesce(village_detection, village_residence))

累积数学

如果您希望一列反映累积总和/平均值/最小值/最大值等，如向下评估到该点的数据帧的行，请使用以下函数：

cumsum()返回累计和，如下：

sum(c(2,4,15,10))     # returns only one number
## [1] 31

cumsum(c(2,4,15,10))  # returns the cumulative sum at each step
## [1]  2  6 21 31

这可以在创建新列时在数据框中使用。例如，要计算爆发中每天的累计病例数，请考虑如下代码：

cumulative_case_counts <- linelist %>%  # begin with case linelist
  count(date_onset) %>%                 # count of rows per day, as column 'n'   
  mutate(cumulative_cases = cumsum(n))  # new column, of the cumulative sum at each row

使用基础R

要使用基础R定义新列（或重新定义列），请将数据框的名称与连接写入$新列（或要修改的列）。使用赋值运算符<-定义新值。请记住，在使用基本R 时，您必须每次都在列名之前指定数据框名称（例如 dataframe$column）。下面是使用基础R创建bmi列的示例：

linelist$bmi = linelist$wt_kg / (linelist$ht_cm / 100) ^ 2)

添加到管道链

下面，向管道链添加了一个新列，并转换了一些类。

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # remove column
    select(-c(row_num, merged_header, x28)) %>% 
  
    # de-duplicate
    distinct() %>% 
  
    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    ###################################################
    # add new column
    mutate(bmi = wt_kg / (ht_cm/100)^2) %>% 
  
    # convert class of columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age))

重新编码值

以下是一些需要重新编码（更改）值的场景：

编辑一个特定的值（例如，一个年份或格式不正确的日期）
协调拼写不同的值
创建一个新的分类值列
创建一个新的数字类别列（例如年龄类别）

具体值

要手动更改值，您可以使用recode()函数中的mutate()函数。

假设数据中有一个无意义的日期（例如“2014-14-15”）：您可以在原始源数据中手动修复日期，或者您可以通过mutate()和recode()将更改写入清理管道。后者对于任何寻求理解或重复您的分析的人来说更加透明和可重复。

# fix incorrect values                   # old value       # new value
linelist <- linelist %>% 
  mutate(date_onset = recode(date_onset, "2014-14-15" = "2014-04-15"))

上面的mutate()行可以读作：“改变列date_onset以等于date_onset重新编码的列，以便将旧值更改为新值”。请注意，此模式（旧 = 新）recode()与大多数 R 模式（新 = 旧）相反。

这是在一个列中重新编码多个值的另一个示例。

在linelist“医院”列中的值必须进行清洁。有几种不同的拼写和许多缺失值。下面的recode()命令将列“hospital”重新定义为当前列“hospital”，但指定的重新编码更改。不要忘记每个后面的逗号！

table(linelist$hospital, useNA = "always")  # print table of all unique values, including missing  
## 
##                      Central Hopital                     Central Hospital                           Hospital A                           Hospital B 
##                                   11                                  457                                  290                                  289 
##                     Military Hopital                    Military Hospital                     Mitylira Hopital                    Mitylira Hospital 
##                                   32                                  798                                    1                                   79 
##                                Other                         Port Hopital                        Port Hospital St. Mark's Maternity Hospital (SMMH) 
##                                  907                                   48                                 1756                                  417 
##   St. Marks Maternity Hopital (SMMH)                                 <NA> 
##                                   11                                 1512
linelist <- linelist %>% 
  mutate(hospital = recode(hospital,
                     # for reference: OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      ))

现在我们看到hospital列中的拼写已被更正和合并：

table(linelist$hospital, useNA = "always")
## 
##                     Central Hospital                           Hospital A                           Hospital B                    Military Hospital 
##                                  468                                  290                                  289                                  910 
##                                Other                        Port Hospital St. Mark's Maternity Hospital (SMMH)                                 <NA> 
##                                  907                                 1804                                  428                                 1512

提示：等号前后的空格数无关紧要。通过为所有或大多数行对齐 = 使您的代码更易于阅读。此外，考虑添加一个散列评论行，以向未来的读者澄清哪一边是旧的，哪一边是新的。

提示：有时数据集中存在空白字符值（不识别为 R 的缺失值 -NA）。您可以使用两个引号引用此值，中间没有空格 ("")。

按逻辑

下面我们演示如何使用逻辑和条件重新编码列中的值：

使用replace(),ifelse()和if_else()用于简单的逻辑
case_when()用于更复杂的逻辑

简单的逻辑

`replace()`

要使用简单的逻辑标准重新编码，您可以使用replace()within mutate()。replace()是来自基础R 的函数。使用逻辑条件指定要更改的行。一般语法是：

mutate(col_to_change = replace(col_to_change, criteria for rows, new value)).

一种常见的使用情况是使用唯一的行标识符replace()仅更改一行中的一个值。case_id下面，列为“2195”的行中的性别更改为“女性”。

# Example: change gender of one specific observation to "Female" 
linelist <- linelist %>% 
  mutate(gender = replace(gender, case_id == "2195", "Female"))

使用基本R 语法和索引括号的等效命令[ ]如下。它将 linelist的gender列的值（对于linelist' 列case_id的值为 '2195' 的行）更改为 'Female'”。

linelist$gender[linelist$case_id == "2195"] <- "Female"

`ifelse()`和`if_else()`

另一个简单逻辑的工具是ifelse()和它的伙伴if_else()。但是，在大多数情况下，case_when()重新编码使用起来会更清晰（详见下文）。这些“if else”命令是if和else编程语句的简化版本。一般语法是：
ifelse(condition, value to return if condition evaluates to TRUE, value to return if condition evaluates to FALSE)

下面，source_known定义了列。source如果列中的行值不丢失，则其在给定行中的值设置为“已知”。如果缺少 in 的值，则将 in 的值source 设置source_known为“未知”。

linelist <- linelist %>% 
  mutate(source_known = ifelse(!is.na(source), "known", "unknown"))

if_else()是dplyr处理日期的特殊版本。请注意，如果 'true' 值是日期，则 'false' 值也必须限定日期，因此使用特殊值NA_real_而不是 just NA。

# Create a date of death column, which is NA if patient has not died.
linelist <- linelist %>% 
  mutate(date_death = if_else(outcome == "Death", date_outcome, NA_real_))

避免将许多 ifelse 命令串在一起……改用case_when()！ case_when()更容易阅读，你会犯更少的错误。

复杂的逻辑

如果您要重新编码到许多新组中，或者如果您需要使用复杂的逻辑语句来重新编码值，请使用dplyr 。case_when()此函数评估数据框中的每一行，评估行是否符合指定条件，并分配正确的新值。

case_when()命令由具有由“波浪号”分隔的右侧 (RHS) 和左侧 (LHS) 的语句组成~。逻辑标准在左侧，随后的值在每个语句的右侧。语句用逗号分隔。

例如，这里我们利用列age并age_unit创建列age_years：

linelist <- linelist %>% 
  mutate(age_years = case_when(
            age_unit == "years"  ~ age,       # if age is given in years
            age_unit == "months" ~ age/12,    # if age is given in months
            is.na(age_unit)      ~ age,       # if age unit is missing, assume years
            TRUE                 ~ NA_real_)) # any other circumstance, assign missing

在评估数据中的每一行时，将按照语句编写的顺序应用/评估标准case_when()- 从上到下。如果给定行的最高标准评估为TRUE，则分配 RHS 值，并且甚至不针对该行测试其余标准。因此，最好先写最具体的标准，最后写最一般的标准。

按照这些思路，在您的最终语句中，将TRUE其放在左侧，这将捕获任何不符合任何先前条件的行。该语句的右侧可以分配一个值，例如“检查我！” 或失踪。

危险： 右侧的 Vvalues 必须都是同一类- 数字、字符、日期、逻辑等。要分配缺失的 (NA)，您可能需要使用特殊的变体，NA例如NA_character_,NA_real_（对于数字或 POSIX），和as.Date(NA)。在处理日期中阅读更多内容。

缺失值

以下是在数据清理上下文中处理缺失值的特殊函数。

有关识别和处理缺失值的更多详细提示，请参阅缺失数据页面。例如，is.na()逻辑上测试缺失的函数。

`replace_na()`

要将缺失值 ( NA) 更改为特定值，例如“Missing”，请使用.dplyr中的replace_na()函数。请注意，它的使用方式与上述recode相同 - 变量的名称必须在replace_na()中。

linelist <- linelist %>% 
  mutate(hospital = replace_na(hospital, "Missing"))

fct_explicit_na()

这是来自forcats包的功能。forcats包处理类 Factor的列。因子是 R 处理有序值的方式，例如c("First", "Second", "Third")或设置值（例如医院）在表格和绘图中出现的顺序。请参阅有关因素的页面。

如果您的数据是类 Factor 并且您尝试使用转换NA为“Missing” replace_na()，您将收到此错误：invalid factor level, NA generated. 您尝试将“缺失”添加为一个值，但它没有被定义为因素的可能水平，并且被拒绝。

解决这个问题的最简单方法是使用forcats函数，该函数fct_explicit_na()将列转换为类因子，并将NA值转换为字符“(Missing)”。

linelist %>% 
  mutate(hospital = fct_explicit_na(hospital))

较慢的替代方法是使用添加因子水平fct_expand()，然后转换缺失值。

`na_if()`

要将特定值转换为 NA，请使用dplyr的na_if()。下面的命令执行相反的操作replace_na()。在下面的示例中，列中“Missing”的任何值hospital都将转换为NA。

linelist <- linelist %>% 
  mutate(hospital = na_if(hospital, "Missing"))

注意：na_if() 不能用于逻辑标准（例如“所有值 > 99”） - 使用replace()orcase_when()用于此：

# Convert temperatures above 40 to NA 
linelist <- linelist %>% 
  mutate(temp = replace(temp, temp > 40, NA))

# Convert onset dates earlier than 1 Jan 2000 to missing
linelist <- linelist %>% 
  mutate(date_onset = replace(date_onset, date_onset > as.Date("2000-01-01"), NA))

清洁词典

使用 R 包linelist，它的功能是使用清理字典clean_variable_spelling()清理数据框。linelist是由RECON（R Epidemics Consortium）开发的软件包。

1.创建一个包含 3 列的清理字典：

“来自”列（不正确的值）
“到”列（正确的值）
一列指定要应用更改的列（或“.global”以应用于所有列）

注意： .global 字典条目将被特定于列的字典条目覆盖。

2.将字典文件导入 R。可以通过下载手册和数据页面上的说明下载此示例。

cleaning_dict <- import("cleaning_dict.csv")

3.将原始 linelist 传递给clean_variable_spelling()，指定到wordlists =清理字典数据框。该spelling_vars =参数可用于指定字典中的哪一列引用列（默认为第三列），或者可以设置为NULL使字典适用于所有字符和因子列。请注意，此功能可能需要很长时间才能运行。

linelist <- linelist %>% 
  linelist::clean_variable_spelling(
    wordlists = cleaning_dict,
    spelling_vars = "col",        # dict column containing column names, defaults to 3rd column in dict
  )

请注意，清理字典中的列名必须与清理脚本中此时的名称相对应。有关更多详细信息，请参阅此在线参考以获取 linelist 包。

添加到管道链

下面，一些新的列和列转换被添加到管道链中。

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # remove column
    select(-c(row_num, merged_header, x28)) %>% 
  
    # de-duplicate
    distinct() %>% 
  
    # add column
    mutate(bmi = wt_kg / (ht_cm/100)^2) %>%     

    # convert class of columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 
    
    # add column: delay to hospitalisation
    mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>% 
    
   # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
   ###################################################

    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      )) %>% 
    
    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years" ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit) ~ age,
          TRUE ~ NA_real_))

数字类别

在这里，我们描述了一些从数字列创建类别的特殊方法。常见示例包括年龄类别、实验室值组等。这里我们将讨论：

age_categories()，来自Epikit包
cut(), 从基数R
case_when()
分位数与quantile()andntile()

审查分布

对于此示例，我们将age_cat使用该列创建一个age_years列。首先，检查数据的分布，以做出适当的切点。

# examine the distribution
hist(linelist$age_years)

summary(linelist$age_years, na.rm=T)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.00   13.00   16.04   23.00   84.00     107

注意：有时，数字变量将作为类“字符”导入。如果某些值中有非数字字符，例如年龄输入“2 个月”，或者（取决于您的 R 语言环境设置）如果在小数位中使用逗号（例如“4, 5” 表示四年半）..

`age_categories()`

使用Epikit包，您可以使用该age_categories()函数轻松地对数字列进行分类和标记（注意：此函数也可以应用于非年龄数字变量）。作为奖励，输出列自动成为有序因子。

以下是所需的输入：

数值向量（列）
breakers =参数 - 为新组提供断点的数字向量

首先，最简单的例子：

# Simple example
################
pacman::p_load(epikit)                    # load package

linelist <- linelist %>% 
  mutate(
    age_cat = age_categories(             # create new column
      age_years,                            # numeric column to make groups from
      breakers = c(0, 5, 10, 15, 20,        # break points
                   30, 40, 50, 60, 70)))

# show table
table(linelist$age_cat, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-29 30-39 40-49 50-59 60-69   70+  <NA> 
##  1227  1223  1048   827  1216   597   251    78    27     7   107

默认情况下，您指定的中断值是下限 - 也就是说，它们包含在“更高”组中/这些组在下/左侧是“开放”的。如下所示，您可以将每个中断值加 1 以实现在顶部/右侧打开的组。

# Include upper ends for the same categories
############################################
linelist <- linelist %>% 
  mutate(
    age_cat = age_categories(
      age_years, 
      breakers = c(0, 6, 11, 16, 21, 31, 41, 51, 61, 71)))

# show table
table(linelist$age_cat, useNA = "always")
## 
##   0-5  6-10 11-15 16-20 21-30 31-40 41-50 51-60 61-70   71+  <NA> 
##  1469  1195  1040   770  1149   547   231    70    24     6   107

您可以使用调整标签的显示方式separator =。默认为“-”

您可以使用参数调整如何处理前ceiling =几位数字。设置一个上限截止集ceiling = TRUE。在此使用中，提供的最高中断值是“上限”，并且不创建类别“XX+”。任何高于最高中断值（或到upper =，如果定义）的值都归类为NA。下面是一个带有的示例ceiling = TRUE，因此没有 XX+ 类别，并且高于 70（最高中断值）的值被分配为 NA。

# With ceiling set to TRUE
##########################
linelist <- linelist %>% 
  mutate(
    age_cat = age_categories(
      age_years, 
      breakers = c(0, 5, 10, 15, 20, 30, 40, 50, 60, 70),
      ceiling = TRUE)) # 70 is ceiling, all above become NA

# show table
table(linelist$age_cat, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-29 30-39 40-49 50-59 60-70  <NA> 
##  1227  1223  1048   827  1216   597   251    78    28   113

或者，breakers =您可以提供所有lower =、upper =和，而不是by =：

lower =您要考虑的最小数字 - 默认值为 0
upper =您要考虑的最高数字
by =组间年数

linelist <- linelist %>% 
  mutate(
    age_cat = age_categories(
      age_years, 
      lower = 0,
      upper = 100,
      by = 10))

# show table
table(linelist$age_cat, useNA = "always")
## 
##   0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99  100+  <NA> 
##  2450  1875  1216   597   251    78    27     6     1     0     0   107

`cut()`

cut()是 age_categories()在R 的替代品，但我想你会明白为什么age_categories()要开发它来简化这个过程。一些显着的区别age_categories()是：

您不需要安装/加载另一个包
您可以指定组是否在右侧/左侧打开/关闭
您必须自己提供准确的标签
如果您希望 0 包含在最低组中，则必须指定此

其中的基本语法cut()是首先提供要切割的数字列 ( age_years)，然后提供断点参数，它是断点的数字向量c()。使用cut()，结果列是一个有序因子。

默认情况下，分类发生使得右侧/上侧是“开放的”和包容的（左/下侧是“封闭的”或排斥的）。这是与age_categories()函数相反的行为。默认标签使用符号“(A, B]”，这意味着 A 不包括在内，但 B 包括在内。通过提供参数来反转此行为right = TRUE。

因此，默认情况下，“0”值被排除在最低组之外，并归类为NA！“0”值可能是编码为 0 岁的婴儿，所以要小心！要更改这一点，请添加参数include.lowest = TRUE，以便任何“0”值都将包含在最低组中。最低类别的自动生成标签将是“[A],B]”。请注意，如果包含include.lowest = TRUE参数and right = TRUE，则极端包含现在将应用于最高的断点值和类别，而不是最低的。

labels =您可以使用参数提供自定义标签的向量。由于这些是手动编写的，因此要非常小心以确保它们准确无误！使用交叉表检查您的工作，如下所述。

一个cut()应用于age_years创建新变量的示例age_cat如下：

# Create new variable, by cutting the numeric age variable
# lower break is excluded but upper break is included in each category
linelist <- linelist %>% 
  mutate(
    age_cat = cut(
      age_years,
      breaks = c(0, 5, 10, 15, 20,
                 30, 50, 70, 100),
      include.lowest = TRUE         # include 0 in lowest group
      ))

# tabulate the number of observations per group
table(linelist$age_cat, useNA = "always")
## 
##    [0,5]   (5,10]  (10,15]  (15,20]  (20,30]  (30,50]  (50,70] (70,100]     <NA> 
##     1469     1195     1040      770     1149      778       94        6      107

检查你的工作！！！通过交叉制表数字和类别列，验证每个年龄值是否分配给正确的类别。检查边界值的分配（例如，如果相邻类别为 10-15 和 16-20，则为 15）。

# Cross tabulation of the numeric and category columns. 
table("Numeric Values" = linelist$age_years,   # names specified in table for clarity.
      "Categories"     = linelist$age_cat,
      useNA = "always")                        # don't forget to examine NA values
##                     Categories
## Numeric Values       [0,5] (5,10] (10,15] (15,20] (20,30] (30,50] (50,70] (70,100] <NA>
##   0                    136      0       0       0       0       0       0        0    0
##   0.0833333333333333     1      0       0       0       0       0       0        0    0
##   0.25                   2      0       0       0       0       0       0        0    0
##   0.333333333333333      6      0       0       0       0       0       0        0    0
##   0.416666666666667      1      0       0       0       0       0       0        0    0
##   0.5                    6      0       0       0       0       0       0        0    0
##   0.583333333333333      3      0       0       0       0       0       0        0    0
##   0.666666666666667      3      0       0       0       0       0       0        0    0
##   0.75                   3      0       0       0       0       0       0        0    0
##   0.833333333333333      1      0       0       0       0       0       0        0    0
##   0.916666666666667      1      0       0       0       0       0       0        0    0
##   1                    275      0       0       0       0       0       0        0    0
##   1.5                    2      0       0       0       0       0       0        0    0
##   2                    308      0       0       0       0       0       0        0    0
##   3                    246      0       0       0       0       0       0        0    0
##   4                    233      0       0       0       0       0       0        0    0
##   5                    242      0       0       0       0       0       0        0    0
##   6                      0    241       0       0       0       0       0        0    0
##   7                      0    256       0       0       0       0       0        0    0
##   8                      0    239       0       0       0       0       0        0    0
##   9                      0    245       0       0       0       0       0        0    0
##   10                     0    214       0       0       0       0       0        0    0
##   11                     0      0     220       0       0       0       0        0    0
##   12                     0      0     224       0       0       0       0        0    0
##   13                     0      0     191       0       0       0       0        0    0
##   14                     0      0     199       0       0       0       0        0    0
##   15                     0      0     206       0       0       0       0        0    0
##   16                     0      0       0     186       0       0       0        0    0
##   17                     0      0       0     164       0       0       0        0    0
##   18                     0      0       0     141       0       0       0        0    0
##   19                     0      0       0     130       0       0       0        0    0
##   20                     0      0       0     149       0       0       0        0    0
##   21                     0      0       0       0     158       0       0        0    0
##   22                     0      0       0       0     149       0       0        0    0
##   23                     0      0       0       0     125       0       0        0    0
##   24                     0      0       0       0     144       0       0        0    0
##   25                     0      0       0       0     107       0       0        0    0
##   26                     0      0       0       0     100       0       0        0    0
##   27                     0      0       0       0     117       0       0        0    0
##   28                     0      0       0       0      85       0       0        0    0
##   29                     0      0       0       0      82       0       0        0    0
##   30                     0      0       0       0      82       0       0        0    0
##   31                     0      0       0       0       0      68       0        0    0
##   32                     0      0       0       0       0      84       0        0    0
##   33                     0      0       0       0       0      78       0        0    0
##   34                     0      0       0       0       0      58       0        0    0
##   35                     0      0       0       0       0      58       0        0    0
##   36                     0      0       0       0       0      33       0        0    0
##   37                     0      0       0       0       0      46       0        0    0
##   38                     0      0       0       0       0      45       0        0    0
##   39                     0      0       0       0       0      45       0        0    0
##   40                     0      0       0       0       0      32       0        0    0
##   41                     0      0       0       0       0      34       0        0    0
##   42                     0      0       0       0       0      26       0        0    0
##   43                     0      0       0       0       0      31       0        0    0
##   44                     0      0       0       0       0      24       0        0    0
##   45                     0      0       0       0       0      27       0        0    0
##   46                     0      0       0       0       0      25       0        0    0
##   47                     0      0       0       0       0      16       0        0    0
##   48                     0      0       0       0       0      21       0        0    0
##   49                     0      0       0       0       0      15       0        0    0
##   50                     0      0       0       0       0      12       0        0    0
##   51                     0      0       0       0       0       0      13        0    0
##   52                     0      0       0       0       0       0       7        0    0
##   53                     0      0       0       0       0       0       4        0    0
##   54                     0      0       0       0       0       0       6        0    0
##   55                     0      0       0       0       0       0       9        0    0
##   56                     0      0       0       0       0       0       7        0    0
##   57                     0      0       0       0       0       0       9        0    0
##   58                     0      0       0       0       0       0       6        0    0
##   59                     0      0       0       0       0       0       5        0    0
##   60                     0      0       0       0       0       0       4        0    0
##   61                     0      0       0       0       0       0       2        0    0
##   62                     0      0       0       0       0       0       1        0    0
##   63                     0      0       0       0       0       0       5        0    0
##   64                     0      0       0       0       0       0       1        0    0
##   65                     0      0       0       0       0       0       5        0    0
##   66                     0      0       0       0       0       0       3        0    0
##   67                     0      0       0       0       0       0       2        0    0
##   68                     0      0       0       0       0       0       1        0    0
##   69                     0      0       0       0       0       0       3        0    0
##   70                     0      0       0       0       0       0       1        0    0
##   72                     0      0       0       0       0       0       0        1    0
##   73                     0      0       0       0       0       0       0        3    0
##   76                     0      0       0       0       0       0       0        1    0
##   84                     0      0       0       0       0       0       0        1    0
##   <NA>                   0      0       0       0       0       0       0        0  107

重新标记`NA`值

您可能希望为NA值分配一个标签，例如“缺失”。因为新列是类 Factor（限制值），所以不能简单地用replace_na()改变它，因为这个值将被拒绝。相反，请按照fct_explicit_na()from forcats 的fator页面中的说明使用。

linelist <- linelist %>% 
  
  # cut() creates age_cat, automatically of class Factor      
  mutate(age_cat = cut(
    age_years,
    breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),          
    right = FALSE,
    include.lowest = TRUE,        
    labels = c("0-4", "5-9", "10-14", "15-19", "20-29", "30-49", "50-69", "70-100")),
         
    # make missing values explicit
    age_cat = fct_explicit_na(
      age_cat,
      na_level = "Missing age")  # you can specify the label
  )    

# table to view counts
table(linelist$age_cat, useNA = "always")
## 
##         0-4         5-9       10-14       15-19       20-29       30-49       50-69      70-100 Missing age        <NA> 
##        1227        1223        1048         827        1216         848         105           7         107           0

快速制作中断和标签

要快速创建中断和标记向量，请使用以下内容。

# Make break points from 0 to 90 by 5
age_seq = seq(from = 0, to = 90, by = 5)
age_seq

# Make labels for the above categories, assuming default cut() settings
age_labels = paste0(age_seq + 1, "-", age_seq + 5)
age_labels

# check that both vectors are the same length
length(age_seq) == length(age_labels)

分位数分隔符

在通常的理解中，“分位数”或“百分位数”通常是指某个值，低于该值的一部分值。例如，第 95 个百分位的年龄linelist将是 95% 的年龄低于该年龄的年龄。

然而，在普通话中，“四分位数”和“十分位数”也可以指数据组，平均分为 4 组或 10 组（注意会比组多一个断点）。

要获得分位数断点，您可以使用basequantile() R的stats包。您提供一个数字向量（例如数据集中的列）和范围从 0 到 1.0 的数字概率值向量。断点作为数字向量返回。通过输入 ?quantile来探索统计方法的细节。

如果您的输入数字向量有任何缺失值，最好设置na.rm = TRUE
设置names = FALSE为获取一个未命名的数字向量

quantile(linelist$age_years,               # specify numeric vector to work on
  probs = c(0, .25, .50, .75, .90, .95),   # specify the percentiles you want
  na.rm = TRUE)                            # ignore missing values 
##  0% 25% 50% 75% 90% 95% 
##   0   6  13  23  33  41

您可以将quantile()的结果用作age_categories()或 cut()中的断点。下面我们使用quantiles()和age_yearstabyl()定义中断的地方创建一个新列deciles。下面，我们使用from janitor显示结果，以便您查看百分比（请参阅描述性表格页面）。请注意，他们在每组中并不完全是 10%。

linelist %>%                                # begin with linelist
  mutate(deciles = cut(age_years,           # create new column decile as cut() on column age_years
    breaks = quantile(                      # define cut breaks using quantile()
      age_years,                               # operate on age_years
      probs = seq(0, 1, by = 0.1),             # 0.0 to 1.0 by 0.1
      na.rm = TRUE),                           # ignore missing values
    include.lowest = TRUE)) %>%             # for cut() include age 0
  janitor::tabyl(deciles)                   # pipe to table to display
##  deciles   n    percent valid_percent
##    [0,2] 748 0.11319613    0.11505922
##    (2,5] 721 0.10911017    0.11090601
##    (5,7] 497 0.07521186    0.07644978
##   (7,10] 698 0.10562954    0.10736810
##  (10,13] 635 0.09609564    0.09767728
##  (13,17] 755 0.11425545    0.11613598
##  (17,21] 578 0.08746973    0.08890940
##  (21,26] 625 0.09458232    0.09613906
##  (26,33] 596 0.09019370    0.09167820
##  (33,84] 648 0.09806295    0.09967697
##     <NA> 107 0.01619249            NA

大小均匀的组

制作数字组的另一个工具是dplyr函数ntile()，它试图将您的数据分成 n个大小均匀的组-但请注意，不同quantile()的值可能会出现在多个组中。提供数字向量，然后提供组数。创建的新列中的值只是组“数字”（例如 1 到 10），而不是使用时值本身的范围cut()。

# make groups with ntile()
ntile_data <- linelist %>% 
  mutate(even_groups = ntile(age_years, 10))

# make table of counts and proportions by group
ntile_table <- ntile_data %>% 
  janitor::tabyl(even_groups)
  
# attach min/max values to demonstrate ranges
ntile_ranges <- ntile_data %>% 
  group_by(even_groups) %>% 
  summarise(
    min = min(age_years, na.rm=T),
    max = max(age_years, na.rm=T)
  )
## Warning in min(age_years, na.rm = T): no non-missing arguments to min; returning Inf
# combine and print - note that values are present in multiple groups
left_join(ntile_table, ntile_ranges, by = "even_groups")
##  even_groups   n    percent valid_percent min  max
##            1 651 0.09851695    0.10013844   0    2
##            2 650 0.09836562    0.09998462   2    5
##            3 650 0.09836562    0.09998462   5    7
##            4 650 0.09836562    0.09998462   7   10
##            5 650 0.09836562    0.09998462  10   13
##            6 650 0.09836562    0.09998462  13   17
##            7 650 0.09836562    0.09998462  17   21
##            8 650 0.09836562    0.09998462  21   26
##            9 650 0.09836562    0.09998462  26   33
##           10 650 0.09836562    0.09998462  33   84
##           NA 107 0.01619249            NA Inf -Inf

`case_when()`

可以使用dplyr函数从数字列创建类别，但在Epikitcase_when()中更容易使用age_categories()， cut()或者因为这些会自动创建有序因子。

如果使用case_when()，请查看本页前面“重新编码值”部分中所述的正确使用方法。另请注意，所有右侧值必须属于同一类。因此，如果您想要NA在右侧，您应该写“Missing”或使用特殊NA值NA_character_。

添加到管道链

下面，将创建两个分类年龄列的代码添加到清洁管道链中：

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # remove column
    select(-c(row_num, merged_header, x28)) %>% 
  
    # de-duplicate
    distinct() %>% 

    # add column
    mutate(bmi = wt_kg / (ht_cm/100)^2) %>%     

    # convert class of columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 
    
    # add column: delay to hospitalisation
    mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>% 
    
    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      )) %>% 
    
    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years" ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit) ~ age,
          TRUE ~ NA_real_)) %>% 
  
    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    ###################################################   
    mutate(
          # age categories: custom
          age_cat = epikit::age_categories(age_years, breakers = c(0, 5, 10, 15, 20, 30, 50, 70)),
        
          # age categories: 0 to 85 by 5s
          age_cat5 = epikit::age_categories(age_years, breakers = seq(0, 85, 5)))

添加行

一对一

手动逐一添加行很乏味，但可以使用add_row()from dplyr来完成。请记住，每一列必须只包含一个类的值（字符、数字、逻辑等）。因此，添加一行需要细微差别来维持这一点。

linelist <- linelist %>% 
  add_row(row_num = 666,
          case_id = "abc",
          generation = 4,
          `infection date` = as.Date("2020-10-10"),
          .before = 2)

使用.before和.after.指定要添加的行的位置。.before = 3将新行放在当前第三行之前。默认行为是将行添加到末尾。未指定的列将留空 ( NA)。

新的行号可能看起来很奇怪（“...23”），但预先存在的行中的行号已更改。因此，如果两次使用该命令，请仔细检查/测试插入。

如果您提供的class关闭，您将看到如下错误：

Error: Can't combine ..1$infection date <date> and ..2$infection date <character>.

（插入具有日期值的行时，请记住将日期包装在函数中，as.Date()如as.Date("2020-10-10")）。

绑定行

要通过将一个数据框的行绑定到另一个数据框的底部来将数据集组合在一起，您可以使用bind_rows()from dplyr。这在加入数据页面中有更详细的解释。

过滤行

清理列和重新编码值后的典型清理步骤是使用dplyr包的filter()过滤特定行的数据帧。

在filter()中，指定必须TRUE保留数据集中的行的逻辑。下面我们展示如何根据简单和复杂的逻辑条件过滤行。

简单过滤器

这个简单的示例将数据框重新定义linelist为自身，过滤了行以满足逻辑条件。仅保留括号内的逻辑语句计算结果的行。TRUE

在这个例子中，逻辑语句是gender == "f"，它询问列中的值是否gender等于“f”（区分大小写）。

在应用过滤器之前，中的行数linelist为nrow(linelist)。

linelist <- linelist %>% 
  filter(gender == "f")   # keep only rows where gender is equal to "f"

按行号过滤

在数据框或 tibble 中，每一行通常都有一个“行号”（当在 R Viewer 中看到时）出现在第一列的左侧。它本身不是数据中的真实列，但可以在filter()语句中使用。

要基于“行号”进行过滤，您可以使用带左括号的dplyr函数row_number()作为逻辑过滤语句的一部分。通常，您将使用%in%运算符和一系列数字作为该逻辑语句的一部分，如下所示。要查看前N 行，您还可以使用特殊的dplyr函数head()。

# View first 100 rows
linelist %>% head(100)     # or use tail() to see the n last rows

# Show row 5 only
linelist %>% filter(row_number() == 5)

# View rows 2 through 20, and three specific columns
linelist %>% filter(row_number() %in% 2:20) %>% select(date_onset, outcome, age)

您还可以通过将数据框传递给tibble函数将行号转换为真实列rownames_to_column()（不要在括号中放置任何内容）。

复合过滤器

可以使用括号( )、OR |、否定!、%in%和 AND&运算符构造更复杂的逻辑语句。下面是一个例子：

注意：您可以!在逻辑条件前面使用运算符来否定它。例如，如果不缺少!is.na(column)列值，则计算结果为 true 。如果列值不在向量中，则同样评估为 true。!column %in% c("a", "b", "c")

检查数据

下面是一个简单的单行命令，用于创建开始日期的直方图。看到从 2012 年到 2013 年的第二次较小的爆发也包含在这个原始数据集中。对于我们的分析，我们希望从这次较早的爆发中删除条目。

hist(linelist$date_onset, breaks = 50)

过滤器如何处理缺失的数字和日期值

我们可以date_onset在 2013 年 6 月之后按行过滤吗？警告！应用该代码filter(date_onset > as.Date("2013-06-01")))将删除以后流行病中缺少发病日期的任何行！

危险：过滤到大于 (>) 或小于 (<) 的日期或数字可能会删除任何缺少值的行 (NA)！这是因为NA被视为无限大和无限小。

（有关使用日期和包lubridate的更多信息，请参阅使用日期页面）

设计过滤器

检查交叉表以确保我们仅排除正确的行：

table(Hospital  = linelist$hospital,                     # hospital name
      YearOnset = lubridate::year(linelist$date_onset),  # year of date_onset
      useNA     = "always")                              # show missing values
##                                       YearOnset
## Hospital                               2012 2013 2014 2015 <NA>
##   Central Hospital                        0    0  351   99   18
##   Hospital A                            229   46    0    0   15
##   Hospital B                            227   47    0    0   15
##   Military Hospital                       0    0  676  200   34
##   Missing                                 0    0 1117  318   77
##   Other                                   0    0  684  177   46
##   Port Hospital                           9    1 1372  347   75
##   St. Mark's Maternity Hospital (SMMH)    0    0  322   93   13
##   <NA>                                    0    0    0    0    0

我们可以过滤哪些其他标准以从数据集中删除第一次爆发（2012 年和 2013 年）？我们看到：

2012年和2013年的第一次流行发生在医院A、医院B，港口医院也有10例。
A 和 B 医院在第二次疫情中没有病例，但港口医院有。

我们要排除：

nrow(linelist %>% filter(hospital %in% c("Hospital A", "Hospital B") | date_onset < as.Date("2013-06-01")))2012 年和 2013 年在医院 A、B 或港口发病的行：
- 排除nrow(linelist %>% filter(date_onset < as.Date("2013-06-01")))2012 年和 2013 年发病的行
- nrow(linelist %>% filter(hospital %in% c('Hospital A', 'Hospital B') & is.na(date_onset)))从医院 A 和 B 中排除缺少发病日期的行
- 不要排除缺少起始日期的nrow(linelist %>% filter(!hospital %in% c('Hospital A', 'Hospital B') & is.na(date_onset)))其他行。

我们从nrow(linelist)` 的 linelist 开始。这是我们的过滤器声明：

linelist <- linelist %>% 
  # keep rows where onset is after 1 June 2013 OR where onset is missing and it was a hospital OTHER than Hospital A or B
  filter(date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))

nrow(linelist)
## [1] 6019

当我们重新制作交叉表时，我们看到医院 A 和 B 被完全删除，2012 年和 2013 年的 10 个港口医院病例被删除，所有其他值都相同 - 正如我们想要的那样。

table(Hospital  = linelist$hospital,                     # hospital name
      YearOnset = lubridate::year(linelist$date_onset),  # year of date_onset
      useNA     = "always")                              # show missing values
##                                       YearOnset
## Hospital                               2014 2015 <NA>
##   Central Hospital                      351   99   18
##   Military Hospital                     676  200   34
##   Missing                              1117  318   77
##   Other                                 684  177   46
##   Port Hospital                        1372  347   75
##   St. Mark's Maternity Hospital (SMMH)  322   93   13
##   <NA>                                    0    0    0

多个语句可以包含在一个过滤器命令中（用逗号分隔），或者为了清楚起见，您始终可以通过管道连接到单独的 filter() 命令。

注意：一些读者可能会注意到，只过滤会更容易，date_hospitalisation因为它是 100% 完整的，没有缺失值。这是真的。但date_onset用于演示复杂过滤器的目的。

Standalone

过滤也可以作为独立命令（不是管道链的一部分）来完成。与其他dplyr动词一样，在这种情况下，第一个参数必须是数据集本身。

# dataframe <- filter(dataframe, condition(s) for rows to keep)

linelist <- filter(linelist, !is.na(case_id))

您还可以使用基本R 来使用方括号进行子集化，方括号反映您想要保留的 [行、列]。

# dataframe <- dataframe[row conditions, column conditions] (blank means keep all)

linelist <- linelist[!is.na(case_id), ]

快速查看记录

通常您想快速查看几条记录，只针对几列。基本的R 函数View()将打印一个数据框以在您的 RStudio 中查看。

在 RStudio 中查看 linelist：

View(linelist)

以下是查看特定单元格（特定行和特定列）的两个示例：

使用 dplyr 函数filter()和select()：

在View()中，将数据集通过管道传输filter()到以保留某些行，然后将数据集传输到select()以保留某些列。例如，要查看 3 个特定病例的发病和住院日期：

View(linelist %>%
       filter(case_id %in% c("11f8ea", "76b97a", "47a5f5")) %>%
       select(date_onset, date_hospitalisation))

您可以使用基本R 语法来实现相同的目的，使用括号[ ]来作为您想要查看的子集。

View(linelist[linelist$case_id %in% c("11f8ea", "76b97a", "47a5f5"), c("date_onset", "date_hospitalisation")])

添加到管道链

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # remove column
    select(-c(row_num, merged_header, x28)) %>% 
  
    # de-duplicate
    distinct() %>% 

    # add column
    mutate(bmi = wt_kg / (ht_cm/100)^2) %>%     

    # convert class of columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 
    
    # add column: delay to hospitalisation
    mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>% 
    
    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      )) %>% 
    
    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years" ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit) ~ age,
          TRUE ~ NA_real_)) %>% 
  
    mutate(
          # age categories: custom
          age_cat = epikit::age_categories(age_years, breakers = c(0, 5, 10, 15, 20, 30, 50, 70)),
        
          # age categories: 0 to 85 by 5s
          age_cat5 = epikit::age_categories(age_years, breakers = seq(0, 85, 5))) %>% 
    
    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    ###################################################
    filter(
          # keep only rows where case_id is not missing
          !is.na(case_id),  
          
          # also filter to keep only the second outbreak
          date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))

逐行计算

如果要在一行内执行计算，可以使用rowwise()from dplyr。请参阅有关逐行计算的在线插图。
例如，此代码应用rowwise()然后创建一个新列，该列对行列表中的每一行的具有值“yes”的指定症状列的数量求和。列sum()在 vector 中按名称指定c()。rowwise()本质上是一种特殊的group_by()，所以最好在ungroup()完成后使用（分组数据页面）。

linelist %>%
  rowwise() %>%
  mutate(num_symptoms = sum(c(fever, chills, cough, aches, vomit) == "yes")) %>% 
  ungroup() %>% 
  select(fever, chills, cough, aches, vomit, num_symptoms) # for display
## # A tibble: 5,888 x 6
##    fever chills cough aches vomit num_symptoms
##    <chr> <chr>  <chr> <chr> <chr>        <int>
##  1 no    no     yes   no    yes              2
##  2 <NA>  <NA>   <NA>  <NA>  <NA>            NA
##  3 <NA>  <NA>   <NA>  <NA>  <NA>            NA
##  4 no    no     no    no    no               0
##  5 no    no     yes   no    yes              2
##  6 no    no     yes   no    yes              2
##  7 <NA>  <NA>   <NA>  <NA>  <NA>            NA
##  8 no    no     yes   no    yes              2
##  9 no    no     yes   no    yes              2
## 10 no    no     yes   no    no               1
## # ... with 5,878 more rows

当您指定要评估的列时，您可能需要使用本页select()部分中描述的“tidyselect”辅助函数。您只需要进行一项调整（因为您没有在 or 之类的dplyr函数中使用它们）。select()summarise()

将列规范标准放在dplyr函数c_across()中。这是因为c_across（文档）是专门为使用而设计的rowwise()。例如，下面的代码：

应用rowwise()，因此在每一行中应用以下操作 ( sum())（而不是对整列求和）
创建新列num_NA_dates，为每一行定义为is.na()评估为 TRUE 的列数（名称包含“日期”）（它们缺少数据）。
ungroup()去除rowwise()后续步骤的影响

linelist %>%
  rowwise() %>%
  mutate(num_NA_dates = sum(is.na(c_across(contains("date"))))) %>% 
  ungroup() %>% 
  select(num_NA_dates, contains("date")) # for display
## # A tibble: 5,888 x 5
##    num_NA_dates date_infection date_onset date_hospitalisation date_outcome
##           <int> <date>         <date>     <date>               <date>      
##  1            1 2014-05-08     2014-05-13 2014-05-15           NA          
##  2            1 NA             2014-05-13 2014-05-14           2014-05-18  
##  3            1 NA             2014-05-16 2014-05-18           2014-05-30  
##  4            1 2014-05-04     2014-05-18 2014-05-20           NA          
##  5            0 2014-05-18     2014-05-21 2014-05-22           2014-05-29  
##  6            0 2014-05-03     2014-05-22 2014-05-23           2014-05-24  
##  7            0 2014-05-22     2014-05-27 2014-05-29           2014-06-01  
##  8            0 2014-05-28     2014-06-02 2014-06-03           2014-06-07  
##  9            1 NA             2014-06-05 2014-06-06           2014-06-18  
## 10            1 NA             2014-06-05 2014-06-07           2014-06-09  
## # ... with 5,878 more rows

您还可以提供其他功能，例如max()获取每行的最新或最近日期：

linelist %>%
  rowwise() %>%
  mutate(latest_date = max(c_across(contains("date")), na.rm=T)) %>% 
  ungroup() %>% 
  select(latest_date, contains("date"))  # for display
## # A tibble: 5,888 x 5
##    latest_date date_infection date_onset date_hospitalisation date_outcome
##    <date>      <date>         <date>     <date>               <date>      
##  1 2014-05-15  2014-05-08     2014-05-13 2014-05-15           NA          
##  2 2014-05-18  NA             2014-05-13 2014-05-14           2014-05-18  
##  3 2014-05-30  NA             2014-05-16 2014-05-18           2014-05-30  
##  4 2014-05-20  2014-05-04     2014-05-18 2014-05-20           NA          
##  5 2014-05-29  2014-05-18     2014-05-21 2014-05-22           2014-05-29  
##  6 2014-05-24  2014-05-03     2014-05-22 2014-05-23           2014-05-24  
##  7 2014-06-01  2014-05-22     2014-05-27 2014-05-29           2014-06-01  
##  8 2014-06-07  2014-05-28     2014-06-02 2014-06-03           2014-06-07  
##  9 2014-06-18  NA             2014-06-05 2014-06-06           2014-06-18  
## 10 2014-06-09  NA             2014-06-05 2014-06-07           2014-06-09  
## # ... with 5,878 more rows

排列和排序

使用dplyr函数arrange()按列值对行进行排序或排序。

简单地按照它们应该排序的顺序列出列。指定.by_group = TRUE是否希望首先按应用于数据的任何分组进行排序（请参阅分组数据页面）。

默认情况下，列将按“升序”顺序排序（适用于数字列和字符列）。您可以通过desc()将变量包装为“降序”对变量进行排序。

对数据进行排序在制作表格以进行演示、slice()用于获取每组的“顶部”行或按出现顺序设置因子级别顺序arrange()时特别有用。

例如，要对我们的 linelist 行进行排序hospital，然后按date_onset降序排列，我们将使用：

linelist %>% 
   arrange(hospital, desc(date_onset))

掉头就走

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
R-应用流行病学和公共卫生-7.数据清洗-2

心态崩了，上一篇续写因为ctrlz重写了两遍，因为提交不上又重写了两遍。。。分了吧选择或重新排序列使用select()fromdplyr选择要保留的列，并指定它们在数据框中的顺序。注意：在以下示例中，linelist数据框已修改select()并显示，但未保存。这是出于演示目的。修改后的列名通过管道将数据框传输到names().以下是清洁管道链中此时行列表中的所有列名：names(linelist)## [1] "case_id" "generation.
复制链接

扫一扫