R语言入门-数据分析实操(tyidyverse工作流+代码)

本文链接：https://blog.csdn.net/Allenmumu/article/details/118826190

1. 数据分析的开端，Tidyverse？

tidyverse 是一个清晰的 R 包集合，在数据操作、探索和可视化领域提供统一的数据科学解决方案，这些解决方案具有共同的设计理念。它是由 RStudio 背后的首席科学家 Hadley Wickham 创建的。 tidyverse 中的 R 包旨在提高统计学家和数据科学家的工作效率。包引导他们完成工作流程，促进沟通并产生可重复的工作产品。 tidyverse 本质上侧重于使工作流成为可能的工具的互连。在数据科学项目中采用 tidyverse有很多优势。它提供一致的功能、工作流覆盖范围、数据科学教育、数据科学工具开发的简化路径以及提高生产力的潜力。

它的主要目标之一是帮助任何需要分析数据的人高效地工作。如下图所示，tidyverse可以帮助你实现：

数据导入/导出
数据清洗处理
批量建模
数据/模型结果可视化
生成(可交互)的分析报告. (eg. pdf, word, ppt)

本文主要讨论前两个目标及数据的创建导入导出，以及数据清洗处理。

在这里插入图片描述

tidyverse核心：管道操作 %>%

进行数据分析前先导入必要的包和数据：

library(tidyverse)
library(rio)
data("german", package = "rchallenge")

首先查看你的数据集：

german %>% glimpse()

## Rows: 1,000
## Columns: 21
## $ status                  <fct> no checking account, no checking account, ... …
## $ duration                <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history          <fct> all credits at this bank paid back duly, all c…
## $ purpose                 <fct> car (used), others, retraining, others, others…
## $ amount                  <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings                 <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration     <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate        <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex     <fct> female : non-single or male : single, male : m…
## $ other_debtors           <fct> none, none, none, none, none, none, none, none…
## $ present_residence       <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property                <fct> car or other, unknown / no property, unknown /…
## $ age                     <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing                 <fct> for free, for free, for free, for free, rent, …
## $ number_credits          <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ job                     <fct> skilled employee/official, skilled employee/of…
## $ people_liable           <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone               <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker          <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…
## $ credit_risk             <fct> good, good, good, good, good, good, good, good…

如你想对数据集german进行如下操作,：

选择status,savings,amount,age,credit_risk这些列

german %>% 
  select(status,savings,amount, age, credit_risk) %>%

再按分类变量status分类

german %>% 
  select(status,savings,amount, age, credit_risk) %>% 
  group_by(status) %>% 
  glimpse()

## Rows: 1,000
## Columns: 5
## Groups: status [4]
## $ status      <fct> no checking account, no checking account, ... < 0 DM, no c…
## $ savings     <fct> unknown/no savings account, unknown/no savings account, ..…
## $ amount      <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361, 1098, 3758,…
## $ age         <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24, 31, 31, 23…
## $ credit_risk <fct> good, good, good, good, good, good, good, good, good, good…

并计算每一类的个数。

german %>% 
  select(status,savings,amount, age, credit_risk) %>% 
  group_by(status) %>% 
  glimpse()

## Rows: 1,000
## Columns: 5
## Groups: status [4]
## $ status      <fct> no checking account, no checking account, ... < 0 DM, no c…
## $ savings     <fct> unknown/no savings account, unknown/no savings account, ..…
## $ amount      <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361, 1098, 3758,…
## $ age         <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24, 31, 31, 23…
## $ credit_risk <fct> good, good, good, good, good, good, good, good, good, good…

通过tidyverse这样的工作流程，我们可以按照自己的思路一步一步的处理清洗我们的数据。

2. 数据导入/导出

数据读取

这里建议使用rio包里的import函数导入。可以导入xlxs,.csv,.txt, SPSS, Stata, SAS等大部分的格式。

当然每个格式也有单独对应的包，如readr可以用来读取csv,readxl读取xls,haven读取SPSS,Stat,SAS文件等。

library(rio)
df = import("yourdirectory/data.csv")
df = import("yourdirectory/data.xlxs")

数据导出

同理这里推荐使用rio包里的export函数。

export(df, "yourdirectory/name.csv")
export(df, "yourdirectory/name.xlxs")

3. tidyverse数据清洗处理流程

这里主要介绍关于行列创建修改的基本操作，以及一些分组汇总知识。更多细节请查看。

R语言编程–基于tidyverse

3.1选择列

主要使用到的函数有 relocate/select

select 选择所需要的列
relocate 选择需要的列并排序，它保留了所有列，但为重新排序提供了更大的灵活性。

german %>% select(status,credit_risk,job,age) %>% glimpse()

## Rows: 1,000
## Columns: 4
## $ status      <fct> no checking account, no checking account, ... < 0 DM, no c…
## $ credit_risk <fct> good, good, good, good, good, good, good, good, good, good…
## $ job         <fct> skilled employee/official, skilled employee/official, unsk…
## $ age         <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24, 31, 31, 23…

german %>% select(2,3,5,7) %>% glimpse()

## Rows: 1,000
## Columns: 4
## $ duration            <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6, 48…
## $ credit_history      <fct> all credits at this bank paid back duly, all credi…
## $ amount              <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361, 109…
## $ employment_duration <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 <= .…

# 选取这些列排到最前面，保留剩余列
german %>% relocate(status,credit_risk,job,age) %>% glimpse()

## Rows: 1,000
## Columns: 21
## $ status                  <fct> no checking account, no checking account, ... …
## $ credit_risk             <fct> good, good, good, good, good, good, good, good…
## $ job                     <fct> skilled employee/official, skilled employee/of…
## $ age                     <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24…
## $ duration                <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history          <fct> all credits at this bank paid back duly, all c…
## $ purpose                 <fct> car (used), others, retraining, others, others…
## $ amount                  <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings                 <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration     <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate        <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex     <fct> female : non-single or male : single, male : m…
## $ other_debtors           <fct> none, none, none, none, none, none, none, none…
## $ present_residence       <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property                <fct> car or other, unknown / no property, unknown /…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing                 <fct> for free, for free, for free, for free, rent, …
## $ number_credits          <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ people_liable           <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone               <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker          <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…

# 选取这些列，且以status-age 为顺序，age是倒数第三列。
german %>% relocate(status,credit_risk,job,age, .after = last_col(offset = 3))

使用函数选择

german %>% select(starts_with("s")) %>% glimpse() # 选择列名是以s开始的

## Rows: 1,000
## Columns: 2
## $ status  <fct> no checking account, no checking account, ... < 0 DM, no check…
## $ savings <fct> unknown/no savings account, unknown/no savings account, ... < …

german %>% select(ends_with("s")) %>% glimpse() # 选择列名是以s结束的

## Rows: 1,000
## Columns: 5
## $ status                  <fct> no checking account, no checking account, ... …
## $ savings                 <fct> unknown/no savings account, unknown/no savings…
## $ other_debtors           <fct> none, none, none, none, none, none, none, none…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ number_credits          <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…

german %>% select(contains("s")) # 选择列名中包含s的

german %>% select(where(is.factor))  # 选择是列属性是factor的

使用逻辑符选择

# 选择这三列，并且这三列中是因子的
german %>% select(c(status,age,credit_risk) & where(is.factor)) %>% glimpse()

## Rows: 1,000
## Columns: 2
## $ status      <fct> no checking account, no checking account, ... < 0 DM, no c…
## $ credit_risk <fct> good, good, good, good, good, good, good, good, good, good…

# 不选择这age,status这两列
german %>% select(!c(age,status))

german %>% select(-c(age,status))

3.2 修改列

创建新列/修改新列/修改旧列 mutate

german %>% 
  # 创建新列age_mean计算age的均值，一般默认会把新列放在最后，.before=1使得放在第一列
  mutate(age_mean = mean(age), .before = 1) %>% 
  # 也可以修改原有的列
  mutate(age = age + 1) %>% 
  #也可以同时创建多个列，顺序是依次往后计算，所以前面创建的列可以在后面直接用。
  mutate(duration_mean = mean(duration),
         duration_median = median(duration),
         duration_sd = sd(duration),
         order = duration_mean > duration_median,
         order = as.numeric(order),
         .before= 1) %>% 
  glimpse()

## Rows: 1,000
## Columns: 26
## $ duration_mean           <dbl> 20.903, 20.903, 20.903, 20.903, 20.903, 20.903…
## $ duration_median         <dbl> 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18…
## $ duration_sd             <dbl> 12.05881, 12.05881, 12.05881, 12.05881, 12.058…
## $ order                   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ age_mean                <dbl> 35.542, 35.542, 35.542, 35.542, 35.542, 35.542…
## $ status                  <fct> no checking account, no checking account, ... …
## $ duration                <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history          <fct> all credits at this bank paid back duly, all c…
## $ purpose                 <fct> car (used), others, retraining, others, others…
## $ amount                  <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings                 <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration     <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate        <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex     <fct> female : non-single or male : single, male : m…
## $ other_debtors           <fct> none, none, none, none, none, none, none, none…
## $ present_residence       <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property                <fct> car or other, unknown / no property, unknown /…
## $ age                     <dbl> 22, 37, 24, 40, 39, 49, 40, 41, 66, 24, 37, 25…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing                 <fct> for free, for free, for free, for free, rent, …
## $ number_credits          <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ job                     <fct> skilled employee/official, skilled employee/of…
## $ people_liable           <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone               <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker          <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…
## $ credit_risk             <fct> good, good, good, good, good, good, good, good…

同时修改多个列 across

german %>%
  # 将选定列转化为字符型
  mutate(across(c(duration, age), as.character)) %>% 
  # 将每一列转化为数字型
  mutate(across(everything(), as.numeric)) %>% 
  # 分别计算duration, age 列的均值和方差
  mutate(across(c(duration, age),
                .fns = list(mean = mean, stdev =sd)),
         .before = 1)

german %>% 
  group_by(status) %>% 
  summarise(
    across(c(duration,age), .fns = list(mean = mean, stdev = sd)),
    .groups = "drop"
  )

对列进行重新编码
- if_else() 多用于处理二分类，也可以多分类但不推荐

# 当age>=60，老人，1
german %>% 
  mutate(old = if_else(age<60,0,1))

# 也可以用多个if_else 嵌套处理多分类
# age<30:0 ; age<60:1; else:2 
german %>% 
  mutate(age_category = if_else(age<30,0,
                                if_else(age<60,1,2)),
         .before=1) %>% 
  glimpse()

## Rows: 1,000
## Columns: 22
## $ age_category            <dbl> 0, 1, 0, 1, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1…
## $ status                  <fct> no checking account, no checking account, ... …
## $ duration                <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history          <fct> all credits at this bank paid back duly, all c…
## $ purpose                 <fct> car (used), others, retraining, others, others…
## $ amount                  <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings                 <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration     <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate        <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex     <fct> female : non-single or male : single, male : m…
## $ other_debtors           <fct> none, none, none, none, none, none, none, none…
## $ present_residence       <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property                <fct> car or other, unknown / no property, unknown /…
## $ age                     <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing                 <fct> for free, for free, for free, for free, rent, …
## $ number_credits          <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ job                     <fct> skilled employee/official, skilled employee/of…
## $ people_liable           <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone               <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker          <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…
## $ credit_risk             <fct> good, good, good, good, good, good, good, good…

case_when() 用于处理多分类，要比if_else()更直接好用

# 同理对年龄进行多分类
german %>% 
  mutate(age_category = case_when(
    age<30 ~ 0,
    age<60 ~ 1,
    TRUE ~ 2
  ),
  .before=1)

german %>% 
  mutate(age_category = case_when(
    age<30 ~ "young",
    age<60 ~ "middle",
    TRUE ~ "old"
  ),
  .before=1) %>% 
  glimpse()

one-hot 处理 correlationfunnel

把所有变量进行one-hot处理，生成二进制 (0/1) 变量的特征集。

Numeric data 被切分（使用n_bins）成分类数据，然后所有分类数据都被单热编码以产生二进制特征。为了防止低频类别（高基数类别）增加维数（结果数据框的宽度），我们使用 thresh_infreq = 0.01 和 name_infreq = “OTHER” 对多余的类别进行分组。
Categorical data : one-hot encoding

# 把所有变量进行one-hot 处理
# install.packages("correlationfunnel")
german %>% 
  correlationfunnel::binarize(n_bins = 5, thresh_infreq = 0.1, name_infreq = "OTHER",
                              one_hot = TRUE) %>% 
  glimpse()

## Rows: 1,000
## Columns: 74
## $ status__no_checking_account                                   <dbl> 1, 1, 0,…
## $ `status__..._<_0_DM`                                          <dbl> 0, 0, 1,…
## $ `status__..._>=_200_DM_/_salary_for_at_least_1_year`          <dbl> 0, 0, 0,…
## $ status__OTHER                                                 <dbl> 0, 0, 0,…
........
## $ duration__12_15                                               <dbl> 0, 0, 0,…
## $ duration__15_24                                               <dbl> 1, 0, 0,…
## $ duration__24_30                                               <dbl> 0, 0, 0,…
## $ `telephone__yes_(under_customer_name)`                        <dbl> 0, 0, 0,…
## $ foreign_worker__no                                            <dbl> 1, 1, 1,…
## $ foreign_worker__OTHER                                         <dbl> 0, 0, 0,…
## $ credit_risk__bad                                              <dbl> 0, 0, 0,…
## $ credit_risk__good                                             <dbl> 1, 1, 1,…

关于correlationfunnel包的更多具体使用请查看R packages: Correlationfunnel 相关漏斗图，可视化因变量自变量之间的相关关系

3.3 筛选行 filter()

# 选取满足多个条件的行，这里, 相当于 & 表示和的意思
german %>% 
  filter(age > 60, credit_risk == "good", telephone == "no") %>% 
  glimpse()

german %>% 
  filter(age != 35 | age>60, credit_risk == "good") %>% 
  glimpse()

# 选取age列和duration列都大于50的行
german %>% 
  filter(if_all(c(age, duration), ~ .x > 50)) %>% 
  glimpse()

分组汇总

german %>% 
  group_by(status) %>% 
  count(credit_risk, name = "n") %>% 
  glimpse()

## Rows: 8
## Columns: 3
## Groups: status [4]
## $ status      <fct> no checking account, no checking account, ... < 0 DM, ... …
## $ credit_risk <fct> bad, good, bad, good, bad, good, bad, good
## $ n           <int> 135, 139, 105, 164, 14, 49, 46, 348

german_wider = german %>% 
  group_by(status) %>% 
  count(credit_risk, name = "n") %>% 
  pivot_wider(
    names_from = credit_risk,
    values_from = n,
    values_fill = 0
  ) %>% 
  ungroup() %>% 
  glimpse()

## Rows: 4
## Columns: 3
## $ status <fct> no checking account, ... < 0 DM, 0<= ... < 200 DM, ... >= 200 D…
## $ bad    <int> 135, 105, 14, 46
## $ good   <int> 139, 164, 49, 348

生成分组表格的快捷方式

library(table1)

# 汇总如下变量 
table1(~  duration + age + status + job , data = german)

总结

tidyverse 的优势包括一致的功能、工作流覆盖、数据科学教育的途径、开发数据科学工具的简约方法以及提高生产力的可能性。tidyverse 用一种更加优雅且符合人直观思路的方式，以管道式、泛函式编程技术实现了数据科学的
整个流程:数据导入、数据清洗、数据操作、数据可视化、数据建模、可重现与交互报告。

如果你也面临以下问题待解决：

处理自己领域真实的数据
对数据进行建模分析、可视化
得出有用的结论

欢迎使用tidyverse工作流。是从basic R 还是tidy R入门R语言的争议一直都有，没有绝对的对错，只有合适与否。基于我自身，我需要解决的就是上述三个问题，因此tidy R 工作流程更适合我。