R for Data Science总结之——Tidy Data

最新推荐文章于 2024-05-23 09:40:50 发布

我要养只哈士奇

最新推荐文章于 2024-05-23 09:40:50 发布

阅读量3k

点赞数 3

分类专栏： R Data Science R语言数据挖掘tidyverse框架

本文链接：https://blog.csdn.net/weixin_38423453/article/details/82969121

版权

本文总结了R for Data Science中的Tidy Data概念，强调了数据整洁的重要性。利用tidyr包，介绍了 Gathering（收集）、Spreading（展开）、Separating（分离）和Uniting（合并）等技巧，确保每个变量独占一列，每条观测独占一行。通过实例展示了如何将不整洁的数据转换为tidy格式，包括处理缺失值和统一数据格式，最终通过管道操作实现整个转换过程。

摘要由CSDN通过智能技术生成

R for Data Science总结之——Tidy Data

在R中进行数据挖掘要求数据集具有tidy data的特征，这有点类似数据库中的范式结构：

每一个变量都有自己独立的一列
每一个观测值都有自己独立的一行
每一个数据都是独立的单元格

在这里插入图片描述
这里我们会用到tidyr包来处理每一个数据集使其拥有tidy data的特征，其包含在tidyverse框架中：

library(tidyverse)

table1
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
table2
#> # A tibble: 12 x 4
#>   country      year type           count
#>   <chr>       <int> <chr>          <int>
#> 1 Afghanistan  1999 cases            745
#> 2 Afghanistan  1999 population  19987071
#> 3 Afghanistan  2000 cases           2666
#> 4 Afghanistan  2000 population  20595360
#> 5 Brazil       1999 cases          37737
#> 6 Brazil       1999 population 172006362
#> # ... with 6 more rows
table3
#> # A tibble: 6 x 3
#>   country      year rate             
#> * <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583

# Spread across two tibbles
table4a  # cases
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766
table4b  # population
#> # A tibble: 3 x 3
#>   country         `1999`     `2000`
#> * <chr>            <int>      <int>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583

这之中只有table1符合tidy data的要求，而拥有tidy的特征是使用dplyr中mutate, summary等函数的基础：

# Compute rate per 10,000
table1 %>% 
  mutate(rate = cases / population * 10000)
#> # A tibble: 6 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071 0.373
#> 2 Afghanistan  2000   2666   20595360 1.29 
#> 3 Brazil       1999  37737  172006362 2.19 
#> 4 Brazil       2000  80488  174504898 4.61 
#> 5 China        1999 212258 1272915272 1.67 
#> 6 China        2000 213766 1280428583 1.67

# Compute cases per year
table1 %>% 
  count(year, wt = cases)
#> # A tibble: 2 x 2
#>    year      n
#>   <int>  <int>
#> 1  1999 250740
#> 2  2000 296920

# Visualise changes over time
library(ggplot2)
ggplot(table1, aes(year, cases)) + 
  geom_line(aes(group = country), colour = "grey50") + 
  geom_point(aes(colour = country))

在这里插入图片描述

Gathering

table4a
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

这个数据集中两列1999和2000是数值而不是变量，列名应放在