R for Data Science总结之——Tidy Data

本文总结了R for Data Science中的Tidy Data概念,强调了数据整洁的重要性。利用tidyr包,介绍了 Gathering(收集)、Spreading(展开)、Separating(分离)和Uniting(合并)等技巧,确保每个变量独占一列,每条观测独占一行。通过实例展示了如何将不整洁的数据转换为tidy格式,包括处理缺失值和统一数据格式,最终通过管道操作实现整个转换过程。
摘要由CSDN通过智能技术生成

R for Data Science总结之——Tidy Data

在R中进行数据挖掘要求数据集具有tidy data的特征,这有点类似数据库中的范式结构:

  • 每一个变量都有自己独立的一列
  • 每一个观测值都有自己独立的一行
  • 每一个数据都是独立的单元格

在这里插入图片描述
这里我们会用到tidyr包来处理每一个数据集使其拥有tidy data的特征,其包含在tidyverse框架中:

library(tidyverse)

table1
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
table2
#> # A tibble: 12 x 4
#>   country      year type           count
#>   <chr>       <int> <chr>          <int>
#> 1 Afghanistan  1999 cases            745
#> 2 Afghanistan  1999 population  19987071
#> 3 Afghanistan  2000 cases           2666
#> 4 Afghanistan  2000 population  20595360
#> 5 Brazil       1999 cases          37737
#> 6 Brazil       1999 population 172006362
#> # ... with 6 more rows
table3
#> # A tibble: 6 x 3
#>   country      year rate             
#> * <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583

# Spread across two tibbles
table4a  # cases
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766
table4b  # population
#> # A tibble: 3 x 3
#>   country         `1999`     `2000`
#> * <chr>            <int>      <int>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583

这之中只有table1符合tidy data的要求,而拥有tidy的特征是使用dplyr中mutate, summary等函数的基础:

# Compute rate per 10,000
table1 %>% 
  mutate(rate = cases / population * 10000)
#> # A tibble: 6 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071 0.373
#> 2 Afghanistan  2000   2666   20595360 1.29 
#> 3 Brazil       1999  37737  172006362 2.19 
#> 4 Brazil       2000  80488  174504898 4.61 
#> 5 China        1999 212258 1272915272 1.67 
#> 6 China        2000 213766 1280428583 1.67

# Compute cases per year
table1 %>% 
  count(year, wt = cases)
#> # A tibble: 2 x 2
#>    year      n
#>   <int>  <int>
#> 1  1999 250740
#> 2  2000 296920

# Visualise changes over time
library(ggplot2)
ggplot(table1, aes(year, cases)) + 
  geom_line(aes(group = country), colour = "grey50") + 
  geom_point(aes(colour = country))

在这里插入图片描述

Gathering

table4a
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

这个数据集中两列1999和2000是数值而不是变量,列名应放在

  • 3
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
What exactly is data science? With this book, you’ll gain a clear understanding of this discipline for discovering natural laws in the structure of data. Along the way, you’ll learn how to use the versatile R programming language for data analysis. Whenever you measure the same thing twice, you get two results—as long as you measure precisely enough. This phenomenon creates uncertainty and opportunity. Author Garrett Grolemund, Master Instructor at RStudio, shows you how data science can help you work with the uncertainty and capture the opportunities. You’ll learn about: Data Wrangling—how to manipulate datasets to reveal new information Data Visualization—how to create graphs and other visualizations Exploratory Data Analysis—how to find evidence of relationships in your measurements Modelling—how to derive insights and predictions from your data Inference—how to avoid being fooled by data analyses that cannot provide foolproof results Through the course of the book, you’ll also learn about the statistical worldview, a way of seeing the world that permits understanding in the face of uncertainty, and simplicity in the face of complexity. Table of Contents Part I. Explore Chapter 1. Data Visualization with ggplot2 Chapter 2. Workflow: Basics Chapter 3. Data Transformation with dplyr Chapter 4. Workflow: Scripts Chapter 5. Exploratory Data Analysis Chapter 6. Workflow: Projects Part II. Wrangle Chapter 7. Tibbles with tibble Chapter 8. Data Import with readr Chapter 9. Tidy Data with tidyr Chapter 10. Relational Data with dplyr Chapter 11. Strings with stringr Chapter 12. Factors with forcats Chapter 13. Dates and Times with lubridate Part III. Program Chapter 14. Pipes with magrittr Chapter 15. Functions Chapter 16. Vectors Chapter 17. Iteration with purrr Part IV. Model Chapter 18. Model Basics with modelr Chapter 19. Model Building Chapter 20. Many Models with purrr and broom Part V. Communicate Chapter 21. R Markdown Chapter 22. Graphics for Communication with ggplot2 Chapter 23. R Markdown Formats Chapter 24. R Markdown Workflow
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值