R语言扩展包dplyr——数据清洗和整理

该包主要用于数据清洗和整理,coursera课程链接:Getting and Cleaning Data

也可以载入swirl包,加载课Getting and Cleaning Data跟着学习。

如下:

library(swirl)
install_from_swirl("Getting and Cleaning Data")
swirl()

此文主要是参考R自带的简介:Introduce to dplyr

1、示范数据

> library(nycflights13)
> dim(flights)
[1] 336776     16
> head(flights, 3)
Source: local data frame [3 x 16]

  year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time
1 2013     1   1      517         2      830        11      UA  N14228   1545    EWR  IAH      227
2 2013     1   1      533         4      850        20      UA  N24211   1714    LGA  IAH      227
3 2013     1   1      542         2      923        33      AA  N619AA   1141    JFK  MIA      160
Variables not shown: distance (dbl), hour (dbl), minute (dbl)

2、将过长的数据整理成友好的tbl_df数据

> flights_df <- tbl_df(flights)
> flights_df


3、筛选filter()

> filter(flights_df, month == 1, day == 1)
Source: local data frame [842 x 16]

   year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time
1  2013     1   1      517         2      830        11      UA  N14228   1545    EWR  IAH      227
2  2013     1   1      533         4      850        20      UA  N24211   1714    LGA  IAH      227
筛选出month=1和day=1的数据

同样效果的,

flights_df[flights_df$month == 1 & flights_df$day == 1, ]

4、选出几行数据slice()

slice(flights_df, 1:10)

5、排列arrange()

>arrange(flights_df, year, month, day)
将flights_df数据按照year,month,day的升序排列。

降序

>arrange(flights_df, year, desc(month), day)
R语言当中的自带函数

flights_df[order(flights$year, flights_df$month, flights_df$day), ]
flights_df[order(desc(flights_df$arr_delay)), ]


6、选择select()

通过列名来选择所要的数据

select(flights_df, year, month, day)
选出三列数据
使用:符号
select(flights_df, year:day)
使用-来删除不要的列表

select(flights_df, -(year:day))

7、变形mutate()

产生新的列

> mutate(flights_df,
+        gain = arr_delay - dep_delay,
+        speed = distance / air_time * 60)


8、汇总summarize()
<pre name="code" class="html">> summarise(flights,
+           delay = mean(dep_delay, na.rm = TRUE)

求dep_delay的均值

9、随机选出样本

sample_n(flights_df, 10)
随机选出10个样本
sample_frac(flights_df, 0.01)
随机选出1%个样本

10、分组group_py()

by_tailnum <- group_by(flights, tailnum)
#确定组别为tailnum,赋值为by_tailnum
delay <- summarise(by_tailnum,
                   count = n(),
                   dist = mean(distance, na.rm = TRUE),
                   delay = mean(arr_delay, na.rm = TRUE))
#汇总flights里地tailnum组的分类数量,及其组别对应的distance和arr_delay的均值
delay <- filter(delay, count > 20, dist < 2000)
ggplot(delay, aes(dist, delay)) +
    geom_point(aes(size = count), alpha = 1/2) +
    geom_smooth() +
    scale_size_area()



结果都需要通过赋值存储

a1 <- group_by(flights, year, month, day)
a2 <- select(a1, arr_delay, dep_delay)
a3 <- summarise(a2,
  arr = mean(arr_delay, na.rm = TRUE),
  dep = mean(dep_delay, na.rm = TRUE))
a4 <- filter(a3, arr > 30 | dep > 30)

11、引入链接符%>%

使用时把数据名作为开头,然后依次对数据进行多步操作:

flights %>%
    group_by(year, month, day) %>%
    select(arr_delay, dep_delay) %>%
    summarise(
        arr = mean(arr_delay, na.rm = TRUE),
        dep = mean(dep_delay, na.rm = TRUE)
    ) %>%
    filter(arr > 30 | dep > 30)
前面都免去了数据名


若想要进行更多地了解这个包,可以参考其自带的说明书(60页):dplyr

阅读更多
想对作者说点什么?

博主推荐

换一批

没有更多推荐了,返回首页