dplyr: R Language

http://127.0.0.1:22867/library/dplyr/doc/introduction.html

Introduction to dplyr


1. functions

filter() and slice()

arrange()

select() and rename()

distinct()

mutate() and transmute()

summarise()

sample_n() and sample_frac()


2. filter()

select a subset of the rows of a data frame.

filter( data_frame, filtering expressions....)


e.g. filter(flights, month==1,day==1)

which is equivalent to:

flights[flights$month==1 & flights$day==1,]

filter(flights,month==1 | day=2)


slice: select rows by position

e.g. slice(flights,1:10)


3. arrange()

takes a dataframe, and a set of column names to order by.

arrange(data_frame,column_names)

descending order: 

arrange(flights,desc(year,month,day))


e.g. arrange(flights,year,month,day)


4. select

select columns you are interested in.

select will drop all variables not explicitly mentioned

e.g. select(flights,year,month,day)

select(flights,year:day)

select(flights,-(year,day))


rename(flights, tail_num=tailnum) this will not drip any variable


distinct()

distinct work with select

e.g. distinct(select(flights,tailnum))

distinct(select(flights,origin,dest))


5. mutate()

ADD new variables that are functions of existing columns

e.g. mutate(flights, gain=arr_delay-dep_delay,speed=distance/air_time*60)

the key difference between mutate and transform is it allows you to refer to columns that you just created.

e.g. mutate(flights,gain=arr_delay-dep_delay,gain_per_hour=gain/(air_time/60))

transform(flights, gain=arr_delay-dep_delay,gain_per_hour=gain/(air_time/60)) this will induce an error


transmute: it will only keep the new variables.

e.g. transmute(flights,gain=arr_delay-dep_delay)


6. summarise()

which collapses a dataframe to a single row.


e.g. summarise(flights,delay=mean(dep_delay,na.rm=T))



7. sample_n() and sample_frac()

take a random sample of rows.

replacement=TRUE, will perform a bootstrap sample.


e.g. sample_n(flights,10) sample_frac(flights,0.01)


These five functions provide the basis of a language of data manipulation. At the most basic level, you can only alter a tidy data frame in five useful ways: you can reorder the rows (arrange()), pick observations and variables of interest (filter() andselect()), add new variables that are functions of existing variables (mutate()) or collapse many values to a summary (summarise()). The remainder of the language comes from applying the five functions to different types of data, like to grouped data, as described next.



Grouped operations


 In dplyr, you use the group_by() function to describe how to break a dataset down into groups of rows. 

You can then use the resulting object in exactly the same functions as above; they’ll automatically work “by group” when the input is a grouped.


  • grouped select() is the same as ungrouped select(), excepted that retains grouping variables are always retained.

  • grouped arrange() orders first by grouping variables

  • mutate() and filter() are most useful in conjunction with window functions (likerank(), or min(x) == x), and are described in detail invignette("window-function")

  • sample_n() and sample_frac() sample the specified number/fraction of rows in each group.

  • slice() extracts rows within each group

e.g. by_tailnum=group_by(flights,tailnum)
delay=summarise(by_tailnum,count=n(),dist=mean(distance,na.rm=T),delay=mean(arr_delay,na.rm=T))
delay=filter(delay,count>20,dist<2000)

aggregate functions: n(): number of observations in the current group

n_distinct(x): count the number of unique values in x

first(x), last(x), nth(x,n)



%>%


x%>%y turns into f(x,y) so you can use it to rewrite multiple operations so you can read from left to right, top to bottom.

flights %>%
  group_by(year, month, day) %>%
  select(arr_delay, dep_delay) %>%
  summarise(
    arr = mean(arr_delay, na.rm = TRUE),
    dep = mean(dep_delay, na.rm = TRUE)
  ) %>%
  filter(arr > 30 | dep > 30)

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值