dplyr: R Language

最新推荐文章于 2022-02-25 22:10:15 发布

samantha_wang

最新推荐文章于 2022-02-25 22:10:15 发布

阅读量1.1k

点赞数

分类专栏： R语言文章标签： dplyr R

R语言专栏收录该内容

8 篇文章 0 订阅

订阅专栏

http://127.0.0.1:22867/library/dplyr/doc/introduction.html

Introduction to dplyr

1. functions

filter() and slice()

arrange()

select() and rename()

distinct()

mutate() and transmute()

summarise()

sample_n() and sample_frac()

2. filter()

select a subset of the rows of a data frame.

filter( data_frame, filtering expressions....)

e.g. filter(flights, month==1,day==1)

which is equivalent to:

flights[flights$month==1 & flights$day==1,]

filter(flights,month==1 | day=2)

slice: select rows by position

e.g. slice(flights,1:10)

3. arrange()

takes a dataframe, and a set of column names to order by.

arrange(data_frame,column_names)

descending order:

arrange(flights,desc(year,month,day))

e.g. arrange(flights,year,month,day)

4. select

select columns you are interested in.

select will drop all variables not explicitly mentioned

e.g. select(flights,year,month,day)

select(flights,year:day)

select(flights,-(year,day))

rename(flights, tail_num=tailnum) this will not drip any variable

distinct()

distinct work with select

e.g. distinct(select(flights,tailnum))

distinct(select(flights,origin,dest))

5. mutate()

ADD new variables that are functions of existing columns

e.g. mutate(flights, gain=arr_delay-dep_delay,speed=distance/air_time*60)

the key difference between mutate and transform is it allows you to refer to columns that you just created.

e.g. mutate(flights,gain=arr_delay-dep_delay,gain_per_hour=gain/(air_time/60))

transform(flights, gain=arr_delay-dep_delay,gain_per_hour=gain/(air_time/60)) this will induce an error

transmute: it will only keep the new variables.

e.g. transmute(flights,gain=arr_delay-dep_delay)

6. summarise()

which collapses a dataframe to a single row.

e.g. summarise(flights,delay=mean(dep_delay,na.rm=T))

7. sample_n() and sample_frac()

take a random sample of rows.

replacement=TRUE, will perform a bootstrap sample.

e.g. sample_n(flights,10) sample_frac(flights,0.01)

These five functions provide the basis of a language of data manipulation. At the most basic level, you can only alter a tidy data frame in five useful ways: you can reorder the rows (arrange()), pick observations and variables of interest (filter() andselect()), add new variables that are functions of existing variables (mutate()) or collapse many values to a summary (summarise()). The remainder of the language comes from applying the five functions to different types of data, like to grouped data, as described next.

Grouped operations

In dplyr, you use the group_by() function to describe how to break a dataset down into groups of rows.

You can then use the resulting object in exactly the same functions as above; they’ll automatically work “by group” when the input is a grouped.

grouped select() is the same as ungrouped select(), excepted that retains grouping variables are always retained.
grouped arrange() orders first by grouping variables
mutate() and filter() are most useful in conjunction with window functions (likerank(), or min(x) == x), and are described in detail invignette("window-function")
sample_n() and sample_frac() sample the specified number/fraction of rows in each group.
slice() extracts rows within each group

e.g. by_tailnum=group_by(flights,tailnum)

delay=summarise(by_tailnum,count=n(),dist=mean(distance,na.rm=T),delay=mean(arr_delay,na.rm=T))

delay=filter(delay,count>20,dist<2000)

aggregate functions: n(): number of observations in the current group

n_distinct(x): count the number of unique values in x

first(x), last(x), nth(x,n)

%>%

x%>%y turns into f(x,y) so you can use it to rewrite multiple operations so you can read from left to right, top to bottom.

flights %>%
  group_by(year, month, day) %>%
  select(arr_delay, dep_delay) %>%
  summarise(
    arr = mean(arr_delay, na.rm = TRUE),
    dep = mean(dep_delay, na.rm = TRUE)
  ) %>%
  filter(arr > 30 | dep > 30)