http://127.0.0.1:22867/library/dplyr/doc/introduction.html
Introduction to dplyr
1. functions
filter() and slice()
arrange()
select() and rename()
distinct()
mutate() and transmute()
summarise()
sample_n() and sample_frac()
2. filter()
select a subset of the rows of a data frame.
filter( data_frame, filtering expressions....)
e.g. filter(flights, month==1,day==1)
which is equivalent to:
flights[flights$month==1 & flights$day==1,]
filter(flights,month==1 | day=2)
slice: select rows by position
e.g. slice(flights,1:10)
3. arrange()
takes a dataframe, and a set of column names to order by.
arrange(data_frame,column_names)
descending order:
arrange(flights,desc(year,month,day))
e.g. arrange(flights,year,month,day)
4. select
select columns you are interested in.
select will drop all variables not explicitly mentioned
e.g. select(flights,year,month,day)
select(flights,year:day)
select(flights,-(year,day))
rename(flights, tail_num=tailnum) this will not drip any variable
distinct()
distinct work with select
e.g. distinct(select(flights,tailnum))
distinct(select(flights,origin,dest))
5. mutate()
ADD new variables that are functions of existing columns
e.g. mutate(flights, gain=arr_delay-dep_delay,speed=distance/air_time*60)
the key difference between mutate and transform is it allows you to refer to columns that you just created.
e.g. mutate(flights,gain=arr_delay-dep_delay,gain_per_hour=gain/(air_time/60))
transform(flights, gain=arr_delay-dep_delay,gain_per_hour=gain/(air_time/60)) this will induce an error
transmute: it will only keep the new variables.
e.g. transmute(flights,gain=arr_delay-dep_delay)
6. summarise()
which collapses a dataframe to a single row.
e.g. summarise(flights,delay=mean(dep_delay,na.rm=T))
7. sample_n() and sample_frac()
take a random sample of rows.
replacement=TRUE, will perform a bootstrap sample.
e.g. sample_n(flights,10) sample_frac(flights,0.01)
These five functions provide the basis of a language of data manipulation. At the most basic level, you can only alter a tidy data frame in five useful ways: you can reorder the rows (arrange()
), pick observations and variables of interest (filter()
andselect()
), add new variables that are functions of existing variables (mutate()
) or collapse many values to a summary (summarise()
). The remainder of the language comes from applying the five functions to different types of data, like to grouped data, as described next.
Grouped operations
In dplyr, you use the group_by()
function to describe how to break a dataset down into groups of rows.
You can then use the resulting object in exactly the same functions as above; they’ll automatically work “by group” when the input is a grouped.
-
grouped
select()
is the same as ungroupedselect()
, excepted that retains grouping variables are always retained. -
grouped
arrange()
orders first by grouping variables -
mutate()
andfilter()
are most useful in conjunction with window functions (likerank()
, ormin(x) == x
), and are described in detail invignette("window-function")
-
sample_n()
andsample_frac()
sample the specified number/fraction of rows in each group. -
slice()
extracts rows within each group
aggregate functions: n(): number of observations in the current group
n_distinct(x): count the number of unique values in x
first(x), last(x), nth(x,n)
%>%
x%>%y turns into f(x,y) so you can use it to rewrite multiple operations so you can read from left to right, top to bottom.
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)