This post was originally posted on Quantide blog. Read the full article here.
If you want to compute arbitrary operations on a data frame returning more than one number back, use dplyr
do()
!
This post aims to explore some basic concepts of do()
, along with giving some advice in using and programming.
do()
is a verb (function) of dplyr
. dplyr
is a powerful R package for data manipulation, written and maintained by Hadley Wickham. This package allows you to perform the common data manipulation tasks on data frames, like: filtering for rows, selecting specific columns, re-ordering rows, adding new columns, summarizing data and computing arbitrary operations.
First of all, you have to install dplyr
package:
install.packages("dplyr")
and to load it:
require(dplyr)
We will analyze the use of do()
with the following dataset, created with random data:
set.seed(100)
ds <- data.frame(group=c(rep("a",100), rep("b",100), rep("c",100)),
x=rnorm(n = 300, mean = 3, sd = 2), y=rnorm(n = 300, mean = 2, sd = 2))
We firstly transform it into a tbl_df
object to achieve a better print method. No changes occur on the input data frame.
ds <- tbl_df(ds)
ds
Source: local data frame [300 x 3]
group x y
(fctr) (dbl) (dbl)
1 a 1.995615 -1.71089045
2 a 3.263062 -0.03712943
3 a 2.842166 -0.09022217
4 a 4.773570 0.69742469
5 a 3.233943 2.76536531
6 a 3.637260 4.06379942
7 a 1.836419 2.26214995
8 a 4.429065 2.75438347
9 a 1.349481 -1.77539016
10 a 2.280276 3.04043881
.. ... ... ...
Base Concepts of do() (Non Standard Evaluation Version)
As we already said, do()
computes arbitrary operations on a data frame returning more than one number back.
To use do()
, you must know that:
- it always returns a dataframe
- unlike the others data manipulation verbs of
dplyr
,do()
needs the specification of.
placeholder inside the function to apply, referring to the data it has to work with.# Head of ds ds %>% do(head(.))
Source: local data frame [6 x 3] group x y (fctr) (dbl) (dbl) 1 a 1.995615 -1.71089045 2 a 3.263062 -0.03712943 3 a 2.842166 -0.09022217 4 a 4.773570 0.69742469 5 a 3.233943 2.76536531 6 a 3.637260 4.06379942
- it is conceived to be used with dplyr
group_by()
to compute operations within groups:# Head of ds by group ds %>% group_by(group) %>% do(head(.))
Source: local data frame [18 x 3] Groups: group [3] group x y (fctr) (dbl) (dbl) 1 a 1.99561530 -1.71089045 2 a 3.26306233 -0.03712943 3 a 2.84216582 -0.09022217 4 a 4.77356962 0.69742469 5 a 3.23394254 2.76536531 6 a 3.63726018 4.06379942 7 b 2.33415330 -0.56965729 8 b 5.72622741 1.71643653 9 b 2.06170532 4.87756954 10 b 4.68575126 -0.08011508 11 b 0.08401255 -0.04767590 12 b 2.19938816 4.18954758 13 c 3.05634353 -0.89257491 14 c 2.28659319 2.63171152 15 c 4.70525275 1.31450497 16 c 4.02673050 -1.86270620 17 c 5.03640599 2.48564201 18 c 0.95704183 1.27446410
- the argument of
do()
can be named or unnamed:- named arguments (more than one supplied) become list-columns, with one element for each group:
# Tail (last 3 obs) of x by group ds %>% group_by(group) %>% do(out=tail(.$x, 3))
Source: local data frame [3 x 2] Groups: <by row> group out (fctr) (chr) 1 a <dbl[3]> 2 b <dbl[3]> 3 c <dbl[3]>
- unnamed argument (only one supplied) must be a data frame and labels will be duplicated accordingly:
# Tail (last 3 obs) of x by group ds %>% group_by(group) %>% do(data.frame(out=tail(.$x, 3)))
Source: local data frame [9 x 2] Groups: group [3] group out (fctr) (dbl) 1 a 3.8270397 2 a 0.6426337 3 a 0.6519305 4 b 3.3238824 5 b 0.8290942 6 b 4.1538746 7 c 6.5861213 8 c 4.6280643 9 c 0.3599512
Its use is the same working with customized functions.
Let us define the following function, which performs two simple operations returning a data frame:
my_fun <- function(x, y){
res_x = mean(x) + 2
res_y = mean(y) * 5
return(data.frame(res_x, res_y))
}
If the argument is named the result is:
# Apply my_fun() function to ds by group
ds %>% group_by(group) %>% do(out=my_fun(x=.$x, y=.$y))
Source: local data frame [3 x 2]
Groups: <by row>
group out
(fctr) (chr)
1 a <data.frame [1,2]>
2 b <data.frame [1,2]>
3 c <data.frame [1,2]>
Otherwise, if argument is unnamed the result is:
# Apply my_fun() function to ds by group
ds %>% group_by(group) %>% do(my_fun(x=.$x, y=.$y))
Source: local data frame [3 x 3]
Groups: group [3]
group res_x res_y
(fctr) (dbl) (dbl)
1 a 5.005825 9.167546
2 b 5.022282 8.683619
3 c 5.025586 11.240558
Programming with do_() (Standard Evaluation Version)
How can we enclose the previous operations inside a function? Simple! Using do_()
(the SE version of do()
) and interp()
function of lazyeval
package.
The post dplyr do: Some Tips for Using and Programming appeared first on MilanoR.