Purrr package for R is good for performance

421 篇文章 14 订阅

Hadley’s project purrr

So, if you haven’t seen it, there’s some goodness over at github where Hadley Wickham has been
working to fill in some more of the holes in R should one want a more functional programming language
set of constructs to work with.

But, in true Hadley style, in addition to all of the functional programming syntactical goodness, the code is fast as well.

——more——

To install the package, which is not on CRAN as of this post, one need simply


# install.packages("devtools")
devtools::install_github("hadley/purrr")

Here is an example using purrr. The example sets to split a data frame into pieces, fit a model to each piece, summarise and extract R^2.

library(purrr)
 
mtcars %>%
  split(.$cyl) %>%
  map(~ lm(mpg ~ wt, data = .)) %>%
  map(summary) %>%
  map_dbl("r.squared")

Here is another, more complicated example. It generates 100 random test-training splits, fits a model to each training split then evaluates based on the test split:

library(dplyr)
randomgroup <- function(n, probs) {
  probs <- probs / sum(probs)
  g <- findInterval(seq(0, 1, length = n), c(0, cumsum(probs)),
    rightmost.closed = TRUE)
  names(probs)[sample(g)]
}
partition <- function(df, n, probs) {
  replicate(n, split(df, randomgroup(nrow(df), probs)), FALSE) %>%
    zip() %>%
    asdataframe()
}
 
msd <- function(x, y) sqrt(mean((x - y) ^ 2))
 
# Genearte 100 rbootandom test-training splits
boot <- partition(mtcars, 100, c(test = 0.8, training = 0.2))
boot
 
boot <- boot %>% mutate(
  # Fit the models
  models = map(training, ~ lm(mpg ~ wt, data = mtcars)),
  # Make predictions on test data
  preds = map2(models, test, predict),
  diffs = map2(preds, test %>% map("mpg"), msd)
)
 
# Evaluate mean-squared difference between predicted and actual
mean(unlist(boot$diffs))

As Hadley writes about the philosophy for purrr, the goal is not to try and simulate Haskell in R: purrr does not implement currying or destructuring binds or pattern matching. The goal is to give you similar expressiveness to an FP language, while allowing you to write code that looks and works like R.

  • Instead of point free style, use the pipe, %>%, to write code that can be read from left to right.

  • Instead of currying, we use … to pass in extra arguments.

  • Anonymous functions are verbose in R, so we provide two convenient shorthands. For predicate functions, ~ .x + 1 is equivalent to function(.x) .x + 1. For chains of transformations functions, . %>% f() %>% g() is equivalent to function(.) . %>% f() %>% g().

  • R is weakly typed, so we can implement general zip(), rather than having to specialise on the number of arguments. (That said I still provide map2() and map3() since it’s useful to clearly separate which arguments are vectorised over).

  • R has named arguments, so instead of providing different functions for minor variations (e.g. detect() and detectLast()) I use a named argument, .first. Type-stable functions are easy to reason about so additional arguments will never change the type of the output.

Timings

OK, so how about some measurements of performance. Let us create a 10 x 10,000 matrix with one row for each combination of the levels in f.

# Some data
nvars <- 10000
nsamples <- 500
sample_groups <- 5
MAT <- replicate(nvars, runif(n=nsamples))
 
# And a grouping vector:
 
f <- rep_len(1:sample_groups, nsamples)
f <- LETTERS[f]

In pursuit of this, the first task is to calculate the mean for each group for all columns. First, a high order function in R
leveraging helpers.

# Settings
aggr_FUN  <- mean
combi_FUN <- function(x,y) "/"(x,y) 
 
# helper function
pasteC <- function(x,y) paste(x,y,sep=" - ")
 
# aggregate
system.time({
temp2 <- aggregate(. ~ class, data = cbind.data.frame(class=f,MAT), aggr_FUN)
})

which yields

user system elapsed
13.457 1.187 14.766

Here’s an approach with reshape

# reshape2
library(reshape2)
system.time({
temp3 <- recast(data.frame(class=f,MAT),class ~ variable,id.var="class",aggr_FUN)
})

which has

user system elapsed
1.945 0.454 2.525

7x faster. Finally, here is a purrr approach. Firstly, look at the elegance of the representation. Then look at the timings.

# purrr 
library(purrr)
system.time({
    tmp <- data.frame(class = f, MAT) %>%
        slicerows("class") %>%
        byslice(map, aggr_FUN)
})

user system elapsed
0.512 0.043 0.569

Another 4x speedup, or 28x faster than the original approach with aggregate. Impressive. The purrr work deserves to be
looked at and picked up by R devs, as it is both elegant and performant.

All of this has resulted in

tmp[,1:10]
Source: local data frame [5 x 10]
 
  class        X1        X2        X3        X4        X5        X6        X7        X8        X9
1     A 0.5194124 0.5066943 0.5326734 0.5042122 0.4190162 0.4882796 0.4947138 0.4701085 0.4982535
2     B 0.5267829 0.4545410 0.4883640 0.4894278 0.4672661 0.4477106 0.4832262 0.4583598 0.4767773
3     C 0.4703151 0.4994032 0.4842406 0.4960585 0.5276044 0.4817216 0.4853307 0.5331066 0.4881527
4     D 0.5139762 0.5318747 0.5071466 0.4657025 0.4972884 0.4815889 0.5049296 0.4685044 0.5535197
5     E 0.5439962 0.4479991 0.4640088 0.4946168 0.4716724 0.5370196 0.5011706 0.5219855 0.5160875
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
package ‘ggtreeextra’ is not available (for r version 3.6.3) 是指在R版本3.6.3中无法使用‘ggtreeextra’包。可能的原因有以下几种: 1. 包未在R版本3.6.3的CRAN存储库中找到:CRAN是R语言的官方软件存储库,只有存储在CRAN中的软件包才能通过R内置的包管理器安装。如果‘ggtreeextra’包不在CRAN中,您将无法直接从CRAN安装它。 2. 软件包的名称拼写错误:请确保您正确输入了软件包的名称。在R中搜索或安装软件包时,名称大小写是敏感的。请再次确认您使用了正确的软件包名称。 3. 软件包不再被维护或已废弃:有时,软件包可能不再得到维护或更新,因此无法安装它们。如果‘ggtreeextra’已被废弃或停止维护,您将无法在最新版本的R中进行安装。 解决此问题的方法可能是: 1. 检查包的名称和拼写:确保您正确输入了包的名称,并使用正确的大小写。 2. 检查CRAN和其他存储库:查看其他可能的软件包存储库,例如Bioconductor或GitHub上的结果。有时,软件包可能在这些存储库中可用,但不在CRAN中。 3. 升级R版本:考虑升级到最新版本的R。新版本的R通常具有更多的功能和包,您可能能够在新版本中找到所需的‘ggtreeextra’包。 总之,如果您无法在R版本3.6.3中找到‘ggtreeextra’包,可能是由于包未在CRAN中找到、拼写错误、不再维护或需要升级R版本等原因。通过检查包名称、查看其他存储库或升级R版本,您可能能够解决这个问题。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值