Divide and parallelize large data problems with Rcpp

421 篇文章 14 订阅

by Błażej Moska, computer science student and data science intern

Got stuck with too large a dataset? R speed drives you mad? Divide, parallelize and go with Rcpp!

One of the frustrating moments while working with data is when you need results urgently, but your dataset is large enough to make it impossible. This happens often when we need to use algorithm with high computational complexity. I will demonstrate it on the example I’ve been working with.

Suppose we have large dataset consisting of association rules. For some reasons we want to slim it down. Whenever two rules consequents are the same and one rule’s antecedent is a subset of second rule’s antecedent, we want to choose the smaller one (probability of obtaining smaller set is bigger than probability of obtaining bigger set). This is illustrated below:

{A,B,C}=>{D}

{E}=>{F}

{A,B}=>{D}

{A}=>D

How can we achieve that? For example, using below pseudo algorithm:

For i=1 to n:
  For j=i+1 to n:
   # check if antecedent[i] contains antecedent[j]
   (if consequents[i]=consequents[j]), then flag antecedent[i] with 1, 
                                       otherwise with 0
    else: # check if antecedent[j] contains antecedent[i] 
          (if consequents[i]=consequents[j]), then flag antecedent[j] with 1, 
                                              otherwise with 0

How many operations do we need to perform with this simple algorithm?

For the first i we need to iterate \(n-1\) times, for the second i \(n-2\) times, for the third i \(n-3\) and so on, reaching finally \(n-(n-1)\). This leads to (proof can be found here):

\[
\sum_{i=1}^{n}{i}= \frac{n(n-1)}{2}
\]

So the above has asymptotic complexity of \(O(n^2)\). It means, more or less, that the computational complexity grows with the square of the size of the data. Well, for the dataset containing around 1,300,000 records this becomes serious issue. With R I was unable to perform computation in reasonable time. Since a compiled language performs better with simple arithmetic operations, the second idea was to use Rcpp. Yes, it is faster, to some extent — but with such a large dataframe I was still unable to get results in satisfying time. So are there any other options?

Yes, there are. If we take a look at our dataset, we can see that it can be aggregated in such way that each individual “chunk” will consist of records with exactly same consequents:

{A,B}=>{D}

{A}=>{D}

{C,G}=>{F}

{Y}=>{F}

After such division I got 3300 chunks, so the average number of observations per chunk was around 400. Next step was to retry sequentially for each chunk. Since our algorithm has square complexity, it is faster to do it that way rather than on the whole dataset at once. While R failed again, Rcpp finally returned result (after 5 minutes). But still there is a room for improvement. Since our chunks can be calculated independently, there is a possibility to perform parallel computation using for example, foreach package (which I demonstrated in previous article). While passing R functions to foreach is a simple task, parallelizing Rcpp is a little bit more time consuming. We need to do below steps:

  1. Create .cpp file, which includes all of functions needed
  2. Create a package using Rcpp. This can be achieved using for example:
    Rcpp.package.skeleton("nameOfYourPackage",cpp_files = "directory_of_your_cpp_file")
  3. Install your Rcpp package from source:
    install.packages("directory_of_your_rcpp_package", repos=NULL, type="source")
  4. Load your library:
    library(name_of_your_rcpp_package)

Now you can use your Rcpp function in foreach:

results=foreach(k=1:length(len),
                .packages=c(name_of_your_package)) %dopar% 
                              {your_cpp_function(data)}

Even with foreach I waited forever for the R results, but Rcpp gave them in approximately 2.5 minutes. Not too bad!

Here are some conclusions. Firstly, it’s worth knowing more languages/tools than just R. Secondly, there is often escape from the large dataset trap. There is little chance that somebody will do exactly the same task as mentioned in above example, but much higher probability that someone will face similar problem, with a possibility to solve it in the same way.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值