Divide and parallelize large data problems with Rcpp

421 篇文章 14 订阅

by Błażej Moska, computer science student and data science intern

Got stuck with too large a dataset? R speed drives you mad? Divide, parallelize and go with Rcpp!

One of the frustrating moments while working with data is when you need results urgently, but your dataset is large enough to make it impossible. This happens often when we need to use algorithm with high computational complexity. I will demonstrate it on the example I’ve been working with.

Suppose we have large dataset consisting of association rules. For some reasons we want to slim it down. Whenever two rules consequents are the same and one rule’s antecedent is a subset of second rule’s antecedent, we want to choose the smaller one (probability of obtaining smaller set is bigger than probability of obtaining bigger set). This is illustrated below:

{A,B,C}=>{D}

{E}=>{F}

{A,B}=>{D}

{A}=>D

How can we achieve that? For example, using below pseudo algorithm:

For i=1 to n:
  For j=i+1 to n:
   # check if antecedent[i] contains antecedent[j]
   (if consequents[i]=consequents[j]), then flag antecedent[i] with 1, 
                                       otherwise with 0
    else: # check if antecedent[j] contains antecedent[i] 
          (if consequents[i]=consequents[j]), then flag antecedent[j] with 1, 
                                              otherwise with 0

How many operations do we need to perform with this simple algorithm?

For the first i we need to iterate \(n-1\) times, for the second i \(n-2\) times, for the third i \(n-3\) and so on, reaching finally \(n-(n-1)\). This leads to (proof can be found here):

\[
\sum_{i=1}^{n}{i}= \frac{n(n-1)}{2}
\]

So the above has asymptotic complexity of \(O(n^2)\). It means, more or less, that the computational complexity grows with the square of the size of the data. Well, for the dataset containing around 1,300,000 records this becomes serious issue. With R I was unable to perform computation in reasonable time. Since a compiled language performs better with simple arithmetic operations, the second idea was to use Rcpp. Yes, it is faster, to some extent — but with such a large dataframe I was still unable to get results in satisfying time. So are there any other options?

Yes, there are. If we take a look at our dataset, we can see that it can be aggregated in such way that each individual “chunk” will consist of records with exactly same consequents:

{A,B}=>{D}

{A}=>{D}

{C,G}=>{F}

{Y}=>{F}

After such division I got 3300 chunks, so the average number of observations per chunk was around 400. Next step was to retry sequentially for each chunk. Since our algorithm has square complexity, it is faster to do it that way rather than on the whole dataset at once. While R failed again, Rcpp finally returned result (after 5 minutes). But still there is a room for improvement. Since our chunks can be calculated independently, there is a possibility to perform parallel computation using for example, foreach package (which I demonstrated in previous article). While passing R functions to foreach is a simple task, parallelizing Rcpp is a little bit more time consuming. We need to do below steps:

  1. Create .cpp file, which includes all of functions needed
  2. Create a package using Rcpp. This can be achieved using for example:
    Rcpp.package.skeleton("nameOfYourPackage",cpp_files = "directory_of_your_cpp_file")
  3. Install your Rcpp package from source:
    install.packages("directory_of_your_rcpp_package", repos=NULL, type="source")
  4. Load your library:
    library(name_of_your_rcpp_package)

Now you can use your Rcpp function in foreach:

results=foreach(k=1:length(len),
                .packages=c(name_of_your_package)) %dopar% 
                              {your_cpp_function(data)}

Even with foreach I waited forever for the R results, but Rcpp gave them in approximately 2.5 minutes. Not too bad!

Here are some conclusions. Firstly, it’s worth knowing more languages/tools than just R. Secondly, there is often escape from the large dataset trap. There is little chance that somebody will do exactly the same task as mentioned in above example, but much higher probability that someone will face similar problem, with a possibility to solve it in the same way.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
"divide and grow: capturing huge diversity in crowd images with incrementally" 是一个标题,该标题是基于一种方法或者技术,用于在人群图像中捕捉巨大的多样性。该方法通过逐步分割和增长的方式来实现。 在传统的人群图像处理中,通常会面临到一个挑战,即人群中的个体数量巨大且多样性丰富。这使得对整个人群图像进行全局处理变得困难。 "divide and grow"方法通过将人群图像分割成更小的区域,并在每个区域中逐步增加分析的粒度来解决这个问题。 首先,该方法会将人群图像划分成多个重叠的子区域。然后,对每个子区域进行单独的分析,以捕捉到区域内的个体。这种分割使得处理的任务更加可管理,同时也可以提高对于人群中不同个体的检测能力。接下来,"divide and grow"方法会逐步增加分析的粒度,即进一步细分每个子区域,以更准确地捕捉到更多的个体。 通过这种逐步分割和增长的方式,"divide and grow"方法可以较好地捕捉到人群图像中的巨大多样性。这种方法可以帮助在人群中准确地检测和识别各种不同类型的个体,例如不同年龄、性别、服装等。 总之,"divide and grow: capturing huge diversity in crowd images with incrementally"是一种用于处理人群图像中巨大多样性的方法。通过逐步分割和增长的方式,该方法可以有效地捕捉到人群中各种不同类型的个体。这种方法可以在人群图像处理和识别中具有广泛的应用潜力。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值