Why is the fold action necessary in Spark?

最新推荐文章于 2019-08-14 11:01:14 发布

yuliying

最新推荐文章于 2019-08-14 11:01:14 发布

阅读量334

点赞数

分类专栏： scala/akka/spark

scala/akka/spark 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

转自： http://stackoverflow.com/questions/34529953/why-is-the-fold-action-necessary-in-spark

Empty RDD

It cannot be substituted when RDD is empty:

val rdd = sc.emptyRDD[Int]
rdd.reduce(_ + _)
// java.lang.UnsupportedOperationException: empty collection at   
// org.apache.spark.rdd.RDD\$\$anonfun$reduce$1\$\$anonfun$apply$ ...

rdd.fold(0)(_ + _)
// Int = 0

You can of course combine reduce with condition on isEmpty but it is rather ugly.

Mutable buffer

Another use case for fold is aggregation with mutable buffer. Consider following RDD:

Lets say we want a sum of all elements. A naive solution is to simply reduce with +:

import breeze.linalg.DenseVector
val rdd = sc.parallelize(Array.fill(100)(DenseVector(1)), 8)
rdd.reduce(_ + _)

Unfortunately it creates a new vector for each element. Since object creation and subsequent garbage collection is expensive it could be
better to use a mutable object. It is not possible with reduce (immutability of RDD doesn't imply immutability of the elements), but can be
achieved with fold as follows:

rdd.fold(DenseVector(0))((acc, x) => acc += x)

Zero element is used here as mutable buffer initialized once per partition leaving actual data untouched.

acc = op(obj, acc), why this operation order is used instead of acc = op(acc, obj) , See SPARK-6416 and SPARK-7683

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

yuliying

关注关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Why is the fold action necessary in Spark?

转自： http://stackoverflow.com/questions/34529953/why-is-the-fold-action-necessary-in-sparkEmpty RDDIt cannot be substituted when RDD is empty:val rdd = sc.emptyRDD[Int]rdd.reduce(_ + _
复制链接

扫一扫