spark中的聚合函数总结

最新推荐文章于 2022-04-27 09:56:05 发布

通凡

最新推荐文章于 2022-04-27 09:56:05 发布

阅读量2.6k

点赞数

分类专栏： spark

本文链接：https://blog.csdn.net/wangxiaotongfan/article/details/82693610

版权

spark 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

PairRDDFunctions中的函数：

def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]

Aggregate the values of each key, using given combine functions and a neutral “zero value”. This function can return a different result type, U, than the type of the values in this RDD, V. Thus, we need one operation for merging a V into a U and one operation for merging two U’s, as in scala.TraversableOnce. The former operation is used for merging values within a partition, and the latter is used for merging values between partitions. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.

def aggregateByKey[U](zeroValue: U, numPartitions: Int)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]

def aggregateByKey[U](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]

defreduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]

Merge the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.

defreduceByKeyLocally(func: (V, V) ⇒ V): Map[K, V]

:scala.collection.Map[K,V])

Merge the values for each key using an associative and commutative reduce function, but return the results immediately to the master as a Map. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce.

defgroupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]

Group the values for each key in the RDD into a single sequence. Allows controlling the partitioning of the resulting key-value pair RDD by passing a Partitioner. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.

Note

As currently implemented, groupByKey must be able to hold all the key-value pairs for any key in memory. If a key has too many values, it can result in an OutOfMemoryError.,This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using PairRDDFunctions.aggregateByKey or PairRDDFunctions.reduceByKey will provide much better performance.

defcombineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]

:org.apache.spark.rdd.RDD[(K,C)])

Simplified version of combineByKeyWithClassTag that hash-partitions the resulting RDD using the existing partitioner/parallelism level. This method is here for backward compatibility. It does not provide combiner classtag information to the shuffle.
- See also
combineByKeyWithClassTag

defcombineByKeyWithClassTag[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C)(implicit ct: ClassTag[C]): RDD[(K, C)]

(implicitct:scala.reflect.ClassTag[C]):org.apache.spark.rdd.RDD[(K,C)])

Simplified version of combineByKeyWithClassTag that hash-partitions the resulting RDD using the existing partitioner/parallelism level.
- Annotations
  
  @Experimental()

通凡

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
spark中的聚合函数总结

PairRDDFunctions中的函数：def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]Aggregate the values of each key, using given combine func...
复制链接

扫一扫