spark的aggregate函数理解

最新推荐文章于 2023-08-02 02:59:28 发布

奋斗的瘦胖子

最新推荐文章于 2023-08-02 02:59:28 发布

阅读量306

点赞数

分类专栏： spark

本文链接：https://blog.csdn.net/QQ1131221088/article/details/104106328

版权

spark 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

先看源码：

def aggregate(self, zeroValue, seqOp, combOp):
        """
        Aggregate the elements of each partition, and then the results for all
        the partitions, using a given combine functions and a neutral "zero
        value."

        The functions C{op(t1, t2)} is allowed to modify C{t1} and return it
        as its result value to avoid object allocation; however, it should not
        modify C{t2}.

        The first function (seqOp) can return a different result type, U, than
        the type of this RDD. Thus, we need one operation for merging a T into
        an U and one operation for merging two U

        >>> seqOp = (lambda x, y: (x[0] + y, x[1] + 1))
        >>> combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1]))
        >>> sc.parallelize([1, 2, 3, 4]).aggregate((0, 0), seqOp, combOp)
        (10, 4)
        >>> sc.parallelize([]).aggregate((0, 0), seqOp, combOp)
        (0, 0)
        """
        seqOp = fail_on_stopiteration(seqOp)
        combOp = fail_on_stopiteration(combOp)

        def func(iterator):
            acc = zeroValue
            for obj in iterator:
                acc = seqOp(acc, obj)
            yield acc
        # collecting result of mapPartitions here ensures that the copy of
        # zeroValue provided to each partition is unique from the one provided
        # to the final reduce call
        vals = self.mapPartitions(func).collect()
        return reduce(combOp, vals, zeroValue)

aggregate的定义：
aggregate是一种聚合函数，通过给定的聚合函数和初始值，对所有partitions的数据进行聚合操作。

优点：属于partition内先聚合，然后在聚合partition之间的。效率较高

例如：

>>> seqOp = (lambda x, y: (x[0] + y, x[1] + 1))
>>> combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1]))
>>> sc.parallelize([1, 2, 3, 4]).aggregate((0, 0), seqOp, combOp)
(10, 4)
>>> sc.parallelize([]).aggregate((0, 0), seqOp, combOp)
(0, 0)

其中，seqOp是partition内部的聚合函数，combOp是partition之间的聚合函数。
另外类似于fold函数，有初始值；

奋斗的瘦胖子

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark的aggregate函数理解

先看源码：def aggregate(self, zeroValue, seqOp, combOp): """ Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions ...
复制链接

扫一扫

专栏目录