spark中的aggregate action的实现过程

最新推荐文章于 2023-08-17 20:15:11 发布

heayin123

最新推荐文章于 2023-08-17 20:15:11 发布

阅读量1.4k

点赞数

本文链接：https://blog.csdn.net/u012684933/article/details/45919851

版权

例子代码如下：

public final class BasicAvg {
    public static class AvgCount implements Serializable {
        public AvgCount(int total, int num) {
            total_ = total;
            num_ = num;
        }
        public int total_;
        public int num_;
        public float avg() {
            return total_ / (float) num_;
        }
    }

    public static void main(String[] args) throws Exception {
        String master;
        if (args.length > 0) {
            master = args[0];
        } else {
            master = "local";
        }

        JavaSparkContext sc = new JavaSparkContext(
                master, "basicavg", System.getenv("SPARK_HOME"), System.getenv("JARS"));
        JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(8, 2, 3, 4));
        Function2<AvgCount, Integer, AvgCount> addAndCount = new Function2<AvgCount, Integer, AvgCount>() {
            @Override
            public AvgCount call(AvgCount a, Integer x) {
                a.total_ += x;
                a.num_ += 1;
                return a;
            }
        };
        Function2<AvgCount, AvgCount, AvgCount> combine = new Function2<AvgCount, AvgCount, AvgCount>() {
            @Override
            public AvgCount call(AvgCount a, AvgCount b) {
                a.total_ += b.total_;
                a.num_ += b.num_;
                return a;
            }
        };
        AvgCount initial = new AvgCount(10,20);
        AvgCount result = rdd.aggregate(initial, addAndCount, combine);
        System.out.println(result.avg());
        sc.stop();
    }
}

在如下截图位置添加断点：

开始调试后，程序最先在断点1停止，第一次停止的时候的截图：

可见addAndCount方法在执行call回调函数的时候，第一个参数的初始值为initial的值，而第二个参数的值是rdds元素的值。

继续执行程序，第二次在断点1停止的时候，截图如下：

由此得出结论： rdd.aggregate的addAndCount方法的作用是将rdds的各个元素的值依次跟initial相加

继续运行程序，发现在断点1总共停止了4次，原因是rdds只有4个元素

断点1停止4次之后，在断点2停止了1次，当时截图如下：

可见combine函数的作用是将initial的值与addAndCount 4次累加计算的结果合并

因此rdd.aggregate()的作用是首先累加计算rdds各个元素的值(累加计算的初始值可以任意指定，由aggregate函数的第一个参数确定)，然后合并累加结果和初始值(初始值由aggregate函数的第一个参数确定)

heayin123

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark中的aggregate action的实现过程

例子代码如下：public final class BasicAvg { public static class AvgCount implements Serializable { public AvgCount(int total, int num) { total_ = total; num_ = num;
复制链接

扫一扫