Spark部分：groupbykey，reducebykey，sortbykey，congroup，join的区别【文字说明+代码示例】

最新推荐文章于 2023-01-03 17:01:00 发布

道法—自然

最新推荐文章于 2023-01-03 17:01:00 发布

阅读量1.4k

点赞数

本文链接：https://blog.csdn.net/wyqwilliam/article/details/81623626

版权

1.reduceByKey(func, numPartitions=None)
Merge the values for each key using an associative reduce function. This will also perform the merginglocally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce. Output will be hash-partitioned with numPartitions partitions, or the default parallelism level if numPartitions is not specified.
也就是，reduceByKey用于对每个key对应的多个value进行merge操作，最重要的是它能够在本地先进行merge操作，并且merge操作可以通过函数自定义。

对数据集key相同的值，都被使用指定的reduce函数聚合到一起。

# -*- coding:utf-8 -*-
from pyspark import SparkConf
from pyspark import SparkContext
import os
from operator import add
if __name__ == '__main__':
    os.environ["SPARK_HOME"] = "/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6"
    conf = SparkConf().setMaster('local').setAppName('reduce')
    sc = SparkContext(conf=conf)
    data = [('tom',90),('jerry',97),('luck',92),('tom',78),('luck',64),('jerry',50)]
    rdd = sc.parallelize(data)        
    print rdd.reduceByKey(add).collect()
    sc.close()
# 输出结果
[('jerry', 147), ('luck', 156), ('tom', 168)]

当采用reduceByKey时，Spark可以在每个分区移动数据之前将待输出数据与一个共用的key结合。

为了确定将数据对移到哪个主机，Spark会对数据对的 key 调用一个分区算法。当移动的数据量大于单台执行机器内存总量时 Spark 会把数据保存到磁盘上。不过在保存时每次会处理一个 key 的数据，所以当单个 key 的键值对超过内存容量会存在内存溢出的异常。这将会在之后发行的 Spark 版本中更加优雅地处理，这样的工作还可以继续完善。尽管如此，仍应避免将数据保存到磁盘上，这

最低0.47元/天解锁文章

道法—自然

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Spark部分：groupbykey，reducebykey，sortbykey，congroup，join的区别【文字说明+代码示例】

1.reduceByKey(func, numPartitions=None)Merge the values for each key using an associative reduce function. This will also perform the merginglocally on each mapper before sending results to a reducer...
复制链接

扫一扫