groupByKey与reduceByKey区别

最新推荐文章于 2024-02-21 11:31:29 发布

笛在月明

最新推荐文章于 2024-02-21 11:31:29 发布

阅读量790

点赞数

分类专栏： spark

spark 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

If we compare the result of both ( “groupByKey” and “reduceByKey”) transformations, we have got the same results. I am sure you must be wondering what is the difference in both transformations. The “reduceByKey” transformations first combined the values for each key in all partition, so each partition will have only one value for a key then after shuffling, in reduce phase executors will apply operation for example, in my case sum(lambda x: x+y).
这里写图片描述

Source: Databricks

But in case of “groupByKey” transformation, it will not combine the values in each key in all partition it directly shuffle the data then merge the values for each key. Here in “groupByKey” transformation lot of shuffling in the data is required to get the answer, so it is better to use “reduceByKey” in case of large shuffling of data.
这里写图片描述