理解RDD的reduceByKey与groupByKey

最新推荐文章于 2024-06-02 21:22:03 发布

Julian Win

最新推荐文章于 2024-06-02 21:22:03 发布

阅读量2.2k

点赞数

分类专栏：大数据文章标签： Spark

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/perfer258/article/details/82082299

版权

大数据专栏收录该内容

13 篇文章 0 订阅

订阅专栏

数据准备

val words = Array("a","a","b","c","c")
val conf = new SparkConf().setAppName("word-count").setMaster("local");
val sc = new SparkContext(conf)
val rdd = sc.parallelize(words)

reduceByKey方法

rdd.map((_,1)).reduceByKey(_+_).collect().foreach(println)

groupByKey方法

rdd.map((_,1)).groupByKey().map(word => (word._1, word._2.sum)).collect().foreach(println)

输出结果是一致的，我们查看API文档发现有如下描述，

reduceByKey
Merge the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.

groupByKey()
Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with the existing partitioner/parallelism level. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.

根据对比，我们发现reduceByKey方法在向reducer发送数据之前会先将数据按key进行合并，而groupByKey方法是直接对计算的RDD结果进行分区。

假设我们的数据文件分布在两个节点上，那么

reduceByKey工作图解

groupByKey工作图解

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

Julian Win CSDN认证博客专家 CSDN认证企业博客

码龄9年

27: 原创

15万+: 周排名

108万+: 总排名

13万+: 访问

: 等级

1169: 积分

9: 粉丝

39: 获赞

15: 评论

110: 收藏

私信

关注

热门文章

分类专栏

Qt
C++ 3篇
图形学 1篇
大数据 13篇
数据结构与算法 2篇
Linux 5篇
Java 2篇
Python 3篇
Windows 2篇
服务器 1篇
Scala
工具

最新评论

解决 error C0204: version directive must be first statement and may not be repeated
zqiongy: 谢谢，确实没这个问题了。但还是想说一句，这是什么鬼
解决 error C0204: version directive must be first statement and may not be repeated
afe_ge: 感谢大哥
问题描述：hbase shell启动失败
SuperBigData~: 应该把jline-2.12.jar放进hbase lib里面
acos(-1)或者acos(1)结果为nan
keyuyukuaiee: 我是求平面二维向量的夹角，完整程序如下： #include <iostream>//头文件 #include <math.h> #include <cstring> #define M_RAD_TO_DEG 57.2957795130823f int main()//主函数 { float a[2]={1,1}; float b[2]={2,2}; float a_sqrt = sqrt(a[0]*a[0]+a[1]*a[1]); float b_sqrt = sqrt(b[0]*b[0]+b[1]*b[1]); float cost = (a[0]*b[0]+a[1]*b[1])/(a_sqrt*b_sqrt); printf("cost %.15f\n", cost); cost=std::min(std::max(cost,-1.0f),1.0f); printf("cost %.15f\n", cost); float theta = acosf(cost); printf("theta %f %f\n",theta, theta*M_RAD_TO_DEG); return 0;//结束程序 }
acos(-1)或者acos(1)结果为nan
keyuyukuaiee: 的确是这样，打印时多打印几位才能看得出来: 测试程序: printf("cost %.15f\n", cost); cost=std::min(std::max(cost,-1.0f),1.0f); printf("cost %.15f\n", cost); float theta = acosf(cost); printf("theta %f %f\n",theta, theta*M_RAD_TO_DEG); 打印： cost 1.000000119209290 cost 1.000000000000000 theta 0.000000 0.000000

大家在看

最新文章

目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。