spark学习(3)--高级算子

枣泥馅

已于 2022-12-08 10:15:57 修改

阅读量395

点赞数

分类专栏：大数据hadoop 文章标签： spark

于 2021-01-11 15:06:03 首次发布

本文链接：https://blog.csdn.net/u011447164/article/details/112411606

版权

大数据hadoop 专栏收录该内容

78 篇文章 8 订阅

订阅专栏

1、mapPartitionsWithIndex、mapPartitions(func)
对rdd中的每个分区进行某种处理
def mapPartitionsWithIndexInternal[U:ClassTag]
(f:(Int,Iterator[T])=>Interator[U],preservesPartitiong:Boolean=false)
Int :分区号，默认从0开始
Iterator[T]:分区中的元素
Iterator[U]:操作完成后的返回值
操作：
创建一个rdd：
val rdd1=sc.parallelize(List(1,2,3,4,5,6,7,8,9),2)
定义一个函数
def func1(index:Int,iter:Iterator[Int]):Iterator[String]={
iter.toList.map(elem=>"PartID: “+index+”,Value: "+elem).iterator
}
然后定义完函数，我们就对rdd1进行操作
val rdd2=rdd1.mapPartitionsWithIndex(func1)
rdd2.collect
mapPartitions(func)这个算子是和上一个差不多只不多是没有index
2、aggregate：聚合操作，先做局部操作后做全局操作
def aggregateU:ClassTag(seqop:(U，T)=>U,comOp:(U,U)=>U)
zeroValue:初始值，针对局部和全局都有效
seqop:是分区的操作方法
comOP:是全局操作方法

val rdd1=sc.parallelize(List(1,2,3,4,5,6,7,8,9),2)
rdd.aggregate(0)(math.max(_,_),_+_)

seqop是作用于分区上的rdd，comop是通过操作seqop后再对其结果进行的操作
aggregateByKey:和aggregate算子差不多，只不过操作的是<key，value>的数据类型，seqop是对分区中同样的key做的操作的结果再用comOp进行操作。
示例：
1、准备带有（key,value）的分区数据

val pairRDD=sc.parallelize(List(("cat",2),("cat",5),("mouse",4),("cat",12),("dog",12),("mouse",2)),2)

2、查看一下分区后的数据是什么样的。
定义一个分区操作函数

def func3(index:Int,iter:Iterator[(String,Int)]):Iterator[String]={
	iter.toList.map("PartID:"+index+" ,value:"+_).iterator
}
 pairRDD.mapPartitionsWithIndex(func3).collect

数据显示为：

Array[String] = Array(
PartID:0 ,value:(cat,2), PartID:0 ,value:(cat,5), PartID:0 ,value:(mouse,4),
 PartID:1 ,value:(cat,12), PartID:1 ,value:(dog,12), PartID:1 ,value:(mouse,2)
 )

3、使用aggregateByKey函数进行操作

（1）将每个动物园中动物数最多个数进行求和

pairRDD.aggregateByKey(0)(math.max(_,_),_+_)

结果是：Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))
（2）求每种动物的和

pairRDD.aggregateByKey(0)(_+_,_+_).collect

结果是：res28: Array[(String, Int)] = Array((dog,12), (cat,19), (mouse,6))
总结：我认为其实在做aggregateByKey的时候，其实做了一次局部reduceByKey，
形成了分区0聚合后的数据为：（“cat”,(2,5)),(“mouse”,(4))
分区1聚合后的数据为：（“cat",(12)),(“dog”,(12)),(“mouse”,(2))
aggregateByKey初始化值0，其实往value中添加的，比如分区0聚合后的数据为
（“cat”,(2,5,0)),(“mouse”,(4,0)),分区1聚合后的数据为（“cat",(12,0)),(“dog”,(12,0)),(“mouse”,(2,0)),随后seqop中操作的和comop操作的数据其实均为value

枣泥馅

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
spark学习(3)--高级算子

1、mapPartitionsWithIndex、mapPartitions(func)对rdd中的每个分区进行某种处理def mapPartitionsWithIndexInternal[U:ClassTag](f:(Int,Iterator[T])=>Interator[U],preservesPartitiong:Boolean=false)Int :分区号，默认从0开始Iterator[T]:分区中的元素Iterator[U]:操作完成后的返回值操作：创建一个rdd：val
复制链接

扫一扫