Spark中mapPartitions使用

最新推荐文章于 2024-05-25 10:59:32 发布

绛门人

最新推荐文章于 2024-05-25 10:59:32 发布

阅读量2.5w

点赞数

分类专栏： spark

spark 专栏收录该内容

34 篇文章 0 订阅

订阅专栏

转：http://blog.csdn.net/lsshlsw/article/details/48627737

与map方法类似，map是对rdd中的每一个元素进行操作，而mapPartitions(foreachPartition)则是对rdd中的每个分区的迭代器进行操作。如果在map过程中需要频繁创建额外的对象(例如将rdd中的数据通过jdbc写入数据库,map需要为每个元素创建一个链接而mapPartition为每个partition创建一个链接),则mapPartitions效率比map高的多。

SparkSql或DataFrame默认会对程序进行mapPartition的优化。

Demo

实现将每个数字变成原来的2倍的功能

比如：输入2,结果(2,4)

使用map

val a = sc.parallelize(1 to 9, 3)
def mapDoubleFunc(a : Int) : (Int,Int) = {
    (a,a*2)
}
val mapResult = a.map(mapDoubleFunc)

println(mapResult.collect().mkString)
 
 1
2
3
4
5
6
7
 
 1
2
3
4
5
6
7

结果

(1,2)(2,4)(3,6)(4,8)(5,10)(6,12)(7,14)(8,16)(9,18)

使用mapPartitions

val a = sc.parallelize(1 to 9, 3)
  def doubleFunc(iter: Iterator[Int]) : Iterator[(Int,Int)] = {
    var res = List[(Int,Int)]()
    while (iter.hasNext)
    {
      val cur = iter.next;
      res .::= (cur,cur*2)
    }
    res.iterator
  }
val result = a.mapPartitions(doubleFunc)
println(result.collect().mkString)
 
 1
2
3
4
5
6
7
8
9
10
11
12
 
 1
2
3
4
5
6
7
8
9
10
11
12

结果

(3,6)(2,4)(1,2)(6,12)(5,10)(4,8)(9,18)(8,16)(7,14)

绛门人

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Spark中mapPartitions使用

转：http://blog.csdn.net/lsshlsw/article/details/48627737与map方法类似，map是对rdd中的每一个元素进行操作，而mapPartitions(foreachPartition)则是对rdd中的每个分区的迭代器进行操作。如果在map过程中需要频繁创建额外的对象(例如将rdd中的数据通过jdbc写入数据库,map需要为每个元素创建一个
复制链接

扫一扫

专栏目录