Spark的map,mapPartitions,mapPartitionsWithIndex详解

原理解读

spark的官网给的函数定义,如下,可以仔细读一读,理解一下之间的差异。

TransformationMeaning
map(func)Return a new distributed dataset formed by passing each element of the source through a function func.
mapPartitions(func)Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T.
mapPartitionsWithIndex(func)Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator<T>) => Iterator<U> when running on an RDD of type T.

相同点分析

三个函数的共同点,都是Transformation算子。惰性的算子。

不同点分析

map函数是一条数据一条数据的处理,也就是,map的输入参数中要包含一条数据以及其他你需要传的参数。

mapPartitions函数是一个partition数据一起处理,也即是说,mapPartitions函数的输入是一个partition的所有数据构成的“迭代器”,然后函数里面可以一条一条的处理,在把所有结果,按迭代器输出。也可以结合yield使用效果更优。

mapPartitionsWithIndex函数,其实和mapPartitions函数区别不大,因为mapPartitions背后调的就是mapPartitionsWithIndex函数,只是一个参数被close了。mapPartitionsWithIndex的函数可以或得partition索引号;

使用示例

map:

使用方式:

       rdd.map(lambda x: func(x,..))

函数定义:

      def func(x,.):
		  new_x = x
          Return new_x

mapPartitions:

使用方式:

rdd.mapPartitions(func)

函数定义:

def func(partitions):
         for line in partitions:
                New_line = line
                yield new_line

mapPartitionsWithIndex:

使用方式:

rdd.mapPartitionsWithIndex(func)

函数定义:

 def func(index,partitions):
         for line in partitions:
                New_line = line
                yield new_line

源码走起

def mapPartitions(self, f, preservesPartitioning=False): 
""" Return a new RDD by applying a function to each partition of this RDD. >>> rdd = sc.parallelize([1, 2, 3, 4], 2) >>> def f(iterator): yield sum(iterator) >>> rdd.mapPartitions(f).collect() [3, 7] "”" 
    def func(s, iterator): 
        return f(iterator) 
    return self.mapPartitionsWithIndex(func, preservesPartitioning)
def mapPartitionsWithIndex(self, f, preservesPartitioning=False): 
""" Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition. >>> rdd = sc.parallelize([1, 2, 3, 4], 4) >>> def f(splitIndex, iterator): yield splitIndex >>> rdd.mapPartitionsWithIndex(f).sum() 6 "”" 
   return PipelinedRDD(self, f, preservesPartitioning)
def map(self, f, preservesPartitioning=False): 
""" Return a new RDD by applying a function to each element of this RDD. >>> rdd = sc.parallelize(["b", "a", "c"]) >>> sorted(rdd.map(lambda x: (x, 1)).collect()) [('a', 1), ('b', 1), ('c', 1)] "”" 
   def func(_, iterator): 
       return map(fail_on_stopiteration(f), iterator) 
   return self.mapPartitionsWithIndex(func, preservesPartitioning)
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值