Spark的map，mapPartitions，mapPartitionsWithIndex详解

最新推荐文章于 2025-03-17 19:41:51 发布

奋斗的瘦胖子

最新推荐文章于 2025-03-17 19:41:51 发布

阅读量3k

点赞数

分类专栏： spark 文章标签： spark python map mapPartitions

本文链接：https://blog.csdn.net/QQ1131221088/article/details/104051087

版权

spark 专栏收录该内容

13 篇文章

订阅专栏

原理解读

spark的官网给的函数定义，如下，可以仔细读一读，理解一下之间的差异。

Transformation	Meaning
map(func)	Return a new distributed dataset formed by passing each element of the source through a function func.
mapPartitions(func)	Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T.
mapPartitionsWithIndex(func)	Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator<T>) => Iterator<U> when running on an RDD of type T.

相同点分析

三个函数的共同点，都是Transformation算子。惰性的算子。

不同点分析

map函数是一条数据一条数据的处理，也就是，map的输入参数中要包含一条数据以及其他你需要传的参数。

mapPartitions函数是一个partition数据一起处理，也即是说，mapPartitions函数的输入是一个partition的所有数据构成的“迭代器”，然后函数里面可以一条一条的处理，在把所有结果，按迭代器输出。也可以结合yield使用效果更优。

mapPartitionsWithIndex函数，其实和mapPartitions函数区别不大，因为mapPartitions背后调的就是mapPartitionsWithIndex函数，只是一个参数被close了。mapPartitionsWithIndex的函数可以或得partition索引号；

使用示例

map：

使用方式：

       rdd.map(lambda x: func(x, …..))

函数定义：

      def func(x, ….):
		  new_x = x
          Return new_x

mapPartitions:

使用方式：

rdd.mapPartitions(func)

函数定义：

def func(partitions):
         for line in partitions:
                New_line = line
                yield new_line

mapPartitionsWithIndex:

使用方式：

rdd.mapPartitionsWithIndex(func)

函数定义：

 def func(index，partitions):
         for line in partitions:
                New_line = line
                yield new_line

源码走起

def mapPartitions(self, f, preservesPartitioning=False): 
""" Return a new RDD by applying a function to each partition of this RDD. >>> rdd = sc.parallelize([1, 2, 3, 4], 2) >>> def f(iterator): yield sum(iterator) >>> rdd.mapPartitions(f).collect() [3, 7] "”" 
    def func(s, iterator): 
        return f(iterator) 
    return self.mapPartitionsWithIndex(func, preservesPartitioning)

def mapPartitionsWithIndex(self, f, preservesPartitioning=False): 
""" Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition. >>> rdd = sc.parallelize([1, 2, 3, 4], 4) >>> def f(splitIndex, iterator): yield splitIndex >>> rdd.mapPartitionsWithIndex(f).sum() 6 "”" 
   return PipelinedRDD(self, f, preservesPartitioning)

def map(self, f, preservesPartitioning=False): 
""" Return a new RDD by applying a function to each element of this RDD. >>> rdd = sc.parallelize(["b", "a", "c"]) >>> sorted(rdd.map(lambda x: (x, 1)).collect()) [('a', 1), ('b', 1), ('c', 1)] "”" 
   def func(_, iterator): 
       return map(fail_on_stopiteration(f), iterator) 
   return self.mapPartitionsWithIndex(func, preservesPartitioning)