Spark中map和mapPartitions的区别

最新推荐文章于 2023-05-12 23:17:30 发布

飞天小老头

最新推荐文章于 2023-05-12 23:17:30 发布

阅读量2.4k

点赞数 3

分类专栏： SPARK 文章标签： spark scala big data

本文链接：https://blog.csdn.net/AnameJL/article/details/121689987

版权

SPARK 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

map和mapPartitions的区别

map和mapPartitions的本质区别

map是对RDD中的每个元素进行操作,源码如下所示

 /**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (_, _, iter) => iter.map(cleanF))
  }

通过源码我们可以看出,map传入的是分区中的每个元素,是对每个元素就进行一次传入的逻辑运算.

mapPartitions针对的迭代器进行操作,源码如下所示

  /**
   * Return a new RDD by applying a function to each partition of this RDD.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
   */
  def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (_: TaskContext, _: Int, iter: Iterator[T]) => cleanedF(iter),
      preservesPartitioning)
  }

通过源码我们可以看到,这里传入的参数是Iterator返回值也是Iterator,所传入的计算逻辑是对一个Iterator进行一次.

map和mapPartitions优缺点

map优点
当一个分区中的数据量极大时且内存资源不够时,map可以处理一部分数据就对处理完并且在存在内存中的数据进行垃圾回收或者落地存储中间结果,或者其他的处理方式,这样一般很少导致OOM
map缺点
如果在map的逻辑运算中,需要创建jdbc链接,那这里就会对分区中的每个元素创建一个链接,这样就很损耗资源,影响运行速度.
mapPartitions优点
mapPartitions的优点就和map的缺点相对应,应为mapPartitions是针对一个迭代器进行操作的,所以可以创建更少的对象或链接,这样就提高了性能.
mapPartitions的缺点
mapPartitions的缺点就和map的优点相对应,如果一个迭代器中的数据量很大而内存资源有限时,这种情况就可能会导致OOM的情况,因为mapPartitions要将一整个迭代中的数据加载到内存中一次性完成计算.