Spark算子map()、mapPartitions()、mapPartitionsWithIndex()

最新推荐文章于 2022-05-13 15:56:51 发布

Plume_WZ

最新推荐文章于 2022-05-13 15:56:51 发布

阅读量223

点赞数

本文链接：https://blog.csdn.net/wz272343078/article/details/95046927

版权

map()：通过将函数应用于此RDD的所有元素来返回新的RDD。

  /**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

mapPartitions()：通过将函数应用于此RDD的每个分区来返回新的RDD

  /**
   * Return a new RDD by applying a function to each partition of this RDD.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
   */
  def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
      preservesPartitioning)
  }

mapPartitionsWithIndex()：通过将函数应用于此RDD的每个分区来返回新的RDD，同时跟踪原始分区的索引。

  /**
   * Return a new RDD by applying a function to each partition of this RDD, while tracking the index
   * of the original partition.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
   */
  def mapPartitionsWithIndex[U: ClassTag](
      f: (Int, Iterator[T]) => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),
      preservesPartitioning)
  }

map()和mapPartitions比较：

map()：每次处理一条数据。
mapPartitions类似于map，但独立地在RDD的每一个分片（分区）上运行，因此在类型为T的RDD上运行时，func的函数类型必须是Iterator[T] => Iterator[U]。假设有N个元素，有M个分区，那么map的函数的将被调用N次,而mapPartitions被调用M次,一个函数一次处理所有分区。
mapPartition()：每次处理一个分区的数据，这个分区的数据处理完后，原RDD中分区的数据才能释放，可能导致OOM。
当内存空间较大时候，建议使用mapPartitions()。

Plume_WZ

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark算子map()、mapPartitions()、mapPartitionsWithIndex()

map()：通过将函数应用于此RDD的所有元素来返回新的RDD。 /** * Return a new RDD by applying a function to all elements of this RDD. */ def map[U: ClassTag](f: T => U): RDD[U] = withScope { val cleanF = sc.cl...
复制链接

扫一扫