spark RDD的map方法和mapPartitionsWithIndex方法的区别

最新推荐文章于 2021-08-28 11:13:09 发布

weixin_43866709

最新推荐文章于 2021-08-28 11:13:09 发布

阅读量733

点赞数

分类专栏： spark 文章标签： mapPartitionsWithIndex

本文链接：https://blog.csdn.net/weixin_43866709/article/details/88641820

版权

spark 专栏收录该内容

47 篇文章 1 订阅

订阅专栏

RDD的map方法，是Executor中执行时，是一条一条的将数据拿出来处理

mapPartitionsWithIndex 一次拿出一个分区（分区中并没有数据，而是记录要读取哪些数据，真正生成的Task会读取多条数据），并且可以将分区的编号取出来

先来看一下mapPartitionsWithIndex 方法的源码：

/**
   * Return a new RDD by applying a function to each partition of this RDD, while tracking the index
   * of the original partition.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
   */
  def mapPartitionsWithIndex[U: ClassTag](
      f: (Int, Iterator[T]) => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),
      preservesPartitioning)
  }

解释：通过对这个RDD的每个分区应用一个函数来返回一个新的RDD，同时跟踪原始分区的索引。
通过这个源码可以看出，mapPartitionsWithIndex方法要传入一个函数，还要在传入一个布尔类型的值，这个布尔类型的值默认是false。传入的这个函数也接受两个参数，一个是Int类型的，他代表分区的索引，一个是迭代器，代表对应分区中的数据。

下面写一个方法：
功能：取分区中对应的数据时，还可以将分区的编号取出来，这样就可以知道数据是属于哪个分区的（哪个分区对应的Task的数据）

val func = (index: Int, iter: Iterator[Int]) => {
  iter.map(x => "[partID:" +  index + ", val: " + x + "]")
}

在这里插入图片描述

weixin_43866709

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录