总结map和mapPartitions、foreach和foreachPartition、map和flatMap的区别

最新推荐文章于 2023-05-26 10:59:54 发布

失散Lost

最新推荐文章于 2023-05-26 10:59:54 发布

阅读量445

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/jason_9527/article/details/102905142

版权

Spark 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

版本:spark-2.2.0 && scala-2.11.8

map和mapPartitions的区别

map是对每一个元素进行操作,mapPartitions是对一个分区?

map源码

  /**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

mapPartitions的源码

  /**
   * Return a new RDD by applying a function to each partition of this RDD.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
   */
  def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
      preservesPartitioning)
  }

举例
1.现在有2个分区，共10000条数据，我们按照理想情况每个分区各有5000条数据，分别使用map和mapPartition遍历。(数据量较大)

(1)、使用map(func())遍历
现在，当我们将作为map参数的方法func应用于RDD时，总数据有10000条,所以func方法会执行10000次,每次接收1条数据。尤其是和数据库连接的时候,每条数据都会生成一个db连接,非常耗性能。

(2)、使用mapPartitions(func())遍历
mapPartition是作用在分区上的,func方法只执行2次(分区数)。和数据库连接的话只会创建两个连接。

2.现在有2个分区,每个分区10条数据(数据量很小)

这种情况,map和mapPartitions差别不大

3.现在有2个分区,每个分区1000000数据(数据量非常大)

如果使用的是mapPartitons,那么func方法一次要处理1百万数据,这很可能会使内存不够,如果无法有更多内存空间,就可能发生OOM(内存溢出)

foreach和foreachPartition的区别

foreach源码

  // Actions (launch a job to return a value to the user program)

  /**
   * Applies a function f to all elements of this RDD.
   */
  def foreach(f: T => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
  }

foreachPartition源码

  /**
   * Applies a function f to each partition of this RDD.
   */
  def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))
  }

很明显foreach是对每一条数据进行操作，foreachPartition是对每个分区。

所以优势劣势和map与mapPartition的基本一致：

func中有数据库，网络TCP等IO链接，文件流等等的创建关闭操作，采用foreachPatition方法，针对每个分区集合进行计算，更能提高我们的性能

map和flatMap的区别

map源码

  /**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

flatMap源码

 /**
   *  Return a new RDD by first applying a function to all elements of this
   *  RDD, and then flattening the results.
   */
  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
  }

可以看到基本没有区别，只不过一个调用的是scala中的map，另一个调用的是scala中的flatmap，多了个flatten的效果

失散Lost

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
总结map和mapPartitions、foreach和foreachPartition、map和flatMap的区别

版本:spark-2.2.0 && scala-2.11.8map和mapPartitions的区别map是对每一个元素进行操作,mapPartitions是对一个分区?map源码 /** * Return a new RDD by applying a function to all elements of this RDD. */ def map[U:...
复制链接

扫一扫