Spark中map、mapPartitions、foreach、foreachPartitions算子

最新推荐文章于 2022-12-01 18:53:30 发布

chilai4545

最新推荐文章于 2022-12-01 18:53:30 发布

阅读量622

点赞数

文章标签：大数据数据库

原文链接：https://my.oschina.net/dreamness/blog/3077996

版权

map 与 mapPartitions

  /**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

  /**
   * Return a new RDD by applying a function to each partition of this RDD.
   *
   */
  def mapPartitions[U: ClassTag]

mapPartitions 是对每个Partition执行一个方法，而map是对每条数据执行一个方法。
其中map的底层调用了mapPartitions算子， mapPartitions算子的效率更高。

foreach 与 foreachPartitions

  /**
   * Applies a function f to all elements of this RDD.
   */
  def foreach(f: T => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
  }

  /**
   * Applies a function f to each partition of this RDD.
   */
  def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))
  }

foreach 与 foreachPartitions 与上两者相似, 不同的是这两者是action算子。
当执行向数据库插入信息等操作时，应当使用foreachPartitions算子。
但是如果一个Partition的数据量过大，会导致oom，因此在使用之前要评估RDD的数据量、每个Partition的数据量以及整个作业的资源。

转载于:https://my.oschina.net/dreamness/blog/3077996