sparkstreaming foreach foreachRDD foreachPartition

最新推荐文章于 2022-12-01 18:53:30 发布

爱吃甜食_

最新推荐文章于 2022-12-01 18:53:30 发布

阅读量536

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/a3125504x/article/details/108444519

版权

sparkstreaming 常见遍历函数的区别

区别
官网示例
代码示例
官网链接

区别

foreach

源码

  /** Applies a function `f` to all values produced by this iterator.
   *
   *  @param  f   the function that is applied for its side-effect to every element.
   *              The result of function `f` is discarded.
   *
   *  @tparam  U  the type parameter describing the result of function `f`.
   *              This result will always be ignored. Typically `U` is `Unit`,
   *              but this is not necessary.
   *
   *  @note    Reuse: $consumesIterator
   *
   *  @usecase def foreach(f: A => Unit): Unit
   *    @inheritdoc
   */
  def foreach[U](f: A => U) {
    while (hasNext) f(next()) }

action算子，作用于每个元素，在executor端执行
与foreachPartition类似的是,foreach也是对每个partition中的iterator实行迭代处理,通过用户传入的function(即函数func)对iterator进行内容的处理,而不同的是,函数func中的参数传入的不再是一个迭代器,而是每次foreach得到的一个rdd的kv实例,也就是具体的数据.

foreachRDD

源码

  /**
   * Apply a function to each RDD in this DStream. This is an output operator, so
   * 'this' DStream will be registered as an output stream and therefore materialized.
   */
  def foreachRDD(foreachFunc: RDD[T] => Unit): Unit = ssc.withScope {
   
    val cleanedF = context.sparkContext.clean(foreachFunc, false)
    foreachRDD((r: RDD[T], _: Time) => cleanedF(r), displayInnerRDDOps = true)
  }

transform算子，作用于每个Dstream中的每个RDD，在driver端执行，是常用的输出操作。一般内部配合action使用。

注意
仅使用foreachRDD而不使用action算子，则不会触发任务执行

foreachPartition

源码

  /**
   * Applies a function f to each partition of this RDD.
   */
  def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
   
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => cleanF(iter)