sparkstreaming 常见遍历函数的区别
区别
foreach
源码
/** Applies a function `f` to all values produced by this iterator.
*
* @param f the function that is applied for its side-effect to every element.
* The result of function `f` is discarded.
*
* @tparam U the type parameter describing the result of function `f`.
* This result will always be ignored. Typically `U` is `Unit`,
* but this is not necessary.
*
* @note Reuse: $consumesIterator
*
* @usecase def foreach(f: A => Unit): Unit
* @inheritdoc
*/
def foreach[U](f: A => U) {
while (hasNext) f(next()) }
- action算子,作用于每个元素,在executor端执行
- 与foreachPartition类似的是,foreach也是对每个partition中的iterator实行迭代处理,通过用户传入的function(即函数func)对iterator进行内容的处理,而不同的是,函数func中的参数传入的不再是一个迭代器,而是每次foreach得到的一个rdd的kv实例,也就是具体的数据.
foreachRDD
源码
/**
* Apply a function to each RDD in this DStream. This is an output operator, so
* 'this' DStream will be registered as an output stream and therefore materialized.
*/
def foreachRDD(foreachFunc: RDD[T] => Unit): Unit = ssc.withScope {
val cleanedF = context.sparkContext.clean(foreachFunc, false)
foreachRDD((r: RDD[T], _: Time) => cleanedF(r), displayInnerRDDOps = true)
}
- transform算子,作用于每个Dstream中的每个RDD,在driver端执行,是常用的输出操作。一般内部配合action使用。
注意
仅使用foreachRDD而不使用action算子,则不会触发任务执行
foreachPartition
源码
/**
* Applies a function f to each partition of this RDD.
*/
def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
val cleanF = sc.clean(f)
sc.runJob(this, (iter: Iterator[T]) => cleanF(iter)