spark filter源码:
/**
* Return a new RDD containing only the elements that satisfy a predicate.
*/
def filter(f: T => Boolean): RDD[T] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[T, T](
this,
(context, pid, iter) => iter.filter(cleanF),
preservesPartitioning = true)
}
context, pid, iter 代表 TaskContext, partition index, iterator
scala filter 源码:
/** Returns an iterator over all the elements of this iterator that satisfy the predicate `p`.
* The order of the elements is preserved.
*
* @param p the predicate used to test values.
* @return an iterator which produces those values of this iterator which satisfy the predicate `p`.
* @note Reuse: $consumesAndProducesIterator
*/
def filter(p: A => Boolean): Iterator[A] = new AbstractIterator[A] {
// TODO 2.12 - Make a full-fledged FilterImpl that will reverse sense of p
private var hd: A = _
private var hdDefined: Boolean = false
def hasNext: Boolean = hdDefined || {
do {
if (!self.hasNext) return false
hd = self.next()
} while (!p(hd))
hdDefined = true
true
}
def next() = if (hasNext) { hdDefined = false; hd } else empty.next()
}
标红部分其实就是 将满足p函数的元素单独拿出来组成新迭代器(元素的顺序不改变),不满足的直接抛弃。最后这些迭代器
组成新的RDD。
例子:
object Test extends App {
val sparkConf = new SparkConf().
setAppName("Test")
.setMaster("local[6]")
val spark = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
val value: RDD[Int] = spark.sparkContext.parallelize(List(1, 2, 3, 5, 8, 9), 3)
println(value.filter(_ != 2).getNumPartitions)
}
分区不回被改变。