RDD-Transformation——filter

最新推荐文章于 2024-09-15 07:19:35 发布

搬砖小工053

最新推荐文章于 2024-09-15 07:19:35 发布

阅读量2.3k

点赞数

分类专栏： Spark 文章标签： filter RDD

Spark 专栏收录该内容

25 篇文章 0 订阅

订阅专栏

原理图

filter的功能是对元素进行过滤，对每个元素应用f函数，返回值为true的元素在RDD中保留，返回为false的将过滤掉。内部实现相当于生成FilteredRDD(this，sc.clean(f))。

这里写图片描述
图中，每个方框代表一个RDD分区。 T可以是任意的类型。通过用户自定义的过滤函数f，对每个数据项进行操作，将满足条件，返回结果为true的数据项保留。例如，过滤掉V2、 V3保留了V1，将区分命名为V1’。

源码

/**
 * Return a new RDD containing only the elements that satisfy a predicate.
 */
def filter(f: T => Boolean): RDD[T] = {
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[T, T](
    this,
    (context, pid, iter) => iter.filter(cleanF),
    preservesPartitioning = true)
}

上手使用

scala> var rdd = sc.makeRDD(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at <console>:27

scala> rdd.filter(_ >3).collect
res1: Array[Int] = Array(4, 5, 6, 7, 8, 9, 10)