一天一个RDD函数-2：flatMap

最新推荐文章于 2024-06-02 21:06:36 发布

weixin_33717298

最新推荐文章于 2024-06-02 21:06:36 发布

阅读量117

点赞数

文章标签： python scala

原文链接：https://my.oschina.net/hunglish/blog/1542515

版权

2019独角兽企业重金招聘Python工程师标准>>>

上一篇文章：一天一个RDD函数-1：map 为我们讲述了map函数的源码，并补充了源码涉及到的一些语法知识，为了有一个更好的承上启下的作用，本篇文章决定写flatMap。顾名思义，flatMap比map多了一个flat操作。那么这个flat究竟是什么意思，它在实践中又能起到什么用呢？我们一步步来看，首先看看源码实现。

/**
   *  Return a new RDD by first applying a function to all elements of this
   *  RDD, and then flattening the results.
   */
  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
  }

对比一下map函数

/**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

如上所述，我们进一步看看，iter.flatMap的实现。

/** Creates a new iterator by applying a function to all values produced by this iterator
   *  and concatenating the results.
   *
   *  @param f the function to apply on each element.
   *  @return  the iterator resulting from applying the given iterator-valued function
   *           `f` to each value produced by this iterator and concatenating the results.
   *  @note    Reuse: $consumesAndProducesIterator
   */
  def flatMap[B](f: A => GenTraversableOnce[B]): Iterator[B] = new AbstractIterator[B] {
    private var cur: Iterator[B] = empty
    private def nextCur() { cur = f(self.next()).toIterator }
    def hasNext: Boolean = {
      // Equivalent to cur.hasNext || self.hasNext && { nextCur(); hasNext }
      // but slightly shorter bytecode (better JVM inlining!)
      while (!cur.hasNext) {
        if (!self.hasNext) return false
        nextCur()
      }
      true
    }
    def next(): B = (if (hasNext) cur else empty).next()
  }

同样的，我们不孤立的看这个实现原则，依旧对比进行：

/** Creates a new iterator that maps all produced values of this iterator
   *  to new values using a transformation function.
   *
   *  @param f  the transformation function
   *  @return a new iterator which transforms every value produced by this
   *          iterator by applying the function `f` to it.
   *  @note   Reuse: $consumesAndProducesIterator
   */
  def map[B](f: A => B): Iterator[B] = new AbstractIterator[B] {
    def hasNext = self.hasNext
    def next() = f(self.next())
  }

首先，iter.map函数很简单，它向AbstractIterator传递了hasNext和next这两个参数，熟悉迭代器性质的人都知道这两个参数的作用。结合上一篇文章，我们提到过，mapPartitionsRDD中定义了一个compute方法，它把RDD所有的分区内容全部传递给了导入函数的这个闭包中去。就这一点充分体现了scala语言中闭包的设计思想。所以这就给了我们一个很清晰的概念，iter.map和iter.flatMap要做的就是遍历RDD。

注意到一个细节，flatMap中定义了nextCur，它并没有将导入数据的next作为一个值进行传递，而是将其进行了一个toIterator的操作，起到了一个递归遍历的作用。这种方法差异，体现在具体的实践中，会产生不一样的结果，我们举例说明一下：

/*考察map与flatMap函数的区别*/
object mapAndFlatMap {

  val familyA = List("Jonson", "Harry", "Marry")
  val familyB = List("Jack", "Rose", "Ben")
  val family = List(familyA, familyB)

  def main(args: Array[String]): Unit = {
    family.map(x => print(x+" "))
    family.flatMap(x => x.map(x => print(x+" ")))
  }
}

输出结果为：

List(Jonson, Harry, Marry) List(Jack, Rose, Ben) Jonson Harry Marry Jack Rose Ben

值得注意的是，flatMap接收的函数返回值必须是一个GenTraversableOnce类，即一个可遍历的类。

最后总结，flatMap和map都是将函数传递给RDD中各个对象的操作，只不过两者针对的粒度不一样而已。在实际使用过程中，按照需求进行合理的调用。

转载于:https://my.oschina.net/hunglish/blog/1542515

weixin_33717298

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
一天一个RDD函数-2：flatMap

2019独角兽企业重金招聘Python工程师标准>>> ...
复制链接

扫一扫

一天一个RDD函数-2：flatMap

“相关推荐”对你有帮助么？