上一篇文章:一天一个RDD函数-1:map 为我们讲述了map函数的源码,并补充了源码涉及到的一些语法知识,为了有一个更好的承上启下的作用,本篇文章决定写flatMap。顾名思义,flatMap比map多了一个flat操作。那么这个flat究竟是什么意思,它在实践中又能起到什么用呢?我们一步步来看,首先看看源码实现。
/**
* Return a new RDD by first applying a function to all elements of this
* RDD, and then flattening the results.
*/
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}
对比一下map函数
/**
* Return a new RDD by applying a function to all elements of this RDD.
*/
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}
如上所述,我们进一步看看,iter.flatMap的实现。
/** Creates a new iterator by applying a function to all values produced by this iterator
* and concatenating the results.
*
* @param f the function to apply on each element.
* @return the iterator resulting from applying the given iterator-valued function
* `f` to each value produced by this iterator and concatenating the results.
* @note Reuse: $consumesAndProducesIterator
*/
def flatMap[B](f: A => GenTraversableOnce[B]): Iterator[B] = new AbstractIterator[B] {
private var cur: Iterator[B] = empty
private def nextCur() { cur = f(self.next()).toIterator }
def hasNext: Boolean = {
// Equivalent to cur.hasNext || self.hasNext && { nextCur(); hasNext }
// but slightly shorter bytecode (better JVM inlining!)
while (!cur.hasNext) {
if (!self.hasNext) return false
nextCur()
}
true
}
def next(): B = (if (hasNext) cur else empty).next()
}
同样的,我们不孤立的看这个实现原则,依旧对比进行:
/** Creates a new iterator that maps all produced values of this iterator
* to new values using a transformation function.
*
* @param f the transformation function
* @return a new iterator which transforms every value produced by this
* iterator by applying the function `f` to it.
* @note Reuse: $consumesAndProducesIterator
*/
def map[B](f: A => B): Iterator[B] = new AbstractIterator[B] {
def hasNext = self.hasNext
def next() = f(self.next())
}
首先,iter.map函数很简单,它向AbstractIterator传递了hasNext和next这两个参数,熟悉迭代器性质的人都知道这两个参数的作用。结合上一篇文章,我们提到过,mapPartitionsRDD中定义了一个compute方法,它把RDD所有的分区内容全部传递给了导入函数的这个闭包中去。就这一点充分体现了scala语言中闭包的设计思想。所以这就给了我们一个很清晰的概念,iter.map和iter.flatMap要做的就是遍历RDD。
注意到一个细节,flatMap中定义了nextCur,它并没有将导入数据的next作为一个值进行传递,而是将其进行了一个toIterator的操作,起到了一个递归遍历的作用。这种方法差异,体现在具体的实践中,会产生不一样的结果,我们举例说明一下:
/*考察map与flatMap函数的区别*/
object mapAndFlatMap {
val familyA = List("Jonson", "Harry", "Marry")
val familyB = List("Jack", "Rose", "Ben")
val family = List(familyA, familyB)
def main(args: Array[String]): Unit = {
family.map(x => print(x+" "))
family.flatMap(x => x.map(x => print(x+" ")))
}
}
输出结果为:
List(Jonson, Harry, Marry) List(Jack, Rose, Ben) Jonson Harry Marry Jack Rose Ben
值得注意的是,flatMap接收的函数返回值必须是一个GenTraversableOnce类,即一个可遍历的类。
最后总结,flatMap和map都是将函数传递给RDD中各个对象的操作,只不过两者针对的粒度不一样而已。在实际使用过程中,按照需求进行合理的调用。