1、map和flatMap的区别
RDD.scala中的map和flatMap
package com.grace.updateState
import org.apache.spark.{SparkConf, SparkContext}
object MapAndFlatMap {
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("map_flatMap_demo").setMaster("local"))
val arrayRDD =sc.parallelize(Array("a_b","c_d","e_f"))
arrayRDD.foreach(println) //打印结果1
//a_b
//c_d
//e_f
arrayRDD.map(string=>{
string.split("_")
}).foreach(x=>{
println(x.mkString(",")) //打印结果2
//a,b
//c,d
//e,f
})
arrayRDD.flatMap(string=>{
string.split("_")
}).foreach(x=>{
println(x.mkString(","))//打印结果3
//a
//b
//c
//d
//e
//f
})
}
}
对比结果2与结果3,很容易得出结论:
map函数后,RDD的值为 Array(Array("a","b"),Array("c","d"),Array("e","f"))
flatMap函数处理后,RDD的值为 Array("a","b","c","d","e","f")
即最终可以认为,flatMap(Func)会将其返回的数组全部打散,然后合成到一个数组中,并且对每个数据源进行Func处理
源码
/**
* Return a new RDD by applying a function to all elements of this RDD.
*/
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}
/**
* Return a new RDD by first applying a function to all elements of this
* RDD, and then flattening the results.
*/
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}
DStream.scala中的map和flatMap
源码
/** Return a new DStream by applying a function to all elements of this DStream. */
def