2.flatMap操作
将原RDD中的每个元素拆分成多个元素,并封装到新的RDD中。
scala> val rddData = sc.parallelize(Array("one,two.three","four,five,six","seven,eight,nine,ten"))
rddData: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> val rddData2 = rddData.flatMap(_.split(","))
rddData2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at flatMap at <console>:26
scala> rddData2.collect
res0: Array[String] = Array(one, two.three, four, five, six, seven, eight, nine, ten)
说明:
rddData.flatMap(_.split(",")) : 将Array中每一个字符串用“,”切割,切割后是一个数组集合,符合flatMap方法的输出类型