spark val b = a.flatMap(x => 1 to x)详解

原创 2016年08月28日 15:37:05

flatMap

与map类似,区别是原RDD中的元素经map处理后只能生成一个元素,而原RDD中的元素经flatmap处理后可生成多个元素来构建新RDD。 举例:对原RDD中的每个元素x产生y个元素(从1到y,y为元素x的值)

val b = a.flatMap(x => 1 to x)

根据a中的每个元素的值从1开始每次累加1,直到等于该元素值,生成列表。例如:元素是1,列表是1;元素是2,列表是1、2;

例如:

scala> val a = sc.parallelize(1 to 4, 2)

1.生成4个列表:

    1

    1、2

    1、2、3

    1、2、3、4

2.合并4个列表

    1、1、2、1、2、3、1、2、3、4


scala> val a = sc.parallelize(1 to 4, 2)
scala> val b = a.flatMap(x => 1 to x)
scala> b.collect
res12: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4)


scala> val a = sc.parallelize(1 to 4, 2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[73] at parallelize at <console>:22

scala> a.collect
16/08/28 15:25:28 INFO spark.SparkContext: Starting job: collect at <console>:25
16/08/28 15:25:28 INFO scheduler.DAGScheduler: Got job 34 (collect at <console>:25) with 2 output partitions (allowLocal=false)
16/08/28 15:25:28 INFO scheduler.DAGScheduler: Final stage: Stage 37(collect at <console>:25)
16/08/28 15:25:28 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/08/28 15:25:28 INFO scheduler.DAGScheduler: Missing parents: List()
16/08/28 15:25:28 INFO scheduler.DAGScheduler: Submitting Stage 37 (ParallelCollectionRDD[73] at parallelize at <console>:22), which has no missing parents
16/08/28 15:25:28 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 37 (ParallelCollectionRDD[73] at parallelize at <console>:22)
16/08/28 15:25:28 INFO scheduler.TaskSchedulerImpl: Adding task set 37.0 with 2 tasks
16/08/28 15:25:28 INFO scheduler.TaskSetManager: Starting task 37.0:0 as TID 401 on executor localhost: localhost (PROCESS_LOCAL)
16/08/28 15:25:28 INFO scheduler.TaskSetManager: Serialized task 37.0:0 as 1089 bytes in 6 ms
16/08/28 15:25:28 INFO scheduler.TaskSetManager: Starting task 37.0:1 as TID 402 on executor localhost: localhost (PROCESS_LOCAL)
16/08/28 15:25:28 INFO scheduler.TaskSetManager: Serialized task 37.0:1 as 1089 bytes in 3 ms
16/08/28 15:25:28 INFO executor.Executor: Running task ID 401
16/08/28 15:25:28 INFO executor.Executor: Running task ID 402
16/08/28 15:25:28 INFO executor.Executor: Serialized size of result for 402 is 550
16/08/28 15:25:28 INFO executor.Executor: Serialized size of result for 401 is 550
16/08/28 15:25:28 INFO executor.Executor: Sending result for 402 directly to driver
16/08/28 15:25:28 INFO executor.Executor: Finished task ID 402
16/08/28 15:25:28 INFO executor.Executor: Sending result for 401 directly to driver
16/08/28 15:25:28 INFO executor.Executor: Finished task ID 401
16/08/28 15:25:28 INFO scheduler.TaskSetManager: Finished TID 402 in 179 ms on localhost (progress: 1/2)
16/08/28 15:25:28 INFO scheduler.DAGScheduler: Completed ResultTask(37, 1)
16/08/28 15:25:28 INFO scheduler.TaskSetManager: Finished TID 401 in 207 ms on localhost (progress: 2/2)
16/08/28 15:25:28 INFO scheduler.DAGScheduler: Completed ResultTask(37, 0)
16/08/28 15:25:28 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 37.0, whose tasks have all completed, from pool 
16/08/28 15:25:28 INFO scheduler.DAGScheduler: Stage 37 (collect at <console>:25) finished in 0.242 s
16/08/28 15:25:28 INFO spark.SparkContext: Job finished: collect at <console>:25, took 0.49719503 s
res56: Array[Int] = Array(1, 2, 3, 4)

scala> val b = a.flatMap(x => 1 to x)
b: org.apache.spark.rdd.RDD[Int] = FlatMappedRDD[74] at flatMap at <console>:24

scala> b.collect
16/08/28 15:25:54 INFO spark.SparkContext: Starting job: collect at <console>:27
16/08/28 15:25:54 INFO scheduler.DAGScheduler: Got job 35 (collect at <console>:27) with 2 output partitions (allowLocal=false)
16/08/28 15:25:54 INFO scheduler.DAGScheduler: Final stage: Stage 38(collect at <console>:27)
16/08/28 15:25:54 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/08/28 15:25:54 INFO scheduler.DAGScheduler: Missing parents: List()
16/08/28 15:25:54 INFO scheduler.DAGScheduler: Submitting Stage 38 (FlatMappedRDD[74] at flatMap at <console>:24), which has no missing parents
16/08/28 15:25:54 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 38 (FlatMappedRDD[74] at flatMap at <console>:24)
16/08/28 15:25:54 INFO scheduler.TaskSchedulerImpl: Adding task set 38.0 with 2 tasks
16/08/28 15:25:54 INFO scheduler.TaskSetManager: Starting task 38.0:0 as TID 403 on executor localhost: localhost (PROCESS_LOCAL)
16/08/28 15:25:54 INFO scheduler.TaskSetManager: Serialized task 38.0:0 as 1330 bytes in 3 ms
16/08/28 15:25:54 INFO scheduler.TaskSetManager: Starting task 38.0:1 as TID 404 on executor localhost: localhost (PROCESS_LOCAL)
16/08/28 15:25:54 INFO scheduler.TaskSetManager: Serialized task 38.0:1 as 1330 bytes in 2 ms
16/08/28 15:25:54 INFO executor.Executor: Running task ID 403
16/08/28 15:25:54 INFO executor.Executor: Running task ID 404
16/08/28 15:25:54 INFO executor.Executor: Serialized size of result for 403 is 554
16/08/28 15:25:54 INFO executor.Executor: Sending result for 403 directly to driver
16/08/28 15:25:54 INFO executor.Executor: Finished task ID 403
16/08/28 15:25:54 INFO scheduler.DAGScheduler: Completed ResultTask(38, 0)
16/08/28 15:25:54 INFO scheduler.TaskSetManager: Finished TID 403 in 58 ms on localhost (progress: 1/2)
16/08/28 15:25:54 INFO executor.Executor: Serialized size of result for 404 is 570
16/08/28 15:25:54 INFO executor.Executor: Sending result for 404 directly to driver
16/08/28 15:25:54 INFO executor.Executor: Finished task ID 404
16/08/28 15:25:54 INFO scheduler.TaskSetManager: Finished TID 404 in 71 ms on localhost (progress: 2/2)
16/08/28 15:25:54 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 38.0, whose tasks have all completed, from pool 
16/08/28 15:25:54 INFO scheduler.DAGScheduler: Completed ResultTask(38, 1)
16/08/28 15:25:54 INFO scheduler.DAGScheduler: Stage 38 (collect at <console>:27) finished in 0.082 s
16/08/28 15:25:54 INFO spark.SparkContext: Job finished: collect at <console>:27, took 0.178752245 s
res57: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4)

scala> val a = sc.parallelize(1 to 2, 2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[75] at parallelize at <console>:22

scala> a.collect
16/08/28 15:27:27 INFO spark.SparkContext: Starting job: collect at <console>:25
16/08/28 15:27:27 INFO scheduler.DAGScheduler: Got job 36 (collect at <console>:25) with 2 output partitions (allowLocal=false)
16/08/28 15:27:27 INFO scheduler.DAGScheduler: Final stage: Stage 39(collect at <console>:25)
16/08/28 15:27:27 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/08/28 15:27:27 INFO scheduler.DAGScheduler: Missing parents: List()
16/08/28 15:27:27 INFO scheduler.DAGScheduler: Submitting Stage 39 (ParallelCollectionRDD[75] at parallelize at <console>:22), which has no missing parents
16/08/28 15:27:27 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 39 (ParallelCollectionRDD[75] at parallelize at <console>:22)
16/08/28 15:27:27 INFO scheduler.TaskSchedulerImpl: Adding task set 39.0 with 2 tasks
16/08/28 15:27:27 INFO scheduler.TaskSetManager: Starting task 39.0:0 as TID 405 on executor localhost: localhost (PROCESS_LOCAL)
16/08/28 15:27:27 INFO scheduler.TaskSetManager: Serialized task 39.0:0 as 1089 bytes in 3 ms
16/08/28 15:27:27 INFO scheduler.TaskSetManager: Starting task 39.0:1 as TID 406 on executor localhost: localhost (PROCESS_LOCAL)
16/08/28 15:27:27 INFO scheduler.TaskSetManager: Serialized task 39.0:1 as 1089 bytes in 5 ms
16/08/28 15:27:27 INFO executor.Executor: Running task ID 405
16/08/28 15:27:27 INFO executor.Executor: Running task ID 406
16/08/28 15:27:27 INFO executor.Executor: Serialized size of result for 405 is 546
16/08/28 15:27:27 INFO executor.Executor: Sending result for 405 directly to driver
16/08/28 15:27:27 INFO executor.Executor: Finished task ID 405
16/08/28 15:27:27 INFO scheduler.DAGScheduler: Completed ResultTask(39, 0)
16/08/28 15:27:27 INFO scheduler.TaskSetManager: Finished TID 405 in 67 ms on localhost (progress: 1/2)
16/08/28 15:27:27 INFO executor.Executor: Serialized size of result for 406 is 546
16/08/28 15:27:27 INFO executor.Executor: Sending result for 406 directly to driver
16/08/28 15:27:27 INFO executor.Executor: Finished task ID 406
16/08/28 15:27:27 INFO scheduler.TaskSetManager: Finished TID 406 in 92 ms on localhost (progress: 2/2)
16/08/28 15:27:27 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 39.0, whose tasks have all completed, from pool 
16/08/28 15:27:27 INFO scheduler.DAGScheduler: Completed ResultTask(39, 1)
16/08/28 15:27:27 INFO scheduler.DAGScheduler: Stage 39 (collect at <console>:25) finished in 0.116 s
16/08/28 15:27:27 INFO spark.SparkContext: Job finished: collect at <console>:25, took 0.149541039 s
res58: Array[Int] = Array(1, 2)

scala> val b = a.flatMap(x => 1 to x)
b: org.apache.spark.rdd.RDD[Int] = FlatMappedRDD[76] at flatMap at <console>:24

scala> b.collect
16/08/28 15:27:41 INFO spark.SparkContext: Starting job: collect at <console>:27
16/08/28 15:27:41 INFO scheduler.DAGScheduler: Got job 37 (collect at <console>:27) with 2 output partitions (allowLocal=false)
16/08/28 15:27:41 INFO scheduler.DAGScheduler: Final stage: Stage 40(collect at <console>:27)
16/08/28 15:27:41 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/08/28 15:27:41 INFO scheduler.DAGScheduler: Missing parents: List()
16/08/28 15:27:41 INFO scheduler.DAGScheduler: Submitting Stage 40 (FlatMappedRDD[76] at flatMap at <console>:24), which has no missing parents
16/08/28 15:27:41 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 40 (FlatMappedRDD[76] at flatMap at <console>:24)
16/08/28 15:27:41 INFO scheduler.TaskSchedulerImpl: Adding task set 40.0 with 2 tasks
16/08/28 15:27:41 INFO scheduler.TaskSetManager: Starting task 40.0:0 as TID 407 on executor localhost: localhost (PROCESS_LOCAL)
16/08/28 15:27:41 INFO scheduler.TaskSetManager: Serialized task 40.0:0 as 1329 bytes in 3 ms
16/08/28 15:27:41 INFO scheduler.TaskSetManager: Starting task 40.0:1 as TID 408 on executor localhost: localhost (PROCESS_LOCAL)
16/08/28 15:27:41 INFO scheduler.TaskSetManager: Serialized task 40.0:1 as 1329 bytes in 4 ms
16/08/28 15:27:41 INFO executor.Executor: Running task ID 407
16/08/28 15:27:41 INFO executor.Executor: Running task ID 408
16/08/28 15:27:41 INFO executor.Executor: Serialized size of result for 407 is 546
16/08/28 15:27:41 INFO executor.Executor: Sending result for 407 directly to driver
16/08/28 15:27:41 INFO executor.Executor: Serialized size of result for 408 is 550
16/08/28 15:27:41 INFO executor.Executor: Sending result for 408 directly to driver
16/08/28 15:27:41 INFO executor.Executor: Finished task ID 408
16/08/28 15:27:41 INFO scheduler.DAGScheduler: Completed ResultTask(40, 0)
16/08/28 15:27:41 INFO scheduler.TaskSetManager: Finished TID 407 in 56 ms on localhost (progress: 1/2)
16/08/28 15:27:41 INFO scheduler.TaskSetManager: Finished TID 408 in 69 ms on localhost (progress: 2/2)
16/08/28 15:27:41 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 40.0, whose tasks have all completed, from pool 
16/08/28 15:27:41 INFO scheduler.DAGScheduler: Completed ResultTask(40, 1)
16/08/28 15:27:41 INFO scheduler.DAGScheduler: Stage 40 (collect at <console>:27) finished in 0.077 s
16/08/28 15:27:41 INFO spark.SparkContext: Job finished: collect at <console>:27, took 0.151573644 s
res59: Array[Int] = Array(1, 1, 2)
16/08/28 15:27:41 INFO executor.Executor: Finished task ID 407

spark之map与flatMap区别

http://blog.csdn.net/u013361361/article/details/44463307


版权声明:本文为博主原创文章,未经博主允许不得转载。

相关文章推荐

spark之map与flatMap区别

scala> val m = List(List("a","b"),List("c","d")) m: List[List[String]] = List(List(a, b), List(c, d)...

spark中flatmap和map的区别

以前总是分不清楚spark中flatmap和map的区别,现在弄明白了,总结分享给大家,先看看flatmap和map的定义。map()是将函数用于RDD中的每个元素,将返回值构成新的RDD。flatm...

spark中flatMap函数用法--spark学习(基础)

说明在spark中map函数和flatMap函数是两个比较常用的函数。其中 map:对集合中每个元素进行操作。 flatMap:对集合中每个元素进行操作然后再扁平化。 理解扁平化可以举个简单例子...

Spark RDD API详解(一) Map和Reduce

本文用实例介绍Spark中RDD和MapReduce相关的API。
  • jewes
  • jewes
  • 2014-10-08 17:31
  • 84096

探索Scala(5)-- 基本类型

文本讨论一下Scala语言基本类型的实现方式

BerkeleyX CS100.1x"Introduction to Big Data with Apache Spark"环境搭建

最近想学习一些跟spark相关的教程。然后看到伯克利开了一门课,BerkeleyX CS100.1x "Introduction to Big Data with Apache Spark"。首先需要...

mainwindow.cpp:(.text+0x91b2): undefined reference to `endpoint_se(QVector<double>, int,............

无语搞了半天不知道这个错误

Spark RDD中Transformation的map、flatMap、mapPartitions、glom详解

Ma(func) 返回一个新的分布式数据集,该数据集由每一个输入元素经过func函数转换后组成。 由上可以看出,每一个分片中每一个元素都要经过f函数,将原来元素转成新元素u。 V1,V2,V...

spark1.x-sql-架构原理

整体架构 详解源码结构catalyst sql hive hive-thriftserver

Spark1.x编译与安装

1.    Spark1.x编译与安装 1.1.   基础准备 见《1、基础准备(JDK、Maven、服务器配置)》。 1.2.   Hadoop集群准备 见《2、Hadoop2.2.0 编译与安装》...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:深度学习:神经网络中的前向传播和反向传播算法推导
举报原因:
原因补充:

(最多只允许输入30个字)