spark-shell 启动以及例子

最新推荐文章于 2024-01-25 09:43:40 发布

5icode.top

最新推荐文章于 2024-01-25 09:43:40 发布

阅读量7.3k

点赞数 1

分类专栏： spark 文章标签： spark

本文链接：https://blog.csdn.net/hzdxw/article/details/51703461

版权

spark 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

[root@cdh1 hadoop]# spark-shell
bash: spark-shell: command not found
[root@cdh1 hadoop]# source /etc/profile
[root@cdh1 hadoop]# spark-shell
16/06/17 10:07:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/06/17 10:07:59 INFO spark.SecurityManager: Changing view acls to: root
16/06/17 10:07:59 INFO spark.SecurityManager: Changing modify acls to: root
16/06/17 10:07:59 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
16/06/17 10:07:59 INFO spark.HttpServer: Starting HTTP Server
16/06/17 10:07:59 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/06/17 10:07:59 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:48030
16/06/17 10:07:59 INFO util.Utils: Successfully started service 'HTTP class server' on port 48030.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/

Using Scala version 2.10.4 (Java HotSpot(TM) Client VM, Java 1.7.0_67)
Type in expressions to have them evaluated.
Type :help for more information.
16/06/17 10:08:08 WARN util.Utils: Your hostname, cdh1 resolves to a loopback address: 127.0.0.1; using 192.168.0.103 instead (on interface eth3)
16/06/17 10:08:08 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/06/17 10:08:08 INFO spark.SparkContext: Running Spark version 1.4.0
16/06/17 10:08:09 INFO spark.SecurityManager: Changing view acls to: root
16/06/17 10:08:09 INFO spark.SecurityManager: Changing modify acls to: root
16/06/17 10:08:09 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
16/06/17 10:08:10 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/06/17 10:08:10 INFO Remoting: Starting remoting
16/06/17 10:08:11 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.0.103:46302]
16/06/17 10:08:11 INFO util.Utils: Successfully started service 'sparkDriver' on port 46302.
16/06/17 10:08:11 INFO spark.SparkEnv: Registering MapOutputTracker
16/06/17 10:08:11 INFO spark.SparkEnv: Registering BlockManagerMaster
16/06/17 10:08:11 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-4e24ee66-6db3-41f1-9fe5-12da91e71fc3/blockmgr-36b2f8b4-bb92-4759-9a4e-90ae9c128577
16/06/17 10:08:11 INFO storage.MemoryStore: MemoryStore started with capacity 267.3 MB
16/06/17 10:08:11 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-4e24ee66-6db3-41f1-9fe5-12da91e71fc3/httpd-aeb3a7a4-3947-4f11-9504-8ff175a289b2
16/06/17 10:08:11 INFO spark.HttpServer: Starting HTTP Server
16/06/17 10:08:11 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/06/17 10:08:11 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:40375
16/06/17 10:08:11 INFO util.Utils: Successfully started service 'HTTP file server' on port 40375.
16/06/17 10:08:12 INFO spark.SparkEnv: Registering OutputCommitCoordinator
16/06/17 10:08:12 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/06/17 10:08:13 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
16/06/17 10:08:13 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
16/06/17 10:08:13 INFO ui.SparkUI: Started SparkUI at http://192.168.0.103:4040
16/06/17 10:08:13 INFO executor.Executor: Starting executor ID driver on host localhost
16/06/17 10:08:13 INFO executor.Executor: Using REPL class URI: http://192.168.0.103:48030
16/06/17 10:08:13 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 42970.
16/06/17 10:08:13 INFO netty.NettyBlockTransferService: Server created on 42970
16/06/17 10:08:13 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/06/17 10:08:13 INFO storage.BlockManagerMasterEndpoint: Registering block manager localhost:42970 with 267.3 MB RAM, BlockManagerId(driver, localhost, 42970)
16/06/17 10:08:13 INFO storage.BlockManagerMaster: Registered BlockManager
16/06/17 10:08:14 INFO repl.SparkILoop: Created spark context..
Spark context available as sc.
16/06/17 10:08:15 INFO hive.HiveContext: Initializing execution hive, version 0.13.1
16/06/17 10:08:17 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
16/06/17 10:08:17 INFO metastore.ObjectStore: ObjectStore, initialize called
16/06/17 10:08:18 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored
16/06/17 10:08:18 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/06/17 10:08:18 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/06/17 10:08:19 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/06/17 10:08:21 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
16/06/17 10:08:21 INFO metastore.MetaStoreDirectSql: MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5. Encountered: "@" (64), after : "".
16/06/17 10:08:23 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/06/17 10:08:23 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/06/17 10:08:25 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/06/17 10:08:25 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/06/17 10:08:26 INFO metastore.ObjectStore: Initialized ObjectStore
16/06/17 10:08:26 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 0.13.1aa
16/06/17 10:08:28 INFO metastore.HiveMetaStore: Added admin role in metastore
16/06/17 10:08:28 INFO metastore.HiveMetaStore: Added public role in metastore
16/06/17 10:08:28 INFO metastore.HiveMetaStore: No user is added in admin role, since config is empty
16/06/17 10:08:29 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr.
16/06/17 10:08:29 INFO repl.SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.

scala> val textFile = sc.textFile("file:///spark.test.txt");
16/06/17 10:48:39 WARN util.SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
16/06/17 10:48:39 INFO storage.MemoryStore: ensureFreeSpace(85352) called with curMem=0, maxMem=280248975
16/06/17 10:48:39 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 83.4 KB, free 267.2 MB)
16/06/17 10:48:40 INFO storage.MemoryStore: ensureFreeSpace(19999) called with curMem=85352, maxMem=280248975
16/06/17 10:48:40 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 19.5 KB, free 267.2 MB)
16/06/17 10:48:40 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:42970 (size: 19.5 KB, free: 267.2 MB)
16/06/17 10:48:40 INFO spark.SparkContext: Created broadcast 0 from textFile at <console>:21
textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21

scala> textFile.count();
16/06/17 10:49:24 INFO mapred.FileInputFormat: Total input paths to process : 1
16/06/17 10:49:24 INFO spark.SparkContext: Starting job: count at <console>:24
16/06/17 10:49:24 INFO scheduler.DAGScheduler: Got job 0 (count at <console>:24) with 1 output partitions (allowLocal=false)
16/06/17 10:49:24 INFO scheduler.DAGScheduler: Final stage: ResultStage 0(count at <console>:24)
16/06/17 10:49:24 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/06/17 10:49:24 INFO scheduler.DAGScheduler: Missing parents: List()
16/06/17 10:49:24 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at textFile at <console>:21), which has no missing parents
16/06/17 10:49:24 INFO storage.MemoryStore: ensureFreeSpace(2976) called with curMem=105351, maxMem=280248975
16/06/17 10:49:24 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.9 KB, free 267.2 MB)
16/06/17 10:49:24 INFO storage.MemoryStore: ensureFreeSpace(1755) called with curMem=108327, maxMem=280248975
16/06/17 10:49:24 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1755.0 B, free 267.2 MB)
16/06/17 10:49:24 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:42970 (size: 1755.0 B, free: 267.2 MB)
16/06/17 10:49:24 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:874
16/06/17 10:49:24 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at textFile at <console>:21)
16/06/17 10:49:24 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/06/17 10:49:24 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1393 bytes)
16/06/17 10:49:24 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
16/06/17 10:49:25 INFO rdd.HadoopRDD: Input split: file:/spark.test.txt:0+331
16/06/17 10:49:25 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/06/17 10:49:25 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/06/17 10:49:25 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/06/17 10:49:25 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/06/17 10:49:25 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/06/17 10:49:25 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 1830 bytes result sent to driver
16/06/17 10:49:25 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 927 ms on localhost (1/1)
16/06/17 10:49:25 INFO scheduler.DAGScheduler: ResultStage 0 (count at <console>:24) finished in 0.963 s
16/06/17 10:49:25 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/06/17 10:49:25 INFO scheduler.DAGScheduler: Job 0 finished: count at <console>:24, took 1.495097 s
res0: Long = 7

scala> val linesWithSpark = textFile.filter(line => line.contains("spark"));
linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:23

scala> textFile.filter(line => line.contains("spark")).count();
16/06/17 10:51:17 INFO spark.SparkContext: Starting job: count at <console>:24
16/06/17 10:51:17 INFO scheduler.DAGScheduler: Got job 1 (count at <console>:24) with 1 output partitions (allowLocal=false)
16/06/17 10:51:17 INFO scheduler.DAGScheduler: Final stage: ResultStage 1(count at <console>:24)
16/06/17 10:51:17 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/06/17 10:51:17 INFO scheduler.DAGScheduler: Missing parents: List()
16/06/17 10:51:17 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[3] at filter at <console>:24), which has no missing parents
16/06/17 10:51:17 INFO storage.MemoryStore: ensureFreeSpace(3200) called with curMem=110082, maxMem=280248975
16/06/17 10:51:17 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.1 KB, free 267.2 MB)
16/06/17 10:51:17 INFO storage.MemoryStore: ensureFreeSpace(1866) called with curMem=113282, maxMem=280248975
16/06/17 10:51:17 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1866.0 B, free 267.2 MB)
16/06/17 10:51:17 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:42970 (size: 1866.0 B, free: 267.2 MB)
16/06/17 10:51:17 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:874
16/06/17 10:51:17 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[3] at filter at <console>:24)
16/06/17 10:51:17 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
16/06/17 10:51:17 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1393 bytes)
16/06/17 10:51:17 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 1)
16/06/17 10:51:17 INFO rdd.HadoopRDD: Input split: file:/spark.test.txt:0+331
16/06/17 10:51:17 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 1). 1830 bytes result sent to driver
16/06/17 10:51:17 INFO scheduler.DAGScheduler: ResultStage 1 (count at <console>:24) finished in 0.023 s
16/06/17 10:51:17 INFO scheduler.DAGScheduler: Job 1 finished: count at <console>:24, took 0.042199 s
res1: Long = 5

scala> 16/06/17 10:51:17 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 21 ms on localhost (1/1)
16/06/17 10:51:17 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool

scala> textFile.first();
16/06/17 10:51:50 INFO spark.SparkContext: Starting job: first at <console>:24
16/06/17 10:51:50 INFO scheduler.DAGScheduler: Got job 2 (first at <console>:24) with 1 output partitions (allowLocal=true)
16/06/17 10:51:50 INFO scheduler.DAGScheduler: Final stage: ResultStage 2(first at <console>:24)
16/06/17 10:51:50 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/06/17 10:51:50 INFO scheduler.DAGScheduler: Missing parents: List()
16/06/17 10:51:50 INFO scheduler.DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[1] at textFile at <console>:21), which has no missing parents
16/06/17 10:51:50 INFO storage.MemoryStore: ensureFreeSpace(3136) called with curMem=115148, maxMem=280248975
16/06/17 10:51:50 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.1 KB, free 267.2 MB)
16/06/17 10:51:50 INFO storage.MemoryStore: ensureFreeSpace(1813) called with curMem=118284, maxMem=280248975
16/06/17 10:51:50 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1813.0 B, free 267.2 MB)
16/06/17 10:51:50 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:42970 (size: 1813.0 B, free: 267.2 MB)
16/06/17 10:51:50 INFO spark.SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:874
16/06/17 10:51:50 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[1] at textFile at <console>:21)
16/06/17 10:51:50 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
16/06/17 10:51:50 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, PROCESS_LOCAL, 1393 bytes)
16/06/17 10:51:50 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 2)
16/06/17 10:51:50 INFO rdd.HadoopRDD: Input split: file:/spark.test.txt:0+331
16/06/17 10:51:50 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 2). 1850 bytes result sent to driver
16/06/17 10:51:50 INFO scheduler.DAGScheduler: ResultStage 2 (first at <console>:24) finished in 0.014 s
16/06/17 10:51:50 INFO scheduler.DAGScheduler: Job 2 finished: first at <console>:24, took 0.040965 s
res2: String = hello java hello hadoop hello hive hehe spar1 hh jjj kk

scala> 16/06/17 10:51:50 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 19 ms on localhost (1/1)
16/06/17 10:51:50 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool

scala> textFile.count();
16/06/17 10:52:13 INFO spark.SparkContext: Starting job: count at <console>:24
16/06/17 10:52:13 INFO scheduler.DAGScheduler: Got job 3 (count at <console>:24) with 1 output partitions (allowLocal=false)
16/06/17 10:52:13 INFO scheduler.DAGScheduler: Final stage: ResultStage 3(count at <console>:24)
16/06/17 10:52:13 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/06/17 10:52:13 INFO scheduler.DAGScheduler: Missing parents: List()
16/06/17 10:52:13 INFO scheduler.DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[1] at textFile at <console>:21), which has no missing parents
16/06/17 10:52:13 INFO storage.MemoryStore: ensureFreeSpace(2976) called with curMem=120097, maxMem=280248975
16/06/17 10:52:13 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.9 KB, free 267.1 MB)
16/06/17 10:52:13 INFO storage.MemoryStore: ensureFreeSpace(1755) called with curMem=123073, maxMem=280248975
16/06/17 10:52:13 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1755.0 B, free 267.1 MB)
16/06/17 10:52:13 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:42970 (size: 1755.0 B, free: 267.2 MB)
16/06/17 10:52:13 INFO spark.SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:874
16/06/17 10:52:13 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[1] at textFile at <console>:21)
16/06/17 10:52:13 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
16/06/17 10:52:13 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, PROCESS_LOCAL, 1393 bytes)
16/06/17 10:52:13 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 3)
16/06/17 10:52:13 INFO rdd.HadoopRDD: Input split: file:/spark.test.txt:0+331
16/06/17 10:52:13 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 3). 1830 bytes result sent to driver
16/06/17 10:52:13 INFO scheduler.DAGScheduler: ResultStage 3 (count at <console>:24) finished in 0.004 s
16/06/17 10:52:13 INFO scheduler.DAGScheduler: Job 3 finished: count at <console>:24, took 0.019111 s
res3: Long = 7

scala> 16/06/17 10:52:13 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 9 ms on localhost (1/1)
16/06/17 10:52:13 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool

scala> var size = textFile.map(line=>line.split(" ").size)
size: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[4] at map at <console>:23

scala> size.collect();
16/06/17 10:53:30 INFO spark.SparkContext: Starting job: collect at <console>:26
16/06/17 10:53:30 INFO scheduler.DAGScheduler: Got job 4 (collect at <console>:26) with 1 output partitions (allowLocal=false)
16/06/17 10:53:30 INFO scheduler.DAGScheduler: Final stage: ResultStage 4(collect at <console>:26)
16/06/17 10:53:30 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/06/17 10:53:30 INFO scheduler.DAGScheduler: Missing parents: List()
16/06/17 10:53:30 INFO scheduler.DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[4] at map at <console>:23), which has no missing parents
16/06/17 10:53:30 INFO storage.MemoryStore: ensureFreeSpace(3392) called with curMem=124828, maxMem=280248975
16/06/17 10:53:30 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 3.3 KB, free 267.1 MB)
16/06/17 10:53:30 INFO storage.MemoryStore: ensureFreeSpace(1967) called with curMem=128220, maxMem=280248975
16/06/17 10:53:30 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 1967.0 B, free 267.1 MB)
16/06/17 10:53:30 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:42970 (size: 1967.0 B, free: 267.2 MB)
16/06/17 10:53:30 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:874
16/06/17 10:53:30 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[4] at map at <console>:23)
16/06/17 10:53:30 INFO scheduler.TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
16/06/17 10:53:30 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 4, localhost, PROCESS_LOCAL, 1393 bytes)
16/06/17 10:53:30 INFO executor.Executor: Running task 0.0 in stage 4.0 (TID 4)
16/06/17 10:53:30 INFO rdd.HadoopRDD: Input split: file:/spark.test.txt:0+331
16/06/17 10:53:30 INFO executor.Executor: Finished task 0.0 in stage 4.0 (TID 4). 1803 bytes result sent to driver
16/06/17 10:53:30 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 4) in 10 ms on localhost (1/1)
16/06/17 10:53:30 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
16/06/17 10:53:30 INFO scheduler.DAGScheduler: ResultStage 4 (collect at <console>:26) finished in 0.009 s
16/06/17 10:53:30 INFO scheduler.DAGScheduler: Job 4 finished: collect at <console>:26, took 0.019113 s
res4: Array[Int] = Array(11, 8, 8, 8, 8, 8, 8)

scala> size.reduce((a, b)=>if (a > b) a else b)
16/06/17 10:53:53 INFO spark.SparkContext: Starting job: reduce at <console>:26
16/06/17 10:53:53 INFO scheduler.DAGScheduler: Got job 5 (reduce at <console>:26) with 1 output partitions (allowLocal=false)
16/06/17 10:53:53 INFO scheduler.DAGScheduler: Final stage: ResultStage 5(reduce at <console>:26)
16/06/17 10:53:53 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/06/17 10:53:53 INFO scheduler.DAGScheduler: Missing parents: List()
16/06/17 10:53:53 INFO scheduler.DAGScheduler: Submitting ResultStage 5 (MapPartitionsRDD[4] at map at <console>:23), which has no missing parents
16/06/17 10:53:53 INFO storage.MemoryStore: ensureFreeSpace(3368) called with curMem=130187, maxMem=280248975
16/06/17 10:53:53 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 3.3 KB, free 267.1 MB)
16/06/17 10:53:53 INFO storage.MemoryStore: ensureFreeSpace(1944) called with curMem=133555, maxMem=280248975
16/06/17 10:53:53 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 1944.0 B, free 267.1 MB)
16/06/17 10:53:53 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:42970 (size: 1944.0 B, free: 267.2 MB)
16/06/17 10:53:53 INFO spark.SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:874
16/06/17 10:53:53 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 5 (MapPartitionsRDD[4] at map at <console>:23)
16/06/17 10:53:53 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0 with 1 tasks
16/06/17 10:53:53 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 5.0 (TID 5, localhost, PROCESS_LOCAL, 1393 bytes)
16/06/17 10:53:53 INFO executor.Executor: Running task 0.0 in stage 5.0 (TID 5)
16/06/17 10:53:53 INFO rdd.HadoopRDD: Input split: file:/spark.test.txt:0+331
16/06/17 10:53:53 INFO executor.Executor: Finished task 0.0 in stage 5.0 (TID 5). 1908 bytes result sent to driver
16/06/17 10:53:53 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 5.0 (TID 5) in 10 ms on localhost (1/1)
16/06/17 10:53:53 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool
16/06/17 10:53:53 INFO scheduler.DAGScheduler: ResultStage 5 (reduce at <console>:26) finished in 0.005 s
16/06/17 10:53:53 INFO scheduler.DAGScheduler: Job 5 finished: reduce at <console>:26, took 0.017716 s
res5: Int = 11

scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b);
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[7] at reduceByKey at <console>:23

scala> 16/06/17 10:54:28 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on localhost:42970 in memory (size: 1866.0 B, free: 267.2 MB)
16/06/17 10:54:28 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on localhost:42970 in memory (size: 1813.0 B, free: 267.2 MB)
16/06/17 10:54:28 INFO storage.BlockManagerInfo: Removed broadcast_4_piece0 on localhost:42970 in memory (size: 1755.0 B, free: 267.2 MB)
16/06/17 10:54:28 INFO storage.BlockManagerInfo: Removed broadcast_5_piece0 on localhost:42970 in memory (size: 1967.0 B, free: 267.2 MB)
16/06/17 10:54:28 INFO storage.BlockManagerInfo: Removed broadcast_6_piece0 on localhost:42970 in memory (size: 1944.0 B, free: 267.2 MB)

scala> wordCounts.collect();
16/06/17 10:55:12 INFO spark.SparkContext: Starting job: collect at <console>:26
16/06/17 10:55:12 INFO scheduler.DAGScheduler: Registering RDD 6 (map at <console>:23)
16/06/17 10:55:12 INFO scheduler.DAGScheduler: Got job 6 (collect at <console>:26) with 1 output partitions (allowLocal=false)
16/06/17 10:55:12 INFO scheduler.DAGScheduler: Final stage: ResultStage 7(collect at <console>:26)
16/06/17 10:55:12 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 6)
16/06/17 10:55:12 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 6)
16/06/17 10:55:12 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 6 (MapPartitionsRDD[6] at map at <console>:23), which has no missing parents
16/06/17 10:55:12 INFO storage.MemoryStore: ensureFreeSpace(4104) called with curMem=110082, maxMem=280248975
16/06/17 10:55:12 INFO storage.MemoryStore: Block broadcast_7 stored as values in memory (estimated size 4.0 KB, free 267.2 MB)
16/06/17 10:55:12 INFO storage.MemoryStore: ensureFreeSpace(2297) called with curMem=114186, maxMem=280248975
16/06/17 10:55:12 INFO storage.MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 2.2 KB, free 267.2 MB)
16/06/17 10:55:12 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:42970 (size: 2.2 KB, free: 267.2 MB)
16/06/17 10:55:12 INFO spark.SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:874
16/06/17 10:55:12 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 6 (MapPartitionsRDD[6] at map at <console>:23)
16/06/17 10:55:12 INFO scheduler.TaskSchedulerImpl: Adding task set 6.0 with 1 tasks
16/06/17 10:55:12 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 6.0 (TID 6, localhost, PROCESS_LOCAL, 1382 bytes)
16/06/17 10:55:12 INFO executor.Executor: Running task 0.0 in stage 6.0 (TID 6)
16/06/17 10:55:12 INFO rdd.HadoopRDD: Input split: file:/spark.test.txt:0+331
16/06/17 10:55:12 INFO executor.Executor: Finished task 0.0 in stage 6.0 (TID 6). 2001 bytes result sent to driver
16/06/17 10:55:12 INFO scheduler.DAGScheduler: ShuffleMapStage 6 (map at <console>:23) finished in 0.097 s
16/06/17 10:55:12 INFO scheduler.DAGScheduler: looking for newly runnable stages
16/06/17 10:55:12 INFO scheduler.DAGScheduler: running: Set()
16/06/17 10:55:12 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 7)
16/06/17 10:55:12 INFO scheduler.DAGScheduler: failed: Set()
16/06/17 10:55:12 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 6.0 (TID 6) in 100 ms on localhost (1/1)
16/06/17 10:55:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool
16/06/17 10:55:12 INFO scheduler.DAGScheduler: Missing parents for ResultStage 7: List()
16/06/17 10:55:12 INFO scheduler.DAGScheduler: Submitting ResultStage 7 (ShuffledRDD[7] at reduceByKey at <console>:23), which is now runnable
16/06/17 10:55:12 INFO storage.MemoryStore: ensureFreeSpace(2288) called with curMem=116483, maxMem=280248975
16/06/17 10:55:12 INFO storage.MemoryStore: Block broadcast_8 stored as values in memory (estimated size 2.2 KB, free 267.2 MB)
16/06/17 10:55:12 INFO storage.MemoryStore: ensureFreeSpace(1377) called with curMem=118771, maxMem=280248975
16/06/17 10:55:12 INFO storage.MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 1377.0 B, free 267.2 MB)
16/06/17 10:55:12 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:42970 (size: 1377.0 B, free: 267.2 MB)
16/06/17 10:55:12 INFO spark.SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:874
16/06/17 10:55:12 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 7 (ShuffledRDD[7] at reduceByKey at <console>:23)
16/06/17 10:55:12 INFO scheduler.TaskSchedulerImpl: Adding task set 7.0 with 1 tasks
16/06/17 10:55:12 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 7.0 (TID 7, localhost, PROCESS_LOCAL, 1165 bytes)
16/06/17 10:55:12 INFO executor.Executor: Running task 0.0 in stage 7.0 (TID 7)
16/06/17 10:55:12 INFO storage.ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/06/17 10:55:12 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 11 ms
16/06/17 10:55:12 INFO executor.Executor: Finished task 0.0 in stage 7.0 (TID 7). 1248 bytes result sent to driver
16/06/17 10:55:12 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 7.0 (TID 7) in 94 ms on localhost (1/1)
16/06/17 10:55:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, from pool
16/06/17 10:55:12 INFO scheduler.DAGScheduler: ResultStage 7 (collect at <console>:26) finished in 0.087 s
16/06/17 10:55:12 INFO scheduler.DAGScheduler: Job 6 finished: collect at <console>:26, took 0.541458 s
res6: Array[(String, Int)] = Array((spark,4), (hive,7), (hadoop,7), (hehe,7), (hh,1), (spar1,1), (kk,1), (hello,21), (java,7), (spark2,1), (jjj,1), (spk,1))

scala> :quit
Stopping spark context.
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
16/06/17 10:56:32 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
16/06/17 10:56:32 INFO ui.SparkUI: Stopped Spark web UI at http://192.168.0.103:4040
16/06/17 10:56:32 INFO scheduler.DAGScheduler: Stopping DAGScheduler
16/06/17 10:56:32 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/06/17 10:56:32 INFO util.Utils: path = /tmp/spark-4e24ee66-6db3-41f1-9fe5-12da91e71fc3/blockmgr-36b2f8b4-bb92-4759-9a4e-90ae9c128577, already present as root for deletion.
16/06/17 10:56:32 INFO storage.MemoryStore: MemoryStore cleared
16/06/17 10:56:32 INFO storage.BlockManager: BlockManager stopped
16/06/17 10:56:32 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
16/06/17 10:56:32 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/06/17 10:56:32 INFO spark.SparkContext: Successfully stopped SparkContext
16/06/17 10:56:32 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/06/17 10:56:32 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/06/17 10:56:32 INFO util.Utils: Shutdown hook called
16/06/17 10:56:32 INFO util.Utils: Deleting directory /tmp/spark-3e11f3a4-cff9-45f3-8bb8-c98195c49bad
16/06/17 10:56:33 INFO util.Utils: Deleting directory /tmp/spark-b630d853-78dd-4f24-bdc2-683717e162d6
16/06/17 10:56:33 INFO util.Utils: Deleting directory /tmp/spark-4e24ee66-6db3-41f1-9fe5-12da91e71fc3
16/06/17 10:56:33 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
[root@cdh1 hadoop]# cat /spark.test.txt
hello java hello hadoop hello hive hehe spar1 hh jjj kk
hello java hello hadoop hello hive hehe spark2
hello java hello hadoop hello hive hehe spk
hello java hello hadoop hello hive hehe spark
hello java hello hadoop hello hive hehe spark
hello java hello hadoop hello hive hehe spark
hello java hello hadoop hello hive hehe spark
[root@cdh1 hadoop]#