spark sql例子


[root@cdh3 data]# spark-shell
16/07/03 04:24:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/07/03 04:24:21 INFO spark.SecurityManager: Changing view acls to: root
16/07/03 04:24:21 INFO spark.SecurityManager: Changing modify acls to: root
16/07/03 04:24:21 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
16/07/03 04:24:22 INFO spark.HttpServer: Starting HTTP Server
16/07/03 04:24:22 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/07/03 04:24:22 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:52511
16/07/03 04:24:22 INFO util.Utils: Successfully started service 'HTTP class server' on port 52511.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.4.0
      /_/


Using Scala version 2.10.4 (Java HotSpot(TM) Client VM, Java 1.7.0_67)
Type in expressions to have them evaluated.
Type :help for more information.
16/07/03 04:24:28 WARN util.Utils: Your hostname, cdh3 resolves to a loopback address: 127.0.0.1; using 192.168.48.6 instead (on interface eth0)
16/07/03 04:24:28 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/07/03 04:24:28 INFO spark.SparkContext: Running Spark version 1.4.0
16/07/03 04:24:28 INFO spark.SecurityManager: Changing view acls to: root
16/07/03 04:24:28 INFO spark.SecurityManager: Changing modify acls to: root
16/07/03 04:24:28 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
16/07/03 04:24:29 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/07/03 04:24:29 INFO Remoting: Starting remoting
16/07/03 04:24:29 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.48.6:37334]
16/07/03 04:24:29 INFO util.Utils: Successfully started service 'sparkDriver' on port 37334.
16/07/03 04:24:30 INFO spark.SparkEnv: Registering MapOutputTracker
16/07/03 04:24:30 INFO spark.SparkEnv: Registering BlockManagerMaster
16/07/03 04:24:30 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-67f26daa-dc42-445e-9121-8bf6473794b0/blockmgr-ee5ddc3f-a239-4017-93d0-6d01775a6454
16/07/03 04:24:30 INFO storage.MemoryStore: MemoryStore started with capacity 267.3 MB
16/07/03 04:24:30 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-67f26daa-dc42-445e-9121-8bf6473794b0/httpd-8eb5b3d0-1c0a-4d06-96f9-afeb46ba6f42
16/07/03 04:24:30 INFO spark.HttpServer: Starting HTTP Server
16/07/03 04:24:30 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/07/03 04:24:30 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:47390
16/07/03 04:24:30 INFO util.Utils: Successfully started service 'HTTP file server' on port 47390.
16/07/03 04:24:30 INFO spark.SparkEnv: Registering OutputCommitCoordinator
16/07/03 04:24:30 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/07/03 04:24:30 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
16/07/03 04:24:30 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
16/07/03 04:24:30 INFO ui.SparkUI: Started SparkUI at http://192.168.48.6:4040
16/07/03 04:24:30 INFO executor.Executor: Starting executor ID driver on host localhost
16/07/03 04:24:30 INFO executor.Executor: Using REPL class URI: http://192.168.48.6:52511
16/07/03 04:24:31 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37284.
16/07/03 04:24:31 INFO netty.NettyBlockTransferService: Server created on 37284
16/07/03 04:24:31 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/07/03 04:24:31 INFO storage.BlockManagerMasterEndpoint: Registering block manager localhost:37284 with 267.3 MB RAM, BlockManagerId(driver, localhost, 37284)
16/07/03 04:24:31 INFO storage.BlockManagerMaster: Registered BlockManager
16/07/03 04:24:31 INFO repl.SparkILoop: Created spark context..
Spark context available as sc.
16/07/03 04:24:32 INFO hive.HiveContext: Initializing execution hive, version 0.13.1
16/07/03 04:24:34 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
16/07/03 04:24:34 INFO metastore.ObjectStore: ObjectStore, initialize called
16/07/03 04:24:35 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored
16/07/03 04:24:35 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/07/03 04:24:36 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/07/03 04:24:37 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/07/03 04:24:40 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
16/07/03 04:24:41 INFO metastore.MetaStoreDirectSql: MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5.  Encountered: "@" (64), after : "".
16/07/03 04:24:42 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/07/03 04:24:42 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/07/03 04:24:43 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/07/03 04:24:43 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/07/03 04:24:44 INFO metastore.ObjectStore: Initialized ObjectStore
16/07/03 04:24:44 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 0.13.1aa
16/07/03 04:24:46 INFO metastore.HiveMetaStore: Added admin role in metastore
16/07/03 04:24:46 INFO metastore.HiveMetaStore: Added public role in metastore
16/07/03 04:24:46 INFO metastore.HiveMetaStore: No user is added in admin role, since config is empty
16/07/03 04:24:47 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr.
16/07/03 04:24:47 INFO repl.SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.


scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc);
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@1ccc018


scala> val rddCustomers = sc.textFile("data/customers.txt");
16/07/03 04:26:17 WARN util.SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
16/07/03 04:26:17 INFO storage.MemoryStore: ensureFreeSpace(85352) called with curMem=0, maxMem=280248975
16/07/03 04:26:17 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 83.4 KB, free 267.2 MB)
16/07/03 04:26:18 INFO storage.MemoryStore: ensureFreeSpace(19999) called with curMem=85352, maxMem=280248975
16/07/03 04:26:18 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 19.5 KB, free 267.2 MB)
16/07/03 04:26:18 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:37284 (size: 19.5 KB, free: 267.2 MB)
16/07/03 04:26:18 INFO spark.SparkContext: Created broadcast 0 from textFile at <console>:21
rddCustomers: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21


scala> val schemaString = "customer_id name city state zip_code";
schemaString: String = customer_id name city state zip_code


scala> import org.apache.spark.sql._
import org.apache.spark.sql._


scala> 


scala> import org.apache.spark.sql.types._;
import org.apache.spark.sql.types._


scala> val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)));
schema: org.apache.spark.sql.types.StructType = StructType(StructField(customer_id,StringType,true), StructField(name,StringType,true), StructField(city,StringType,true), StructField(state,StringType,true), StructField(zip_code,StringType,true))


scala> val rowRDD = rddCustomers.map(_.split(",")).map(p => Row(p(0).trim,p(1),p(2),p(3),p(4)));
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[3] at map at <console>:29


scala> val dfCustomers = sqlContext.createDataFrame(rowRDD, schema);
dfCustomers: org.apache.spark.sql.DataFrame = [customer_id: string, name: string, city: string, state: string, zip_code: string]


scala> dfCustomers.registerTempTable("customers");


scala> val custNames = sqlContext.sql("SELECT name FROM customers");
custNames: org.apache.spark.sql.DataFrame = [name: string]


scala> custNames.show();
16/07/03 04:28:26 INFO mapred.FileInputFormat: Total input paths to process : 1
16/07/03 04:28:27 INFO spark.SparkContext: Starting job: show at <console>:32
16/07/03 04:28:27 INFO scheduler.DAGScheduler: Got job 0 (show at <console>:32) with 1 output partitions (allowLocal=false)
16/07/03 04:28:27 INFO scheduler.DAGScheduler: Final stage: ResultStage 0(show at <console>:32)
16/07/03 04:28:27 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/07/03 04:28:27 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/03 04:28:27 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[6] at show at <console>:32), which has no missing parents
16/07/03 04:28:27 INFO storage.MemoryStore: ensureFreeSpace(5944) called with curMem=105351, maxMem=280248975
16/07/03 04:28:27 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 5.8 KB, free 267.2 MB)
16/07/03 04:28:27 INFO storage.MemoryStore: ensureFreeSpace(3007) called with curMem=111295, maxMem=280248975
16/07/03 04:28:27 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.9 KB, free 267.2 MB)
16/07/03 04:28:27 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:37284 (size: 2.9 KB, free: 267.2 MB)
16/07/03 04:28:27 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:874
16/07/03 04:28:27 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[6] at show at <console>:32)
16/07/03 04:28:27 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/07/03 04:28:28 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1418 bytes)
16/07/03 04:28:28 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
16/07/03 04:28:28 INFO rdd.HadoopRDD: Input split: hdfs://cdh3:9000/user/root/data/customers.txt:0+185
16/07/03 04:28:28 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/07/03 04:28:28 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/07/03 04:28:28 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/07/03 04:28:28 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/07/03 04:28:28 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/07/03 04:28:28 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 2220 bytes result sent to driver
16/07/03 04:28:28 INFO scheduler.DAGScheduler: ResultStage 0 (show at <console>:32) finished in 0.515 s
16/07/03 04:28:28 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 452 ms on localhost (1/1)
16/07/03 04:28:28 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
16/07/03 04:28:28 INFO scheduler.DAGScheduler: Job 0 finished: show at <console>:32, took 0.749336 s
+---------------+
|           name|
+---------------+
|     John Smith|
|    Joe Johnson|
|      Bob Jones|
|     Andy Davis|
| James Williams|
+---------------+




scala> custNames.map(t => "Name: " + t(0)).collect().foreach(println);
16/07/03 04:28:45 INFO spark.SparkContext: Starting job: collect at <console>:32
16/07/03 04:28:45 INFO scheduler.DAGScheduler: Got job 1 (collect at <console>:32) with 1 output partitions (allowLocal=false)
16/07/03 04:28:45 INFO scheduler.DAGScheduler: Final stage: ResultStage 1(collect at <console>:32)
16/07/03 04:28:45 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/07/03 04:28:45 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/03 04:28:45 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[9] at map at <console>:32), which has no missing parents
16/07/03 04:28:45 INFO storage.MemoryStore: ensureFreeSpace(6288) called with curMem=114302, maxMem=280248975
16/07/03 04:28:45 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 6.1 KB, free 267.2 MB)
16/07/03 04:28:45 INFO storage.MemoryStore: ensureFreeSpace(3128) called with curMem=120590, maxMem=280248975
16/07/03 04:28:45 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 3.1 KB, free 267.1 MB)
16/07/03 04:28:45 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:37284 (size: 3.1 KB, free: 267.2 MB)
16/07/03 04:28:45 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:874
16/07/03 04:28:45 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[9] at map at <console>:32)
16/07/03 04:28:45 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
16/07/03 04:28:45 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1418 bytes)
16/07/03 04:28:45 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 1)
16/07/03 04:28:45 INFO rdd.HadoopRDD: Input split: hdfs://cdh3:9000/user/root/data/customers.txt:0+185
16/07/03 04:28:45 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 1). 1896 bytes result sent to driver
16/07/03 04:28:45 INFO scheduler.DAGScheduler: ResultStage 1 (collect at <console>:32) finished in 0.026 s
16/07/03 04:28:45 INFO scheduler.DAGScheduler: Job 1 finished: collect at <console>:32, took 0.057442 s
16/07/03 04:28:45 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 31 ms on localhost (1/1)
16/07/03 04:28:45 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
Name:  John Smith
Name:  Joe Johnson
Name:  Bob Jones
Name:  Andy Davis
Name:  James Williams


scala> val customersByCity = sqlContext.sql("SELECT name,zip_code FROM customers ORDER BY zip_code").show();
16/07/03 04:29:18 INFO spark.SparkContext: Starting job: show at <console>:29
16/07/03 04:29:18 INFO scheduler.DAGScheduler: Got job 2 (show at <console>:29) with 1 output partitions (allowLocal=false)
16/07/03 04:29:18 INFO scheduler.DAGScheduler: Final stage: ResultStage 2(show at <console>:29)
16/07/03 04:29:18 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/07/03 04:29:18 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/03 04:29:18 INFO scheduler.DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[12] at show at <console>:29), which has no missing parents
16/07/03 04:29:18 INFO storage.MemoryStore: ensureFreeSpace(7040) called with curMem=123718, maxMem=280248975
16/07/03 04:29:18 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 6.9 KB, free 267.1 MB)
16/07/03 04:29:18 INFO storage.MemoryStore: ensureFreeSpace(3495) called with curMem=130758, maxMem=280248975
16/07/03 04:29:18 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 3.4 KB, free 267.1 MB)
16/07/03 04:29:18 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:37284 (size: 3.4 KB, free: 267.2 MB)
16/07/03 04:29:18 INFO spark.SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:874
16/07/03 04:29:18 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[12] at show at <console>:29)
16/07/03 04:29:18 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
16/07/03 04:29:18 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, PROCESS_LOCAL, 1418 bytes)
16/07/03 04:29:18 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 2)
16/07/03 04:29:18 INFO rdd.HadoopRDD: Input split: hdfs://cdh3:9000/user/root/data/customers.txt:0+185
16/07/03 04:29:18 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 2). 3464 bytes result sent to driver
16/07/03 04:29:18 INFO scheduler.DAGScheduler: ResultStage 2 (show at <console>:29) finished in 0.069 s
16/07/03 04:29:18 INFO scheduler.DAGScheduler: Job 2 finished: show at <console>:29, took 0.100239 s
16/07/03 04:29:18 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 74 ms on localhost (1/1)
16/07/03 04:29:18 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
+---------------+--------+
|           name|zip_code|
+---------------+--------+
|    Joe Johnson|   75201|
|      Bob Jones|   77028|
|     Andy Davis|   78227|
|     John Smith|   78727|
| James Williams|   78727|
+---------------+--------+


customersByCity: Unit = ()


scala> customersByCity.map(t => t(0) + "," + t(1)).collect().foreach(println);
<console>:32: error: value map is not a member of Unit
              customersByCity.map(t => t(0) + "," + t(1)).collect().foreach(println);
                              ^


scala> val customersByCity = sqlContext.sql("SELECT name,zip_code FROM customers ORDER BY zip_code");
customersByCity: org.apache.spark.sql.DataFrame = [name: string, zip_code: string]


scala> customersByCity.map(t => t(0) + "," + t(1)).collect().foreach(println);
16/07/03 04:30:19 INFO execution.Exchange: Using SparkSqlSerializer2.
16/07/03 04:30:19 INFO spark.SparkContext: Starting job: map at <console>:32
16/07/03 04:30:19 INFO scheduler.DAGScheduler: Got job 3 (map at <console>:32) with 1 output partitions (allowLocal=false)
16/07/03 04:30:19 INFO scheduler.DAGScheduler: Final stage: ResultStage 3(map at <console>:32)
16/07/03 04:30:19 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/07/03 04:30:19 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/03 04:30:19 INFO scheduler.DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[16] at map at <console>:32), which has no missing parents
16/07/03 04:30:19 INFO storage.MemoryStore: ensureFreeSpace(6696) called with curMem=134253, maxMem=280248975
16/07/03 04:30:19 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 6.5 KB, free 267.1 MB)
16/07/03 04:30:19 INFO storage.MemoryStore: ensureFreeSpace(3331) called with curMem=140949, maxMem=280248975
16/07/03 04:30:19 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 3.3 KB, free 267.1 MB)
16/07/03 04:30:19 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:37284 (size: 3.3 KB, free: 267.2 MB)
16/07/03 04:30:19 INFO spark.SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:874
16/07/03 04:30:19 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[16] at map at <console>:32)
16/07/03 04:30:19 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
16/07/03 04:30:19 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, PROCESS_LOCAL, 1418 bytes)
16/07/03 04:30:19 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 3)
16/07/03 04:30:19 INFO rdd.HadoopRDD: Input split: hdfs://cdh3:9000/user/root/data/customers.txt:0+185
16/07/03 04:30:19 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 3). 2528 bytes result sent to driver
16/07/03 04:30:19 INFO scheduler.DAGScheduler: ResultStage 3 (map at <console>:32) finished in 0.030 s
16/07/03 04:30:19 INFO scheduler.DAGScheduler: Job 3 finished: map at <console>:32, took 0.049610 s
16/07/03 04:30:19 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 30 ms on localhost (1/1)
16/07/03 04:30:19 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 
16/07/03 04:30:19 INFO spark.SparkContext: Starting job: collect at <console>:32
16/07/03 04:30:19 INFO scheduler.DAGScheduler: Registering RDD 17 (map at <console>:32)
16/07/03 04:30:19 INFO scheduler.DAGScheduler: Got job 4 (collect at <console>:32) with 5 output partitions (allowLocal=false)
16/07/03 04:30:19 INFO scheduler.DAGScheduler: Final stage: ResultStage 5(collect at <console>:32)
16/07/03 04:30:19 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 4)
16/07/03 04:30:19 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 4)
16/07/03 04:30:19 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 4 (MapPartitionsRDD[17] at map at <console>:32), which has no missing parents
16/07/03 04:30:19 INFO storage.MemoryStore: ensureFreeSpace(8128) called with curMem=144280, maxMem=280248975
16/07/03 04:30:19 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 7.9 KB, free 267.1 MB)
16/07/03 04:30:19 INFO storage.MemoryStore: ensureFreeSpace(4089) called with curMem=152408, maxMem=280248975
16/07/03 04:30:19 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 4.0 KB, free 267.1 MB)
16/07/03 04:30:19 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:37284 (size: 4.0 KB, free: 267.2 MB)
16/07/03 04:30:19 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:874
16/07/03 04:30:19 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 4 (MapPartitionsRDD[17] at map at <console>:32)
16/07/03 04:30:19 INFO scheduler.TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
16/07/03 04:30:19 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 4, localhost, PROCESS_LOCAL, 1407 bytes)
16/07/03 04:30:19 INFO executor.Executor: Running task 0.0 in stage 4.0 (TID 4)
16/07/03 04:30:19 INFO rdd.HadoopRDD: Input split: hdfs://cdh3:9000/user/root/data/customers.txt:0+185
16/07/03 04:30:19 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on localhost:37284 in memory (size: 2.9 KB, free: 267.2 MB)
16/07/03 04:30:19 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on localhost:37284 in memory (size: 3.1 KB, free: 267.2 MB)
16/07/03 04:30:19 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on localhost:37284 in memory (size: 3.4 KB, free: 267.2 MB)
16/07/03 04:30:19 INFO storage.BlockManagerInfo: Removed broadcast_4_piece0 on localhost:37284 in memory (size: 3.3 KB, free: 267.2 MB)
16/07/03 04:30:19 INFO executor.Executor: Finished task 0.0 in stage 4.0 (TID 4). 2005 bytes result sent to driver
16/07/03 04:30:19 INFO scheduler.DAGScheduler: ShuffleMapStage 4 (map at <console>:32) finished in 0.289 s
16/07/03 04:30:19 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 4) in 293 ms on localhost (1/1)
16/07/03 04:30:19 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool 
16/07/03 04:30:20 INFO scheduler.DAGScheduler: looking for newly runnable stages
16/07/03 04:30:20 INFO scheduler.DAGScheduler: running: Set()
16/07/03 04:30:20 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 5)
16/07/03 04:30:20 INFO scheduler.DAGScheduler: failed: Set()
16/07/03 04:30:20 INFO scheduler.DAGScheduler: Missing parents for ResultStage 5: List()
16/07/03 04:30:20 INFO scheduler.DAGScheduler: Submitting ResultStage 5 (MapPartitionsRDD[22] at map at <console>:32), which is now runnable
16/07/03 04:30:20 INFO storage.MemoryStore: ensureFreeSpace(9712) called with curMem=117568, maxMem=280248975
16/07/03 04:30:20 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 9.5 KB, free 267.1 MB)
16/07/03 04:30:20 INFO storage.MemoryStore: ensureFreeSpace(4645) called with curMem=127280, maxMem=280248975
16/07/03 04:30:20 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 4.5 KB, free 267.1 MB)
16/07/03 04:30:20 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:37284 (size: 4.5 KB, free: 267.2 MB)
16/07/03 04:30:20 INFO spark.SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:874
16/07/03 04:30:20 INFO scheduler.DAGScheduler: Submitting 5 missing tasks from ResultStage 5 (MapPartitionsRDD[22] at map at <console>:32)
16/07/03 04:30:20 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0 with 5 tasks
16/07/03 04:30:20 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 5.0 (TID 5, localhost, PROCESS_LOCAL, 1165 bytes)
16/07/03 04:30:20 INFO executor.Executor: Running task 0.0 in stage 5.0 (TID 5)
16/07/03 04:30:20 INFO storage.ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/07/03 04:30:20 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 4 ms
16/07/03 04:30:20 INFO executor.Executor: Finished task 0.0 in stage 5.0 (TID 5). 908 bytes result sent to driver
16/07/03 04:30:20 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 5.0 (TID 6, localhost, PROCESS_LOCAL, 1165 bytes)
16/07/03 04:30:20 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 5.0 (TID 5) in 218 ms on localhost (1/5)
16/07/03 04:30:20 INFO executor.Executor: Running task 1.0 in stage 5.0 (TID 6)
16/07/03 04:30:20 INFO storage.ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/07/03 04:30:20 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/07/03 04:30:20 INFO executor.Executor: Finished task 1.0 in stage 5.0 (TID 6). 906 bytes result sent to driver
16/07/03 04:30:20 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 5.0 (TID 7, localhost, PROCESS_LOCAL, 1165 bytes)
16/07/03 04:30:20 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 5.0 (TID 6) in 21 ms on localhost (2/5)
16/07/03 04:30:20 INFO executor.Executor: Running task 2.0 in stage 5.0 (TID 7)
16/07/03 04:30:20 INFO storage.ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/07/03 04:30:20 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/07/03 04:30:20 INFO executor.Executor: Finished task 2.0 in stage 5.0 (TID 7). 907 bytes result sent to driver
16/07/03 04:30:20 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 5.0 (TID 8, localhost, PROCESS_LOCAL, 1165 bytes)
16/07/03 04:30:20 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 5.0 (TID 7) in 15 ms on localhost (3/5)
16/07/03 04:30:20 INFO executor.Executor: Running task 3.0 in stage 5.0 (TID 8)
16/07/03 04:30:20 INFO storage.ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/07/03 04:30:20 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/07/03 04:30:20 INFO executor.Executor: Finished task 3.0 in stage 5.0 (TID 8). 932 bytes result sent to driver
16/07/03 04:30:20 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 5.0 (TID 9, localhost, PROCESS_LOCAL, 1165 bytes)
16/07/03 04:30:20 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 5.0 (TID 8) in 12 ms on localhost (4/5)
16/07/03 04:30:20 INFO executor.Executor: Running task 4.0 in stage 5.0 (TID 9)
16/07/03 04:30:20 INFO storage.ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/07/03 04:30:20 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/07/03 04:30:20 INFO executor.Executor: Finished task 4.0 in stage 5.0 (TID 9). 886 bytes result sent to driver
16/07/03 04:30:20 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 5.0 (TID 9) in 7 ms on localhost (5/5)
16/07/03 04:30:20 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool 
16/07/03 04:30:20 INFO scheduler.DAGScheduler: ResultStage 5 (collect at <console>:32) finished in 0.259 s
16/07/03 04:30:20 INFO scheduler.DAGScheduler: Job 4 finished: collect at <console>:32, took 0.641571 s
 Joe Johnson, 75201
 Bob Jones, 77028
 Andy Davis, 78227
 John Smith, 78727
 James Williams, 78727


scala> 
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值