环境:
本地:win7 + jdk1.8 + IntelliJ IDEA 2018.1.2 + maven-3.3.9 + scala插件,机器要求可以联网(需要下载各种依赖包)
远程:CentOS7.3 + jdk1.8 + scala-2.11.12 + hadoop-2.6.0-cdh5.7.0 + hive-1.1.0-cdh5.7.0-bin + spark-2.2.0-bin-2.6.0-cdh5.7.0
1. IDEA新建一个maven+scala的project
点击Finish之后静待项目初始化完成。
2. 修改pom.xml配置文件
2.1 修改<properties>标签
配置scala版本、spark版本以及hadoop版本
<properties>
<scala.version>2.11.8</scala.version>
<spark.version>2.2.0</spark.version>
<hadoop.version>2.6.0-cdh5.7.0</hadoop.version>
</properties>
其中,scala版本为我们计划使用的scala版本号,scala版本要支持spark版本;
spark版本为spark源码编译时指定的版本号,在生产服务器上可以通过echo $SPARK_HOME查看;
hadoop版本为hadoop源码编译时产生的版本号,在生产服务器上可以通过echo $HADOOP_HOME查看。
2.2 修改<dependencies>标签
添加spark和hadoop依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
其中spark和hadoop的依赖从spark官方文档里面可以找到,版本信息在上面<properties>标签里面已经定义好了。
2.3 修改<repositories>标签
因为我们使用的hadoop是cdh版本,需要为其添加cdh的repository。
<repositories>
<repository>
<id>cloudera</id>
<name>cloudera</name>
<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
</repository>
其中,id和name可以随便写,url为正确的cdh repository地址。
修改完以上配置以后,maven会解析dependency然后下载相关的jar包及源码。如果这一步IDEA没有自动解析下载jar包,可以点开pom.xml,右键 - Maven - Reimport。如果还有红色波浪线报错,可以在右边Maven Prohects - Spark - Lifecycle - clean - Run Maven Build运行一下。
下载完成。
至此,一个maven+scala的project创建完成,并且添加了spark和hadoop 的dependency。下面开始正式编程。
3. 删除project下面多余的文件
删除红色框框出来的文件
4. Spark Hello World 编程
新建一个Package
在新建的Package上右键,新建一个scala类
编写程序
打包:Maven Project - Spark - Liftcycle - package右键 - Run Maven Build。打包的过程中会联网下载一些包
打包完成之后会在相应的文件夹下产生一个.jar包
5. 把生成的jar包上传到服务器上并提交作业
5.1 把产生的jar包上传到服务器上
[hadoop@hadoop01 lib]$ pwd
/home/hadoop/lib
[hadoop@hadoop01 lib]$ ls -ltr
total 12
-rw-r--r--. 1 hadoop hadoop 3086 Jun 28 21:12 hive-1.0-SNAPSHOT.jar
-rw-r--r--. 1 hadoop hadoop 7859 Jul 16 20:41 spark-1.0.jar
[hadoop@hadoop01 lib]$
5.2 在hdfs上准备数据文件
服务器上hdfs上已有数据文件
[hadoop@hadoop01 data]$ hadoop fs -ls /tmp/data/
Found 1 items
-rw-r--r-- 1 hadoop supergroup 36 2018-07-16 21:21 /tmp/data/input3.txt
[hadoop@hadoop01 data]$ hadoop fs -cat /tmp/data/input3.txt
hello
world
hello
scala
hello
scala
[hadoop@hadoop01 data]$
5.3 作业提交
提交作业的方法,见官网说明
这里告诉我们作业可以通过spark-submit的方式提交。拷贝官网上的spark-submit语句过来,并修改参数对应的内容
./bin/spark-submit \
--class com.bigdata.spark.core.WordCountApp \ # 类的全路径
--master local[2] \
/home/hadoop/lib/spark-1.0.jar \ # 服务器上面刚传上去的jar包的路径
hdfs://hadoop01:9000/tmp/data # 传入参数:hadoop数据文件目录,其中hadoop01也可以写主机的IP地址
在服务器上提交作业
[hadoop@hadoop01 spark-2.2.0-bin-2.6.0-cdh5.7.0]$ ./bin/spark-submit \
> --class com.bigdata.spark.core.WordCountApp \
> --master local[2] \
> /home/hadoop/lib/spark-1.0.jar \
> hdfs://hadoop01:9000/tmp/data
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/07/16 21:22:12 INFO SparkContext: Running Spark version 2.2.0
18/07/16 21:22:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes wher
18/07/16 21:22:13 INFO SparkContext: Submitted application: com.bigdata.spark.core.WordCountApp
18/07/16 21:22:13 INFO SecurityManager: Changing view acls to: hadoop
18/07/16 21:22:13 INFO SecurityManager: Changing modify acls to: hadoop
18/07/16 21:22:13 INFO SecurityManager: Changing view acls groups to:
18/07/16 21:22:13 INFO SecurityManager: Changing modify acls groups to:
18/07/16 21:22:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: th modify permissions: Set(hadoop); groups with modify permissions: Set()
18/07/16 21:22:13 INFO Utils: Successfully started service 'sparkDriver' on port 42063.
18/07/16 21:22:13 INFO SparkEnv: Registering MapOutputTracker
18/07/16 21:22:13 INFO SparkEnv: Registering BlockManagerMaster
18/07/16 21:22:13 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology infor
18/07/16 21:22:13 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/07/16 21:22:13 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-e4468706-de65-4d02-9c93-8ade4f0466cd
18/07/16 21:22:13 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
18/07/16 21:22:14 INFO SparkEnv: Registering OutputCommitCoordinator
18/07/16 21:22:14 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/07/16 21:22:14 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.132.37.38:4040
18/07/16 21:22:14 INFO SparkContext: Added JAR file:/home/hadoop/lib/spark-1.0.jar at spark://10.132.37.38:42063/jars/spark-1.0.ja
18/07/16 21:22:14 INFO Executor: Starting executor ID driver on host localhost
18/07/16 21:22:14 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 4066
18/07/16 21:22:14 INFO NettyBlockTransferService: Server created on 10.132.37.38:40669
18/07/16 21:22:14 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/07/16 21:22:14 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.132.37.38, 40669, None)
18/07/16 21:22:14 INFO BlockManagerMasterEndpoint: Registering block manager 10.132.37.38:40669 with 366.3 MB RAM, BlockManagerId(
18/07/16 21:22:14 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.132.37.38, 40669, None)
18/07/16 21:22:14 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.132.37.38, 40669, None)
18/07/16 21:22:15 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 222.5 KB, free 366.1 MB)
18/07/16 21:22:15 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 21.1 KB, free 366.1 MB)
18/07/16 21:22:15 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.132.37.38:40669 (size: 21.1 KB, free: 366.3 MB)
18/07/16 21:22:15 INFO SparkContext: Created broadcast 0 from textFile at WordCountApp.scala:10
18/07/16 21:22:16 INFO FileInputFormat: Total input paths to process : 1
18/07/16 21:22:16 INFO SparkContext: Starting job: collect at WordCountApp.scala:14
18/07/16 21:22:16 INFO DAGScheduler: Registering RDD 3 (map at WordCountApp.scala:12)
18/07/16 21:22:16 INFO DAGScheduler: Got job 0 (collect at WordCountApp.scala:14) with 2 output partitions
18/07/16 21:22:16 INFO DAGScheduler: Final stage: ResultStage 1 (collect at WordCountApp.scala:14)
18/07/16 21:22:16 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
18/07/16 21:22:16 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
18/07/16 21:22:16 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCountApp.scala:12), which has
18/07/16 21:22:16 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.8 KB, free 366.1 MB)
18/07/16 21:22:16 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.8 KB, free 366.1 MB)
18/07/16 21:22:16 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.132.37.38:40669 (size: 2.8 KB, free: 366.3 MB)
18/07/16 21:22:16 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
18/07/16 21:22:16 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCountApp
18/07/16 21:22:16 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
18/07/16 21:22:16 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, ANY, 4848 b
18/07/16 21:22:16 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, ANY, 4848 b
18/07/16 21:22:16 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/07/16 21:22:16 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
18/07/16 21:22:16 INFO Executor: Fetching spark://10.132.37.38:42063/jars/spark-1.0.jar with timestamp 1531747334342
18/07/16 21:22:16 INFO TransportClientFactory: Successfully created connection to /10.132.37.38:42063 after 24 ms (0 ms spent in b
18/07/16 21:22:16 INFO Utils: Fetching spark://10.132.37.38:42063/jars/spark-1.0.jar to /tmp/spark-2188b919-286c-4b29-a1d7-7a68e64ileTemp4710383261656161395.tmp
18/07/16 21:22:16 INFO Executor: Adding file:/tmp/spark-2188b919-286c-4b29-a1d7-7a68e648ba9b/userFiles-39fcbe1f-2327-4dd9-9de7-992
18/07/16 21:22:16 INFO HadoopRDD: Input split: hdfs://hadoop01:9000/tmp/data/input3.txt:18+18
18/07/16 21:22:16 INFO HadoopRDD: Input split: hdfs://hadoop01:9000/tmp/data/input3.txt:0+18
18/07/16 21:22:16 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1197 bytes result sent to driver
18/07/16 21:22:16 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1197 bytes result sent to driver
18/07/16 21:22:16 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 345 ms on localhost (executor driver) (1/2)
18/07/16 21:22:16 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 368 ms on localhost (executor driver) (2/2)
18/07/16 21:22:16 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/07/16 21:22:16 INFO DAGScheduler: ShuffleMapStage 0 (map at WordCountApp.scala:12) finished in 0.390 s
18/07/16 21:22:16 INFO DAGScheduler: looking for newly runnable stages
18/07/16 21:22:16 INFO DAGScheduler: running: Set()
18/07/16 21:22:16 INFO DAGScheduler: waiting: Set(ResultStage 1)
18/07/16 21:22:16 INFO DAGScheduler: failed: Set()
18/07/16 21:22:16 INFO DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCountApp.scala:12), which has
18/07/16 21:22:16 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.2 KB, free 366.1 MB)
18/07/16 21:22:16 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1986.0 B, free 366.0 MB)
18/07/16 21:22:16 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.132.37.38:40669 (size: 1986.0 B, free: 366.3 MB)
18/07/16 21:22:16 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
18/07/16 21:22:16 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCountApp.
18/07/16 21:22:16 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
18/07/16 21:22:16 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, localhost, executor driver, partition 1, PROCESS_LOC
18/07/16 21:22:16 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 3, localhost, executor driver, partition 0, ANY, 4621 b
18/07/16 21:22:16 INFO Executor: Running task 0.0 in stage 1.0 (TID 3)
18/07/16 21:22:16 INFO Executor: Running task 1.0 in stage 1.0 (TID 2)
18/07/16 21:22:16 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
18/07/16 21:22:16 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 2 blocks
18/07/16 21:22:16 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 6 ms
18/07/16 21:22:16 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 6 ms
18/07/16 21:22:16 INFO Executor: Finished task 1.0 in stage 1.0 (TID 2). 1091 bytes result sent to driver
18/07/16 21:22:16 INFO Executor: Finished task 0.0 in stage 1.0 (TID 3). 1289 bytes result sent to driver
18/07/16 21:22:16 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 2) in 46 ms on localhost (executor driver) (1/2)
18/07/16 21:22:16 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 3) in 45 ms on localhost (executor driver) (2/2)
18/07/16 21:22:16 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
18/07/16 21:22:16 INFO DAGScheduler: ResultStage 1 (collect at WordCountApp.scala:14) finished in 0.049 s
18/07/16 21:22:16 INFO DAGScheduler: Job 0 finished: collect at WordCountApp.scala:14, took 0.751515 s
(scala,2)
(hello,3)
(world,1)
18/07/16 21:22:16 INFO SparkUI: Stopped Spark web UI at http://10.132.37.38:4040
18/07/16 21:22:16 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/07/16 21:22:16 INFO MemoryStore: MemoryStore cleared
18/07/16 21:22:16 INFO BlockManager: BlockManager stopped
18/07/16 21:22:16 INFO BlockManagerMaster: BlockManagerMaster stopped
18/07/16 21:22:16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/07/16 21:22:16 INFO SparkContext: Successfully stopped SparkContext
18/07/16 21:22:16 INFO ShutdownHookManager: Shutdown hook called
18/07/16 21:22:16 INFO ShutdownHookManager: Deleting directory /tmp/spark-2188b919-286c-4b29-a1d7-7a68e648ba9b
[hadoop@hadoop01 spark-2.2.0-bin-2.6.0-cdh5.7.0]$
这样,我们就完成了一次作业的提交。
小技巧:
1) --class后面的CLASS_NAME要求全路径,可以通过在IDEA里面右键scala类名字,并点击Copy Reference得到
(com.bigdata.spark.core.WordCountApp)
2) 传入的参数([application-arguments]:hadoop文件路径)是可以写通配符的,即如果只想计算
hdfs://hadoop01:9000/tmp/data/ 路径下的.txt格式的文件,则可以写成
hdfs://hadoop01:9000/tmp/data/*.txt
此时,计算的结果只是返回到当前shell窗口,回到IDEA进一步修改程序,迭代输出结果到HDFS。
6. 修改程序,迭代输出结果到HDFS
package com.bigdata.spark.core
import org.apache.spark.{SparkConf, SparkContext}
object WordCountApp {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
val textFile = sc.textFile(args(0)) // 需要传参,hdfs文件路径
val wc = textFile.flatMap(line => line.split("\r"))
.map((_,1)).reduceByKey(_+_)
//wc.collect().foreach(println) // 修改这一行为下面的内容
wc.saveAsTextFile(args(1)) // 需要传参,结果输出到hdfs的路径
sc.stop()
}
}
重复上面步骤4~5的内容,Maven Projects - Spark - Lifecycle - package再次打包,再次传到服务器上的/home/hadoop/lib/下。
加上hdfs输出路径,再次提交
[hadoop@hadoop01 spark-2.2.0-bin-2.6.0-cdh5.7.0]$ ./bin/spark-submit \
> --class com.bigdata.spark.core.WordCountApp \
> --master local[2] \
> /home/hadoop/lib/spark-1.0.jar \
> hdfs://hadoop01:9000/tmp/data hdfs://hadoop01:9000/tmp/output_data
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/07/16 21:36:49 INFO SparkContext: Running Spark version 2.2.0
18/07/16 21:36:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/07/16 21:36:50 INFO SparkContext: Submitted application: com.bigdata.spark.core.WordCountApp
18/07/16 21:36:50 INFO SecurityManager: Changing view acls to: hadoop
18/07/16 21:36:50 INFO SecurityManager: Changing modify acls to: hadoop
18/07/16 21:36:50 INFO SecurityManager: Changing view acls groups to:
18/07/16 21:36:50 INFO SecurityManager: Changing modify acls groups to:
18/07/16 21:36:50 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
18/07/16 21:36:50 INFO Utils: Successfully started service 'sparkDriver' on port 37500.
18/07/16 21:36:50 INFO SparkEnv: Registering MapOutputTracker
18/07/16 21:36:50 INFO SparkEnv: Registering BlockManagerMaster
18/07/16 21:36:50 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/07/16 21:36:50 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/07/16 21:36:50 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-39c7d9b9-366a-4d8a-8999-c5e760f825d5
18/07/16 21:36:50 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
18/07/16 21:36:50 INFO SparkEnv: Registering OutputCommitCoordinator
18/07/16 21:36:50 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/07/16 21:36:50 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.0.0.8:4040
18/07/16 21:36:50 INFO SparkContext: Added JAR file:/home/hadoop/lib/spark-1.0.jar at spark://10.0.0.8:37500/jars/spark-1.0.jar with timestamp 1531748210949
18/07/16 21:36:51 INFO Executor: Starting executor ID driver on host localhost
18/07/16 21:36:51 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 36075.
18/07/16 21:36:51 INFO NettyBlockTransferService: Server created on 10.0.0.8:36075
18/07/16 21:36:51 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/07/16 21:36:51 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.0.8, 36075, None)
18/07/16 21:36:51 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.0.8:36075 with 366.3 MB RAM, BlockManagerId(driver, 10.0.0.8, 36075, None)
18/07/16 21:36:51 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.0.8, 36075, None)
18/07/16 21:36:51 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.0.8, 36075, None)
18/07/16 21:36:51 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 222.5 KB, free 366.1 MB)
18/07/16 21:36:52 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 21.1 KB, free 366.1 MB)
18/07/16 21:36:52 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.0.0.8:36075 (size: 21.1 KB, free: 366.3 MB)
18/07/16 21:36:52 INFO SparkContext: Created broadcast 0 from textFile at WordCountApp.scala:10
18/07/16 21:36:52 INFO FileInputFormat: Total input paths to process : 1
18/07/16 21:36:52 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
18/07/16 21:36:52 INFO SparkContext: Starting job: saveAsTextFile at WordCountApp.scala:15
18/07/16 21:36:52 INFO DAGScheduler: Registering RDD 3 (map at WordCountApp.scala:12)
18/07/16 21:36:53 INFO DAGScheduler: Got job 0 (saveAsTextFile at WordCountApp.scala:15) with 2 output partitions
18/07/16 21:36:53 INFO DAGScheduler: Final stage: ResultStage 1 (saveAsTextFile at WordCountApp.scala:15)
18/07/16 21:36:53 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
18/07/16 21:36:53 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
18/07/16 21:36:53 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCountApp.scala:12), which has no missing parents
18/07/16 21:36:53 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.8 KB, free 366.1 MB)
18/07/16 21:36:53 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.8 KB, free 366.1 MB)
18/07/16 21:36:53 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.0.0.8:36075 (size: 2.8 KB, free: 366.3 MB)
18/07/16 21:36:53 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
18/07/16 21:36:53 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCountApp.scala:12) (first 15 tasks are for partitions Vector(0, 1))
18/07/16 21:36:53 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
18/07/16 21:36:53 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, ANY, 4848 bytes)
18/07/16 21:36:53 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, ANY, 4848 bytes)
18/07/16 21:36:53 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/07/16 21:36:53 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
18/07/16 21:36:53 INFO Executor: Fetching spark://10.0.0.8:37500/jars/spark-1.0.jar with timestamp 1531748210949
18/07/16 21:36:53 INFO TransportClientFactory: Successfully created connection to /10.0.0.8:37500 after 25 ms (0 ms spent in bootstraps)
18/07/16 21:36:53 INFO Utils: Fetching spark://10.0.0.8:37500/jars/spark-1.0.jar to /tmp/spark-1828226d-6ca6-47a3-be12-feb32a44a193/userFiles-b1a75a7b-2368-4cdc-86c0-a448052e6d12/fetchFileTemp8565928102062995692.tmp
18/07/16 21:36:53 INFO Executor: Adding file:/tmp/spark-1828226d-6ca6-47a3-be12-feb32a44a193/userFiles-b1a75a7b-2368-4cdc-86c0-a448052e6d12/spark-1.0.jar to class loader
18/07/16 21:36:53 INFO HadoopRDD: Input split: hdfs://hadoop01:9000/tmp/data/input3.txt:0+18
18/07/16 21:36:53 INFO HadoopRDD: Input split: hdfs://hadoop01:9000/tmp/data/input3.txt:18+18
18/07/16 21:36:53 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1197 bytes result sent to driver
18/07/16 21:36:53 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1197 bytes result sent to driver
18/07/16 21:36:53 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 376 ms on localhost (executor driver) (1/2)
18/07/16 21:36:53 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 365 ms on localhost (executor driver) (2/2)
18/07/16 21:36:53 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/07/16 21:36:53 INFO DAGScheduler: ShuffleMapStage 0 (map at WordCountApp.scala:12) finished in 0.403 s
18/07/16 21:36:53 INFO DAGScheduler: looking for newly runnable stages
18/07/16 21:36:53 INFO DAGScheduler: running: Set()
18/07/16 21:36:53 INFO DAGScheduler: waiting: Set(ResultStage 1)
18/07/16 21:36:53 INFO DAGScheduler: failed: Set()
18/07/16 21:36:53 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at WordCountApp.scala:15), which has no missing parents
18/07/16 21:36:53 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 67.6 KB, free 366.0 MB)
18/07/16 21:36:53 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 24.2 KB, free 366.0 MB)
18/07/16 21:36:53 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.0.0.8:36075 (size: 24.2 KB, free: 366.3 MB)
18/07/16 21:36:53 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
18/07/16 21:36:53 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at WordCountApp.scala:15) (first 15 tasks are for partitions Vector(0, 1))
18/07/16 21:36:53 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
18/07/16 21:36:53 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, localhost, executor driver, partition 1, PROCESS_LOCAL, 4621 bytes)
18/07/16 21:36:53 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 3, localhost, executor driver, partition 0, ANY, 4621 bytes)
18/07/16 21:36:53 INFO Executor: Running task 1.0 in stage 1.0 (TID 2)
18/07/16 21:36:53 INFO Executor: Running task 0.0 in stage 1.0 (TID 3)
18/07/16 21:36:53 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 2 blocks
18/07/16 21:36:53 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
18/07/16 21:36:53 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 6 ms
18/07/16 21:36:53 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 7 ms
18/07/16 21:36:53 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
18/07/16 21:36:53 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
18/07/16 21:36:53 INFO FileOutputCommitter: Saved output of task 'attempt_20180716213652_0001_m_000001_2' to hdfs://hadoop01:9000/tmp/output_data/_temporary/0/task_20180716213652_0001_m_000001
18/07/16 21:36:53 INFO SparkHadoopMapRedUtil: attempt_20180716213652_0001_m_000001_2: Committed
18/07/16 21:36:53 INFO Executor: Finished task 1.0 in stage 1.0 (TID 2). 1138 bytes result sent to driver
18/07/16 21:36:53 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 2) in 169 ms on localhost (executor driver) (1/2)
18/07/16 21:36:53 INFO FileOutputCommitter: Saved output of task 'attempt_20180716213652_0001_m_000000_3' to hdfs://hadoop01:9000/tmp/output_data/_temporary/0/task_20180716213652_0001_m_000000
18/07/16 21:36:53 INFO SparkHadoopMapRedUtil: attempt_20180716213652_0001_m_000000_3: Committed
18/07/16 21:36:53 INFO Executor: Finished task 0.0 in stage 1.0 (TID 3). 1181 bytes result sent to driver
18/07/16 21:36:53 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 3) in 233 ms on localhost (executor driver) (2/2)
18/07/16 21:36:53 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
18/07/16 21:36:53 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at WordCountApp.scala:15) finished in 0.236 s
18/07/16 21:36:53 INFO DAGScheduler: Job 0 finished: saveAsTextFile at WordCountApp.scala:15, took 0.987969 s
18/07/16 21:36:53 INFO SparkUI: Stopped Spark web UI at http://10.0.0.8:4040
18/07/16 21:36:53 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/07/16 21:36:53 INFO MemoryStore: MemoryStore cleared
18/07/16 21:36:53 INFO BlockManager: BlockManager stopped
18/07/16 21:36:53 INFO BlockManagerMaster: BlockManagerMaster stopped
18/07/16 21:36:53 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/07/16 21:36:53 INFO SparkContext: Successfully stopped SparkContext
18/07/16 21:36:53 INFO ShutdownHookManager: Shutdown hook called
18/07/16 21:36:53 INFO ShutdownHookManager: Deleting directory /tmp/spark-1828226d-6ca6-47a3-be12-feb32a44a193
[hadoop@hadoop01 spark-2.2.0-bin-2.6.0-cdh5.7.0]$
可以看到结果输出到了hdfs的/tmp/output_data/路径下
[hadoop@hadoop01 spark-2.2.0-bin-2.6.0-cdh5.7.0]$ hdfs dfs -ls /tmp/output_data/
Found 3 items
-rw-r--r-- 3 hadoop supergroup 0 2018-07-16 22:22 /tmp/output_data/_SUCCESS # 空文件,仅表示mr执行成功
-rw-r--r-- 3 hadoop supergroup 30 2018-07-16 22:22 /tmp/output_data/part-00000 # 真正结果输出
-rw-r--r-- 3 hadoop supergroup 0 2018-07-16 22:22 /tmp/output_data/part-00001 # 空文件
[hadoop@hadoop01 spark-2.2.0-bin-2.6.0-cdh5.7.0]$ hdfs dfs -cat /tmp/output_data/part-00000
(scala,2)
(hello,3)
(world,1)
[hadoop@hadoop01 spark-2.2.0-bin-2.6.0-cdh5.7.0]$ hdfs dfs -cat /tmp/output_data/part-00001
[hadoop@hadoop01 spark-2.2.0-bin-2.6.0-cdh5.7.0]$
7. 继续编程
目标:把输出的结果按照value排序。
拷贝WordCountApp.scala并粘贴在相同的package下
修改SortWordCountApp.scala
package com.bigdata.spark.core
import org.apache.spark.{SparkConf, SparkContext}
object SortWordCountApp { // 修改名字
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
val textFile = sc.textFile(args(0))
val wc = textFile.flatMap(line => line.split("\r"))
.map((_,1)).reduceByKey(_+_)
// wc.collect().foreach(println)
// wc.saveAsTextFile(args(1))
// 把key和value掉头,再按key排序,再反转一次
val sorted = wc.map(x => (x._2, x._1)).sortByKey(false).map(x => (x._2, x._1))
sorted.saveAsTextFile(args(1))
sc.stop()
}
}
打包,并放到服务器上再次执行
[hadoop@hadoop01 spark-2.2.0-bin-2.6.0-cdh5.7.0]$ ./bin/spark-submit \
> --class com.bigdata.spark.core.SortWordCountApp \ # 注意CLASS_NAME已经不是同一个了
> --master local[2] \
> /home/hadoop/lib/spark-1.0.jar \
> hdfs://hadoop01:9000/tmp/data hdfs://hadoop01:9000/tmp/output_data1
结果输出
[hadoop@hadoop01 spark-2.2.0-bin-2.6.0-cdh5.7.0]$ hdfs dfs -cat /tmp/output_data1/part-00000
(scala,2)
(hello,3)
(world,1)
[hadoop@hadoop01 spark-2.2.0-bin-2.6.0-cdh5.7.0]$
已经是按照value排倒序输出的结果了。
8. 简单测试
上面的过程我们已经知道怎么去通过IDEA编程,过程中不知道编写的代码的正确性,所以要反复修改代码-打包-提交测试,这个过程过于复杂,简单的做法是使用spark-shell来做测试。
[hadoop@hadoop01 ~]$ spark-shell --master local[2]
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/07/16 23:23:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java cla
18/07/16 23:23:39 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://10.132.37.38:4040
Spark context available as 'sc' (master = local[2], app id = local-1531754613823).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val textFile = sc.textFile("hdfs://hadoop01:9000/tmp/data")
textFile: org.apache.spark.rdd.RDD[String] = hdfs://hadoop01:9000/tmp/data MapPartitionsRDD[1] at textFile at <console>:24
scala> val wc = textFile.flatMap(line => line.split("\r")).map((_,1)).reduceByKey(_+_)
wc: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:26
scala> wc.collect // 查看结果
res2: Array[(String, Int)] = Array((scala,2), (hello,3), (world,1))
scala> wc.map(x => (x._2, x._1)).collect // key和value反转
res3: Array[(Int, String)] = Array((2,scala), (3,hello), (1,world))
scala> wc.map(x => (x._2, x._1)).sortByKey(false).collect // 排序
res4: Array[(Int, String)] = Array((3,hello), (2,scala), (1,world))
scala> wc.map(x => (x._2, x._1)).sortByKey(false).map(x => (x._2, x._1)).collect // 再反转
res5: Array[(String, Int)] = Array((hello,3), (scala,2), (world,1))
scala>
spark-shell的好处是可以快速验证代码,经过测试以后的代码放到IDEA中可以提高编码效率。