JDK8+Scala2.11+spark-2.0.0+Intellij2017.3.4开发wordcount程序并在集群中运行

一 安装JDK
下载地址
下载文件名:jdk-8u162-windows-x64.exe

二 安装Scala
下载地址
下载文件名:
scala-2.11.8.msi

三 安装Intellij
下载社区版:
ideaIC-2017.3.4.exe
安装路径:D:\Program Files\JetBrains\IntelliJ
在开发之前,需要安装Scala插件,启动程序后,单击“File->Setting”,选择“plugins”,输入Scala,然后进行安装

四 新建项目sparkproject
1 第一步
2 第二步
3 完成后等待一段时间,SBT会自动下载相应的jar包和相关文件
4 加载spark包

五 wordcount程序
1 创建一个Spark Context
2 加载数据
3 把每一行分割成单词
4 转换成pairs并且计数
具体代码如下
import org.apache.spark.{SparkConf, SparkContext}

object WordCount {
  def main(args: Array[String]): Unit = {
    val conf=new SparkConf().setAppName("wordcount")
    val sc=new SparkContext(conf)
    val input = sc.textFile("/root/helloSpark.txt")
    val lines=input.flatMap(line=>line.split(" "))
    val count= lines.map(word=>(word,1)).reduceByKey{case (x,y)=>x+y}
    val output=count.saveAsTextFile("/root/result")
  }
}

六 打包
第一步:配置Jar包相关属性
第二步:创建jar包

七 运行
[root@master sbin]# ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark-2.0.0-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-master.out
localhost: \S
localhost: Kernel \r on an \m
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-2.0.0-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-master.out
[root@master sbin]# jps
2438 Master
2583 Jps
2504 Worker
[root@master sbin]# spark-submit --master spark://master:7077 --class WordCount /root/sparkproject.jar
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/02/04 13:29:59 INFO SparkContext: Running Spark version 2.0.0
18/02/04 13:29:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/02/04 13:30:00 INFO SecurityManager: Changing view acls to: root
18/02/04 13:30:00 INFO SecurityManager: Changing modify acls to: root
18/02/04 13:30:00 INFO SecurityManager: Changing view acls groups to:
18/02/04 13:30:00 INFO SecurityManager: Changing modify acls groups to:
18/02/04 13:30:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
18/02/04 13:30:00 INFO Utils: Successfully started service 'sparkDriver' on port 48514.
18/02/04 13:30:00 INFO SparkEnv: Registering MapOutputTracker
18/02/04 13:30:00 INFO SparkEnv: Registering BlockManagerMaster
18/02/04 13:30:00 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-d67b5767-77b0-4a83-b75c-15f10ee470b3
18/02/04 13:30:00 INFO MemoryStore: MemoryStore started with capacity 413.9 MB
18/02/04 13:30:00 INFO SparkEnv: Registering OutputCommitCoordinator
18/02/04 13:30:01 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/02/04 13:30:01 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.0.110:4040
18/02/04 13:30:01 INFO SparkContext: Added JAR file:/root/sparkproject.jar at spark://192.168.0.110:48514/jars/sparkproject.jar with timestamp 1517722201114
18/02/04 13:30:01 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://master:7077...
18/02/04 13:30:01 INFO TransportClientFactory: Successfully created connection to master/192.168.0.110:7077 after 28 ms (0 ms spent in bootstraps)
18/02/04 13:30:01 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20180204133001-0000
18/02/04 13:30:01 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 59853.
18/02/04 13:30:01 INFO NettyBlockTransferService: Server created on 192.168.0.110:59853
18/02/04 13:30:01 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.0.110, 59853)
18/02/04 13:30:01 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.110:59853 with 413.9 MB RAM, BlockManagerId(driver, 192.168.0.110, 59853)
18/02/04 13:30:01 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.0.110, 59853)
18/02/04 13:30:01 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20180204133001-0000/0 on worker-20180204131918-192.168.0.110-52161 (192.168.0.110:52161) with 1 cores
18/02/04 13:30:01 INFO StandaloneSchedulerBackend: Granted executor ID app-20180204133001-0000/0 on hostPort 192.168.0.110:52161 with 1 cores, 1024.0 MB RAM
18/02/04 13:30:01 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180204133001-0000/0 is now RUNNING
18/02/04 13:30:02 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
18/02/04 13:30:03 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 145.5 KB, free 413.8 MB)
18/02/04 13:30:04 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 16.3 KB, free 413.8 MB)
18/02/04 13:30:04 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.0.110:59853 (size: 16.3 KB, free: 413.9 MB)
18/02/04 13:30:04 INFO SparkContext: Created broadcast 0 from textFile at WordCount.scala:7
18/02/04 13:30:04 INFO FileInputFormat: Total input paths to process : 1
18/02/04 13:30:05 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
18/02/04 13:30:05 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
18/02/04 13:30:05 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
18/02/04 13:30:05 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
18/02/04 13:30:05 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
18/02/04 13:30:05 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
18/02/04 13:30:05 INFO SparkContext: Starting job: saveAsTextFile at WordCount.scala:10
18/02/04 13:30:05 INFO DAGScheduler: Registering RDD 3 (map at WordCount.scala:9)
18/02/04 13:30:05 INFO DAGScheduler: Got job 0 (saveAsTextFile at WordCount.scala:10) with 2 output partitions
18/02/04 13:30:05 INFO DAGScheduler: Final stage: ResultStage 1 (saveAsTextFile at WordCount.scala:10)
18/02/04 13:30:05 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
18/02/04 13:30:05 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
18/02/04 13:30:06 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:9), which has no missing parents
18/02/04 13:30:06 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.5 KB, free 413.8 MB)
18/02/04 13:30:06 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.6 KB, free 413.8 MB)
18/02/04 13:30:06 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.0.110:59853 (size: 2.6 KB, free: 413.9 MB)
18/02/04 13:30:06 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1012
18/02/04 13:30:06 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:9)
18/02/04 13:30:06 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
18/02/04 13:30:07 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (192.168.0.110:55321) with ID 0
18/02/04 13:30:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.0.110, partition 0, PROCESS_LOCAL, 5418 bytes)
18/02/04 13:30:08 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 0 on executor id: 0 hostname: 192.168.0.110.
18/02/04 13:30:08 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.110:36305 with 413.9 MB RAM, BlockManagerId(0, 192.168.0.110, 36305)
18/02/04 13:30:13 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.0.110:36305 (size: 2.6 KB, free: 413.9 MB)
18/02/04 13:30:15 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.0.110:36305 (size: 16.3 KB, free: 413.9 MB)
18/02/04 13:30:17 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 192.168.0.110, partition 1, PROCESS_LOCAL, 5418 bytes)
18/02/04 13:30:17 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 1 on executor id: 0 hostname: 192.168.0.110.
18/02/04 13:30:17 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 9372 ms on 192.168.0.110 (1/2)
18/02/04 13:30:17 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 295 ms on 192.168.0.110 (2/2)
18/02/04 13:30:17 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/02/04 13:30:17 INFO DAGScheduler: ShuffleMapStage 0 (map at WordCount.scala:9) finished in 11.182 s
18/02/04 13:30:17 INFO DAGScheduler: looking for newly runnable stages
18/02/04 13:30:17 INFO DAGScheduler: running: Set()
18/02/04 13:30:17 INFO DAGScheduler: waiting: Set(ResultStage 1)
18/02/04 13:30:17 INFO DAGScheduler: failed: Set()
18/02/04 13:30:17 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at WordCount.scala:10), which has no missing parents
18/02/04 13:30:17 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 72.1 KB, free 413.7 MB)
18/02/04 13:30:17 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 26.1 KB, free 413.7 MB)
18/02/04 13:30:17 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.0.110:59853 (size: 26.1 KB, free: 413.9 MB)
18/02/04 13:30:17 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1012
18/02/04 13:30:17 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at WordCount.scala:10)
18/02/04 13:30:17 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
18/02/04 13:30:17 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, 192.168.0.110, partition 0, NODE_LOCAL, 5206 bytes)
18/02/04 13:30:17 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 2 on executor id: 0 hostname: 192.168.0.110.
18/02/04 13:30:17 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.0.110:36305 (size: 26.1 KB, free: 413.9 MB)
18/02/04 13:30:18 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 192.168.0.110:55321
18/02/04 13:30:18 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 155 bytes
18/02/04 13:30:18 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, 192.168.0.110, partition 1, NODE_LOCAL, 5206 bytes)
18/02/04 13:30:18 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 3 on executor id: 0 hostname: 192.168.0.110.
18/02/04 13:30:18 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 870 ms on 192.168.0.110 (1/2)
18/02/04 13:30:18 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 237 ms on 192.168.0.110 (2/2)
18/02/04 13:30:18 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
18/02/04 13:30:18 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at WordCount.scala:10) finished in 1.097 s
18/02/04 13:30:18 INFO DAGScheduler: Job 0 finished: saveAsTextFile at WordCount.scala:10, took 13.562280 s
18/02/04 13:30:19 INFO SparkContext: Invoking stop() from shutdown hook
18/02/04 13:30:19 INFO SparkUI: Stopped Spark web UI at http://192.168.0.110:4040
18/02/04 13:30:19 INFO StandaloneSchedulerBackend: Shutting down all executors
18/02/04 13:30:19 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
18/02/04 13:30:19 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/02/04 13:30:19 INFO MemoryStore: MemoryStore cleared
18/02/04 13:30:19 INFO BlockManager: BlockManager stopped
18/02/04 13:30:19 INFO BlockManagerMaster: BlockManagerMaster stopped
18/02/04 13:30:19 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/02/04 13:30:19 WARN Dispatcher: Message RemoteProcessDisconnected(192.168.0.110:55323) dropped. RpcEnv already stopped.
18/02/04 13:30:19 WARN Dispatcher: Message RemoteProcessDisconnected(192.168.0.110:55323) dropped. RpcEnv already stopped.
18/02/04 13:30:19 INFO SparkContext: Successfully stopped SparkContext
18/02/04 13:30:19 INFO ShutdownHookManager: Shutdown hook called
18/02/04 13:30:19 INFO ShutdownHookManager: Deleting directory /tmp/spark-8a2cf344-de9d-48c6-9cd2-926f9fd19de1
[root@master sbin]# cd /root
[root@master ~]# ls
sparkproject.jar           helloSpark.txt        result
[root@master ~]# cat helloSpark.txt
go to home hello java
so many to hello word kafka java
[root@master ~]# cd result
[root@master result]# ll
total 8
-rw-r--r-- 1 root root 52 Feb  4 13:30 part-00000
-rw-r--r-- 1 root root 25 Feb  4 13:30 part-00001
-rw-r--r-- 1 root root  0 Feb  4 13:30 _SUCCESS
[root@master result]# cat part-0000
cat: part-0000: No such file or directory
[root@master result]# cat part-00000
(word,1)
(hello,2)
(java,2)
(go,2)
(so,2)
(kafka,1)
[root@master result]# cat part-00001
(many,1)
(home,1)
(to,3)
八 参考
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值