spark操作hdfs统计单词实例 for Eclipse

spark操作hdfs统计单词实例 for Eclipse

开发环境

eclipse 4.8.0 javaEE
安装scala插件
help/Eclipase Marketplace/find scala/Scala IDE4.7.x /install

引用库build paths…
JRE 1.8.181
scala 2.11.11

解压spark-2.3.1-bin-hadoop2.7.tgz文件
到D:\cwgis\apps\spark\
引入D:\cwgis\apps\spark\jars\目录下所有*.jar文件到
Java Build Path/Libraries/Add External JARs…

新建项目Scala Wizards/Scala Project
项目名称:SparkWordCount

新建SparkWordCount.scala文件

import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
object SparkWordCount {
     def main(args:Array[String]):Unit={
        //if(args.length<2){
        //   println("Usage:<in><out>");
        //   return;
        //}
        var in_path="";  //args(0);
        var out_path=""; //args(1);

        in_path="hdfs://192.168.145.180:8020/spark/hellospark";
        out_path="hdfs://192.168.145.180:8020/spark/output";
        println("hello spark");
        val conf=new SparkConf().setAppName("SparkWordCount");
        //set spark master url  
        conf.setMaster("local[*]");
        conf.set("spark.driver.allowMultipleContexts","true");
        //
        val sc=new SparkContext(conf);
        val textRDD=sc.textFile(in_path);
        val result=textRDD.flatMap(line=>line.split(" ")).map(word=>(word,1)).reduceByKey((a,b)=>a+b)
        result.saveAsTextFile(out_path);
        //println(result.collect());
        println(in_path);
        println(out_path);
        println("hello end");
     }
}

执行情况

spark/hellospark文件内容:

hello spark
hello world
hello spark!

spark/output文件内容:

(spark!,1)
(hello,3)
(world,1)
(spark,1)

执行过程如下所示:

hello spark
2018-09-07 20:27:46 INFO  SparkContext:54 - Running Spark version 2.3.1
2018-09-07 20:27:46 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-09-07 20:27:46 INFO  SparkContext:54 - Submitted application: SparkWordCount
2018-09-07 20:27:46 INFO  SecurityManager:54 - Changing view acls to: hsg
2018-09-07 20:27:46 INFO  SecurityManager:54 - Changing modify acls to: hsg
2018-09-07 20:27:46 INFO  SecurityManager:54 - Changing view acls groups to: 
2018-09-07 20:27:46 INFO  SecurityManager:54 - Changing modify acls groups to: 
2018-09-07 20:27:46 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hsg); groups with view permissions: Set(); users  with modify permissions: Set(hsg); groups with modify permissions: Set()
2018-09-07 20:27:46 INFO  Utils:54 - Successfully started service 'sparkDriver' on port 1623.
2018-09-07 20:27:46 INFO  SparkEnv:54 - Registering MapOutputTracker
2018-09-07 20:27:46 INFO  SparkEnv:54 - Registering BlockManagerMaster
2018-09-07 20:27:46 INFO  BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2018-09-07 20:27:46 INFO  BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2018-09-07 20:27:46 INFO  DiskBlockManager:54 - Created local directory at C:\Users\hsg\AppData\Local\Temp\blockmgr-2184be5b-b56c-4e63-a47f-a6bee53a2cce
2018-09-07 20:27:46 INFO  MemoryStore:54 - MemoryStore started with capacity 1987.5 MB
2018-09-07 20:27:46 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
2018-09-07 20:27:46 INFO  log:192 - Logging initialized @1353ms
2018-09-07 20:27:46 INFO  Server:346 - jetty-9.3.z-SNAPSHOT
2018-09-07 20:27:47 INFO  Server:414 - Started @1408ms
2018-09-07 20:27:47 INFO  AbstractConnector:278 - Started ServerConnector@bcef303{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-09-07 20:27:47 INFO  Utils:54 - Successfully started service 'SparkUI' on port 4040.
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@531f4093{/jobs,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@403f0a22{/jobs/json,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@503ecb24{/jobs/job,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6995bf68{/jobs/job/json,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5143c662{/stages,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@77825085{/stages/json,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3568f9d2{/stages/stage,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5b1ebf56{/stages/stage/json,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@294a6b8e{/stages/pool,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4b1d6571{/stages/pool/json,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1b835480{/storage,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3549bca9{/storage/json,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4f25b795{/storage/rdd,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6fb365ed{/storage/rdd/json,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6e950bcf{/environment,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@16414e40{/environment/json,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@74bada02{/executors,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@525575{/executors/json,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@46dffdc3{/executors/threadDump,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5a709816{/executors/threadDump/json,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@78383390{/static,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@31bcf236{/,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4b3ed2f0{/api,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3a12c404{/jobs/job/kill,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1941a8ff{/stages/stage/kill,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://hsgpc:4040
2018-09-07 20:27:47 INFO  Executor:54 - Starting executor ID driver on host localhost
2018-09-07 20:27:47 INFO  Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 1664.
2018-09-07 20:27:47 INFO  NettyBlockTransferService:54 - Server created on hsgpc:1664
2018-09-07 20:27:47 INFO  BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2018-09-07 20:27:47 INFO  BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, hsgpc, 1664, None)
2018-09-07 20:27:47 INFO  BlockManagerMasterEndpoint:54 - Registering block manager hsgpc:1664 with 1987.5 MB RAM, BlockManagerId(driver, hsgpc, 1664, None)
2018-09-07 20:27:47 INFO  BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, hsgpc, 1664, None)
2018-09-07 20:27:47 INFO  BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, hsgpc, 1664, None)
2018-09-07 20:27:47 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6722db6e{/metrics/json,null,AVAILABLE,@Spark}
2018-09-07 20:27:47 INFO  MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 236.7 KB, free 1987.3 MB)
2018-09-07 20:27:47 INFO  MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.9 KB, free 1987.2 MB)
2018-09-07 20:27:47 INFO  BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on hsgpc:1664 (size: 22.9 KB, free: 1987.5 MB)
2018-09-07 20:27:47 INFO  SparkContext:54 - Created broadcast 0 from textFile at SparkWordCount.scala:22
2018-09-07 20:27:48 INFO  FileInputFormat:249 - Total input paths to process : 1
2018-09-07 20:27:48 INFO  deprecation:1173 - mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
2018-09-07 20:27:48 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-09-07 20:27:48 INFO  SparkContext:54 - Starting job: runJob at SparkHadoopWriter.scala:78
2018-09-07 20:27:48 INFO  DAGScheduler:54 - Registering RDD 3 (map at SparkWordCount.scala:23)
2018-09-07 20:27:48 INFO  DAGScheduler:54 - Got job 0 (runJob at SparkHadoopWriter.scala:78) with 2 output partitions
2018-09-07 20:27:48 INFO  DAGScheduler:54 - Final stage: ResultStage 1 (runJob at SparkHadoopWriter.scala:78)
2018-09-07 20:27:48 INFO  DAGScheduler:54 - Parents of final stage: List(ShuffleMapStage 0)
2018-09-07 20:27:48 INFO  DAGScheduler:54 - Missing parents: List(ShuffleMapStage 0)
2018-09-07 20:27:48 INFO  DAGScheduler:54 - Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at SparkWordCount.scala:23), which has no missing parents
2018-09-07 20:27:48 INFO  MemoryStore:54 - Block broadcast_1 stored as values in memory (estimated size 4.8 KB, free 1987.2 MB)
2018-09-07 20:27:48 INFO  MemoryStore:54 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.8 KB, free 1987.2 MB)
2018-09-07 20:27:48 INFO  BlockManagerInfo:54 - Added broadcast_1_piece0 in memory on hsgpc:1664 (size: 2.8 KB, free: 1987.5 MB)
2018-09-07 20:27:48 INFO  SparkContext:54 - Created broadcast 1 from broadcast at DAGScheduler.scala:1039
2018-09-07 20:27:48 INFO  DAGScheduler:54 - Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at SparkWordCount.scala:23) (first 15 tasks are for partitions Vector(0, 1))
2018-09-07 20:27:48 INFO  TaskSchedulerImpl:54 - Adding task set 0.0 with 2 tasks
2018-09-07 20:27:48 INFO  TaskSetManager:54 - Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, ANY, 7880 bytes)
2018-09-07 20:27:48 INFO  TaskSetManager:54 - Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, ANY, 7880 bytes)
2018-09-07 20:27:48 INFO  Executor:54 - Running task 0.0 in stage 0.0 (TID 0)
2018-09-07 20:27:48 INFO  Executor:54 - Running task 1.0 in stage 0.0 (TID 1)
2018-09-07 20:27:48 INFO  HadoopRDD:54 - Input split: hdfs://192.168.145.180:8020/spark/hellospark:0+18
2018-09-07 20:27:48 INFO  HadoopRDD:54 - Input split: hdfs://192.168.145.180:8020/spark/hellospark:18+18
2018-09-07 20:27:48 INFO  Executor:54 - Finished task 0.0 in stage 0.0 (TID 0). 1147 bytes result sent to driver
2018-09-07 20:27:48 INFO  TaskSetManager:54 - Finished task 0.0 in stage 0.0 (TID 0) in 300 ms on localhost (executor driver) (1/2)
2018-09-07 20:27:48 INFO  Executor:54 - Finished task 1.0 in stage 0.0 (TID 1). 1104 bytes result sent to driver
2018-09-07 20:27:48 INFO  TaskSetManager:54 - Finished task 1.0 in stage 0.0 (TID 1) in 301 ms on localhost (executor driver) (2/2)
2018-09-07 20:27:48 INFO  TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool 
2018-09-07 20:27:48 INFO  DAGScheduler:54 - ShuffleMapStage 0 (map at SparkWordCount.scala:23) finished in 0.358 s
2018-09-07 20:27:48 INFO  DAGScheduler:54 - looking for newly runnable stages
2018-09-07 20:27:48 INFO  DAGScheduler:54 - running: Set()
2018-09-07 20:27:48 INFO  DAGScheduler:54 - waiting: Set(ResultStage 1)
2018-09-07 20:27:48 INFO  DAGScheduler:54 - failed: Set()
2018-09-07 20:27:48 INFO  DAGScheduler:54 - Submitting ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at SparkWordCount.scala:24), which has no missing parents
2018-09-07 20:27:48 INFO  MemoryStore:54 - Block broadcast_2 stored as values in memory (estimated size 72.3 KB, free 1987.2 MB)
2018-09-07 20:27:48 INFO  MemoryStore:54 - Block broadcast_2_piece0 stored as bytes in memory (estimated size 26.1 KB, free 1987.1 MB)
2018-09-07 20:27:48 INFO  BlockManagerInfo:54 - Added broadcast_2_piece0 in memory on hsgpc:1664 (size: 26.1 KB, free: 1987.4 MB)
2018-09-07 20:27:48 INFO  SparkContext:54 - Created broadcast 2 from broadcast at DAGScheduler.scala:1039
2018-09-07 20:27:48 INFO  DAGScheduler:54 - Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at SparkWordCount.scala:24) (first 15 tasks are for partitions Vector(0, 1))
2018-09-07 20:27:48 INFO  TaskSchedulerImpl:54 - Adding task set 1.0 with 2 tasks
2018-09-07 20:27:48 INFO  TaskSetManager:54 - Starting task 0.0 in stage 1.0 (TID 2, localhost, executor driver, partition 0, ANY, 7649 bytes)
2018-09-07 20:27:48 INFO  TaskSetManager:54 - Starting task 1.0 in stage 1.0 (TID 3, localhost, executor driver, partition 1, ANY, 7649 bytes)
2018-09-07 20:27:48 INFO  Executor:54 - Running task 0.0 in stage 1.0 (TID 2)
2018-09-07 20:27:48 INFO  Executor:54 - Running task 1.0 in stage 1.0 (TID 3)
2018-09-07 20:27:48 INFO  ShuffleBlockFetcherIterator:54 - Getting 2 non-empty blocks out of 2 blocks
2018-09-07 20:27:48 INFO  ShuffleBlockFetcherIterator:54 - Getting 1 non-empty blocks out of 2 blocks
2018-09-07 20:27:48 INFO  ShuffleBlockFetcherIterator:54 - Started 0 remote fetches in 5 ms
2018-09-07 20:27:48 INFO  ShuffleBlockFetcherIterator:54 - Started 0 remote fetches in 5 ms
2018-09-07 20:27:48 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-09-07 20:27:48 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-09-07 20:27:49 INFO  FileOutputCommitter:535 - Saved output of task 'attempt_20180907202748_0005_m_000001_0' to hdfs://192.168.145.180:8020/spark/output/_temporary/0/task_20180907202748_0005_m_000001
2018-09-07 20:27:49 INFO  SparkHadoopMapRedUtil:54 - attempt_20180907202748_0005_m_000001_0: Committed
2018-09-07 20:27:49 INFO  Executor:54 - Finished task 1.0 in stage 1.0 (TID 3). 1502 bytes result sent to driver
2018-09-07 20:27:49 INFO  TaskSetManager:54 - Finished task 1.0 in stage 1.0 (TID 3) in 696 ms on localhost (executor driver) (1/2)
2018-09-07 20:27:49 INFO  FileOutputCommitter:535 - Saved output of task 'attempt_20180907202748_0005_m_000000_0' to hdfs://192.168.145.180:8020/spark/output/_temporary/0/task_20180907202748_0005_m_000000
2018-09-07 20:27:49 INFO  SparkHadoopMapRedUtil:54 - attempt_20180907202748_0005_m_000000_0: Committed
2018-09-07 20:27:49 INFO  Executor:54 - Finished task 0.0 in stage 1.0 (TID 2). 1459 bytes result sent to driver
2018-09-07 20:27:49 INFO  TaskSetManager:54 - Finished task 0.0 in stage 1.0 (TID 2) in 716 ms on localhost (executor driver) (2/2)
2018-09-07 20:27:49 INFO  TaskSchedulerImpl:54 - Removed TaskSet 1.0, whose tasks have all completed, from pool 
2018-09-07 20:27:49 INFO  DAGScheduler:54 - ResultStage 1 (runJob at SparkHadoopWriter.scala:78) finished in 0.734 s
2018-09-07 20:27:49 INFO  DAGScheduler:54 - Job 0 finished: runJob at SparkHadoopWriter.scala:78, took 1.260756 s
2018-09-07 20:27:49 INFO  SparkHadoopWriter:54 - Job job_20180907202748_0005 committed.
hdfs://192.168.145.180:8020/spark/hellospark
hdfs://192.168.145.180:8020/spark/output
hello end
2018-09-07 20:27:49 INFO  SparkContext:54 - Invoking stop() from shutdown hook
2018-09-07 20:27:49 INFO  AbstractConnector:318 - Stopped Spark@bcef303{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-09-07 20:27:49 INFO  SparkUI:54 - Stopped Spark web UI at http://hsgpc:4040
2018-09-07 20:27:49 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-09-07 20:27:49 INFO  MemoryStore:54 - MemoryStore cleared
2018-09-07 20:27:49 INFO  BlockManager:54 - BlockManager stopped
2018-09-07 20:27:49 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2018-09-07 20:27:49 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-09-07 20:27:49 INFO  SparkContext:54 - Successfully stopped SparkContext
2018-09-07 20:27:49 INFO  ShutdownHookManager:54 - Shutdown hook called
2018-09-07 20:27:49 INFO  ShutdownHookManager:54 - Deleting directory C:\Users\hsg\AppData\Local\Temp\spark-97dc8724-e958-4db6-a005-5e365dcdcdba

—the–end—

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值