本地执行Spark程序示例

最新推荐文章于 2021-12-20 21:25:52 发布

Rookie_Feng

最新推荐文章于 2021-12-20 21:25:52 发布

阅读量4k

点赞数 1

分类专栏：大数据技术文章标签： spark本地程序入门示例

本文链接：https://blog.csdn.net/xaiomessi/article/details/50629181

版权

大数据技术专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Scala IDE eclipse（去官网下载）开发基于本地的spark程序

下载完成后，在写代码之前，需要对IDE做以下修改

（1）Scala IDE的默认Scala版本是2.11.7, 需要修改为2.10.x

（2）加入spark1.6.0的jar包依赖

package com.jiang.spark
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object WordCount {
  def main(args: Array[String]){
    /**
     * 第一步：创建Spark的配置对象SparkConf，设置Spark程序的运行时的配置信息，
     * 例如说通过setMaster来设置程序要链接的Spark集群的Master的URL，如果设置为
     * local，则代表Spark程序在在本地运行
     */
    val conf = new SparkConf;//创建SparkConf对象
    
    conf.setAppName("Wow")//设置应用程序的名称，在程序运行的监控界面可以看到名称
    conf.setMaster("local")//此时，程序在本地运行，不需要Spark集群
    
    /**
     * 第二步：创建SparkContext对象
     * SparkContext是Spark程序所有功能的唯一入口，无论是采用Scala、Java、Python、R等都必须有一个SparkContext
     * SparkContext 的核心作用：初始化Spark应用程序运行所需要的核心组件，包括DAGScheduler、TaskScheduler、SchedulerBackend
     * 同时还会负责Spark程序往Master注册程序等
     * SparkContext是整个Spark应用程序中最为至关重要的一个对象
     */
    val sc = new SparkContext(conf)//创建SparkContext对象，通过传入SparkConf实例来定制Spark运行的具体参数和配置信息
    
    /**
     * 第三步：根据具体的数据来源（HDFS、HBase、Local、FS、DB、S3等）通过SparkContext来创建RDD
     * RDD的创建基本有三种方式：根据外部的数据来源、根据Scala集合、有其它的RDD操作
     * 数据会根据RDD划分为一系列的Partitions，分配到每个Partition的数据属于一个Task的处理范畴
     */
    val lines = sc.textFile("README.md",1 )
    
    /**
     * 第四步：对初始的RDD进行Transformation级别的处理，例如map、filter等高阶函数等的编程，来进行具体的数据计算
     * 4.1步；将每一行的字符串拆分为当个的单词
     * 
     */
    val words = lines.flatMap { line => line.split(" ") }
   
    /**
     * 4.2步：在单词拆分的基础上对每个单词实例计数为1，也就是 word=> (word,1)
     */
    val pairs = words.map { word=> (word,1) }
    
    /**
     * 4.3步：在每个单词实例计数为1的基础上统计每个单词在文件中出现分的总次数
     */
    val wordCounts = pairs.reduceByKey(_+_)
    
    wordCounts.foreach(wordNumberPair => println((wordNumberPair._1) +" : "+ (wordNumberPair._2)))
    
    sc.stop()
  }
}

代码执行结果

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/02/03 13:59:13 INFO SparkContext: Running Spark version 1.6.0
16/02/03 13:59:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/02/03 13:59:14 INFO SecurityManager: Changing view acls to: FengMac
16/02/03 13:59:14 INFO SecurityManager: Changing modify acls to: FengMac
16/02/03 13:59:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(FengMac); users with modify permissions: Set(FengMac)
16/02/03 13:59:14 INFO Utils: Successfully started service 'sparkDriver' on port 49558.
16/02/03 13:59:15 INFO Slf4jLogger: Slf4jLogger started
16/02/03 13:59:15 INFO Remoting: Starting remoting
16/02/03 13:59:15 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@172.27.35.4:49559]
16/02/03 13:59:15 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 49559.
16/02/03 13:59:15 INFO SparkEnv: Registering MapOutputTracker
16/02/03 13:59:15 INFO SparkEnv: Registering BlockManagerMaster
16/02/03 13:59:15 INFO DiskBlockManager: Created local directory at /private/var/folders/dw/1rt91pqx2dv298yfpf0ngz1w0000gn/T/blockmgr-d1b26cd4-bd6d-4b49-bf5c-336f915ee092
16/02/03 13:59:15 INFO MemoryStore: MemoryStore started with capacity 457.9 MB
16/02/03 13:59:15 INFO SparkEnv: Registering OutputCommitCoordinator
16/02/03 13:59:15 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/02/03 13:59:15 INFO SparkUI: Started SparkUI at http://172.27.35.4:4040
16/02/03 13:59:15 INFO Executor: Starting executor ID driver on host localhost
16/02/03 13:59:15 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 49560.
16/02/03 13:59:15 INFO NettyBlockTransferService: Server created on 49560
16/02/03 13:59:15 INFO BlockManagerMaster: Trying to register BlockManager
16/02/03 13:59:15 INFO BlockManagerMasterEndpoint: Registering block manager localhost:49560 with 457.9 MB RAM, BlockManagerId(driver, localhost, 49560)
16/02/03 13:59:16 INFO BlockManagerMaster: Registered BlockManager
16/02/03 13:59:16 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 127.4 KB, free 127.4 KB)
16/02/03 13:59:16 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 141.3 KB)
16/02/03 13:59:16 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:49560 (size: 13.9 KB, free: 457.9 MB)
16/02/03 13:59:16 INFO SparkContext: Created broadcast 0 from textFile at WordCount.scala:31
16/02/03 13:59:16 INFO FileInputFormat: Total input paths to process : 1
16/02/03 13:59:16 INFO SparkContext: Starting job: foreach at WordCount.scala:50
16/02/03 13:59:17 INFO DAGScheduler: Registering RDD 3 (map at WordCount.scala:43)
16/02/03 13:59:17 INFO DAGScheduler: Got job 0 (foreach at WordCount.scala:50) with 1 output partitions
16/02/03 13:59:17 INFO DAGScheduler: Final stage: ResultStage 1 (foreach at WordCount.scala:50)
16/02/03 13:59:17 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
16/02/03 13:59:17 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
16/02/03 13:59:17 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:43), which has no missing parents
16/02/03 13:59:17 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.0 KB, free 145.4 KB)
16/02/03 13:59:17 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 147.6 KB)
16/02/03 13:59:17 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:49560 (size: 2.3 KB, free: 457.9 MB)
16/02/03 13:59:17 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/02/03 13:59:17 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:43)
16/02/03 13:59:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/02/03 13:59:17 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2143 bytes)
16/02/03 13:59:17 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/02/03 13:59:17 INFO HadoopRDD: Input split: file:/Users/FengMac/workspace/MyFirstScala/README.md:0+3359
16/02/03 13:59:17 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/02/03 13:59:17 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/02/03 13:59:17 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/02/03 13:59:17 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/02/03 13:59:17 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/02/03 13:59:17 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2253 bytes result sent to driver
16/02/03 13:59:17 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 286 ms on localhost (1/1)
16/02/03 13:59:17 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
16/02/03 13:59:17 INFO DAGScheduler: ShuffleMapStage 0 (map at WordCount.scala:43) finished in 0.306 s
16/02/03 13:59:17 INFO DAGScheduler: looking for newly runnable stages
16/02/03 13:59:17 INFO DAGScheduler: running: Set()
16/02/03 13:59:17 INFO DAGScheduler: waiting: Set(ResultStage 1)
16/02/03 13:59:17 INFO DAGScheduler: failed: Set()
16/02/03 13:59:17 INFO DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCount.scala:48), which has no missing parents
16/02/03 13:59:17 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.5 KB, free 150.1 KB)
16/02/03 13:59:17 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1584.0 B, free 151.7 KB)
16/02/03 13:59:17 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:49560 (size: 1584.0 B, free: 457.9 MB)
16/02/03 13:59:17 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
16/02/03 13:59:17 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCount.scala:48)
16/02/03 13:59:17 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
16/02/03 13:59:17 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, partition 0,NODE_LOCAL, 1894 bytes)
16/02/03 13:59:17 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
16/02/03 13:59:17 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/02/03 13:59:17 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 6 ms
package : 1
For : 2
Programs : 1
processing. : 1
Because : 1
The : 1
cluster. : 1
its : 1
[run : 1
APIs : 1
have : 1
Try : 1
computation : 1
through : 1
several : 1
This : 2
graph : 1
Hive : 2
storage : 1
["Specifying : 1
To : 2
page](http://spark.apache.org/documentation.html) : 1
Once : 1
"yarn" : 1
prefer : 1
SparkPi : 2
engine : 1
version : 1
file : 1
documentation, : 1
processing, : 1
the : 21
are : 1
systems. : 1
params : 1
not : 1
different : 1
refer : 2
Interactive : 2
R, : 1
given. : 1
if : 4
build : 3
when : 1
be : 2
Tests : 1
Apache : 1
./bin/run-example : 2
programs, : 1
including : 3
Spark. : 1
package. : 1
1000).count() : 1
Versions : 1
HDFS : 1
Data. : 1
>>> : 1
programming : 1
Testing : 1
module, : 1
Streaming : 1
environment : 1
run: : 1
clean : 1
1000: : 2
rich : 1
GraphX : 1
Please : 3
is : 6
run : 7
URL, : 1
threads. : 1
same : 1
MASTER=spark://host:7077 : 1
on : 5
built : 1
against : 1
[Apache : 1
tests : 2
examples : 2
at : 2
optimized : 1
usage : 1
using : 2
graphs : 1
talk : 1
Shell : 2
class : 2
abbreviated : 1
directory. : 1
README : 1
computing : 1
overview : 1
`examples` : 2
example: : 1
## : 8
N : 1
set : 2
use : 3
Hadoop-supported : 1
tests](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools). : 1
running : 1
find : 1
contains : 1
project : 1
Pi : 1
need : 1
or : 3
Big : 1
Java, : 1
high-level : 1
uses : 1
<class> : 1
Hadoop, : 2
available : 1
requires : 1
(You : 1
see : 1
Documentation : 1
of : 5
tools : 1
using: : 1
cluster : 2
must : 1
supports : 2
built, : 1
system : 1
build/mvn : 1
Hadoop : 3
this : 1
Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version) : 1
particular : 2
Python : 2
Spark : 13
general : 2
YARN, : 1
pre-built : 1
[Configuration : 1
locally : 2
library : 1
A : 1
locally. : 1
sc.parallelize(1 : 1
only : 1
Configuration : 1
following : 2
basic : 1
# : 1
changed : 1
More : 1
which : 2
learning, : 1
first : 1
./bin/pyspark : 1
also : 4
should : 2
for : 11
[params]`. : 1
documentation : 3
[project : 2
mesos:// : 1
Maven](http://maven.apache.org/). : 1
setup : 1
<http://spark.apache.org/> : 1
latest : 1
your : 1
MASTER : 1
example : 3
scala> : 1
DataFrames, : 1
provides : 1
configure : 1
distributions. : 1
can : 6
About : 1
instructions. : 1
do : 2
easiest : 1
no : 1
how : 2
`./bin/run-example : 1
Note : 1
individual : 1
spark:// : 1
It : 2
Scala : 2
Alternatively, : 1
an : 3
variable : 1
submit : 1
machine : 1
thread, : 1
them, : 1
detailed : 2
stream : 1
And : 1
distribution : 1
return : 2
Thriftserver : 1
./bin/spark-shell : 1
"local" : 1
start : 1
You : 3
Spark](#building-spark). : 1
one : 2
help : 1
with : 3
print : 1
Spark"](http://spark.apache.org/docs/latest/building-spark.html). : 1
data : 1
wiki](https://cwiki.apache.org/confluence/display/SPARK). : 1
in : 5
-DskipTests : 1
downloaded : 1
versions : 1
online : 1
Guide](http://spark.apache.org/docs/latest/configuration.html) : 1
comes : 1
[building : 1
Python, : 2
Many : 1
building : 2
Running : 1
from : 1
way : 1
Online : 1
site, : 1
other : 1
Example : 1
analysis. : 1
sc.parallelize(range(1000)).count() : 1
you : 4
runs. : 1
Building : 1
higher-level : 1
protocols : 1
guidance : 2
a : 8
guide, : 1
name : 1
fast : 1
SQL : 2
will : 1
instance: : 1
to : 14
core : 1
 : 67
web : 1
"local[N]" : 1
programs : 2
package.) : 1
that : 2
MLlib : 1
["Building : 1
shell: : 2
Scala, : 1
and : 10
command, : 2
./dev/run-tests : 1
sample : 1
16/02/03 13:59:17 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1165 bytes result sent to driver
16/02/03 13:59:17 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 109 ms on localhost (1/1)
16/02/03 13:59:17 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
16/02/03 13:59:17 INFO DAGScheduler: ResultStage 1 (foreach at WordCount.scala:50) finished in 0.110 s
16/02/03 13:59:17 INFO DAGScheduler: Job 0 finished: foreach at WordCount.scala:50, took 0.621807 s
16/02/03 13:59:17 INFO SparkUI: Stopped Spark web UI at http://172.27.35.4:4040
16/02/03 13:59:17 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/02/03 13:59:17 INFO MemoryStore: MemoryStore cleared
16/02/03 13:59:17 INFO BlockManager: BlockManager stopped
16/02/03 13:59:17 INFO BlockManagerMaster: BlockManagerMaster stopped
16/02/03 13:59:17 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/02/03 13:59:17 INFO SparkContext: Successfully stopped SparkContext
16/02/03 13:59:17 INFO ShutdownHookManager: Shutdown hook called
16/02/03 13:59:17 INFO ShutdownHookManager: Deleting directory /private/var/folders/dw/1rt91pqx2dv298yfpf0ngz1w0000gn/T/spark-0aa0bdbc-cd39-48c1-b860-1643cb858d26
16/02/03 13:59:17 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.

Rookie_Feng

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
本地执行Spark程序示例

Scala IDE eclipse（去官网下载）开发基于本地的spark程序下载完成后，在写代码之前，需要对IDE做以下修改（1）Scala IDE的默认Scala版本是2.11.7, 需要修改为2.10.x（2）加入spark1.6.0的jar包依赖package com.jiang.sparkimport org.apache.spark.SparkConfi
复制链接

扫一扫