首先借用淘宝明风的图说明下spark on yarn的架构:
YARN/MRv2是下一代MapReduce框架,重构根本的思想是将 JobTracker 两个主要的功能分离成单独的组件,这两个功能是资源管理和任务调度 / 监控。新的资源管理器全局管理所有应用程序计算资源的分配,每一个应用的 ApplicationMaster 负责相应的调度和协调。一个应用程序无非是一个单独的传统的 MapReduce 任务或者是一个 DAG( 有向无环图 ) 任务。ResourceManager 和每一台机器的节点管理服务器能够管理用户在那台机器上的进程并能对计算进行组织。
首先将spark提供的exmple里边的程序运行在yarn上面:
1.经典的pi程序:
#用YARN_CONF_DIR或HADOOP_CONF_DIR指定YARN或者Hadoop配置文件存放目录
export YARN_CONF_DIR=/home/hadoop/hadoop/etc/hadoop
export SPARK_JAR=/home/hadoop/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
spark-class org.apache.spark.deploy.yarn.Client --jar ../examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar --class org.apache.spark.examples.SparkPi --args yarn-standalone
或者可以编写shell脚本来执行
这时候终端可以看到
[hadoop@localhost bin]$ spark-class org.apache.spark.deploy.yarn.Client --jar ../examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar --class org.apache.spark.examples.SparkPi --args yarn-standalone
14/05/22 11:26:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/05/22 11:26:50 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/05/22 11:26:51 INFO yarn.Client: Got Cluster metric info from ApplicationsManager (ASM), number of NodeManagers: 1
14/05/22 11:26:51 INFO yarn.Client: Queue info ... queueName: default, queueCurrentCapacity: 0.0, queueMaxCapacity: 1.0,
queueApplicationCount = 3, queueChildQueueCount = 0
14/05/22 11:26:51 INFO yarn.Client: Max mem capabililty of a single resource in this cluster 8192
14/05/22 11:26:51 INFO yarn.Client: Preparing Local resources
14/05/22 11:26:52 INFO yarn.Client: Uploading file:/home/hadoop/spark-0.9.1-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar to hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1400726883363_0004/spark-examples_2.10-assembly-0.9.1.jar
14/05/22 11:26:56 INFO yarn.Client: Uploading file:/home/hadoop/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar to hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1400726883363_0004/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
14/05/22 11:27:00 INFO yarn.Client: Setting up the launch environment
14/05/22 11:27:01 INFO yarn.Client: Setting up container launch context
14/05/22 11:27:01 INFO yarn.Client: Command for starting the Spark ApplicationMaster: $JAVA_HOME/bin/java -server -Xmx512m -Djava.io.tmpdir=$PWD/tmp org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.examples.SparkPi --jar ../examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar --args 'yarn-standalone' --worker-memory 1024 --worker-cores 1 --num-workers 2 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
14/05/22 11:27:01 INFO yarn.Client: Submitting application to ASM
14/05/22 11:27:01 INFO impl.YarnClientImpl: Submitted application application_1400726883363_0004 to ResourceManager at /0.0.0.0:8032
14/05/22 11:27:02 INFO yarn.Client: Application report from ASM:
application identifier: application_1400726883363_0004
appId: 4
clientToAMToken: null
appDiagnostics:
appMasterHost: N/A
appQueue: default
同时在http://localhost:8088/cluster这也没可以看到User为hadoop提交的name=spark application=spark的程序正在运行
稍等就可以看到
14/05/22 11:28:12 INFO yarn.Client: Application report from ASM:
application identifier: application_1400726883363_0004
appId: 4
clientToAMToken: null
appDiagnostics:
appMasterHost: localhost
appQueue: default
appMasterRpcPort: 0
appStartTime: 1400729221502
yarnAppState: FINISHED
distributedFinalState: SUCCEEDED
appTrackingUrl: localhost:8088/proxy/application_1400726883363_0004/A
appUser: hadoop
distributedFinalState: SUCCEEDED说明程序已经执行完成 这时候8088页面的job status也变成了SUCCEEDED
然后是自己编写的worldcount程序:
1.环境搭建:
我这里用的是eclipse集成了spark ide,
安装方法可以通过help ---》 install new software --》
首先new project 选择scaa wizard
2.输入项目名称 然后点击finish
接下来在这个工程的java buildpath里边加上spark on yarn的jar包,这个jar包的位置在spark安装目录下面的/home/hadoop/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar 根本版本不一样 jar里边的version可能会有区别
点击ok 然后就可以进入程序编写阶段了。。。。。
上程序
import org.apache.spark._
import SparkContext._
import org.apache.spark.SparkContext
object WordCount {
def main(args: Array[String]) {
if (args.length != 3 ){
println("usage is org.test.WordCount <master> <input> <output>")
return
}
println(System.getenv("SPARK_HOME"))
println(System.getenv("SPARK_TEST_JAR"))
val sc = new SparkContext(args(0), "WordCount",
System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_TEST_JAR")))
val textFile = sc.textFile(args(1))
val result = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1)).reduceByKey(_ + _)
result.saveAsTextFile(args(2))
sc.stop()
}
}
然后导出为jar包,放到知道目录,编写一个yarn client shell脚本
代码如下
export YARN_CONF_DIR=/home/hadoop/hadoop/etc/hadoop/
export SPARK_JAR=/home/hadoop/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
./spark-class org.apache.spark.deploy.yarn.Client --jar spark-wordcount-scala.jar --class WordCount --args yarn-standalone --args hdfs://localhost:9000/input --args hdfs://localhost:9000/sparkoutput --num-workers 1 --master-memory 2g --worker-memory 2g --worker-cores 2
然后让小象飞奔起来把
测试结果:
[hadoop@localhost target]$ hadoop fs -cat /sparkoutput/part-0000014/05/22 11:50:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
(,2)
(caozw,1)
(hello,4)
(2.2.0,1)
(world,1)
(hitachi,2)
(bhh,1)
[hadoop@localhost target]$ hadoop fs -cat /sparkoutput/part-00001
14/05/22 11:50:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
(hadoop,1)
(china,1)
(develop,1)