spark on yarn第一个程序(为小象插上翅膀)

13 篇文章 0 订阅

首先借用淘宝明风的图说明下spark on yarn的架构:


YARN/MRv2是下一代MapReduce框架,重构根本的思想是将 JobTracker 两个主要的功能分离成单独的组件,这两个功能是资源管理和任务调度 / 监控。新的资源管理器全局管理所有应用程序计算资源的分配,每一个应用的 ApplicationMaster 负责相应的调度和协调。一个应用程序无非是一个单独的传统的 MapReduce 任务或者是一个 DAG( 有向无环图 ) 任务。ResourceManager 和每一台机器的节点管理服务器能够管理用户在那台机器上的进程并能对计算进行组织。



首先将spark提供的exmple里边的程序运行在yarn上面:

1.经典的pi程序:

#用YARN_CONF_DIR或HADOOP_CONF_DIR指定YARN或者Hadoop配置文件存放目录

export  YARN_CONF_DIR=/home/hadoop/hadoop/etc/hadoop

export SPARK_JAR=/home/hadoop/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar


spark-class org.apache.spark.deploy.yarn.Client --jar ../examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar --class  org.apache.spark.examples.SparkPi --args yarn-standalone 


或者可以编写shell脚本来执行

这时候终端可以看到

[hadoop@localhost bin]$ spark-class org.apache.spark.deploy.yarn.Client --jar ../examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar --class  org.apache.spark.examples.SparkPi --args yarn-standalone
14/05/22 11:26:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/05/22 11:26:50 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/05/22 11:26:51 INFO yarn.Client: Got Cluster metric info from ApplicationsManager (ASM), number of NodeManagers: 1
14/05/22 11:26:51 INFO yarn.Client: Queue info ... queueName: default, queueCurrentCapacity: 0.0, queueMaxCapacity: 1.0,
      queueApplicationCount = 3, queueChildQueueCount = 0
14/05/22 11:26:51 INFO yarn.Client: Max mem capabililty of a single resource in this cluster 8192
14/05/22 11:26:51 INFO yarn.Client: Preparing Local resources
14/05/22 11:26:52 INFO yarn.Client: Uploading file:/home/hadoop/spark-0.9.1-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar to hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1400726883363_0004/spark-examples_2.10-assembly-0.9.1.jar
14/05/22 11:26:56 INFO yarn.Client: Uploading file:/home/hadoop/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar to hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1400726883363_0004/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
14/05/22 11:27:00 INFO yarn.Client: Setting up the launch environment
14/05/22 11:27:01 INFO yarn.Client: Setting up container launch context
14/05/22 11:27:01 INFO yarn.Client: Command for starting the Spark ApplicationMaster: $JAVA_HOME/bin/java -server -Xmx512m -Djava.io.tmpdir=$PWD/tmp org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.examples.SparkPi --jar ../examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar --args  'yarn-standalone'  --worker-memory 1024 --worker-cores 1 --num-workers 2 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
14/05/22 11:27:01 INFO yarn.Client: Submitting application to ASM
14/05/22 11:27:01 INFO impl.YarnClientImpl: Submitted application application_1400726883363_0004 to ResourceManager at /0.0.0.0:8032
14/05/22 11:27:02 INFO yarn.Client: Application report from ASM:
     application identifier: application_1400726883363_0004
     appId: 4
     clientToAMToken: null
     appDiagnostics:
     appMasterHost: N/A
     appQueue: default


同时在http://localhost:8088/cluster这也没可以看到User为hadoop提交的name=spark application=spark的程序正在运行


稍等就可以看到

14/05/22 11:28:12 INFO yarn.Client: Application report from ASM:
     application identifier: application_1400726883363_0004
     appId: 4
     clientToAMToken: null
     appDiagnostics:
     appMasterHost: localhost
     appQueue: default
     appMasterRpcPort: 0
     appStartTime: 1400729221502
     yarnAppState: FINISHED
     distributedFinalState: SUCCEEDED
     appTrackingUrl: localhost:8088/proxy/application_1400726883363_0004/A
     appUser: hadoop
  distributedFinalState: SUCCEEDED说明程序已经执行完成 这时候8088页面的job status也变成了SUCCEEDED


然后是自己编写的worldcount程序:

1.环境搭建:

我这里用的是eclipse集成了spark ide,

安装方法可以通过help ---》 install new software --》

首先new project 选择scaa wizard

2.输入项目名称  然后点击finish

接下来在这个工程的java buildpath里边加上spark on yarn的jar包,这个jar包的位置在spark安装目录下面的/home/hadoop/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar  根本版本不一样 jar里边的version可能会有区别



 

点击ok  然后就可以进入程序编写阶段了。。。。。


上程序

import org.apache.spark._
import SparkContext._
import org.apache.spark.SparkContext
object WordCount {
  def main(args: Array[String]) {
    if (args.length != 3 ){
      println("usage is org.test.WordCount <master> <input> <output>")
      return
    }
    println(System.getenv("SPARK_HOME"))
    println(System.getenv("SPARK_TEST_JAR"))
    val sc = new SparkContext(args(0), "WordCount",
    System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_TEST_JAR")))
    val textFile = sc.textFile(args(1))
    val result = textFile.flatMap(line => line.split(" "))
        .map(word => (word, 1)).reduceByKey(_ + _)
    result.saveAsTextFile(args(2))
    sc.stop()
  }
}

然后导出为jar包,放到知道目录,编写一个yarn client shell脚本

代码如下

export  YARN_CONF_DIR=/home/hadoop/hadoop/etc/hadoop/
export SPARK_JAR=/home/hadoop/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
./spark-class org.apache.spark.deploy.yarn.Client --jar spark-wordcount-scala.jar --class WordCount --args yarn-standalone --args hdfs://localhost:9000/input --args hdfs://localhost:9000/sparkoutput --num-workers 1 --master-memory 2g --worker-memory 2g --worker-cores 2 



然后让小象飞奔起来把 


测试结果:

[hadoop@localhost target]$ hadoop fs -cat /sparkoutput/part-0000014/05/22 11:50:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
(,2)
(caozw,1)
(hello,4)
(2.2.0,1)
(world,1)
(hitachi,2)
(bhh,1)

[hadoop@localhost target]$ hadoop fs -cat /sparkoutput/part-00001
14/05/22 11:50:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
(hadoop,1)
(china,1)
(develop,1)






  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值