Starting Spark jobs directly via YARN REST API

avatar image

Starting Spark jobs directly via YARN REST API

Bernhard Walter 创建 · 2016年04月18日 17:20 ·  已编辑 · 2016年04月18日 17:48
9
Short Description:
This article describes how to submit spark jobs directly via the YARN REST API. This allows submitting jobs from a workstation or via Knox.
Article

There are situations, when one might want to submit a Spark job via a REST API:

  • If you want to submit Spark jobs from your IDE on our workstation outside the cluster
  • If the cluster can only be accessed via Knox (perimeter security)

One possibility is to use the Oozie REST API and the Oozie Spark action,

However, this article looks into the option of using the YARN REST API directly. Starting with the Cluster Applications API I tried to come up with an approach that resembles the spark-submit command.

1) Copy Spark assembly jar to HDFS

Per default the spark assembly jar file is not available in HDFS. For remote access we will need it.

Some standard locations in HDP are:

  • HDP 2.3.2:
    • Version: 2.3.2.0-2950
    • Spark Jar: /usr/hdp/2.3.2.0-2950/spark/lib/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar
  • HDP 2.4.0:
    • Version: 2.4.0.0-169
    • Spark Jar: /usr/hdp/2.4.0.0-169/spark/lib/spark-assembly-1.6.0.2.4.0.0-169-hadoop2.7.1.2.4.0.0-169.jar

This is a one time preparation step, for example for HDP 2.4 it would be:

    
    
  1. sudo su - hdfs
  2. HDP_VERSION=2.4.0.0-169
  3. SPARK_JAR=spark-assembly-1.6.0.2.4.0.0-169-hadoop2.7.1.2.4.0.0-169.jar
  4. hdfs dfs -mkdir "/hdp/apps/${HDP_VERSION}/spark/"
  5. hdfs dfs -put "/usr/hdp/${HDP_VERSION}/spark/lib/$SPARK_JAR" "/hdp/apps/${HDP_VERSION}/spark/spark-hdp-assembly.jar"

2) Upload your spark application jar file to HDFS

Upload your spark application jar file packaged by sbt to the project folder in HDFS via WebHdfs (maybe use something better than "/tmp"):

    
    
  1. export APP_FILE=simple-project_2.10-1.0.jar
  2. curl -X PUT "${WEBHDFS_HOST}:50070/webhdfs/v1/tmp/simple-project?op=MKDIRS"
  3. curl -i -X PUT "${WEBHDFS_HOST}:50070/webhdfs/v1/tmp/simple-project/${APP_FILE}?op=CREATE&overwrite=true"
  4. # take Location header from the response and issue a PUT request
  5. LOCATION="http://..."
  6. curl -i -X PUT -T "target/scala-2.10/${APP_FILE}" "${LOCATION}"

3) Create spark property file and upload to HDFS

    
    
  1. spark.yarn.submit.file.replication=3
  2. spark.yarn.executor.memoryOverhead=384
  3. spark.yarn.driver.memoryOverhead=384
  4. spark.master=yarn
  5. spark.submit.deployMode=cluster
  6. spark.eventLog.enabled=true
  7. spark.yarn.scheduler.heartbeat.interval-ms=5000
  8. spark.yarn.preserve.staging.files=true
  9. spark.yarn.queue=default
  10. spark.yarn.containerLauncherMaxThreads=25
  11. spark.yarn.max.executor.failures=3
  12. spark.executor.instances=2
  13. spark.eventLog.dir=hdfs\:///spark-history
  14. spark.history.kerberos.enabled=true
  15. spark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider
  16. spark.history.ui.port=18080
  17. spark.history.fs.logDirectory=hdfs\:///spark-history
  18. spark.executor.memory=2G
  19. spark.executor.cores=2
  20. spark.history.kerberos.keytab=none
  21. spark.history.kerberos.principal=none
  22.  

and upload it via WebHDFS as spark-yarn.properties to your simple-project folder as before

4) Create a Spark Job json file

a) We need to construct the command to start the Spark ApplicationMaster:
    
    
  1. java -server -Xmx1024m -Dhdp.version=2.4.0.0-169 \
  2. -Dspark.yarn.app.container.log.dir=/hadoop/yarn/log/rest-api \
  3. -Dspark.app.name=SimpleProject \
  4. org.apache.spark.deploy.yarn.ApplicationMaster \
  5. --class IrisApp --jar __app__.jar \
  6. --arg '--class' --arg 'SimpleProject' \
  7. 1><LOG_DIR>/AppMaster.stdout 2><LOG_DIR>/AppMaster.stderr

It is important to provide the Spark Application Name and the HDP Version. Spark will resolve <LOG_DIR>

b) We need to set some general environment variables
    
    
  1. JAVA_HOME="/usr/jdk64/jdk1.8.0_60/"
  2. SPARK_YARN_MODE=true
  3. HDP_VERSION="2.4.0.0-169"

Then we need to tell Spark which files to distribute across all Spark executors. Therefor we need to set 4 variables. One variable is of format "<hdfs path1>#<cache name 1>,<hdfs path2>#<cache name 2>, ...", and the three others contain comma separated timestamps, file sizes and visbility of each file (same order):

    
    
  1. SPARK_YARN_CACHE_FILES: "hdfs://<<name-node>>:8020/tmp/simple-project/simple-project.jar#__app__.jar,hdfs://<<name-node>>:8020/hdp/apps/2.4.0.0-169/spark/spark-hdp-assembly.jar#__spark__.jar"
  2. SPARK_YARN_CACHE_FILES_FILE_SIZES: "10588,191724610"
  3. SPARK_YARN_CACHE_FILES_TIME_STAMPS: "1460990579987,1460219553714"
  4. SPARK_YARN_CACHE_FILES_VISIBILITIES: "PUBLIC,PRIVATE"

Replace <<name-node>> with the correct address. File size and timestamp can be retrieved from HDFS vie WebHDFS.

Next, construct the classpath

    
    
  1. CLASSPATH="{{PWD}}<CPS>__spark__.jar<CPS>{{PWD}}/__app__.jar<CPS>{{PWD}}/__app__.properties<CPS>{{HADOOP_CONF_DIR}}<CPS>/usr/hdp/current/hadoop-client/*<CPS>/usr/hdp/current/hadoop-client/lib/*<CPS>/usr/hdp/current/hadoop-hdfs-client/*<CPS>/usr/hdp/current/hadoop-hdfs-client/lib/*<CPS>/usr/hdp/current/hadoop-yarn-client/*<CPS>/usr/hdp/current/hadoop-yarn-client/lib/*<CPS>{{PWD}}/mr-framework/hadoop/share/hadoop/common/*<CPS>{{PWD}}/mr-framework/hadoop/share/hadoop/common/lib/*<CPS>{{PWD}}/mr-framework/hadoop/share/hadoop/yarn/*<CPS>{{PWD}}/mr-framework/hadoop/share/hadoop/yarn/lib/*<CPS>{{PWD}}/mr-framework/hadoop/share/hadoop/hdfs/*<CPS>{{PWD}}/mr-framework/hadoop/share/hadoop/hdfs/lib/*<CPS>{{PWD}}/mr-framework/hadoop/share/hadoop/tools/lib/*<CPS>/usr/hdp/2.4.0.0-169/hadoop/lib/hadoop-lzo-0.6.0.2.4.0.0-169.jar<CPS>/etc/hadoop/conf/secure<CPS>"

Notes: - __spark__.jar and __app__.jar are the same as provided in SPARK_YARN_CACHE_FILES

- Spark will resolve <CPS> to `:`

c) Create the Spark job json file

The information above will be added to the Spark json file as the command and environment attribute (details see attachment - remove the .txt ending)

The last missing piece are the so called local_resources which describes all files in HDFS necessary for the Spark job: - Spark assembly jar (as in the caching environment variable) - Spark application jar for this project (as in the caching environment variable) - Spark properties file for this project (only for Application Master, no caching necessary)

All three need to be given in a form

    
    
  1. {
  2. "key": "__app__.jar",
  3. "value": {
  4. "resource": "hdfs://<<name-node>>:8020/tmp/simple-project/simple-project.jar",
  5. "size": 10588,
  6. "timestamp": 1460990579987,
  7. "type": "FILE",
  8. "visibility": "APPLICATION"
  9. }
  10. },

Again, replace <<name-node>>. Timestamp, hdfs path, size and key need to be the same as for the caching environment variables.

Save it as spark-yarn.json (details see attachment - remove the .txt ending)

5) Submit the job

First request an application ID from YARN

    
    
  1. curl -s -X POST -d '' \
  2. https://$KNOX_SERVER:8443/gateway/default/resourcemanager/v1/cluster/apps/new-application
  3. # {
  4. # "application-id": "application_1460195242962_0054",
  5. # "maximum-resource-capability": {
  6. # "memory": 8192,
  7. # "vCores": 3
  8. # }
  9. # }

Edit the "application-id" in spark-yarn.json and then submit the job:

    
    
  1. curl -s -i -X POST -H "Content-Type: application/json" ${HADOOP_RM}/ws/v1/cluster/apps \
  2. --data-binary spark-yarn.json
  3. # HTTP/1.1 100 Continue
  4. #
  5. # HTTP/1.1 202 Accepted
  6. # Cache-Control: no-cache
  7. # Expires: Sun, 10 Apr 2016 13:02:47 GMT
  8. # Date: Sun, 10 Apr 2016 13:02:47 GMT
  9. # Pragma: no-cache
  10. # Expires: Sun, 10 Apr 2016 13:02:47 GMT
  11. # Date: Sun, 10 Apr 2016 13:02:47 GMT
  12. # Pragma: no-cache
  13. # Content-Type: application/json
  14. # Location: http://<<resource-manager>>:8088/ws/v1/cluster/apps/application_1460195242962_0054
  15. # Content-Length: 0
  16. # Server: Jetty(6.1.26.hwx)

6) Track the job

    
    
  1. curl -s "http://<<resource-manager>>:8088/ws/v1/cluster/apps/application_1460195242962_0054"
  2. # {
  3. # "app": {
  4. # "id": "application_1460195242962_0054",
  5. # "user": "dr.who",
  6. # "name": "IrisApp",
  7. # "queue": "default",
  8. # "state": "FINISHED",
  9. # "finalStatus": "SUCCEEDED",
  10. # "progress": 100,
  11. # "trackingUI": "History",
  12. # "trackingUrl": "http://<<ResourceManager>>:8088/proxy/application_1460195242962_0054/",
  13. # "diagnostics": "",
  14. # "clusterId": 1460195242962,
  15. # "applicationType": "YARN",
  16. # "applicationTags": "",
  17. # "startedTime": 1460293367576,
  18. # "finishedTime": 1460293413568,
  19. # "elapsedTime": 45992,
  20. # "amContainerLogs": "http://<<node-manager>>:8042/node/containerlogs/container_e29_1460195242962_0054_01_000001/dr.who",
  21. # "amHostHttpAddress": "<<node-manager>>:8042",
  22. # "allocatedMB": -1,
  23. # "allocatedVCores": -1,
  24. # "runningContainers": -1,
  25. # "memorySeconds": 172346,
  26. # "vcoreSeconds": 112,
  27. # "queueUsagePercentage": 0,
  28. # "clusterUsagePercentage": 0,
  29. # "preemptedResourceMB": 0,
  30. # "preemptedResourceVCores": 0,
  31. # "numNonAMContainerPreempted": 0,
  32. # "numAMContainerPreempted": 0,
  33. # "logAggregationStatus": "SUCCEEDED"
  34. # }
  35. # }

7) Using Knox (without kerberos)

The whole process works with Knox, just replace the WebHdfs and Resource Manager URLs with Knox substitutes:

a) Resource Manager:

http://<<resource-manager>>:8088/ws/v1 ==> https://<<knox-gateway>>:8443/gateway/default/resourcemanager/v1

b) Webhdfs Host

http://<<webhdfs-host>>:50070/webhdfs/v1 ==> https://<<knox-gateway>>:8443/gateway/default/webhdfs/v1

Additionally you need to provide Knox credentials (e.g. Basic Authentication <<user>:<<password>>)

8) More details

More details and a python script to ease the whole process can be found in Spark-Yarn-REST-API Repo

Any comment to make this process easier is highly appreciated ...

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值