1.说明
之前整理过一篇类似文章,但是这个spark.yarn.jar配置的目录最好只是放spark jars目录下的jar包,如果放入其他的jar包,很大概率会有冲突,而且如果项目比较多,jar包引入的内容版本不尽相同,也不太利于管理。题主这里有一个spark的分析项目,引入了很多依赖,如果只是配置了spark.yarn.jars,上传jar包的过程仍然很慢,所以还是需要把项目的依赖jar包上传到HDFS,经过查阅资料和翻查官网,发现了application-jar , --jars 都是即可以使用本地也可以使用hdfs。
还有一个参数在提交的时候也可以起到这个作用spark.yarn.dist.jars
spark执行优化——依赖上传到HDFS(spark.yarn.jar和spark.yarn.archive的使用)
1.1 application-jar/–jars
官方关于application-jar、–jars的说明
关于–jars,file、hdfs:, http:, https:, ftp、local都能使用,但是多个jar使用逗号间隔,而且目录扩展不适用。就是说–jar hdfs:///spark-yarn/dbp-jars/*.jar 这种写法不支持。
1.2 spark.yarn.dist.jars
官方说明如下
只是说每个executer下的jar使用逗号分隔,我理解有点在说每个executor下具体目录下的jar,感觉像是说本地,。但是实际测试发现直接使用hdfs上的jar包也是可以的,而且看到一篇文章是在代码里使用的。
2.具体使用
无论是–jars还是spark.yarn.dist.jars都要求多个jar之间逗号分隔,所以使用之前要做一些处理,hdfs上的jar包太多,所以使用shell作如下处理,hdfs上/spark-yarn/xxx-jars/上传了我项目所有的外部jar包
hdfs_jars=`hadoop fs -ls -C /spark-yarn/xxx-jars/`
for f in ${hdfs_jars}; do
app_CLASSPATH='hdfs://'$f,${app_CLASSPATH}
done
# 依赖jar包
len=${#app_CLASSPATH}-1
JAR_PATH=${app_CLASSPATH:0:len}
这样在提交任务的指令里直接引用JAR_PATH就行了
例如:
spark-submit \
--class org.apache.spark.xxxx.ExecutorMethod \
--master yarn \
--deploy-mode cluster \
--executor-memory 20G \
--num-executors 50 \
--jars $JAR_PATH \
/path/to/examples.jar \
param..
或者
spark-submit \
--class org.apache.spark.xxxx.ExecutorMethod \
--master yarn \
--deploy-mode cluster \
--executor-memory 20G \
--num-executors 50 \
--conf spark.yarn.dist.jars=$JAR_PATH
/path/to/examples.jar \
param..
使用hdfs上的jar之后可以看到提交日志
Source and destination file systems are the same. Not copying 就是spark-defaults.conf里spark.yarn.jars的配置效果
Same name resource的就是 --conf spark.yarn.dist.jars(或者–jars)指定的hdfs上的jar产生的效果
21/08/26 13:28:12 INFO Client: Source and destination file systems are the same. Not copying hdfs://xxx/spark-yarn/jars/antlr4-runtime-4.7.jar
21/08/26 13:28:12 INFO Client: Source and destination file systems are the same. Not copying hdfs://xxx/spark-yarn/jars/aopalliance-1.0.jar
21/08/26 13:28:12 INFO Client: Source and destination file systems are the same. Not copying hdfs://xxx/spark-yarn/jars/aopalliance-repackaged-2.4.0-b34.jar
21/08/26 13:28:12 INFO Client: Source and destination file systems are the same. Not copying hdfs://xxx/spark-yarn/jars/apache-log4j-extras-1.2.17.jar
21/08/26 13:28:12 INFO Client: Source and destination file systems are the same. Not copying hdfs://xxx/spark-yarn/jars/arpack_combined_all-0.1.jar
21/08/26 13:28:12 INFO Client: Source and destination file systems are the same. Not copying hdfs://xxx/spark-yarn/jars/arrow-format-0.8.0.3.1.0.0-78.jar
21/08/26 13:28:12 INFO Client: Source and destination file systems are the same. Not copying hdfs://xxx/spark-yarn/jars/arrow-memory-0.8.0.3.1.0.0-78.jar
21/08/26 13:28:12 INFO Client: Source and destination file systems are the same. Not copying hdfs://xxx/spark-yarn/jars/arrow-vector-0.8.0.3.1.0.0-78.jar
...
...
21/08/26 13:28:14 WARN Client: Same name resource hdfs://xxx/spark-yarn/xxx-jars/commons-crypto-1.0.0.jar added multiple times to distributed cache
21/08/26 13:28:14 WARN Client: Same name resource hdfs://xxx/spark-yarn/xxx-jars/commons-configuration2-2.1.1.jar added multiple times to distributed cache
注意:
如果直接使用如下写法
--conf spark.yarn.jars="hdfs:///spark-yarn/xxx-jars/*.jar
会直接报错: 找不到或无法加载主类 org.apache.spark.deploy.yarn.ApplicationMaster
以上两种写法运行都没问题,但是使用–jars的方式引入hdfs上的jar,会在提交任务时多打印如下warn日志,个人推荐使用spark.yarn.dist.jars
Warning: Skip remote jar hdfs://xxxxx/spark-yarn/xxx-jars/commons-lang-2.6.jar.
Warning: Skip remote jar hdfs://xxxxx/spark-yarn/xxx-jars/commons-io-1.3.2.jar.
Warning: Skip remote jar hdfs://xxxxx/spark-yarn/xxx-jars/commons-httpclient-3.1.jar.
Warning: Skip remote jar hdfs://xxxxx/spark-yarn/xxx-jars/commons-csv-1.4.jar.
Warning: Skip remote jar hdfs://xxxxx/spark-yarn/xxx-jars/commons-crypto-1.0.0.jar.
Warning: Skip remote jar hdfs://xxxxx/spark-yarn/xxx-jars/commons-configuration2-2.1.1.jar.
Warning: Skip remote jar hdfs://xxxxx/spark-yarn/xxx-jars/commons-configuration-1.10.jar.
...
...
看到有文章说local模式提交任务无法执行HDFS上的jar,我又为此做了些测试,我现在使用的是Ambari 2.7.3的集群,
可以看到提交任务日志会有如下内容,执行程序是test-program.jar,可以看到执行程序被抓取到了本地
把执行程序上传到hdfs上
420 21/08/26 15:19:58 INFO SparkContext: Added JAR hdfs://xxx/spark-yarn/xxx-jars/RoaringBitmap-0.5.11.jar at hdfs://xxx/spark-yarn/xxx-jars/RoaringBitmap-0.5.11.jar with timestamp 1629962398294
421 21/08/26 15:19:58 INFO SparkContext: Added JAR hdfs://xxx/spark-yarn/xxx-jars/HdrHistogram-2.1.9.jar at hdfs://xxx/spark-yarn/xxx-jars/HdrHistogram-2.1.9.jar with timestamp 1629962398294
422: 21/08/26 15:19:58 INFO SparkContext: Added JAR hdfs:///spark-yarn/programs/test-program.jar at hdfs:///spark-yarn/programs/test-program.jar with timestamp 1629962398294
...
1192 21/08/26 15:20:17 INFO Utils: Fetching hdfs://xxx/spark-yarn/xxx-jars/chill-java-0.9.3.jar to /tmp/spark-ee38cb1d-da49-4c38-bb6c-eb6288ca056a/userFiles-fc789632-ac6c-4347-a279-7971a0f38098/fetchFileTemp2882690771436990167.tmp
1193 21/08/26 15:20:17 INFO Executor: Adding file:/tmp/spark-ee38cb1d-da49-4c38-bb6c-eb6288ca056a/userFiles-fc789632-ac6c-4347-a279-7971a0f38098/chill-java-0.9.3.jar to class loader
1194: 21/08/26 15:20:17 INFO Executor: Fetching hdfs:///spark-yarn/programs/test-program.jar with timestamp 1629962398294
1195: 21/08/26 15:20:17 INFO Utils: Fetching hdfs:/spark-yarn/programs/test-program.jar to /tmp/spark-ee38cb1d-da49-4c38-bb6c-eb6288ca056a/userFiles-fc789632-ac6c-4347-a279-7971a0f38098/fetchFileTemp6784106893527808599.tmp
1196: 21/08/26 15:20:17 INFO Executor: Adding file:/tmp/spark-ee38cb1d-da49-4c38-bb6c-eb6288ca056a/userFiles-fc789632-ac6c-4347-a279-7971a0f38098/test-program.jar to class loader
...
...
结尾
至此,经验证,spark在yarn上提交任务的jar包依赖就都可以上传到hdfs上了,包括执行程序本身,不论是本地或者yarn模式提交都可以。
参考文章
saprk submit 无法执行HDFS上的jar
Spark - 使用yarn client模式
Spark官网Submitting Applications