spark执行优化——依赖上传到HDFS二(-conf spark.yarn.dist.jars或者--jars 的使用)

最新推荐文章于 2022-05-30 22:35:11 发布

Marcel Lou

最新推荐文章于 2022-05-30 22:35:11 发布

阅读量3.4k

点赞数 3

分类专栏：大数据-spark 文章标签： spark big data scala

本文链接：https://blog.csdn.net/OMars/article/details/119929713

版权

大数据-spark 专栏收录该内容

3 篇文章 1 订阅

订阅专栏

1.说明

之前整理过一篇类似文章，但是这个spark.yarn.jar配置的目录最好只是放spark jars目录下的jar包，如果放入其他的jar包，很大概率会有冲突，而且如果项目比较多，jar包引入的内容版本不尽相同，也不太利于管理。题主这里有一个spark的分析项目，引入了很多依赖，如果只是配置了spark.yarn.jars,上传jar包的过程仍然很慢，所以还是需要把项目的依赖jar包上传到HDFS,经过查阅资料和翻查官网，发现了application-jar , --jars 都是即可以使用本地也可以使用hdfs。
还有一个参数在提交的时候也可以起到这个作用spark.yarn.dist.jars
spark执行优化——依赖上传到HDFS(spark.yarn.jar和spark.yarn.archive的使用)

1.1 application-jar/–jars

官方关于application-jar、–jars的说明
在这里插入图片描述
关于–jars，file、hdfs:, http:, https:, ftp、local都能使用，但是多个jar使用逗号间隔，而且目录扩展不适用。就是说–jar hdfs:///spark-yarn/dbp-jars/*.jar 这种写法不支持。

1.2 spark.yarn.dist.jars

官方说明如下
在这里插入图片描述
只是说每个executer下的jar使用逗号分隔，我理解有点在说每个executor下具体目录下的jar,感觉像是说本地，。但是实际测试发现直接使用hdfs上的jar包也是可以的，而且看到一篇文章是在代码里使用的。

2.具体使用

无论是–jars还是spark.yarn.dist.jars都要求多个jar之间逗号分隔，所以使用之前要做一些处理，hdfs上的jar包太多，所以使用shell作如下处理,hdfs上/spark-yarn/xxx-jars/上传了我项目所有的外部jar包

hdfs_jars=`hadoop fs -ls -C /spark-yarn/xxx-jars/`
for f in ${hdfs_jars}; do
   app_CLASSPATH='hdfs://'$f,${app_CLASSPATH}
done
# 依赖jar包
len=${#app_CLASSPATH}-1
JAR_PATH=${app_CLASSPATH:0:len}

这样在提交任务的指令里直接引用JAR_PATH就行了
例如：

spark-submit \
  --class org.apache.spark.xxxx.ExecutorMethod \
  --master yarn \
  --deploy-mode cluster \  
  --executor-memory 20G \
  --num-executors 50 \
  --jars $JAR_PATH \
  /path/to/examples.jar \
  param..

或者

spark-submit \
  --class org.apache.spark.xxxx.ExecutorMethod \
  --master yarn \
  --deploy-mode cluster \  
  --executor-memory 20G \
  --num-executors 50 \
  --conf spark.yarn.dist.jars=$JAR_PATH
  /path/to/examples.jar \
  param..

使用hdfs上的jar之后可以看到提交日志
Source and destination file systems are the same. Not copying 就是spark-defaults.conf里spark.yarn.jars的配置效果
Same name resource的就是 --conf spark.yarn.dist.jars（或者–jars）指定的hdfs上的jar产生的效果

21/08/26 13:28:12 INFO Client: Source and destination file systems are the same. Not copying hdfs://xxx/spark-yarn/jars/antlr4-runtime-4.7.jar
21/08/26 13:28:12 INFO Client: Source and destination file systems are the same. Not copying hdfs://xxx/spark-yarn/jars/aopalliance-1.0.jar
21/08/26 13:28:12 INFO Client: Source and destination file systems are the same. Not copying hdfs://xxx/spark-yarn/jars/aopalliance-repackaged-2.4.0-b34.jar
21/08/26 13:28:12 INFO Client: Source and destination file systems are the same. Not copying hdfs://xxx/spark-yarn/jars/apache-log4j-extras-1.2.17.jar
21/08/26 13:28:12 INFO Client: Source and destination file systems are the same. Not copying hdfs://xxx/spark-yarn/jars/arpack_combined_all-0.1.jar
21/08/26 13:28:12 INFO Client: Source and destination file systems are the same. Not copying hdfs://xxx/spark-yarn/jars/arrow-format-0.8.0.3.1.0.0-78.jar
21/08/26 13:28:12 INFO Client: Source and destination file systems are the same. Not copying hdfs://xxx/spark-yarn/jars/arrow-memory-0.8.0.3.1.0.0-78.jar
21/08/26 13:28:12 INFO Client: Source and destination file systems are the same. Not copying hdfs://xxx/spark-yarn/jars/arrow-vector-0.8.0.3.1.0.0-78.jar
...
...
21/08/26 13:28:14 WARN Client: Same name resource hdfs://xxx/spark-yarn/xxx-jars/commons-crypto-1.0.0.jar added multiple times to distributed cache
21/08/26 13:28:14 WARN Client: Same name resource hdfs://xxx/spark-yarn/xxx-jars/commons-configuration2-2.1.1.jar added multiple times to distributed cache

注意：
如果直接使用如下写法

--conf spark.yarn.jars="hdfs:///spark-yarn/xxx-jars/*.jar

会直接报错: 找不到或无法加载主类 org.apache.spark.deploy.yarn.ApplicationMaster

以上两种写法运行都没问题，但是使用–jars的方式引入hdfs上的jar,会在提交任务时多打印如下warn日志,个人推荐使用spark.yarn.dist.jars

Warning: Skip remote jar hdfs://xxxxx/spark-yarn/xxx-jars/commons-lang-2.6.jar.
Warning: Skip remote jar hdfs://xxxxx/spark-yarn/xxx-jars/commons-io-1.3.2.jar.
Warning: Skip remote jar hdfs://xxxxx/spark-yarn/xxx-jars/commons-httpclient-3.1.jar.
Warning: Skip remote jar hdfs://xxxxx/spark-yarn/xxx-jars/commons-csv-1.4.jar.
Warning: Skip remote jar hdfs://xxxxx/spark-yarn/xxx-jars/commons-crypto-1.0.0.jar.
Warning: Skip remote jar hdfs://xxxxx/spark-yarn/xxx-jars/commons-configuration2-2.1.1.jar.
Warning: Skip remote jar hdfs://xxxxx/spark-yarn/xxx-jars/commons-configuration-1.10.jar.
...
...

看到有文章说local模式提交任务无法执行HDFS上的jar，我又为此做了些测试，我现在使用的是Ambari 2.7.3的集群，
可以看到提交任务日志会有如下内容,执行程序是test-program.jar，可以看到执行程序被抓取到了本地
把执行程序上传到hdfs上


  420  21/08/26 15:19:58 INFO SparkContext: Added JAR hdfs://xxx/spark-yarn/xxx-jars/RoaringBitmap-0.5.11.jar at hdfs://xxx/spark-yarn/xxx-jars/RoaringBitmap-0.5.11.jar with timestamp 1629962398294
  421  21/08/26 15:19:58 INFO SparkContext: Added JAR hdfs://xxx/spark-yarn/xxx-jars/HdrHistogram-2.1.9.jar at hdfs://xxx/spark-yarn/xxx-jars/HdrHistogram-2.1.9.jar with timestamp 1629962398294
  422: 21/08/26 15:19:58 INFO SparkContext: Added JAR hdfs:///spark-yarn/programs/test-program.jar at hdfs:///spark-yarn/programs/test-program.jar with timestamp 1629962398294

  ...
 1192  21/08/26 15:20:17 INFO Utils: Fetching hdfs://xxx/spark-yarn/xxx-jars/chill-java-0.9.3.jar to /tmp/spark-ee38cb1d-da49-4c38-bb6c-eb6288ca056a/userFiles-fc789632-ac6c-4347-a279-7971a0f38098/fetchFileTemp2882690771436990167.tmp
 1193  21/08/26 15:20:17 INFO Executor: Adding file:/tmp/spark-ee38cb1d-da49-4c38-bb6c-eb6288ca056a/userFiles-fc789632-ac6c-4347-a279-7971a0f38098/chill-java-0.9.3.jar to class loader
 1194: 21/08/26 15:20:17 INFO Executor: Fetching hdfs:///spark-yarn/programs/test-program.jar with timestamp 1629962398294
 1195: 21/08/26 15:20:17 INFO Utils: Fetching hdfs:/spark-yarn/programs/test-program.jar to /tmp/spark-ee38cb1d-da49-4c38-bb6c-eb6288ca056a/userFiles-fc789632-ac6c-4347-a279-7971a0f38098/fetchFileTemp6784106893527808599.tmp
 1196: 21/08/26 15:20:17 INFO Executor: Adding file:/tmp/spark-ee38cb1d-da49-4c38-bb6c-eb6288ca056a/userFiles-fc789632-ac6c-4347-a279-7971a0f38098/test-program.jar to class loader
...
...