spark执行优化--禁止将依赖的Jar包传到HDFS(spark.yarn.jar和spark.yarn.archive的使用)

使用yarn的方式提交spark应用时,如果没有配置spark.yarn.archive或者spark.yarn.jars时, 输出的日志在输出Neither spark.yarn.jars nor spark.yarn.archive is set后,会将SPARK_HOME/jars下的所有jar打包并上传到HDFS上,这个过程会非常耗时。可以通过如下2种方法避免此操作,缩小spark应用的启动时间

  1. spark-defaults.conf配置里添加spark.yarn.archive或spark.yarn.jars

  2. 在spark-submit启动参数中,增加:--conf "spark.yarn.archive=hdfs://hdp01:8020/spark-yarn/zip/spark_jars_2.4.0.zip"  或  --conf "spark.yarn.jars=hdfs://hdp01:8020/spark-yarn/jars/*.jar"

要使Spark runtime jars 可以从YARN端访问,可以指定spark.yarn.archive or spark.yarn.jars.

这两个参数的官方定义如下:详见:http://spark.apache.org/docs/2.4.7/running-on-yarn.html

 spark.yarn.jars:包含要分发到YARN容器的Spark代码的库列表。默认情况下,Spark on YARN将使用本地安装的Spark jar,但Spark jar也可以位于HDFS上。这允许YARN将其缓存到节点上,这样就不需要在每次应用程序运行时分发它。例如,要指向HDFS上的jar文件,可以将该配置设置为HDFS:///some/path。

spark.yarn.archive: 一个包含分发到YARN Cache所需的Spark jar的归档文件。如果设置了,这个配置将替换spark.yarn.jars,并且归档文件将在应用程序的所有容器中使用。归档文件的根目录中应该包含jar文件。与前面的选项一样,归档文件也可以托管在HDFS上,以加快文件分发。

注意:如果两个都设置,则spark.yarn.archive的优先级更高。

1. 未设置spark.yarn.archive或者spark.yarn.jars

如果在spark-defaults.conf、spark-submit中都没有设置spark.yarn.archive、spark.yarn.jars参数,则会在日志中提示Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME后,Spark将在$SPARK_HOME/jars下创建一个包含所有jar的zip文件,并将其上传到分布式缓存中。

每个Application都会上传一个这样的zip文件,影响HDFS的性能以及占用HDFS的空间,并且导致启动时间延长。

此zip包在spark 2.4.0版本中,大小约为225MB.


21/04/26 15:34:25 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://172.25.21.134:4040
21/04/26 15:34:25 INFO RMProxy: Connecting to ResourceManager at hdp01/172.25.21.104:8050
21/04/26 15:34:25 INFO Client: Requesting a new application from cluster with 3 NodeManagers
21/04/26 15:34:25 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (76800 MB per container)
21/04/26 15:34:25 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
21/04/26 15:34:25 INFO Client: Setting up container launch context for our AM
21/04/26 15:34:25 INFO Client: Setting up the launch environment for our AM container
21/04/26 15:34:25 INFO Client: Preparing resources for our AM container
21/04/26 15:34:25 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
21/04/26 15:34:27 INFO Client: Uploading resource file:/tmp/spark-d2d0ee3b-b814-4533-8bff-eb14a306d141/__spark_libs__6295765865935354270.zip -> hdfs://hdp01:8020/user/root/.sparkStaging/application_1619146133746_0029/__spark_libs__6295765865935354270.zip
21/04/26 15:34:31 INFO Client: Uploading resource file:/tmp/spark-d2d0ee3b-b814-4533-8bff-eb14a306d141/__spark_conf__6280792259904379056.zip -> hdfs://hdp01:8020/user/root/.sparkStaging/application_1619146133746_0029/__spark_conf__.zip
21/04/26 15:34:31 INFO SecurityManager: Changing view acls to: root
21/04/26 15:34:31 INFO SecurityManager: Changing modify acls to: root
21/04/26 15:34:31 INFO SecurityManager: Changing view acls groups to:
21/04/26 15:34:31 INFO SecurityManager: Changing modify acls groups to:
21/04/26 15:34:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
21/04/26 15:34:32 INFO Client: Submitting application application_1619146133746_0029 to ResourceManager
21/04/26 15:34:32 INFO YarnClientImpl: Submitted application application_1619146133746_0029

    

   

2. 设置spark.yarn.jars

将spark根目录下jars里的所有jar包上传到HDFS

 hdfs dfs -mkdir -p  /spark-yarn/jars
 hdfs dfs -put /opt/spark/jars/* /spark-yarn/jars/

修改spark-defaults.conf (注意:要修改所有节点的),增加此配置:spark.yarn.jars hdfs://hadoop122:9000/spark-yarn/jars/*.jar

在spark-submit指定spark.yarn.jars参数: spark-sql --master yarn --conf "spark.yarn.jars=hdfs://hdp01:8020/spark-yarn/jars/*.jar"

可以看到,因为Source and destination file systems are the same,不再上传spark jars,减少了应用启动时间,只上传了应用所需要的配置,不足300kb

21/04/26 15:48:39 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://172.25.21.134:4040
21/04/26 15:48:39 INFO RMProxy: Connecting to ResourceManager at hdp01/172.25.21.104:8050
21/04/26 15:48:39 INFO Client: Requesting a new application from cluster with 3 NodeManagers
21/04/26 15:48:39 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (76800 MB per container)
21/04/26 15:48:39 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
21/04/26 15:48:39 INFO Client: Setting up container launch context for our AM
21/04/26 15:48:39 INFO Client: Setting up the launch environment for our AM container
21/04/26 15:48:39 INFO Client: Preparing resources for our AM container
21/04/26 15:48:39 INFO Client: Source and destination file systems are the same. Not copying hdfs://hdp01:8020/spark-yarn/jars/JavaEWAH-0.3.2.jar
21/04/26 15:48:39 INFO Client: Source and destination file systems are the same. Not copying hdfs://hdp01:8020/spark-yarn/jars/RoaringBitmap-0.5.11.jar
......
21/04/26 15:48:39 INFO Client: Source and destination file systems are the same. Not copying hdfs://hdp01:8020/spark-yarn/jars/zookeeper-3.4.6.jar
21/04/26 15:48:39 INFO Client: Source and destination file systems are the same. Not copying hdfs://hdp01:8020/spark-yarn/jars/zstd-jni-1.3.2-2.jar
21/04/26 15:48:39 INFO Client: Uploading resource file:/tmp/spark-7435eb56-7303-44e9-882a-4733c406351c/__spark_conf__1185932954890838093.zip -> hdfs://hdp01:8020/user/root/.sparkStaging/application_1619146133746_0032/__spark_conf__.zip
21/04/26 15:48:39 INFO SecurityManager: Changing view acls to: root
21/04/26 15:48:39 INFO SecurityManager: Changing modify acls to: root
21/04/26 15:48:39 INFO SecurityManager: Changing view acls groups to:
21/04/26 15:48:39 INFO SecurityManager: Changing modify acls groups to:
21/04/26 15:48:39 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
21/04/26 15:48:40 INFO Client: Submitting application application_1619146133746_0032 to ResourceManager

   

3. 设置spark.yarn.archive

将spark根目录下jars里的所有jar包上传到HDFS

打包要注意所有的jar都在zip包的根目录中(如果提示没有zip命令,则需要先安装:yum install zip)

cd /opt/spark/jars/
zip -q -r spark_jars_2.4.0.zip *
hdfs dfs -mkdir /spark-yarn/zip
hdfs dfs -put spark_jars_2.4.0.zip /spark-yarn/zip/

修改spark-defaults.conf (注意:要修改所有节点的),增加此配置:spark.yarn.archive hdfs://hdp01:8020/spark-yarn/zip/spark_jars_2.4.0.zip

在spark-submit指定spark.yarn.jars参数: spark-sql --master yarn --conf "spark.yarn.archive=hdfs://hdp01:8020/spark-yarn/zip/spark_jars_2.4.0.zip"

可以看到,因为Source and destination file systems are the same. Not copying hdfs://hdp01:8020/spark-yarn/zip/spark_jars_2.4.0.zip,

不再上传spark jars,减少了应用启动时间,只上传了应用所需要的配置,不足300kb

21/04/26 15:57:33 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://172.25.21.134:4040
21/04/26 15:57:33 INFO RMProxy: Connecting to ResourceManager at hdp01/172.25.21.104:8050
21/04/26 15:57:33 INFO Client: Requesting a new application from cluster with 3 NodeManagers
21/04/26 15:57:33 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (76800 MB per container)
21/04/26 15:57:33 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
21/04/26 15:57:33 INFO Client: Setting up container launch context for our AM
21/04/26 15:57:33 INFO Client: Setting up the launch environment for our AM container
21/04/26 15:57:33 INFO Client: Preparing resources for our AM container
21/04/26 15:57:33 INFO Client: Source and destination file systems are the same. Not copying hdfs://hdp01:8020/spark-yarn/zip/spark_jars_2.4.0.zip
21/04/26 15:57:33 INFO Client: Uploading resource file:/tmp/spark-7ca10ef0-2e82-47f8-86b8-c29db05b6bbe/__spark_conf__6479054559100777354.zip -> hdfs://hdp01:8020/user/root/.sparkStaging/application_1619146133746_0033/__spark_conf__.zip
21/04/26 15:57:34 INFO SecurityManager: Changing view acls to: root
21/04/26 15:57:34 INFO SecurityManager: Changing modify acls to: root
21/04/26 15:57:34 INFO SecurityManager: Changing view acls groups to:
21/04/26 15:57:34 INFO SecurityManager: Changing modify acls groups to:
21/04/26 15:57:34 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
21/04/26 15:57:35 INFO Client: Submitting application application_1619146133746_0033 to ResourceManager

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

enjoy编程

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值