添加spark.yarn.jars 解决 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set,

最新推荐文章于 2024-07-17 12:19:49 发布

墨卿风竹

最新推荐文章于 2024-07-17 12:19:49 发布

阅读量1.7k

点赞数 1

文章标签：添加spark.yarn.jars 解决 WARN yarn.Clie

本文链接：https://blog.csdn.net/qq_43688472/article/details/85258239

版权

一：问题现象：

在spark on yarn 提交任务是，提示如下：

WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.

在这里插入图片描述

二：解决办法：

1）.创建 archive: jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .
2）.hdfs下创建目录：hdfs dfs -mkdir -p /system/SparkJars/jar
上传jar包到 HDFS: hdfs dfs -put spark-libs.jar /system/SparkJars/jar
3）. 在spark-default.conf中设置 spark.yarn.archive=hdfs:///system/SparkJars/jar/spark-libs.jar

三：结果：

这是SPARK on YARN,调优的一个手段，节约每个NODE上传JAR到HDFS的时间，可通过具体情况查看：
在这里插入图片描述
四：官网解释：
在[https://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties]里有解释：

To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. For details please refer to Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.

继续查看具体的 Spark Properties：
spark.yarn.jars：none ：List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn’t need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to hdfs:///some/path. Globs are allowed.

spark.yarn.archive：An archive containing needed Spark jars for distribution to the YARN cache. If set, this configuration replaces spark.yarn.jars and the archive is used in all the application’s containers. The archive should contain jar files in its root directory. Like with the previous option, the archive can also be hosted on HDFS to speed up file distribution.

就是在默认情况：Spark on YARN要用Spark jars（默认就在Spark安装目录），但这个jars也可以再HDFS任何可以读到的地方，这样就方便每次应用程序跑的时候在节点上可以Cache，这样就不用上传这些jars，