一:问题现象:
在spark on yarn 提交任务是,提示如下:
WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
二:解决办法:
1).创建 archive: jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .
2).hdfs下创建目录:hdfs dfs -mkdir -p /system/SparkJars/jar
上传jar包到 HDFS: hdfs dfs -put spark-libs.jar /system/SparkJars/jar
3). 在spark-default.conf中设置 spark.yarn.archive=hdfs:///system/SparkJars/jar/spark-libs.jar
三:结果:
这是SPARK on YARN,调优的一个手段,节约每个NODE上传JAR到HDFS的时间,可通过具体情况查看:
四:官网解释:
在[https://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties]里有解释:
To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. For details please refer to Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.
继续查看具体的 Spark Properties:
spark.yarn.jars:none :List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn’t need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to hdfs:///some/path. Globs are allowed.
spark.yarn.archive:An archive containing needed Spark jars for distribution to the YARN cache. If set, this configuration replaces spark.yarn.jars and the archive is used in all the application’s containers. The archive should contain jar files in its root directory. Like with the previous option, the archive can also be hosted on HDFS to speed up file distribution.
就是在默认情况:Spark on YARN要用Spark jars(默认就在Spark安装目录),但这个jars也可以再HDFS任何可以读到的地方,这样就方便每次应用程序跑的时候在节点上可以Cache,这样就不用上传这些jars,