spark 带文件上集群，获取外部文件，--files 使用说明_spark-submit --files

2401_84166567

于 2024-04-13 14:10:41 发布

阅读量623

点赞数 18

分类专栏：程序员文章标签： spark 大数据分布式

本文链接：https://blog.csdn.net/2401_84166567/article/details/137714405

版权

程序员专栏收录该内容

184 篇文章 0 订阅

订阅专栏

//查看文件目录下所有文件：
LOGGER.info("\*\*\*\* Executor-外部 文件目录下所有文件："+subDir(new File(arr(2)).getParentFile).mkString(System.lineSeparator()))

//查看工作目录下所有文件：
LOGGER.info("\*\*\*\* Executor-外部 工作目录下所有文件："+subDir(new File(arr(3))).mkString(System.lineSeparator()))

LOGGER.info("程序结束")


##### cluster模式输出：

**** Driver 具体文件spark.properties:/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/spark-c307a8ed-3853-47af-ad92-7d1970683b3a/userFiles-eb591a81-139b-482f-8056-1829f6219a4b/spark.properties

**** Driver 工作路径/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001

**** Driver properties:
spark.yarn.dist.files -> file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/config/spark.properties,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/config/hbase-site.xml,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/config/kdc.conf
spark.yarn.dist.jars -> file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/bcprov-ext-jdk15on-1.68.jar,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/CryptoUtil-1.1.5.304.jar,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/fastjson-1.2.78.jar,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/commons-pool2-2.8.1.jar

**** Driver 临时目录下所有文件：（！注意这里为空）
**** Driver 工作目录下所有文件：/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/user.keytab-ca6ead50-35d4-4f1d-8e37-d87c7366106c
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/tmp
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/bcprov-ext-jdk15on-1.68.jar
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/carbon.properties
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/mapred-site.xml
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/launch_container.sh
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/commons-pool2-2.8.1.jar
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/topology.properties
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/jaas-zk.conf
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/spark.properties
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/container_tokens
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/hbase-site.xml
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/app.jar
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/jets3t.properties
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/log4j-executor.properties
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/fastjson-1.2.78.jar
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/CryptoUtil-1.1.5.304.jar
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/spark_libs
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/spark_conf
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/kdc.conf
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/arm
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/x86

**** Executor 具体文件spark.properties:/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/./spark.properties

**** Executor 工作路径/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003

**** Executor properties:
spark.yarn.dist.files 不存在
spark.yarn.dist.jars 不存在

**** Executor-内部文件目录下所有文件：/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/app.jar
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/spark_libs
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/mapred-site.xml
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/kdc.conf
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/log4j-executor.properties
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/jets3t.properties
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/spark.properties
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/jaas-zk.conf
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/spark_conf
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/bcprov-ext-jdk15on-1.68.jar
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/arm
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/topology.properties
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/launch_container.sh
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/fastjson-1.2.78.jar
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/x86
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/commons-pool2-2.8.1.jar
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/carbon.properties
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/hbase-site.xml
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/CryptoUtil-1.1.5.304.jar
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/tmp
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/container_tokens

**** Executor-内部工作目录下所有文件：/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/./app.jar
（和上面一样，略）

**** Executor-外部文件目录下所有文件：

**** Executor-外部工作目录下所有文件：


可以看到，Executor 端文件路径和工作路径一样。


dirver端，文件路径和工作路径不一样，因为记录的是临时文件目录。工作目录下该文件是存在的。


##### client模式输出：

**** Driver 具体文件spark.properties:/tmp/spark-aa67cf79-1f10-4d85-99a0-c9fcced0ee80/userFiles-814810e5-3b1d-435b-a575-796b7e3a2a91/spark.properties
**** Driver 工作路径/opt/HIBI-ExecuteShell
**** Driver properties:
spark.yarn.dist.files -> file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/config/spark.properties,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/config/hbase-site.xml,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/config/kdc.conf
spark.yarn.dist.jars -> file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/bcprov-ext-jdk15on-1.68.jar,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/CryptoUtil-1.1.5.304.jar,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/fastjson-1.2.78.jar,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/commons-pool2-2.8.1.jar

**** Driver 临时目录下所有文件：
**** Driver 工作目录下所有文件：/opt/HIBI-ExecuteShell/略(沒有spark.properties)

**** Executor 具体文件spark.properties:/srv/BigData/hadoop/data2/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0019/container_e25_1686228163560_0019_01_000002/./spark.properties

**** Executor 工作路径/srv/BigData/hadoop/data2/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0019/container_e25_1686228163560_0019_01_000002
**** Executor properties:
spark.yarn.dist.files 不存在
spark.yarn.dist.jars 不存在

其他executor信息和cluster模式一样，（略）


可以看到，在client模式下，driver端并没有将文件复制到工作目录。  
 另外，为什么工作路径是`/opt/HIBI-ExecuteShell`?

echo $SPARK_HOME
/opt/ficlient20210106/Spark2x/spark
echo $JAVA_HOME
/opt/ficlient20210106/JDK/jdk-8u201


因为jvm的启动路径是动态的，自定义的spark-submit的启动脚本 `exec-hive.sh`放在了`/opt/HIBI-ExecuteShell`下面，提交任务格式为：

/opt/HIBI-ExecuteShell/exec-hive.sh 你的脚本.sh


下面解释为什么FileInputStream和Source.fromFile直接写文件名也可以，因为scala io的相对路径取的事jvm的相对路径，而jvm的相对路径的根目录和driver和executor的工作路径是相同的。


### Scala IO


* Console(用于标准输入输出)–Console是对System.out、System.in和System.err的一层封装。包括print
* StdIn(用于标准输入)–StdIn是Console类中标准输入的引用
* Source(用于文件输入，输出得用java的api)–Source是对java.io.File的一层封装
* 文件输出（Scala中没有专门的文件输出，需要使用Java的类）
* 相对路径


`Source.fromFile(filePath)`，时，filePath可为相对路径：  
 相对路径以jvm环境变量中的`user.dir`为根目录


在Scala中启动Scala的路径即为`user.dir`的值。  
 在IDEA中user.dir的值为项目的根目录下面。


获取文件目录和子文件：

//遍历某目录下所有的文件和子文件
def main(args:Array[String]):Unit = {
for(d <- subDirRec(new File(“d:\AAA\”)))
println(d)
}

def subDirRec(dir:File):Iterator[File] ={
val dirs = dir.listFiles().filter(.isDirectory())
val files = dir.listFiles().filter(.isFile())
files.toIterator ++ dirs.toIterator.flatMap(subDirRec _)
}
//非循环
def subDir(dir:File):Iterator[File] ={
if(dir.listFiles() == null) Array.empty[File].toIterator
else dir.listFiles().toIterator
}


在Yarn-client中，Application Master仅仅从Yarn中申请资源给Executor，之后client会跟container通信进行作业的调度，下图是Yarn-client模式  
 ![在这里插入图片描述](https://img-blog.csdnimg.cn/20190606105648807.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0FuZ2VsX1NodXJh,size_16,color_FFFFFF,t_70#pic_center)


### 注意SparkFiles.get报错


问题现象  
 Spark任务抛出如下异常：

Exception in thread “main” java.lang.NullPointerException
at org.apache.spark.SparkFiles $. g e tR oo t D i rec t ory (Sp a r k F i l es . sc a l a : 37) a t or g . a p a c h e . s p a r k . Sp a r k F i l es$ .get(SparkFiles.scala:31)
…


问题分析  
 该现象为在初始化SparkContext之前调用了SparkFiles.get()。


问题解决方案  
 优先初始化SparkContext。


#### Important notes


Whether core requests are honored in scheduling decisions depends on which scheduler is in use and how it is configured.  
 In cluster mode, the local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config yarn.nodemanager.local-dirs). If the user specifies spark.local.dir, it will be ignored. In client mode, the Spark executors will use the local directories configured for YARN while the Spark driver will use those defined in spark.local.dir. This is because the Spark driver does not run on the YARN cluster in client mode, only the Spark executors do.  
 The --files and --archives options support specifying file names with the # similar to Hadoop. For example, you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.  
 The --jars option allows the SparkContext.addJar function to work if you are using it with local files and running in cluster mode. It does not need to be used if you are using it with HDFS, HTTP, HTTPS, or FTP files.


### 参考


[官方-submit]( )  
 [官方-cluster模式]( )  
 [官方-spark参数]( )  
 [参数解释]( )  
 [Spark --files使用总结]( )  


**自我介绍一下，小编13年上海交大毕业，曾经在小公司待过，也去过华为、OPPO等大厂，18年进入阿里一直到现在。**

**深知大多数大数据工程师，想要提升技能，往往是自己摸索成长或者是报班学习，但对于培训机构动则几千的学费，着实压力不小。自己不成体系的自学效果低效又漫长，而且极易碰到天花板技术停滞不前！**

**因此收集整理了一份《2024年大数据全套学习资料》，初衷也很简单，就是希望能够帮助到想自学提升又不知道该从何学起的朋友。**
![img](https://img-blog.csdnimg.cn/img_convert/224bbeb1e60bc93dbac59911a20d673b.png)
![img](https://img-blog.csdnimg.cn/img_convert/2308020b7bdf9bae73f1b554698d31cc.png)
![img](https://img-blog.csdnimg.cn/img_convert/1482840de3f31e4900df24b8eb9ae860.png)
![img](https://img-blog.csdnimg.cn/img_convert/6121fb3e29d9f8e829d2017aa5623d8d.png)
![img](https://img-blog.csdnimg.cn/img_convert/14377633f71e5c037ce3220fb861d506.png)

**既有适合小白学习的零基础资料，也有适合3年以上经验的小伙伴深入学习提升的进阶课程，基本涵盖了95%以上大数据开发知识点，真正体系化！**

**由于文件比较大，这里只是将部分目录大纲截图出来，每个节点里面都包含大厂面经、学习笔记、源码讲义、实战项目、讲解视频，并且后续会持续更新**

**如果你觉得这些内容对你有帮助，可以添加VX：vip204888 （备注大数据获取）**
![img](https://img-blog.csdnimg.cn/img_convert/abbedd1a8cc862e4d2eb04c0d0c4b4b8.png)

**一个人可以走的很快，但一群人才能走的更远。不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人，都欢迎扫码加入我们的的圈子（技术交流、学习资源、职场吐槽、大厂内推、面试辅导），让我们一起学习成长！**

只是将部分目录大纲截图出来，每个节点里面都包含大厂面经、学习笔记、源码讲义、实战项目、讲解视频，并且后续会持续更新**

**如果你觉得这些内容对你有帮助，可以添加VX：vip204888 （备注大数据获取）**
[外链图片转存中...(img-wcbAeNIl-1712988572028)]

**一个人可以走的很快，但一群人才能走的更远。不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人，都欢迎扫码加入我们的的圈子（技术交流、学习资源、职场吐槽、大厂内推、面试辅导），让我们一起学习成长！**

2401_84166567

关注

18
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
spark 带文件上集群，获取外部文件，--files 使用说明_spark-submit --files

*** Executor 具体文件spark.properties:/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/./spark.properties。**** Driver 工作路径/opt/HIBI-ExecuteShell。
复制链接

扫一扫