spark 带文件上集群,获取外部文件,--files 使用说明

本文主要探讨了spark-submit在yarn client和cluster模式下提交任务时添加文件、文件传输过程、jar加载情况,还介绍了获取文件路径和内容的方法。同时讲解了Scala IO的相关知识,如标准输入输出、文件输入等,最后提到了SparkFiles.get报错的问题及解决方案。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

本文讨论 yarn client 和 cluster模式

spark-submit 提交任务时

添加文件

spark-submit --files file_paths
其中file_paths可为多种方式:file:,hdfs://,http://,ftp://,local:,多个路径用逗号隔开

-- files 每个文件路径逗号隔开,cluster模式下带到每个executor的工作路径下以及driver的工作路径下(参数user.dir可查看),client模式下仅带到每个executor的工作路径下。对于client 模式,文件路径file_paths需要指向本地文件。对于cluster 模式,文件路径file_paths可以是本地文件或者是一个全局可见文件(例如hdfs://path) ,cluster 模式如果是本地路径,不要求每个节点的该路径都有文件,只要提交任务的机器上有文件即可。

文件传输的具体过程:

当使用spark-submit --files时,会将–files后面的文件路径记录下来传给driver进程,然后当启动driver进程时,会调用SparkFiles.addFile(file_path),并复制文件到driver的临时文件目录中。之后executor启动之后,将从driver这里fetch文件到自己的工作目录。

所以SparkFiles.get(fileName)所得的路径(get(filename)可以查询通过SparkContext.addFile()上传的文件的完整路径),对于driver就是SparkEnv.get.driverTmpDir+fileName,对于executor就是workDir+fileName。

经过验证,driver的临时目录会被清空(但是cluster模式下文件会被写到driver的工作目录下)

到底是 先写到工作目录,再写到临时目录,还是先写到临时目录再写到工作目录目前无法验证,有机会读读源码才知道。但是没关系,只要知道cluster模式下最终driver下是临时目录是没有文件的,都在它的工作目录下,和executor一样即可。client模式下,文件不会被复制到工作目录下,要读取文件的话,需要知道文件原始路径。

顺便讲一下jar:

--jars cluster 模式下外部JARs 会被加载到driver和 executor的classpath 下。client模式下,只加载到 executor的classpath 下。对于client模式,文件path必须是本地路径。对于cluster 模式,文件路径可以是本地文件或者是一个全局可见文件(例如hdfs://path) ,cluster 模式如果是本地路径,不要求每个节点的该路径都有文件,只要提交任务的机器上有文件即可。这些jar也会被放到driver和 executor的工作目录下。

记录下来的文件原始路径可在这些参数中查看:
println(System.getProperties.asScala.mkString(System.lineSeparator()))

  • spark.yarn.dist.archives Comma separated list of archives to be extracted into the working directory of each executor.
  • spark.yarn.dist.files Comma-separated list of files to be placed in the working directory of each executor.
  • spark.yarn.dist.jars Comma-separated list of jars to be placed in the working directory of each executor.

获取文件路径和内容

clutser模式下:

获取文件路径:
filePath = SparkFiles.get(fileName)

获取文件数据流:
driver: inputStream = new FileInputStream(fileName)
executor:inputStream = new FileInputStream(fileName)或者inputStream = new FileInputStream(SparkFiles.get(fileName))
获取文件内容:
driver: Source.fromFile(fileName)
executor: Source.fromFile(fileName)或者Source.fromFile(SparkFiles.get(fileName))

client模式下:

executor和cluster模式一样。

driver:
需要获取文件的原始路径:
方法1:直接将原始路径当参数传入oriPath。 Source.fromFile(oriPath)
方法2:System.getProperty("spark.yarn.dist.files"),然后过滤出需要的文件名oriPath,然后 Source.fromFile(oriPath)

验证(上传spark.properties文件):

spark-submit --files /opt/test/spark.properties

 val ss = SparkSession.builder().enableHiveSupport().getOrCreate()

    ss.sparkContext.setLogLevel("INFO")
    //具体文件
    val filePath = SparkFiles.get("spark.properties")
    val workDir=System.getProperty("user.dir")
    LOGGER.info("**** Driver 具体文件spark.properties:"+filePath)
    LOGGER.info("**** Driver 工作路径"+workDir)

    LOGGER.info(s"**** Driver properties:${System.getProperties.asScala.mkString(System.lineSeparator())}")


    def subDir(dir:File):Iterator[File] ={
      LOGGER.info(s"**** subDir of $dir")
      if(dir.listFiles() == null) Array.empty[File].toIterator
      else dir.listFiles().toIterator
    }

    //查看临时目录下所有文件:
    LOGGER.info("**** Driver 临时目录下所有文件:"+subDir(new File(filePath).getParentFile).mkString(System.lineSeparator()))

    //查看工作目录下所有文件:
    LOGGER.info("**** Driver 工作目录下所有文件:"+subDir(new File(workDir)).mkString(System.lineSeparator()))


    import ss.sqlContext.implicits._
    val arr=Seq("1").toDF().map{x=>

      List(System.getProperties.asScala.mkString(System.lineSeparator()),
        System.getenv().asScala.mkString(System.lineSeparator()),
        SparkFiles.get("spark.properties"),
        System.getProperty("user.dir"),
        subDir(new File(System.getProperty("user.dir"))).mkString(System.lineSeparator()),
        subDir(new File(SparkFiles.get("spark.properties")).getParentFile).mkString(System.lineSeparator())
      )
    }.collect().head

    //具体文件
    LOGGER.info("**** Executor 具体文件spark.properties:"+arr(2))
    LOGGER.info("**** Executor 工作路径"+arr(3))
    LOGGER.info(s"**** Executor properties: "+arr(0))


    //查看文件目录下所有文件:
    LOGGER.info("**** Executor-内部  文件目录下所有文件:"+arr(4))


    //查看工作目录下所有文件:
    LOGGER.info("**** Executor-内部 工作目录下所有文件:"+arr(5))

    //查看文件目录下所有文件:
    LOGGER.info("**** Executor-外部  文件目录下所有文件:"+subDir(new File(arr(2)).getParentFile).mkString(System.lineSeparator()))

    //查看工作目录下所有文件:
    LOGGER.info("**** Executor-外部 工作目录下所有文件:"+subDir(new File(arr(3))).mkString(System.lineSeparator()))

    LOGGER.info("程序结束")

cluster模式输出:
**** Driver 具体文件spark.properties:/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/spark-c307a8ed-3853-47af-ad92-7d1970683b3a/userFiles-eb591a81-139b-482f-8056-1829f6219a4b/spark.properties 

**** Driver 工作路径/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001

**** Driver properties:
spark.yarn.dist.files -> file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/config/spark.properties,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/config/hbase-site.xml,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/config/kdc.conf
spark.yarn.dist.jars -> file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/bcprov-ext-jdk15on-1.68.jar,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/CryptoUtil-1.1.5.304.jar,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/fastjson-1.2.78.jar,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/commons-pool2-2.8.1.jar

 **** Driver 临时目录下所有文件: (!注意这里为空)
 **** Driver 工作目录下所有文件:/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/user.keytab-ca6ead50-35d4-4f1d-8e37-d87c7366106c
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/tmp
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/bcprov-ext-jdk15on-1.68.jar
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/carbon.properties
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/mapred-site.xml
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/launch_container.sh
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/commons-pool2-2.8.1.jar
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/topology.properties
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/jaas-zk.conf
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/spark.properties
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/container_tokens
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/hbase-site.xml
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/__app__.jar
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/jets3t.properties
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/log4j-executor.properties
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/fastjson-1.2.78.jar
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/CryptoUtil-1.1.5.304.jar
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/__spark_libs__
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/__spark_conf__
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/kdc.conf
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/arm
/srv/BigData/hadoop/data3/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000001/x86 



**** Executor 具体文件spark.properties:/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/./spark.properties 

**** Executor 工作路径/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003

**** Executor properties: 
spark.yarn.dist.files 不存在
spark.yarn.dist.jars 不存在


 **** Executor-内部  文件目录下所有文件:/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/__app__.jar
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/__spark_libs__
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/mapred-site.xml
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/kdc.conf
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/log4j-executor.properties
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/jets3t.properties
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/spark.properties
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/jaas-zk.conf
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/__spark_conf__
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/bcprov-ext-jdk15on-1.68.jar
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/arm
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/topology.properties
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/launch_container.sh
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/fastjson-1.2.78.jar
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/x86
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/commons-pool2-2.8.1.jar
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/carbon.properties
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/hbase-site.xml
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/CryptoUtil-1.1.5.304.jar
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/tmp
/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/container_tokens



**** Executor-内部 工作目录下所有文件:/srv/BigData/hadoop/data1/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0011/container_e25_1686228163560_0011_01_000003/./__app__.jar
(和上面一样,略)

**** Executor-外部  文件目录下所有文件:
 
**** Executor-外部 工作目录下所有文件: 
 

可以看到,Executor 端文件路径和工作路径一样。

dirver端,文件路径和工作路径不一样,因为记录的是临时文件目录。工作目录下该文件是存在的。

client模式输出:
**** Driver 具体文件spark.properties:/tmp/spark-aa67cf79-1f10-4d85-99a0-c9fcced0ee80/userFiles-814810e5-3b1d-435b-a575-796b7e3a2a91/spark.properties
**** Driver 工作路径/opt/HIBI-ExecuteShell
**** Driver properties:
spark.yarn.dist.files -> file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/config/spark.properties,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/config/hbase-site.xml,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/config/kdc.conf
spark.yarn.dist.jars -> file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/bcprov-ext-jdk15on-1.68.jar,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/CryptoUtil-1.1.5.304.jar,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/fastjson-1.2.78.jar,file:///opt/HIBI-ExecuteShell/config/CommonDisplayAds/12_0_2_300sp4_sdv3/lib/commons-pool2-2.8.1.jar
 
**** Driver 临时目录下所有文件:
**** Driver 工作目录下所有文件:/opt/HIBI-ExecuteShell/略(沒有spark.properties)

**** Executor 具体文件spark.properties:/srv/BigData/hadoop/data2/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0019/container_e25_1686228163560_0019_01_000002/./spark.properties
 
**** Executor 工作路径/srv/BigData/hadoop/data2/nm/localdir/usercache/AdsNewsFeed/appcache/application_1686228163560_0019/container_e25_1686228163560_0019_01_000002
**** Executor properties:
spark.yarn.dist.files 不存在
spark.yarn.dist.jars 不存在

其他executor信息和cluster模式一样,(略)

可以看到,在client模式下,driver端并没有将文件复制到工作目录。
另外,为什么工作路径是/opt/HIBI-ExecuteShell?

> echo $SPARK_HOME
/opt/ficlient20210106/Spark2x/spark
> echo $JAVA_HOME
/opt/ficlient20210106/JDK/jdk-8u201

因为jvm的启动路径是动态的,自定义的spark-submit的启动脚本 exec-hive.sh放在了/opt/HIBI-ExecuteShell下面,提交任务格式为:

/opt/HIBI-ExecuteShell/exec-hive.sh  你的脚本.sh

下面解释为什么FileInputStream和Source.fromFile直接写文件名也可以,因为scala io的相对路径取的事jvm的相对路径,而jvm的相对路径的根目录和driver和executor的工作路径是相同的。

Scala IO

  • Console(用于标准输入输出)–Console是对System.out、System.in和System.err的一层封装。包括print
  • StdIn(用于标准输入)–StdIn是Console类中标准输入的引用
  • Source(用于文件输入,输出得用java的api)–Source是对java.io.File的一层封装
  • 文件输出(Scala中没有专门的文件输出,需要使用Java的类)
  • 相对路径

Source.fromFile(filePath),时,filePath可为相对路径:
相对路径以jvm环境变量中的user.dir为根目录

在Scala中启动Scala的路径即为user.dir的值。
在IDEA中user.dir的值为项目的根目录下面。

获取文件目录和子文件:

//遍历某目录下所有的文件和子文件  
  def main(args:Array[String]):Unit = {  
      for(d <- subDirRec(new File("d:\\AAA\\")))  
        println(d)  
  }  
   
  def subDirRec(dir:File):Iterator[File] ={  
    val dirs = dir.listFiles().filter(_.isDirectory())  
    val files = dir.listFiles().filter(_.isFile())  
    files.toIterator ++ dirs.toIterator.flatMap(subDirRec _)  
  }  
//非循环
   def subDir(dir:File):Iterator[File] ={
      if(dir.listFiles() == null) Array.empty[File].toIterator
      else dir.listFiles().toIterator
    }

在Yarn-client中,Application Master仅仅从Yarn中申请资源给Executor,之后client会跟container通信进行作业的调度,下图是Yarn-client模式
在这里插入图片描述

注意SparkFiles.get报错

问题现象
Spark任务抛出如下异常:

Exception in thread "main" java.lang.NullPointerException
	at org.apache.spark.SparkFiles$.getRootDirectory(SparkFiles.scala:37)
	at org.apache.spark.SparkFiles$.get(SparkFiles.scala:31)
	...

问题分析
该现象为在初始化SparkContext之前调用了SparkFiles.get()。

问题解决方案
优先初始化SparkContext。

Important notes

Whether core requests are honored in scheduling decisions depends on which scheduler is in use and how it is configured.
In cluster mode, the local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config yarn.nodemanager.local-dirs). If the user specifies spark.local.dir, it will be ignored. In client mode, the Spark executors will use the local directories configured for YARN while the Spark driver will use those defined in spark.local.dir. This is because the Spark driver does not run on the YARN cluster in client mode, only the Spark executors do.
The --files and --archives options support specifying file names with the # similar to Hadoop. For example, you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.
The --jars option allows the SparkContext.addJar function to work if you are using it with local files and running in cluster mode. It does not need to be used if you are using it with HDFS, HTTP, HTTPS, or FTP files.

参考

官方-submit
官方-cluster模式
官方-spark参数
参数解释
Spark --files使用总结
Spark --files理解(该理解有一些错误,不要被误导)
Spark读取配置文件的探究
Scala IO总结
Java中user.dir到底是什么?System.getProperty方法获取属性
–files 对应源码
java.io.File中的绝对路径和相对路径
Spark中函数addFile和addJar函数介绍
Spark源码中添加deleteFile方法
官方文档-addfile函数
知乎-Spark源码分析——Spark-Submit源码实现分析
Spark源码分析-作业提交(spark-submit
spark-client模式下,设置spark的日志级别
Spark进阶-- spark-client和spark-cluster详解

### 关于 Apache Spark 和 TTS 的集成项目 在探索 Apache Spark 与 TTS (Text-to-Speech) 技术相结合的应用场景时,可以考虑构建一个大规模音频合成平台。这类平台能够处理大量文本数据并将其转换成语音文件。 #### 构建分布式 TTS 处理框架 为了实现高效的文本到语音的大规模转化过程,建议采用如下架构设计: 1. **输入模块** 使用 Spark Streaming 或 Structured Streaming 接收来自不同渠道的待处理文本流[^1]。 2. **预处理阶段** 对接收到的数据执行必要的清理工作,比如去除特殊字符、统一编码格式等操作,确保后续流程顺利进行[^2]. 3. **任务分发机制** 将经过初步准备后的文档分割成更小单元,并分配给集群内的各个节点负责具体转换作业;这里可借助 RDDs, DataFrames API 完成交互式编程模型下的任务切片与调度管理[^3]. 4. **TTS 引擎调用接口** 集成第三方服务提供商(如 Google Cloud Text-To-Speech, Amazon Polly)或者开源库来完成实际的声音生成环节[^4]. 5. **结果收集与存储** 收集由各计算节点返回的结果片段,在主控端汇总组装成为完整的 MP3/WAV 文件形式保存下来供进一步应用使用[^5]. ```python from pyspark.sql import SparkSession import requests spark = SparkSession.builder.appName("Spark-TTS Integration").getOrCreate() def convert_text_to_speech(text_chunk): url = 'https://api.example-tts-service.com/v1/synthesize' headers = {'Content-Type': 'application/json'} data = {"text": text_chunk} response = requests.post(url, json=data, headers=headers) audio_content = response.content return audio_content texts_rdd = spark.sparkContext.parallelize(["Hello world", "This is a test"]) audio_files = texts_rdd.map(convert_text_to_speech).collect() ``` 此代码展示了如何利用 PySpark 创建会话实例以及定义函数 `convert_text_to_speech` 来模拟向外部 RESTful API 发送请求获取对应声音字节串的过程[^6].
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值