如果您使用spark-submit以集群模式运行应用程序,那么它可以采用标志 - 文件,用于将文件从驱动程序节点传递给工作人员 . 我相信你能够在本地模式下运行的原因是因为你的驱动程序和worker在同一台机器上,但是在集群模式下,驱动程序和worker可能在不同的机器上 . 在这种情况下,Spark需要知道要将哪些文件发送到工作节点 . 可以按照 Learning Spark by Holden Karau; Andy Konwinski; Patrick Wendell; Matei Zaharia 一书中的说明使用以下标志
--master
Indicates the cluster manager to connect to. The options for this flag are described in Table 7-1.
--deploy-mode
Whether to launch the driver program locally (“client”) or on one of the worker machines inside the cluster (“cluster”). In client mode spark-submit will run your driver on the same machine where spark-submit >s itself being invoked. In cluster mode, the driver will be shipped to execute on a worker node in the cluster. The default is client mode.
--class
The “main” class of your application if you’re running a Java or Scala program.
--name
A human-readable name for your application. This will be displayed in Spark’s web UI.
--jars
A list of JAR files to upload and place on the classpath of your application. If your application depends on a small number of third-party JARs, you can add them here.
--files
A list of files to be placed in the working directory of your application. This can be used for data files that you want to distribute to each node.
--py-files
A list of files to be added to the PYTHONPATH of your application. This can contain .py, .egg, or .zip files.
--executor-memory
The amount of memory to use for executors, in bytes. Suffixes can be used to specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).
--driver-memory
The amount of memory to use for the driver process, in bytes. Suffixes can be used to specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).
Update 我认为Kiran有Hadoop设置(正如他在外部提到的那样)并且无法以编程方式从HDFS中读取程序 . 如果不是这样,请忽略答案 .