7、spark的生产应用提交脚本spark-submit

最新推荐文章于 2024-01-03 17:22:41 发布

Just Jump

最新推荐文章于 2024-01-03 17:22:41 发布

阅读量1.2k

点赞数 1

分类专栏： Spark权威指南文章标签： spark-submit

原文链接：https://www.amazon.com/-/zh/dp/1491912219/ref=sr_1_1?__mk_zh_CN=

版权

Spark权威指南专栏收录该内容

33 篇文章 6 订阅

订阅专栏

一、通过查询命令 spark-submit --help 来查看提交任务时有哪些选项可以用。

Options:	说明	备注【个人翻译和根据使用经验备注，有错误欢迎支持】
--master MASTER_URL	spark://host:port, mesos://host:port, yarn, k8s://https://host:port, or local (Default: local[*]).	常用local本地模式、yarn集群模式
--deploy-mode DEPLOY_MODE	Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client).	驱动程序是本地客户端client启动还是集群cluster上的工作节点启动. 如果是cluster模式，Yarn集群会管理driver进程，application创建后，client客户端就可以退出了。如果是client模式，driver进程会跑在client客户端进程中，Yarn只负责保证执行节点的资源，并不会管理master节点。
--class CLASS_NAME	Your application's main class (for Java / Scala apps).	Java/Scala脚本的main class
--name NAME	A name of your application.	给应用一个名称
--jars JARS	Comma-separated list of jars to include on the driver and executor classpaths.	逗号分隔的jar包列表，会加载到驱动、执行节点的路径上
--packages	Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by --repositories. The format for the coordinates should be groupId:artifactId:version.	逗号分隔的maven坐标下的package包列表，会加载到驱动、执行节点的路径上。会搜索本地的maven资源库或远程资源池来加载。
--exclude-packages	Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in --packages to avoid dependency conflicts.	逗号分隔的package包,在解析依赖的时候会排除不解析，防止依赖冲突。
--repositories	Comma-separated list of additional remote repositories to search for the maven coordinates given with --packages.
--py-files PY_FILES	Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps.	逗号分隔的.zip , .egg, .py文件列表
--files FILES	Comma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed via SparkFiles.get(fileName).	逗号分隔的文件列表，替换工作节点路径下的文件。文件可以通过SparkFiles.get(fileName)获取【注：这里文件其实会被加载存放到工作节点路径下，也不用使用SparkFiles.get(fileName)方式读取，直接读文件名即可】
--conf, -c PROP=VALUE	Arbitrary Spark configuration property.	配置选项
--properties-file FILE	Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf.	配置文件
--driver-memory MEM	Memory for driver (e.g. 1000M, 2G) (Default: 1024M).	驱动节点内存
--driver-java-options	Extra Java options to pass to the driver.
--driver-library-path	Extra library path entries to pass to the driver.
--driver-class-path	Extra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath.
--executor-memory MEM	Memory per executor (e.g. 1000M, 2G) (Default: 1G).	执行节点内存
--proxy-user NAME	User to impersonate when submitting the application. This argument does not work with --principal / --keytab.
--help, -h	Show this help message and exit.	spark-submit --help 获取命令行帮助
--verbose, -v	Print additional debug output.
--version,	Print the version of current Spark.	spark-submit -version 查看当前版本号
Cluster deploy mode only:		只适用于集群部署模式的命令
--driver-cores NUM	Number of cores used by the driver, only in cluster mode (Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise	If given, restarts the driver on failure.
Spark standalone, Mesos or K8s with cluster deploy mode only:
--kill SUBMISSION_ID	If given, kills the driver specified.
--status SUBMISSION_ID	If given, requests the status of the driver specified.
Spark standalone, Mesos and Kubernetes only:
--total-executor-cores NUM	Total cores for all executors.
Spark standalone, YARN and Kubernetes only:
--executor-cores NUM	Number of cores used by each executor. (Default: 1 in YARN and K8S modes, or all available cores on the worker in standalone mode).
Spark on YARN and Kubernetes only:		适用于Yarn和Kubernetes部署模式的命令
--num-executors NUM	Number of executors to launch (Default: 2). If dynamic allocation is enabled, the initial number of executors will be at least NUM.	执行节点个数
--principal PRINCIPAL	Principal to be used to login to KDC.
--keytab KEYTAB	The full path to the file that contains the keytab for the principal specified above.
Spark on YARN only:		只适用于Yarn部署模式的命令
--queue QUEUE_NAME	The YARN queue to submit to (Default: "default").	队列名称
--archives ARCHIVES	Comma separated list of archives to be extracted into the working directory of each executor.

二、scala脚本spark-submit

1、yarn集群模式

1.1 spark-submit 命令模版

spark-submit --class TestClass
--master yarn \
--queue ${指定队列名称} \
--deploy-mode client \
--driver-memory 1G \
--conf spark.driver.maxResultSize=1G \
--driver-cores 2 \
--num-executors 4 \
--executor-cores 4 \
--executor-memory 16G \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.sql.shuffle.partitions=6400 \
--conf spark.default.parallelism=6400 \
--conf spark.storage.memoryfraction=0.4 \
--conf spark.shuffle.memoryFraction=0.4 \
--conf spark.blacklist.enabled=true \
--conf spark.speculation=true \
--conf spark.hadoop.hive.exec.orc.split.strategy=ETL \
--name scala_test \
AtestSparkApplication.jar

1.2 object脚本示例

2、local本地模式

2.1 spark-submit 命令模版

spark-submit --class TestClass
--master local \
--deploy-mode client \
--driver-memory 1G \
--conf spark.driver.maxResultSize=1G \
--executor-memory 16G \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.storage.memoryfraction=0.4 \
--conf spark.shuffle.memoryFraction=0.4 \
--conf spark.blacklist.enabled=true \
--conf spark.speculation=true \
--name scala_test \
AtestSparkApplication.jar

2.2 object脚本示例

三、python脚本spark-submit

1、yarn集群模式

1.1 spark-submit 命令模版

（1）一个python脚本，无任何其他依赖文件的情况

spark-submit \
 --master yarn \
 --queue ${这是集群的队列} \
 --deploy-mode client \
 --driver-memory 4G \
 --driver-cores 4 \
 --executor-memory 8G \
 --executor-cores 4 \
 --num-executors 100 \
 --conf spark.default.parallelism=1600 \
 --name "spark_demo_yarn" \
 pyspark_example_yarn.py

（2）一个python脚本，加上一个/多个 txtfile的情况

（3）一个python脚本，加上一个/多个依赖python脚本的情况

1.2 脚本示例: pyspark_example_yarn.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from __future__ import print_function


if __name__ == '__main__':

    from pyspark.sql import SparkSession
    spark = SparkSession.builder \
                .appName("Word Count") \
                .config("spark.some.config.option", "some-value") \
                .enableHiveSupport() \
                .getOrCreate()

    df = spark.sql("""
            SELECT
	            COUNT(a.user_id)
            FROM
	            (
		            SELECT
			            user_id
		            FROM
			            app.app_purchase_table
		            WHERE
			            dt >= "2019-01-01"
			            AND dt <= "2020-12-31"
			            AND sku_code IN(700052, 721057)
		            GROUP BY
			            user_id
                )
                a
            """)
    df.show()

2、local本地模式

2.1 spark-submit 命令模版

spark-submit \
 --master local \
 --deploy-mode client \
 --name "spark_demo_local" \
 pyspark_example_local.py

2.2 脚本示例: pyspark_example_local.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from __future__ import print_function


if __name__ == '__main__':

    from pyspark.sql import SparkSession
    spark = SparkSession.builder \
                .appName("Word Count") \
                .config("spark.some.config.option", "some-value") \
                .enableHiveSupport() \
                .getOrCreate()

    print(spark.range(5000).where("id > 500").selectExpr("sum(id)").collect())
    spark.range(500).where("id > 400").show()

Just Jump

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
7、spark的生产应用提交脚本spark-submit

一、通过查询命令spark-submit --help 来查看提交任务时有哪些选项可以用。Options: 说明备注 --master MASTER_URL spark://host:port, mesos://host:port, yarn, k8s://https://host:port, or local (Default: local[*]). 常用local本地模式、yarn集群模式　 --deploy-mode DEPLO...
复制链接

扫一扫