134、Spark核心编程进阶之spark-submit基础及例子

最新推荐文章于 2022-11-10 12:30:20 发布

ZFH__ZJ

最新推荐文章于 2022-11-10 12:30:20 发布

阅读量1.4k

点赞数

分类专栏： Spark入坑

本文链接：https://blog.csdn.net/ZJ__ZFH/article/details/86578441

版权

Spark入坑专栏收录该内容

207 篇文章 8 订阅

订阅专栏

基础参数

wordcount.sh

/usr/local/spark/bin/spark-submit \
--class com.zj.spark.core.WordCountCluster \
--master spark://spark-project-1:7077 \
--deploy-mode client \
--conf <key>=<value> \
/opt/spark-study/mysparkstudy-1.0-SNAPSHOT-jar-with-dependencies.jar \
${1}

以下是上面的spark-submit讲解
--class: spark应用程序对应的主类，也就是spark应用运行的主入口，通常是一个包含了main方法的java类或scala类，需要包含全限定包名，比如org.leo.spark.study.WordCount
--master: spark集群管理器的master URL，standalone模式下，就是ip地址+端口号，比如spark://192.168.0.101:7077，standalone默认端口号就是7077
--deploy-mode: 部署模式，决定了将driver进程在worker节点上启动，还是在当前本地机器上启动；默认是client模式，就是在当前本地机器上启动driver进程，如果是cluster，那么就会在worker上启动
--conf: 配置所有spark支持的配置属性，使用key=value的格式；如果value中包含了空格，那么需要将key=value包裹的双引号中
application-jar: 打包好的spark工程jar包，在当前机器上的全路径名
application-arguments: 传递给主类的main方法的参数; 在shell中用${1}这种格式获取传递给shell的参数；然后在比如java中，可以通过main方法的args[0]等参数获取

给main类传递参数

/opt/module/spark/bin/spark-submit \
--class com.zj.spark.core.WordCountCluster \
--master spark://spark-project-1:7077 \
--deploy-mode client \
--num-executors 1 \
--driver-memory 100m \
--executor-memory 450m \
--executor-cores 1 \
/opt/spark-study/mysparkstudy-1.0-SNAPSHOT-jar-with-dependencies.jar \
hello \
haha

./standalone-client.sh hello haha
/opt/module/spark/bin/spark-submit \
--class com.zj.spark.core.WordCountCluster \
--master spark://spark-project-1:7077 \
--deploy-mode client \
--num-executors 1 \
--driver-memory 100m \
--executor-memory 450m \
--executor-cores 1 \
/opt/spark-study/mysparkstudy-1.0-SNAPSHOT-jar-with-dependencies.jar \
${1} \
${2}

例子

使用local本地模式，以及8个线程运行
--class 指定要执行的main类
--master 指定集群模式，local，本地模式，local[8]，进程中用几个线程来模拟集群的执行

/opt/module/spark/bin/spark-submit \
--class com.zj.spark.core.WordCountCluster \
--master local[8] \
/opt/spark-study/mysparkstudy-1.0-SNAPSHOT-jar-with-dependencies.jar \

使用standalone client模式运行
executor-memory，指定每个executor的内存量，这里每个executor内存是2G
total-executor-cores，指定所有executor的总cpu core数量，这里所有executor的总cpu core数量是100个

/opt/module/spark/bin/spark-submit \
--class com.zj.spark.core.WordCountCluster \
--master spark://192.168.0.101:7077 \
--executor-memory 2G \
--total-executor-cores 100 \
/opt/spark-study/mysparkstudy-1.0-SNAPSHOT-jar-with-dependencies.jar \

使用standalone cluster模式运行
supervise参数，指定了spark监控driver节点，如果driver挂掉，自动重启driver

/opt/module/spark/bin/spark-submit \
--class com.zj.spark.core.WordCountCluster \
--master spark://192.168.0.101:7077 \
--deploy-mode cluster \
--supervise \
--executor-memory 2G \
--total-executor-cores 100 \
/opt/spark-study/mysparkstudy-1.0-SNAPSHOT-jar-with-dependencies.jar \

使用yarn-cluster模式运行
num-executors，指定总共使用多少个executor运行spark应用

/opt/module/spark/bin/spark-submit \
--class com.zj.spark.core.WordCountCluster \
--master yarn-cluster \  
--executor-memory 20G \
--num-executors 50 \
/opt/spark-study/mysparkstudy-1.0-SNAPSHOT-jar-with-dependencies.jar \

5. 使用standalone client模式，运行一个python应用
```sh
/opt/module/spark/bin/spark-submit \
--master spark://192.168.0.101:7077 \
/usr/local/python-spark-wordcount.py \

常用的配置

/opt/module/spark/bin/spark-submit \
--class com.zj.spark.core.WordCountCluster \
--master yarn-cluster \
--num-executors 100 \
--executor-cores 2 \
--executor-memory 6G \
--driver-memory  1G \
/opt/spark-study/mysparkstudy-1.0-SNAPSHOT-jar-with-dependencies.jar \