Spark优化

最新推荐文章于 2021-03-12 15:41:48 发布

vitorl_Ch

最新推荐文章于 2021-03-12 15:41:48 发布

阅读量271

点赞数

分类专栏：基础

本文链接：https://blog.csdn.net/weixin_45116848/article/details/103979254

版权

本文详细总结了Spark优化的各个方面，包括资源调优、并行度调整、代码优化、数据本地化、内存管理、shuffle过程优化、Executor堆外内存调节、数据倾斜解决方案以及Spark故障排查，旨在提升Spark应用的性能和稳定性。

摘要由CSDN通过智能技术生成

Spark优化总结

1.资源调优

在部署spark集群时指定资源分配的默认参数(配置文件)
- spark安装包的conf下spark-env.sh
- SPARK_WORKER_CORES
- SPARK_WORKER_MEMORY
- SPARK_WORKER_INSTANCES 每台机器启动的worker数
在提交Application的时候给当前的appliation分配更多的资源(liunx提交命令)
- 提交命令选项
- –executor -cores (不设置,默认每一个worker为当前application开启一个executor,这个executor会使用这个Worker的所有cores和1G内存)
- –executor-memory
- –total-exexutor-cors (不设置,默认将集群剩下的所有的核数分配给当前application)
Application的代码中设置或在Spark-default.conf中设置(代码中设置)
- spark.executor.cores
- spark.executor.memory
- spark.max.cores
- 动态分配资源
  - spark.shuffle.service.enableed true //启动external shuffle Service服务
  - spark.shuffle.service.port 7377 //shuffle的服务端口为7377(端口要与yarn-site中一致)
  - spark.dynamicAllocation.enable true //开启动态资源分配
  - spark.dynamicAllocation.minExecutor 1 //每个application最小分配的executor数
  - spark.dynamicAllocation.maxExecutor 30 //每个application最大并发分配的executor数
  - spark.dynamicAllocation.schedulerBacklogTimeout 1s //如果有新任务处于等待状态，并且等待超过默认时间(默认1s),则会依次启动executor,每次启动1,2,4,8…个executor（如果有的话）。
  - spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 5s //启动的间隔由控制(默认与schedulerBacklogTimeout相同)

2.并行度调优

读取hdfs数据时,降低block大小,相当于提高了RDD中partition的个数 sc.textFile(xx,numPartitions)
sc.parallelize(xxx.numPartitions)
sc.makeRDD(xxx,numpartitions)
sc.parallelizePairs(xxx,numpartitions)
repartions/coalesce //增加或减少partition会产生shuffle,coalesce减少分区可以不产生shuffle
reducebykey/groupbykey/join —(xxx,numpartitions)
spark.default.parallelism net set
spark.sql.shuffle.partitions–200
自定义分区数
如果数据SparkStreaming中
- Receiver: spark.streaming.blockInterval—200ms
- Direct:读取的topic数