Spark on YARN 调优方案-CSDN博客

本文链接：https://blog.csdn.net/weixin_43661914/article/details/147400563

Spark on YARN 调优方案

一、资源分配策略

1.1 YARN 资源配置

<!-- yarn-site.xml 核心配置 -->
<property>
  <name>yarn.scheduler.maximum-allocation-mb</name>
  <value>集群单节点最大内存</value> <!-- 如 64G 节点配置 57344 (56G) -->
</property>
<property>
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>单节点可用总内存</value> <!-- 通常预留20%给系统 -->
</property>

1.2 Executor 资源配置

# 推荐配置示例：
spark-submit \
--executor-memory 8G \        # 单个Executor内存
--executor-cores 4 \          # 单个Executor核数
--num-executors 20 \          # 固定Executor数量
--driver-memory 4G            # Driver内存

黄金配比公式：
总核数 = executor数量 × 单executor核数
建议单个Executor内存 = 核数 × 4-8G

二、内存优化配置

2.1 内存分配策略

spark.memory.fraction=0.6      # JVM堆内存中用于Spark任务的比例
spark.memory.storageFraction=0.5 # Storage/Execution内存分配比例
spark.executor.memoryOverhead=1G # 堆外内存（默认executor内存10%）

2.2 内存调优建议

避免OOM：增加spark.executor.memoryOverhead

大shuffle场景：降低spark.memory.fraction（可到0.4）

缓存密集型：提高spark.memory.storageFraction

三、并行度调优

3.1 分区控制

spark.default.parallelism = 总核数 × 2-3  
spark.sql.shuffle.partitions = 200-1000  // 根据数据量调整

// 手动重分区
df.repartition(200).write.parquet(...)

3.2 数据倾斜处理

解决方案：

加盐处理：key = originKey + “_” + random.nextInt(100)

双重聚合：先局部聚合再全局聚合

过滤异常Key：df.filter(“key != ‘异常值’”)

四、Shuffle 优化

4.1 核心参数配置

spark.shuffle.file.buffer=64k     # 每个shuffle文件缓冲大小
spark.reducer.maxSizeInFlight=96m # reduce端最大拉取数据量
spark.shuffle.io.maxRetries=10    # shuffle连接重试次数
spark.shuffle.service.enabled=true # 启用shuffle服务

4.2 优化方案

启用Tungsten引擎：spark.sql.tungsten.enabled=true
使用Sort Shuffle：spark.shuffle.manager=sort
配置压缩编码：

spark.shuffle.compress=true
spark.io.compression.codec=snappy

五、动态资源分配

5.1 启用配置

spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.minExecutors=5
spark.dynamicAllocation.maxExecutors=50
spark.dynamicAllocation.initialExecutors=10
spark.shuffle.service.enabled=true

5.2 策略建议

长时任务：适当提高spark.dynamicAllocation.executorIdleTimeout

短时任务：降低spark.dynamicAllocation.schedulerBacklogTimeout

六、参数配置示例

6.1 中小规模集群

spark-submit \
--executor-memory 4G \
--executor-cores 2 \
--num-executors 20 \
--conf spark.sql.shuffle.partitions=200

七、监控与日志

7.1 监控指标

关键指标：

Executor CPU/MEM 使用率
Shuffle Read/Write 吞吐量
GC 时间占比（需<10%）

7.2 日志分析

···properties

开启事件日志

spark.eventLog.enabled=true
spark.eventLog.dir=hdfs:///spark-logs

查看日志命令

# 开启事件日志
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs:///spark-logs

# 查看日志命令
yarn logs -applicationId <appId>

八、注意事项

避免超额申请资源：总申请内存 ≤ YARN可用资源
参数冲突检查：动态分配与手动指定参数互斥
监控GC情况：频繁Full GC需调整内存配置
数据本地性优化：优先使用MEMORY_AND_DISK_SER