Spark Container killed by YARN for exceeding memory limits. 11.1 GB of 11 GB physical memory used

本文讲述了在处理SparkSql任务时遇到的内存问题,最初错误理解为内存不足,通过增加executor内存并未解决问题。深入分析日志后,发现是由于堆外内存设置不当导致的YARN内存限制问题。通过调整`spark.yarn.executor.memoryOverhead`参数,将堆外内存设置为2G,最终成功解决了任务失败的问题。强调了正确解读日志和理解Spark参数的重要性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

公司 SparkSql运行出现问题 同事要求帮忙排查下原因
日志:

19-10-2021 10:12:06 CST SPARK_SQL-1632390310963 INFO - SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
19-10-2021 10:12:06 CST SPARK_SQL-1632390310963 INFO - SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19-10-2021 10:13:17 CST SPARK_SQL-1632390310963 INFO - 21/10/19 10:13:17 ERROR cluster.YarnScheduler: Lost executor 1 on xxxxx: Container killed by YARN for exceeding memory limits. 11.1 GB of 11 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
19-10-2021 10:14:32 CST SPARK_SQL-1632390310963 INFO - 21/10/19 10:14:32 ERROR cluster.YarnScheduler: Lost executor 2 on xxxxxx: Container killed by YARN for exceeding memory limits. 11.5 GB of 11 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
19-10-2021 10:15:21 CST SPARK_SQL-1632390310963 INFO - 21/10/19 10:15:21 ERROR cluster.YarnScheduler: Lost executor 3 on xxxxxx: Container killed by YARN for exceeding memory limits. 12.2 GB of 11 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - 21/10/19 10:16:10 ERROR cluster.YarnScheduler: Lost executor 4 on xxxxxxxxxxx: Container killed by YARN for exceeding memory limits. 11.9 GB of 11 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - 21/10/19 10:16:10 ERROR scheduler.TaskSetManager: Task 0 in stage 1.0 failed 4 times; aborting job
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - 21/10/19 10:16:10 ERROR datasources.FileFormatWriter: Aborting job 795511c8-163e-41c7-8b2b-eca6672b1b69.
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 5, wp-ns-prod-cdh-172-22-1-8, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 11.9 GB of 11 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - Driver stacktrace:
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at scala.Option.foreach(Option.scala:257)
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
………………

重点在那一句:

Container killed by YARN for exceeding memory limits. 11.1 GB of 11 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager

起初

刚开始没注意到后面那句 只看了前半句 以为内存不足了 (也没有第一时间问同事 该任务的数据量大概是多少,不然应该能更早发现原因),给任务加大了 executor memory
结果发现任务还是失败,而且所用内存还越来越大 ,依然也是这样的报错,有点纳闷

后来

询问了同事该任务会读取的数据有多大,同事告知并没有多少。
然后仔细读了报错,发现了后半句,发现了应该是堆外内存的问题

  • spark.yarn.executor.memoryOverhead 默认等于max( executorMemory * 0.10,384M),那么增大executorMemory再多,堆外内存实际增加的也比较少,因此考虑直接增大spark.yarn.executor.memoryOverhead值

再次提交任务的时候 带上配置 : --conf "spark.yarn.executor.memoryOverhead=2G"
这次再重跑一下任务,果然成功了

结尾

Spark任务提交时候的参数还是比较重要的,涉及到你的任务是否能够正常运行。
当任务出现报错的时候,仔细阅读报错信息也是很重要的点。
后期还需要多回顾下Spark参数的知识

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值