公司 SparkSql运行出现问题 同事要求帮忙排查下原因
日志:
19-10-2021 10:12:06 CST SPARK_SQL-1632390310963 INFO - SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
19-10-2021 10:12:06 CST SPARK_SQL-1632390310963 INFO - SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19-10-2021 10:13:17 CST SPARK_SQL-1632390310963 INFO - 21/10/19 10:13:17 ERROR cluster.YarnScheduler: Lost executor 1 on xxxxx: Container killed by YARN for exceeding memory limits. 11.1 GB of 11 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
19-10-2021 10:14:32 CST SPARK_SQL-1632390310963 INFO - 21/10/19 10:14:32 ERROR cluster.YarnScheduler: Lost executor 2 on xxxxxx: Container killed by YARN for exceeding memory limits. 11.5 GB of 11 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
19-10-2021 10:15:21 CST SPARK_SQL-1632390310963 INFO - 21/10/19 10:15:21 ERROR cluster.YarnScheduler: Lost executor 3 on xxxxxx: Container killed by YARN for exceeding memory limits. 12.2 GB of 11 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - 21/10/19 10:16:10 ERROR cluster.YarnScheduler: Lost executor 4 on xxxxxxxxxxx: Container killed by YARN for exceeding memory limits. 11.9 GB of 11 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - 21/10/19 10:16:10 ERROR scheduler.TaskSetManager: Task 0 in stage 1.0 failed 4 times; aborting job
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - 21/10/19 10:16:10 ERROR datasources.FileFormatWriter: Aborting job 795511c8-163e-41c7-8b2b-eca6672b1b69.
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 5, wp-ns-prod-cdh-172-22-1-8, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 11.9 GB of 11 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - Driver stacktrace:
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at scala.Option.foreach(Option.scala:257)
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
19-10-2021 10:16:10 CST SPARK_SQL-1632390310963 INFO - at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
………………
重点在那一句:
Container killed by YARN for exceeding memory limits. 11.1 GB of 11 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager
起初
刚开始没注意到后面那句 只看了前半句 以为内存不足了 (也没有第一时间问同事 该任务的数据量大概是多少,不然应该能更早发现原因),给任务加大了 executor memory
结果发现任务还是失败,而且所用内存还越来越大 ,依然也是这样的报错,有点纳闷
后来
询问了同事该任务会读取的数据有多大,同事告知并没有多少。
然后仔细读了报错,发现了后半句,发现了应该是堆外内存的问题
- spark.yarn.executor.memoryOverhead 默认等于max( executorMemory * 0.10,384M),那么增大executorMemory再多,堆外内存实际增加的也比较少,因此考虑直接增大spark.yarn.executor.memoryOverhead值
再次提交任务的时候 带上配置 : --conf "spark.yarn.executor.memoryOverhead=2G"
这次再重跑一下任务,果然成功了
结尾
Spark任务提交时候的参数还是比较重要的,涉及到你的任务是否能够正常运行。
当任务出现报错的时候,仔细阅读报错信息也是很重要的点。
后期还需要多回顾下Spark参数的知识