spark跑任务,偶尔报如下错误:
===
19/11/03 07:40:27 ERROR
YarnClusterScheduler: Lost executor 28 on ip-10-19-201-115.ec2.internal:
Container killed by YARN for exceeding memory limits. 24.0 GB of 24 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead or disabling
yarn.nodemanager.vmem-check-enabled because of YARN-4714.
19/11/03 07:40:27 WARN TaskSetManager:
Lost task 1124.0 in stage 45.0 (TID 36556, ip-10-19-201-115.ec2.internal,
executor 28): ExecutorLostFailure (executor 28 exited caused by one of the
running tasks) Reason: Container killed by YARN for exceeding memory
limits. 24.0 GB of 24 GB physical memory
used. Consider boosting spark.yarn.executor.memoryOverhead or disabling
yarn.nodemanager.vmem-check-enabled because of YARN-4714.
===
解决方案 aws工程师给的方案如下:
a.) Increase memory overhead(增加 memory overhead配置)
b.) Reduce the number of executor cores(减少 excutor core)
c.) Increase the number of partitions (增加分区数量)
d.) Increase driver and executor memory(增加driver节点的excutor core)
根据建议我们修改后的spark-submit命令如下:
调整前:
spark-submit --deploy-mode cluster --master yarn --driver-memory 10G --executor-memory 20G --conf spark.executor.memoryOverhead=4096 --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf spark.network.timeout=300s --conf spark.executor.heartbeatInterval=100s --conf spark.driver.maxResultSize=3G --executor-cores 14 --conf spark.default.parallelism=800 ........
调整后:
spark-submit --deploy-mode cluster --master yarn --driver-memory 10G --executor-memory 19G --conf spark.executor.memoryOverhead=5120 --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf spark.network.timeout=300s --conf spark.executor.heartbeatInterval=100s --conf spark.driver.maxResultSize=3G --executor-cores 14 --conf spark.default.parallelism=800 ........
修改了:
–executor-memory 19G --conf spark.executor.memoryOverhead=5120
上面两个参数,估计应该可以,因为是偶发问题,所以需要过几天看看会不会有问题,如果有问题就把cores调小,把partitions调大