AWS EMR 上 Spark 任务 Exit status: -100 Container released on a *lost* node 错误_exit status: -100. diagnostics: container released-CSDN博客

本文链接：https://blog.csdn.net/Zzz_Zzz_Z/article/details/112985536

一、问题描述

近期，使用 AWS EMR 集群上跑 Spark 任务时常出现 Exit status: -100. Diagnostics: Container released on a lost node 这样的报错信息，导致任务运行失败

报错日志如下：

ERROR cluster.YarnClusterScheduler: Lost executor 6 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a lost node

二、原因分析

大体日志情况和 Exit code 137 错误基本一致，找不到任何原因，Container 就被释放了

后来咨询了 AWS 的技术人员，找到了原因，我们使用的时 AWS EMR 的 spot instance。这种实列会在资源池不够的情况下，强制释放掉

三、解决方案

要避免由于实列被强制释放导致 Container 进程丢失的概率，在数仓开发侧能做的是，尽量避免有耗时较长且占用节点较多的任务

大体方案就两个：

1，Executor 尽量不要太多，因为过多的 Executor 就会增加被分配倒 spot instance 节点上的盖率

2，提升任务的运行速度，确保分配倒 spot instance 上的任务能够较快的运行完

参考链接：Exit status: -100

后续跟进这个问题，发现导致该错误的原因还有一个：

AWS EMR 当某个节点磁盘利用率超过 90%（该参数可以配置）时会触发 Auto Scaling 机制，该磁盘将被视为运行不正常。YARN ResourceManager 会逐步停用该节点

对应节点的日中中会有如下 log：

local-dirs usable space is below configured utilization percentage/no more usable space [ /mnt/yarn : used space above threshold of 90.0% ] ; 1/1 log-dirs usable space is below configured utilization percentage/no more usable space [ /var/log/hadoop-yarn/containers : used space above threshold of 90.0% ]

解决该问题需要更改集群相关配置

参考链接：Exit status: -100 AWS EMR 官方解释