AWS EMR 上 Spark 任务 Exit status: -100 Container released on a *lost* node 错误

一、问题描述

近期,使用 AWS EMR 集群上跑 Spark 任务时常出现 Exit status: -100. Diagnostics: Container released on a lost node 这样的报错信息,导致任务运行失败

报错日志如下:

ERROR cluster.YarnClusterScheduler: Lost executor 6 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a lost node

二、原因分析

大体日志情况和 Exit code 137 错误 基本一致,找不到任何原因,Container 就被释放了

后来咨询了 AWS 的技术人员,找到了原因,我们使用的时 AWS EMR 的 spot instance。这种实列会在资源池不够的情况下,强制释放掉

三、解决方案

要避免由于实列被强制释放导致 Container 进程丢失的概率,在数仓开发侧能做的是,尽量避免有耗时较长且占用节点较多的任务

大体方案就两个:

1,Executor 尽量不要太多,因为过多的 Executor 就会增加被分配倒 spot instance 节点上的盖率

2,提升任务的运行速度,确保分配倒 spot instance 上的任务能够较快的运行完

参考链接:Exit status: -100


后续跟进这个问题,发现导致该错误的原因还有一个:

AWS EMR 当某个节点磁盘利用率超过 90%(该参数可以配置) 时会触发 Auto Scaling 机制,该磁盘将被视为运行不正常。YARN ResourceManager 会逐步停用该节点

对应节点的日中中会有如下 log:

local-dirs usable space is below configured utilization percentage/no more usable space [ /mnt/yarn : used space above threshold of 90.0% ] ; 1/1 log-dirs usable space is below configured utilization percentage/no more usable space [ /var/log/hadoop-yarn/containers : used space above threshold of 90.0% ]

解决该问题需要更改集群相关配置

参考链接:Exit status: -100 AWS EMR 官方解释

已标记关键词 清除标记
相关推荐
在执行spark的exsample里sparkpi的时候报错了 ----------------------------------------------------- 17/11/08 17:13:14 WARN AppClient$ClientEndpoint: Failed to connect to master namenode:7077 java.lang.RuntimeException: java.io.InvalidClassException: org.apache.spark.rpc.RpcEndpointRef; local class incompatible: stream classdesc serialVersionUID = 18257903091306170, local class serialVersionUID = 5017373498943810947 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:621) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109) at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1$$anonfun$apply$1.apply(NettyRpcEnv.scala:261) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:313) at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:260) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:259) at org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:590) at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:572) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:154) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:186) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:106) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745)
我写了一个label encoder的demo,逻辑很简单,从hive中读一张表(46.5M大小,很小很小), 然后对多列进行label encoder(label encoder不支持多列,使用了pipeline操作),然后从中抽取字典,最后写入hive中。<br> 但是有个问题,就是在我执行到pipeline后面的action时,提交的作业会卡在这很久,不知道是什么原因,不可能是数据倾斜,这么小的数据。。。我对spark的调度只知表面,不知深层,请大神指点一二,谢谢各位大神。 下面是代码和ui界面: ## 代码部分 ![图片说明](https://img-ask.csdn.net/upload/201902/28/1551339758_681102.png) ## spark ui 相关界面 ![图片说明](https://img-ask.csdn.net/upload/201902/28/1551340001_746426.png) count at NullValueCheck 是校验一下空值,这个直接读取hive表,count一下,countByValue是StringIndexer类中的方法。他们的执行时间还可接受。 NullValueCheck的DAG图界面: ![图片说明](https://img-ask.csdn.net/upload/201902/28/1551340155_472280.png) 下面是countByValue方法的DAG图界面: ![图片说明](https://img-ask.csdn.net/upload/201902/28/1551340203_735727.png) 下面是count at LabelEncoder这了,这里提交了pipeline任务,然后就卡在这了: ![图片说明](https://img-ask.csdn.net/upload/201902/28/1551340513_485759.png) 下面两张是count at LabelEnocder job的DAG: ![图片说明](https://img-ask.csdn.net/upload/201902/28/1551340560_83321.png) ![图片说明](https://img-ask.csdn.net/upload/201902/28/1551340605_835422.png) 下面是这个stages界面,可以看到scheduler delay很长,task time 没有,任务卡在这了: ![图片说明](https://img-ask.csdn.net/upload/201902/28/1551340693_634744.png) 下面是executor界面供参考: ![图片说明](https://img-ask.csdn.net/upload/201902/28/1551340733_776756.png) 下面是这个卡住的用的总时长和后面保存表操作,可以看到这个提交pipeline任务的时间跟别的不在一个等级上,里面因为scheduler delay卡住很长时间: ![图片说明](https://img-ask.csdn.net/upload/201902/28/1551340836_550194.png) 烦请各位大神帮忙看下,通过这次指导我一定能从中获取到更多spark任务相关知识,谢谢各位大神了。
©️2020 CSDN 皮肤主题: 编程工作室 设计师:CSDN官方博客 返回首页