Yarn在运行时报错error handling URI: /cluster org.eclipse.jetty.io.RuntimeIOException: org.eclipse.jetty.io

在Yarn运行的application的日志中发现有规律的错误日志

影响:并没有影响服务,application正常处理数据。

日志现状:

1-- 两个时间点。每个时间点往后间隔1分钟会有相同错误日志。

2-- 不是每次间隔都会有(即使1分钟的整数倍,有时候2分钟,有时候4分钟)。

日志内容:

2020-07-01 10:05:43,497 ERROR webapp.Dispatcher (Dispatcher.java:service(171)) - error handling URI: /cluster
org.eclipse.jetty.io.RuntimeIOException: org.eclipse.jetty.io.EofException
        at org.eclipse.jetty.server.ResponseWriter.isOpen(ResponseWriter.java:133)
        at org.eclipse.jetty.server.ResponseWriter.println(ResponseWriter.java:314)
        at org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._p(HamletImpl.java:110)
        at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$SCRIPT.__(Hamlet.java:457)
        at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMAppsBlock.renderData(RMAppsBlock.java:194)
        at org.apache.hadoop.yarn.server.webapp.AppsBlock.render(AppsBlock.java:145)
        at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
        at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
        at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
        at org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43)
        at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet.__(Hamlet.java:30354)
        at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppsBlockWithMetrics.render(AppsBlockWithMetrics.java:29)
        at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
        at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
        at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
        at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
        at org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
        at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
        at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
        at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
        at org.apache.hadoop.yarn.webapp.Dispatcher.render(Dispatcher.java:206)
        at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:165)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
        at com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:287)
        at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:277)
        at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:182)
        at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
        at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85)
        at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941)
        at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
        at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:178)
        at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829)
        at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
        at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119)
        at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133)
        at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130)
        at com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203)
        at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
        at org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
        at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)
        at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
        at org.apache.hadoop.security.http.CrossOriginFilter.doFilter(CrossOriginFilter.java:98)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
        at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
        at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
        at org.eclipse.jetty.server.Server.handle(Server.java:539)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
        at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
        at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.eclipse.jetty.io.EofException
        at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:199)
        at org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:420)
        at org.eclipse.jetty.io.WriteFlusher.completeWrite(WriteFlusher.java:375)
        at org.eclipse.jetty.io.SelectChannelEndPoint$3.run(SelectChannelEndPoint.java:107)
        at org.eclipse.jetty.io.SelectChannelEndPoint.onSelected(SelectChannelEndPoint.java:193)
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.processSelected(ManagedSelector.java:283)
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:181)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:249)
        ... 5 more
Caused by: java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:51)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
        at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:177)
        ... 12 more

排查思路:

先查看错误的信息

java.io.IOException: Connection reset by peer

查看资料发现是,会话还没返回,或者返回途中,客户端断开了连接。

根据日志信息查看异常处理,追踪源码发现,源码处理这种情况是重试再返回。所以服务没影响。

接着查看原因

step 1- 依据错误信息,error uri /cluster. 猜测是访问yarn的api。利用抓包工具tcpdump到RM所在机子抓包8088。

发现是RM所在的机子访问

 且记录的过程也符合错误日志,即

1- 有两组会话,组内会话间隔1分钟

2- 不是每个会话都是异常重置(reset),所以日志不是规律的1分钟错误日志

重置会话如下图

step 2- 点开访问 ‘/cluster’ 会话发现,是由python-urllib发送的,监控 8088并过滤python,抓取访问进程

while [ 1 ]; do netstat -apnt | grep 8088 | grep python ; done

 

抓取结果如下

[root@xxxlabxxx yarn]# while [ 1 ]; netstat -apnt | grep 8088 | grep python ; done

tcp 0 0 10.0.2.7:44136 10.0.2.7:8088 ESTABLISHED 106754/python
tcp 0 0 10.0.2.7:51550 10.0.2.4:8088 CLOSE_WAIT 106754/python


[root@xxxlabxxx yarn]# ps -ef | grep 106754
root 39973 90977 0 05:54 pts/1 00:00:00 grep --color=auto 106754
root 106754 106750 3 Aug05 ? 11:09:10 /usr/bin/python /usr/lib/ambari-agent/lib/ambari_agent/main.py start

step 3 - 根据结果,可以发现是 ambari-agent的main.py引起的,但是代码很多。不知道是那段。

step 4 - 锁定8088端口,让它报错,根据报错信息查看代码。执行

iptables -A OUTPUT -p tcp --destination-port 8088 -j DROP

查看他的日志目录/var/log/ambari-agent/下的日志,发现错误信息都是事件触发,有心跳/heartbeat, 告警/ reports/alerts_status等。 /reports/alerts_status的信息里面包含了yarn_resourcemanager_webui,以yarn的报错信息可以确定是访问yarnUI的‘/cluster’。 于是查看security.sh的135行

INFO 2020-09-01 08:44:29,327 Hardware.py:188 - Some mount points were ignored: /dev, /dev/shm, /run, /sys/fs/cgroup, /boot, /run/user/1008, /run/user/1006, /run/user/1004, /run/user/1009, /run/user/990
INFO 2020-09-01 08:44:29,328 security.py:135 - Event to server at /reports/host_status (correlation_id=15078): {'agentEnv': {'transparentHugePage': '', 'hostHealth': {'agentTimeStampAtReporting': 1598949869317, 'liveServices': [{'status': 'Healthy', 'name': 'chronyd', 'desc': ''}]}, 'reverseLookup': True, 'umask': '18', 'hasUnlimitedJcePolicy': True, 'alternatives': [], 'firewallName': 'iptables', 'stackFoldersAndFiles': [], 'existingUsers': [], 'firewallRunning': False}, 'mounts': [{'available': '95667752', 'used': '8660692', 'percent': '9%', 'device': '/dev/sda2', 'mountpoint': '/', 'type': 'xfs', 'size': '104328444'}, {'available': '104347672', 'used': '457708', 'percent': '1%', 'device': '/dev/sdc1', 'mountpoint': '/hadoop', 'type': 'xfs', 'size': '104805380'}, {'available': '29055312', 'used': '2146336', 'percent': '7%', 'device': '/dev/sdb1', 'mountpoint': '/mnt/resource', 'type': 'ext4', 'size': '32895696'}]}
INFO 2020-09-01 08:44:29,332 __init__.py:82 - Event from server at /user/ (correlation_id=15078): {u'status': u'OK'}
INFO 2020-09-01 08:44:37,080 security.py:135 - Event to server at /heartbeat (correlation_id=15079): {'id': 11330}
INFO 2020-09-01 08:44:37,080 security.py:135 - Event to server at /reports/alerts_status (correlation_id=154918): [\{'name': u'nodemanager_health_summary', 'timestamp': 1597816716108L, 'clusterId': '2', 'definitionId': 62, 'state': 'UNKNOWN', 'text': '...'}, \{'name': u'yarn_resourcemanager_webui', 'timestamp': 1597816716185L, 'clusterId': '2', 'definitionId': 55, 'state': 'CRITICAL', 'text': '...'}]

step 5 - 根据step4的错误信息,到security.py 135行报错. 追踪代码发现调用层级较多,分支很多。根据security 的报错信息 135行的方法为起点,yarn的报错信息 error uri, 以uri作为参数的方法为终点。前后查看代码,最终可以确定是/usr/lib/ambari-agent/lib/ambari_agent/alerts/web-alerts中的方法

ambari访问yarnUI(yarn_resourcemanager_webui)。它会发送请求给RM,如果发送到了standby RM,standby RM会重定向到active RM。所以tcpdump抓包看到的是两组会话。代码上看只要 response不为空,它就close掉会话。所以当会话返回内容太大,可能没有全部接受了,它就close掉。导致了开始时的日志信息。

step 6 - 根据step5的猜测,可能是返回信息太大。于是查看application的log发现,会返回application list。如果这个list,记录了所有的application,因此随着yarn运行的application越多。这个日志产生的概率越大。

step 7 - 根据 step6的猜测,需要调整application list的大小。于是查看资料发现参数 yarn.resourcemanager.max-completed-applications,在我们的环境调整为50.

问题解决

  • 6
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 5
    评论
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值