突然出现CDH瘫痪,是因为agent某个节点出现了突然死机,然后重启了之后就成下面这样了。![在这里插入图片描述](https://img-blog.csdnimg.cn/20201009094546412.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl8yOTA1NzYxOQ==,size_16,color_FFFFFF,t_70#pic_center)
启动CDH很明显会报错,报这个有问题的agent丢失
![在这里插入图片描述](https://img-blog.csdnimg.cn/20201009094657891.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl8yOTA1NzYxOQ==,size_16,color_FFFFFF,t_70#pic_center)
查看这个agent的节点日志,会发现报:
[30/Sep/2020 11:42:26 +0000] 21536 MainThread agent ERROR Heartbeating to hadoop101:7182 failed. Traceback (most recent call last): File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/agent.py", line 1396, in _send_heartbeat response = self.requestor.request('heartbeat', heartbeat_data) File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 141, in request return self.issue_request(call_request, message_name, request_datum) File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 254, in issue_request call_response = self.transceiver.transceive(call_request) File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 483, in transceive result = self.read_framed_message() File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 489, in read_framed_message framed_message = response_reader.read_framed_message() File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 417, in read_framed_message raise ConnectionClosedException("Reader read 0 bytes.") ConnectionClosedException: Reader read 0 bytes.
![在这里插入图片描述](https://img-blog.csdnimg.cn/20201009094802262.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl8yOTA1NzYxOQ==,size_16,color_FFFFFF,t_70#pic_center)
意思丢失心跳,agent连接不到server,server也接收不到心跳信息。
分析问题
既然agent可以发送心跳,server也尝试接受心跳,说明服务没有问题,初步认为是地址的问题,于是进行agent的ini配置。
忘记配置文件在那了,查找一下:
find / -name config.ini
![在这里插入图片描述](https://img-blog.csdnimg.cn/20201009095620753.png#pic_center)
看一下配置,然后进行修改重启,发现并没解决问题
![在这里插入图片描述](https://img-blog.csdnimg.cn/20201009101726547.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl8yOTA1NzYxOQ==,size_16,color_FFFFFF,t_70#pic_center)
继续思考问题
我搞了一天,没搞出来,都说是agent占用了进程,结束superxxxx这个进程什么都没用,心跳崩溃,主节点宕机简直要命,脑一热重装了CDH,我太难了。
后悔之余想了想解决办法:
1、重装CDH,这个办法真的是下下下策,一重装,我上面的集群全没了,难受
2、仔细看看这个log日志,看到了没ConnectionClosedException: Reader read 0 bytes.
我忽略了这个点,这应该是解决问题的突破点。希望有问题的小伙伴查查此类问题,可以留言告知。