HBase节点掉线问题排查

最新推荐文章于 2023-02-13 20:07:59 发布

梦回从前

最新推荐文章于 2023-02-13 20:07:59 发布

阅读量2.8k

点赞数 1

分类专栏： HBase 文章标签： Hadoop HBase RegionServer掉线

本文链接：https://blog.csdn.net/microgp/article/details/81234065

版权

HBase 专栏收录该内容

18 篇文章 1 订阅

订阅专栏

环境信息：

Hadoop2.7.2+HBase1.2.2+Zookeeper3.4.10

11台服务器，1主10从，基本配置：128G内存，2个CPU12核48线程

服务器上运行了HDFS（11台），HBase（11台），Zookeeper（11台，部分复用集群资源），Yarn（11台，上面运行MR以及Spark任务），以及部分业务多线程程序

问题描述：

HBase启动后频繁报错，然后进程就aborted了：

regionserver日志：

java.io.EOFException: Premature EOF: no length prefix available

java.io.IOException: 断开的管道

[RS_OPEN_META-hd16:16020-0-MetaLogRoller] wal.ProtobufLogWriter: Failed to write trailer, non-fatal, continuing...

java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.10.38:50010, 192.168.10.48:50010], original=[192.168.10.38:50010, 192.168.10.48:50010]). The current failed data
node replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.

ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting
java.lang.RuntimeException: HRegionServer Aborted

datanode日志：

java.io.EOFException: Premature EOF: no length prefix available

java.io.IOException: 断开的管道

java.io.InterruptedIOException: Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/192.168.10.52:50010 remote=/192.168.10.48:48482]. 60000 millis timeout left.

java.io.IOException: Premature EOF from inputStream
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch

问题分析：

1、socket超时，修改超时时间：

<property>
<name>dfs.client.socket-timeout</name>
<value>6000000</value>
</property>
<property>
<name>dfs.datanode.socket.write.timeout</name>
<value>6000000</value>
</property>

重启并未起到相应的作用，错误仍在继续

2、hbase的JVM参数配置不佳，导致Full GC过于频繁，时间过长，超时，修改JVM参数，Full GC次数明显减少，且时间从几十秒变成几秒

export HBASE_MASTER_OPTS="-Xmx2000m -Xms2000m -Xmn750m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70"
export HBASE_REGIONSERVER_OPTS="-Xmx12800m -Xms12800m -Xmn1000m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly"

重启并未起到相应的作用，错误仍在继续

3、修改Zookeeper的超时时间以及HBase超时后不abort变为restart，配置如下：

<property>
<name>zookeeper.session.timeout</name>
<value>600000</value>
<description>ZooKeepersession timeout.
HBase passes this tothe zk quorum as suggested maximum time for a
session. See http://hadoop.apache.org/zooke ... sions
“The client sends a requested timeout, theserver responds with the
timeout that it cangive the client. The current implementation
requires that thetimeout be a minimum of 2 times the tickTime
(as set in the serverconfiguration) and a maximum of 20 times
the tickTime.” Set thezk ticktime with hbase.zookeeper.property.tickime.
In milliseconds.
</description>
</property>
<property>
<name>hbase.regionserver.restart.on.zk.expire</name>
<value>true</value>
<description>
Zookeeper sessionexpired will force regionserver exit.
Enable this will makethe regionserver restart.
</description>
</property>

重启后错误仍在继续，只是坚持的时间变长了点，可见根本问题不在此

4、上述两个参数修改完毕后一度陷入僵局，后来分析进来集群的yarn负载较大，常态化的运行T级别的MR以及Spark任务，导致磁盘IO过大，响应超时，修改HDFS（hdfs-site.xml）配置：

<property>
<name>dfs.client.block.write.replace-datanode-on-failure.enable</name>
<value>true</value>
</property>
<property>
<name>dfs.client.block.write.replace-datanode-on-failure.policy</name>
<value>ALWAYS</value>
</property>

重启后错误仍在继续，只是坚持的时间变长了点，可见根本问题不在此

5、最后分析可能是由于集群负载过大，尤其是IO压力过大，尝试着增大DataNode的内存资源以及连接数来尝试，配置如下：

<name>dfs.datanode.max.transfer.threads</name>
<value>8192</value>
</property>

注：此处我开始配置的是16384（网上有人说此值的值域为[1-8192]），未经证实，安全起见改为8192，有确认答案的朋友欢迎留言确认

修改DataNode的内存分配（hadoop-env.sh）

export HADOOP_HEAPSIZE=16384（之前是8192）

重启后，HBase运行正常，未再出现节点掉线的问题

梦回从前

关注

1
点赞
踩
7

收藏

觉得还不错? 一键收藏
1
评论
HBase节点掉线问题排查

环境信息：Hadoop2.7.2+HBase1.2.2+Zookeeper3.4.1011台服务器，1主10从，基本配置：128G内存，2个CPU12核48线程服务器上运行了HDFS（11台），HBase（11台），Zookeeper（11台，部分复用集群资源），Yarn（11台，上面运行MR以及Spark任务），以及部分业务多线程程序问题描述：HBase启动后频繁报错，然后进...
复制链接

扫一扫