HBASE引起线上服务宕机的排查-CSDN博客

表现：线上6台8C8G服务器内存瞬间打满，cpu占用也瞬间暴增，导致服务完全不可用（对外全响应504）

排查过程：

1）由于当天早上有上线代码，第一反应先回滚代码，同时去线上dump出log文件

结果：回滚后依然瞬间打满内存

2）联系运维，紧急扩容两台服务器，防止是由于qps过高压垮服务

结果：新扩容服务器瞬间打满内存，同时确认qps并未超限

3）排查定时任务，确认没有死循环等其他问题，主要检查发现问题时间前后的定时任务

结果：没有发现问题，至此也基本排除代码问题

4）检查dump的log文件，发现大量线程BOLOCKED，主要是HBASE和log实时流，于是修改日志级别为ERROR，重启服务，依旧瞬间打满内存

结果：排除log相关的影响，基本定位为HBASE问题

5）联系运维被告知SZ hadoop集群是有宕机，但对线上HBASE应该没有影响，抱着对运维甩锅的怀疑，注释掉HBASE查询代码，打包上线

结果：内存正常，cpu正常

至此线上问题暂时解决，但还需要深究原因#########

项目中部分历史数据的聚合使用了公司数据中心的HBASE，相关底层操作使用了spring4all提供的spring-boot-starter-hbase.jar，接入的时候由于兄弟项目已经稳定使用过一段时间，故没有研究源码直接引入项目使用，这也是埋下了一个巨坑。。。

此次不得不看下源码： HbaseAutoConfiguration类：

@Configuration
@EnableConfigurationProperties({HbaseProperties.class})
@ConditionalOnClass({HbaseTemplate.class})
public class HbaseAutoConfiguration {
    private static final String HBASE_QUORUM = "hbase.zookeeper.quorum";
    private static final String HBASE_ROOTDIR = "hbase.rootdir";
    private static final String HBASE_ZNODE_PARENT = "zookeeper.znode.parent";
    private static final String HBASE_CLIENT_OPERATION_TIMEOUT = "hbase.client.operation.timeout";
    private static final String HBASE_CLIENT_RETRIES_NUMBER = "hbase.client.retries.number";
    @Autowired
    private HbaseProperties hbaseProperties;

    public HbaseAutoConfiguration() {
    }

    @Bean
    @ConditionalOnMissingBean({HbaseTemplate.class})
    public HbaseTemplate hbaseTemplate() {
        org.apache.hadoop.conf.Configuration configuration = HBaseConfiguration.create();
        configuration.set("hbase.zookeeper.quorum", this.hbaseProperties.getQuorum());
        if (StringUtils.isNotBlank(this.hbaseProperties.getClientOperationTimeout())) {
            configuration.set("hbase.client.operation.timeout", this.hbaseProperties.getClientOperationTimeout());
        }

        if (StringUtils.isNotBlank(this.hbaseProperties.getClientRetriesNumber())) {
            configuration.set("hbase.client.retries.number", this.hbaseProperties.getClientRetriesNumber());
        }

        return new HbaseTemplate(configuration);
    }
}
复制代码

获取连接代码：

public Connection getConnection() {
        if (null == this.connection) {
            synchronized(this) {
                if (null == this.connection) {
                    try {
                        ThreadPoolExecutor poolExecutor = new ThreadPoolExecutor(200, 2147483647, 60L, TimeUnit.SECONDS, new SynchronousQueue());
                        poolExecutor.prestartCoreThread();
                        this.connection = ConnectionFactory.createConnection(this.configuration, poolExecutor);
                    } catch (IOException var4) {
                        LOGGER.error("hbase connection资源池创建失败");
                    }
                }
            }
        }

        return this.connection;
    }
复制代码

问题很明显：

1、配置超时时间只有 hbase.client.operation.timeout ，没有 hbase.rpc.timeout、hbase.client.scanner.timeout.period 三个参数的含义 hbase.client.operation.timeout：HBase客户端发起一次数据操作直至得到响应之间总的超时时间 hbase.rpc.timeout：一次rpc调用的超时时间 hbase.client.scanner.timeout.period：HBase客户端发起一次scan操作的rpc调用至得到响应之间总的超时时间虽然设置了hbase.client.operation.timeout 10s，但是代码中较多的使用了scan，于是都会在 hbase.client.scanner.timeout.period 默认值 1min 超时，仔细分析日志果然发现大量的scan耗时40+s

2、线程池最大连接数 2147483647，这个也是无力吐槽了，难道作者只考虑了并发没考虑资源的消耗吗。。。。

结论：代码是不会骗人的，今天的偷懒可能就是明天的坑。