一、背景
服务运行一段时间后(大概7天),hbase写入和读取报错,错误描述为:重试次数耗尽,原因是因为正在重新认证且失败了,不能接受请求。之前同事是使用crontab定时重启临时解决,自己刚好有空帮忙看看。
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=2, exceptions:
Tue Feb 23 15:49:17 CST 2021, RpcRetryingCaller{globalStartTime=1614066557307, pause=100, maxAttempts=2}, javax.security.sasl.SaslException: Call to hadoopxxx8.xxx.com/192.168.xx.xx:16020 failed on local exception: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] [Caused by javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]]
Tue Feb 23 15:49:17 CST 2021, RpcRetryingCaller{globalStartTime=1614066557307, pause=100, maxAttempts=2}, java.io.IOException: Call to hadoopxxx.xxx.com/192.168.xx.xx:16020 failed on local exception: java.io.IOException: Can not send request because relogin is in progress.
at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:145)
at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Call to hadoopxxx.xxx.com/192.168.xx.xx:16020 failed on local exception: java.io.IOException: Can not send request because relogin is in progress.
at sun.reflect.GeneratedConstructorAccessor46.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:221)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:390)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406)
at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callMethod(AbstractRpcClient.java:423)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:328)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:95)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:571)
at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:42534)
at org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:332)
at org.apache.hadoop.hbase.client.ScannerCallable.rpcCall(ScannerCallable.java:242)
at org.apache.hadoop.hbase.client.ScannerCallable.rpcCall(ScannerCallable.java:58)
at org.apache.hadoop.hbase.client.RegionServerCallable.call(RegionServerCallable.java:127)
at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:192)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:387)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:361)
at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107)
... 4 more
Caused by: java.io.IOException: Can not send request because relogin is in progress.
at org.apache.hadoop.hbase.ipc.NettyRpcConnection.sendRequest(NettyRpcConnection.java:301)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callMethod(AbstractRpcClient.java:421)
... 16 more
二、分析
- 报错服务的hbase client包和其他线上跑的n个项目是一样的,但只有这个项目的hbase报续约失败的错误。
- hbase client的认证代码是在这个类UserGroupInformation,属于hadoop-common包
google一下,很快找到关键issue链接
大概意思:使用hadoop rpc的相关应用,无需关注kerberos认证续约问题,而web hdfs,yarn rest api等需要程序(非hadoop环境下的应用)需要自己起一个后台线程定时调用UserGroupInformation.getLoginUser().checkTGTAndReloginFromKeytab()完成续约。
https://stackoverflow.com/questions/41453395/how-to-renew-expiring-kerberos-ticket-in-hbase
大概意思:有两种方式解决这个问题
- 升级hadoop相关包(hadoop-auth, hadoop-mapreduce-client-core, hadoop-common)到2.6.5
- 创建一个线程定时调用如下语句完成续约UserGroupInformation.getLoginUser().checkTGTAndReloginFromKeytab()
三、修复上线
采用第二种方式:增加定时调度线程完成续约,解决了问题
Executors.newSingleThreadScheduledExecutor().scheduleAtFixedRate(new Runnable() {
@Override
public void run() {
try {
UserGroupInformation.getLoginUser().checkTGTAndReloginFromKeytab();
logger.info("Check Kerberos Tgt And Relogin From Keytab Finish.");
} catch (IOException e) {
logger.error("Check Kerberos Tgt And Relogin From Keytab Error", e);
}
}
}, 0, 10, TimeUnit.MINUTES);
logger.info("Start Check Keytab TGT And Relogin Job Success.");
}
疑问:为什么其他项目没有这个问题?为什么要升级hadoop相关包到2.6.5?
四、再分析
既然和hadoop包版本有关系,查看pom.xml
solr依赖的hadoop是2.6.0,maven二级依赖;lz-async-hbase依赖的hadoop版本是2.7.7,maven三级依赖。二级>三级所以使用2.6.0,去jira查hadoop2.6.0的kerberos ticket renew问题。
主要查到以下issue:
1、https://issues.apache.org/jira/browse/HADOOP-10786
大概意思:在jdk8环境中,<=2.6.0的hadoop版本,isKeyTab的判断始终为false,导致
UserGroupInformation.getLoginUser().reloginFromKeytab方法静默失败,无法续约。
2、Kerberos ticket isn't being renewed by Solr when storing indexes on HDFS
大概意思:jdk8,solr<6.2.0环境下,需要手动更新hadoop包到2.6.1+,否者会遇到kerberos票证自动续订有问题。【这个才是最根本原因!】