1. 起因
1.1. 昨夜下班回家后11点左右突然收到hbase监控的连接失败邮件。
1.2. 上线查看监控日志,发现均是timeout异常
1.3. 查看消费者日志,发现所有连接不可用,不断重试中,遂关闭所有消费者
2. 进一步排查
2.1. 早上九点到公司开始继续排查问题
2.2. 查看hbase thrift 日志,发现大量重试,发现thrift处理线程已满,大量等待任务
$ tail -n 1000 hbase-hbase-thrift2-idc02-argo-ds-743.log
2017-02-21 09:50:40,578 ERROR [pool-1-thread-49085] thrift.ProcessFunction: Internal error processing get
java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: Task org.apache.hadoop.hbase.util.BoundedCompletionService$QueueingFuture@6d8d5d1d rejected from java.util.concurrent.ThreadPoolExecutor@42d2ee91[Running, pool size = 256, active threads = 256, queued tasks = 25600, completed tasks = 37016]
......
Caused by: java.util.concurrent.RejectedExecutionException: Task org.apache.hadoop.hbase.util.BoundedCompletionService$QueueingFuture@6d8d5d1d rejected from java.util.concurrent.ThreadPoolExecutor@42d2ee91[Running, pool size = 256, active threads = 256, queued tasks = 25600, completed tasks = 37016]
......
2017-02-21 10:01:49,958 INFO [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=26, retries=35, started=432112 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:02:10,155 INFO [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=27, retries=35, started=452309 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:02:30,189 INFO [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=28, retries=35, started=472343 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:02:50,386 INFO [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=29, retries=35, started=492540 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:03:10,495 INFO [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=30, retries=35, started=512649 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:03:30,689 INFO [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=31, retries=35, started=532843 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:03:50,717 INFO [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=32, retries=35, started=552871 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:04:10,829 INFO [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=33, retries=35, started=572983 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:04:30,857 INFO [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=34, retries=35, started=593011 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2.3. 查看服务器TCP连接发现大量CLOSE_WAIT连接
$ netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
TIME_WAIT 29
CLOSE_WAIT 16440
FIN_WAIT1 13
ESTABLISHED 62
3. 分析问题
- 客户端请求过多,导致服务器线程池无法处理,一部分请求直接处于等待完成状态。此时客户端由于设置timeout,于是超时后放弃连接并起新的连接。由于服务器还在处理其他任务,导致之前的连接无法正常close,于是这部分连接处于CLOSE_WAIT状态。
- 通过日志发现thrift自己本身会对请求失败的任务不断重试,并且重试次数高达35次,这也导致了处理速度过慢
- 服务器大量的CLOSE_WAIT说明系统CLOSE_WAIT设置时间过长
4. 解决问题
- 减少系统CLOSE_WAIT时间(不建议,影响过大
- 去除客户端timeout超时等待(可行,虽然看起来不优雅
- 减少thrift自己重试次数,并增加thrift线程池内线程数