记一次Thrift Server错误排查

1. 起因

1.1. 昨夜下班回家后11点左右突然收到hbase监控的连接失败邮件。
1.2. 上线查看监控日志,发现均是timeout异常
1.3. 查看消费者日志,发现所有连接不可用,不断重试中,遂关闭所有消费者

2. 进一步排查

2.1. 早上九点到公司开始继续排查问题
2.2. 查看hbase thrift 日志,发现大量重试,发现thrift处理线程已满,大量等待任务
$ tail -n 1000 hbase-hbase-thrift2-idc02-argo-ds-743.log
2017-02-21 09:50:40,578 ERROR [pool-1-thread-49085] thrift.ProcessFunction: Internal error processing get
java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: Task org.apache.hadoop.hbase.util.BoundedCompletionService$QueueingFuture@6d8d5d1d rejected from java.util.concurrent.ThreadPoolExecutor@42d2ee91[Running, pool size = 256, active threads = 256, queued tasks = 25600, completed tasks = 37016]
......
Caused by: java.util.concurrent.RejectedExecutionException: Task org.apache.hadoop.hbase.util.BoundedCompletionService$QueueingFuture@6d8d5d1d rejected from java.util.concurrent.ThreadPoolExecutor@42d2ee91[Running, pool size = 256, active threads = 256, queued tasks = 25600, completed tasks = 37016]
......
2017-02-21 10:01:49,958 INFO  [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=26, retries=35, started=432112 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:02:10,155 INFO  [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=27, retries=35, started=452309 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:02:30,189 INFO  [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=28, retries=35, started=472343 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:02:50,386 INFO  [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=29, retries=35, started=492540 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:03:10,495 INFO  [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=30, retries=35, started=512649 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:03:30,689 INFO  [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=31, retries=35, started=532843 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:03:50,717 INFO  [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=32, retries=35, started=552871 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:04:10,829 INFO  [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=33, retries=35, started=572983 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2017-02-21 10:04:30,857 INFO  [pool-1-thread-49843] client.RpcRetryingCaller: Call exception, tries=34, retries=35, started=593011 ms ago, cancelled=false, msg=row '536051594396' on table 'resys:taobao_comment' at null
2.3. 查看服务器TCP连接发现大量CLOSE_WAIT连接
$ netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
TIME_WAIT 29
CLOSE_WAIT 16440
FIN_WAIT1 13
ESTABLISHED 62

3. 分析问题

  1. 客户端请求过多,导致服务器线程池无法处理,一部分请求直接处于等待完成状态。此时客户端由于设置timeout,于是超时后放弃连接并起新的连接。由于服务器还在处理其他任务,导致之前的连接无法正常close,于是这部分连接处于CLOSE_WAIT状态。
  2. 通过日志发现thrift自己本身会对请求失败的任务不断重试,并且重试次数高达35次,这也导致了处理速度过慢
  3. 服务器大量的CLOSE_WAIT说明系统CLOSE_WAIT设置时间过长

4. 解决问题

  1. 减少系统CLOSE_WAIT时间(不建议,影响过大
  2. 去除客户端timeout超时等待(可行,虽然看起来不优雅
  3. 减少thrift自己重试次数,并增加thrift线程池内线程数
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值