昨天联调的时候发现一个问题,请求发送后,页面上一直不返回数据,好像一直在waiting。用stack看了下,第一个线程在等待hbase RPC请求返回数据
"http-bio-8080-exec-5" daemon prio=10 tid=0x00002aaab3661000 nid=0x2bb6 in Object.wait() [0x0000000045258000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000d2067280> (a org.apache.hadoop.hbase.ipc.HBaseClient$Call)
at java.lang.Object.wait(Object.java:485)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:907)
- locked <0x00000000d2067280> (a org.apache.hadoop.hbase.ipc.HBaseClient$Call)
at org.apache.hadoop.hbase.ipc.SecureRpcEngine$Invoker.invoke(SecureRpcEngine.java:157)
at $Proxy9.getProtocolVersion(Unknown Source)
at org.apache.hadoop.hbase.ipc.SecureRpcEngine.getProxy(SecureRpcEngine.java:198)
at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:335)
at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:312)
at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:364)
at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:236)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1293)
- locked <0x00000000d1d791e0> (a java.lang.String)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1
而其他线程则处于block状态
"http-bio-8080-exec-8" daemon prio=10 tid=0x0000000041d07800 nid=0x2f5b waiting for monitor entry [0x0000000046d74000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1283)
- waiting to lock <0x00000000d1d791e0> (a java.lang.String)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1249)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1236)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:945)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:841)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:942)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:845)
因为之前看过对应的代码,请见 http://iwinit.iteye.com/blog/1815347 和 http://iwinit.iteye.com/blog/1811265
看代码是region location的时候RPC meta表所在regionserver的时候卡住了。因为系统刚起来的时候没有cache,所以请求都需要region location,.meta.只在一台rs上,所以都block住了。再看之前的RPC分析文章,第一个请求确实已经发出去了,处于wait状态,也就是等待IO线程notify。奇怪的是,jstack里并没有IO线程(IPC关键字)。也就是说第一个请求线程永远无法被唤醒了:(。Log也没有任何记录,开始各种debug,因为我们客户端是自己的client,SecureConnection,其receiveResponse方法如下
try {
int id = in.readInt(); // try to read an id
if (LOG.isDebugEnabled())
LOG.debug(getName() + " got value #" + id);
//这里把call删除掉了,后续IO线程会因为空闲而自动关闭退出
Call call = calls.remove(id);
//协议中的state字段
int state = in.readInt(); // read call status
if (LOG.isDebugEnabled()) {
LOG.debug("call #"+id+" state is " + state);
}
//对state处理,注意这里没有从的else处理,也就是说如果state不是这3种的话,直接就不处理了,但是call请求确已经被删除掉了。。。
if (state == Status.SUCCESS.state) {
Writable value = ReflectionUtils.newInstance(valueClass, conf);
value.readFields(in); // read value
if (LOG.isDebugEnabled()) {
LOG.debug("call #"+id+", response is:\n"+value.toString());
}
call.setValue(value);
} else if (state == Status.ERROR.state) {
call.setException(new RemoteException(WritableUtils.readString(in),
WritableUtils.readString(in)));
} else if (state == Status.FATAL.state) {
// Close the connection
markClosed(new RemoteException(WritableUtils.readString(in),
WritableUtils.readString(in)));
}
} catch (IOException e) {
debug的时候发现server端范围的state是5xxxxxx的数字。。。原因是sever段没有启用secure导致协议不一致了。。client没有容错导致。