Brain Split 事故
我们使用的Hadoop版本基于Hadoop 2.6.0!
ResourceManager同zookeeper通信,发生异常:
2018-10-19 09:17:49,981 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Error storing info for AMRMTokenSecretManager
java.io.IOException: Wait for ZKClient creation timed out
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1119)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1155)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:947)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:961)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1002)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeOrUpdateAMRMTokenSecretManagerState(ZKRMStateStore.java:1244)
at org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager.rollMasterKey(AMRMTokenSecretManager.java:148)
at org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager$MasterKeyRoller.run(AMRMTokenSecretManager.java:134)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
这是代码发生的位置:
public synchronized void storeOrUpdateAMRMTokenSecretManagerState(
AMRMTokenSecretManagerState amrmTokenSecretManagerState,
boolean isUpdate) {
AMRMTokenSecretManagerState data =
AMRMTokenSecretManagerState.newInstance(amrmTokenSecretManagerState);
byte[] stateData = data.getProto().toByteArray();
try {
setDataWithRetries(amrmTokenSecretManagerRoot, stateData, -1);
} catch (Exception ex) { //在这里抛出异常
LOG.info("Error storing info for AMRMTokenSecretManager", ex);
notifyStoreOperationFailed(ex);
}
}
调用notifyStoreOperationFailed
处理异常:
/**
* This method is called to notify the ResourceManager that the store
* operation has failed.
* @param failureCause the exception due to which the operation failed
*/
protected void notifyStoreOperationFailed(Exception failureCause) {
if (failureCause instanceof StoreFencedException) { //如果是StoreFencedException
Thread standByTransitionThread =
new Thread(new StandByTransitionThread());
standByTransitionThread.setName("StandByTransitionThread Handler");
standByTransitionThread.start();
} else { //如果不是StoreFencedException, 异常无法处理,属于严重异常
rmDispatcher.getEventHandler().handle(
new RMFatalEvent(RMFatalEventType.STATE_STORE_OP_FAILED, failureCause));
}
}
这是我从IDE里面截取的这个方法的调用堆栈,供参考:
注意,从日志看到,上面的异常是IOException
,不是StoreFencedException
,因此走else
分支,生成一个RMFatalEvent事件,交个ResourceManager的dispatcher rmDispatcher去处理;
这里的handle()
方法并没有真正的处理事件,只是把这个事件放到AsyncDispatcher
的eventQueue
中:
class GenericEventHandler implements EventHandler<Event> {
public void handle(Event event) {
if (blockNewEvents) {
return;
}
drained = false;
/* all this method does is enqueue all the events onto the queue */
int qSize = eventQueue.size();
if (qSize !=0 && qSize %1000 == 0) {
LOG.info("Size of event-queue is " + qSize);
}
int remCapacity = eventQueue.remainingCapacity();
if (remCapacity < 1000) {
LOG.warn("Very low remaining capacity in the event-queue: "
+ remCapacity);
}
try {
eventQueue.put(event); //放入eventQueue
} catch (InterruptedException e) {
if (!stopped) {
LOG.warn("AsyncDispatcher thread interrupted", e);
}
// Need to reset drained flag to true if event queue is empty,
// otherwise dispatcher will hang on stop.
drained = eventQueue.isEmpty();
throw new YarnRuntimeException(e);
}
};
}
然后,AsyncDispatcher
创建的一个独立的线程会负责从这个eventQueue中取出事件进行处理:
Runnable createThread() {
return new Runnable() {
@Override
public void run() {
while (!stopped && !Thread.currentThread().isInterrupted()) {
drained = eventQueue.isEmpty(); //标记事件队列是否为空
// blockNewEvents is only set when dispatcher is draining to stop,
// adding this check is to avoid the overhead of acquiring the lock
// and calling notify every time in the normal run of the loop.
if (blockNewEvents) {
synchronized (waitForDrained) {
if (drained) {
waitForDrained.notify();
}
}
}
Event event;
try {
event = eventQueue.take(); //阻塞,直到取出事件
} catch(InterruptedException ie) {
if (!stopped) {
LOG.warn("AsyncDispatcher thread interrupted", ie);
}
return;
}
if (event != null) {
dispatch(event); //对事件进行处理
}
}
}
};
}
这个event的类型是RMFatalEventType.STATE_STORE_OP_FAILED
。这个eventType对应的Handler是ResourceManager.RMFatalEventDispatcher
,从堆栈可以看到,当前jvm正处在这个dispatch()
方法中:
"AsyncDispatcher event handler" #62 prio=5 os_prio=0 tid=0x00007fdb7ca8e000 nid=0x3084 in Object.wait() [0x00007fcf19fc2000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1249)
- locked <0x00007fcfd48d9da0> (a org.apache.hadoop.util.ShutdownHookManager$1)
at java.lang.Thread.join(Thread.java:1323)
at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106)
at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46)
at java.lang.Shutdown.runHooks(Shutdown.java:123)
at java.lang.Shutdown.sequence(Shutdown.java:167)
at java.lang.Shutdown.exit(Shutdown.java:212)
- locked <0x00007fcfd48a01e8> (a java.lang.Class for java.lang.Shutdown)
at java.lang.Runtime.exit(Runtime.java:109)
at java.lang.System.exit(System.java:971)
at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:133)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:752)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:743)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
at java.lang.Thread.run(Thread.java:745)
我们看它的RMFatalEventDispatcher.handle()
方法:
@Private
public static class RMFatalEventDispatcher
implements EventHandler<RMFatalEvent> {
@Override
public void handle(RMFatalEvent event) {
LOG.fatal("Received a " + RMFatalEvent.class.getName() + " of type " +
event.getType().name() + ". Cause:\n" + event.getCause());
ExitUtil.terminate(1, event.getCause());
}
}
这里有一行fatal日志,从RM的日志里也找到了:
2018-10-19 09:17:49,983 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause:
这里调用了
ExitUtil.terminate(1, event.getCause());
/**
* Terminate the current process. Note that terminate is the *only* method
* that should be used to terminate the daemon processes.
*
* @param status
* exit code
* @param msg
* message used to create the {@code ExitException}
* @throws ExitException
* if System.exit is disabled for test purposes
*/
public static void terminate(int status, String msg) throws ExitException {
LOG.info("Exiting with status " + status);
if (systemExitDisabled) {
ExitException ee = new ExitException(status, msg);
LOG.fatal("Terminate called", ee);
if (null == firstExitException) {
firstExitException = ee;
}
throw ee;
}
System.exit(status);
}
我们从日志看到这行日志,证明执行路径的确如此:
2018-10-19 09:17:49,984 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
这会导致ResourceManager
启动的时候注册的Shutdown Hook
被调用:
从当时的堆栈里面可以看到,ShutDownHook
的确被调用:
"Thread-2" #25 prio=5 os_prio=0 tid=0x00007fdb5c7f2800 nid=0x1d5d waiting for monitor entry [0x00007fcf0a3c2000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hadoop.ha.ActiveStandbyElector.quitElection(ActiveStandbyElector.java:355)
- waiting to lock <0x00007fcfd3a4a450> (a org.apache.hadoop.ha.ActiveStandbyElector)
at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:113)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x00007fcfd40795c8> (a java.lang.Object)
at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceStop(AdminService.java:155)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x00007fcfd4079538> (a java.lang.Object)
at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:1121)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x00007fcfd407c228> (a java.lang.Object)
at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:65)
at org.apache.hadoop.service.CompositeService$CompositeServiceShutdownHook.run(CompositeService.java:183)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
但是这个线程被阻塞了,阻塞的原因是在等待ActiveStandbyElector
的对象锁,因为quitElection()
方法是一个synchronized
方法:
public synchronized void quitElection(boolean needFence) {
LOG.info("Yielding from election");
if (!needFence && state == State.ACTIVE) {
// If active is gracefully going back to standby mode, remove
// our permanent znode so no one fences us.
tryDeleteOwnBreadCrumbNode();
}
reset();
wantToBeInElection = false;
}
那么,问题的关键是
- 谁把这把锁抢走了?
- 抢到这把锁的线程为什么一直执行不完?
我们从堆栈可以看到谁已经拿到了这把锁:
"main-EventThread" #897510 daemon prio=5 os_prio=0 tid=0x00007fdb65be1800 nid=0x5d01 waiting for monitor entry [0x00007fcf12f41000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1066)
- waiting to lock <0x00007fcfd3b299e0> (a org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:322)
- locked <0x00007fcfd3a7b930> (a org.apache.hadoop.yarn.server.resourcemanager.AdminService)
at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeStandby(EmbeddedElectorService.java:131)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeStandby(ActiveStandbyElector.java:918)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:429)
- locked <0x00007fcfd3a4a450> (a org.apache.hadoop.ha.ActiveStandbyElector)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
可以看到这个线程已经拿到了这把对象锁,因为processResult()
方法也是需要这个对象锁的,这个线程已经执行进来,说明已经拿到了锁:
/**
* interface implementation of Zookeeper callback for create
*/
@Override
public synchronized void processResult(int rc, String path, Object ctx,
String name) {
if (isStaleClient(ctx)) return;
LOG.debug("CreateNode result: " + rc + " for path: " + path
+ " connectionState: " + zkConnectionState +
" for " + this);
...
同时可以看到,这个线程同样处于BLOCKED模式,从代码看到,它在等ResourceManager.class
的对象锁,这是因为ResourceManager.transitionToStandby()
是一个synchronized
方法:
synchronized void transitionToStandby(boolean initialize)
throws Exception {
if (rmContext.getHAServiceState() ==
HAServiceProtocol.HAServiceState.STANDBY) {
LOG.info("Already in standby state");
return;
}
LOG.info("Transitioning to standby state");
if (rmContext.getHAServiceState() ==
HAServiceProtocol.HAServiceState.ACTIVE) {
stopActiveServices();
reinitialize(initialize);
}
rmContext.setHAServiceState(HAServiceProtocol.HAServiceState.STANDBY);
LOG.info("Transitioned to standby state");
}
那么,同样的问题,谁已经拿到了ResourceManager
的锁并且没有把锁释放呢?这也可以从堆栈里面找出来:
"StandByTransitionThread Handler" #897459 daemon prio=5 os_prio=0 tid=0x0000000004cd0000 nid=0x5cce in Object.wait() [0x00007fcf191b9000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:143)
- locked <0x00007fcfd444f200> (a java.lang.Object)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x00007fcfd444f1e8> (a java.lang.Object)
at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x00007fcfd444f1a0> (a java.lang.Object)
at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x00007fcfd4430550> (a java.lang.Object)
at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:612)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x00007fcfd42ca768> (a java.lang.Object)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.stopActiveServices(ResourceManager.java:1020)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1075)
- locked <0x00007fcfd3b299e0> (a org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandBy(ResourceManager.java:761)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StandByTransitionThread.run(RMStateStore.java:722)
at java.lang.Thread.run(Thread.java:745)
可以看到,这个线程已经进入了ResourceManager.handleTransitionToStandBy()
方法,并且这个方法是一个synchronized
方法,因此这个线程已经拿到了ResourceManager.class
的对象锁,这从日志也可以看出来:
2018-10-19 02:35:29,518 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning to standby state
那么, 为什么这个线程拿到了ResourceManager.class
这个对象锁,却一直没有ResourceManager.class
的对象锁呢?从这个线程堆栈可以看到它的状态是TIMED_WAITING
,明显是在sleep
:
@Override
protected void serviceStop() throws Exception {
if (drainEventsOnStop) {
blockNewEvents = true;
LOG.info("AsyncDispatcher is draining to stop, igonring any new events.");
synchronized (waitForDrained) {
while (!drained && eventHandlingThread.isAlive()) {
waitForDrained.wait(1000);//反复循环等待,只要drained=true或者eventHandlingThread.isAlive == false,就跳出循环
LOG.info("n for AsyncDispatcher to drain. Thread state is :" +
eventHandlingThread.getState());
}
}
}
stopped = true;
if (eventHandlingThread != null) {
eventHandlingThread.interrupt();
try {
eventHandlingThread.join();
} catch (InterruptedException ie) {
LOG.warn("Interrupted Exception while stopping", ie);
}
}
// stop all the components
super.serviceStop();
}
一直处于TIMED_WAITING
,即sleep()
被反复调用, 说明这个while (!drained && eventHandlingThread.isAlive()) {
一直返回true
形成无限循环,也就是说:
drained = false
并且 eventHandlingThread.isAlive() = true
这两个谓语分别进行分析:
-
为什么
drained = false
会成立呢?drained代表队列是否已经清空, 为false表示没有清空,为true代表已经清空。 从上面已经列出的AsyncDispatcher.createThread()可以看到, 发生的时序是:
-
队列里面还有最后一个事件,就是上面的那个
RMFatalEvent
事件,所以drained = eventQueue.isEmpty();
导致drained
为false
-
通过
event = eventQueue.take();
取出事件; -
调用dispatch()来对这个事件进行处理;
注意,虽然此时队列已经为空了,但是
drained
其实还没有修改过来,需要等到下一次while到来才行。但是很可以,这个dispatch()方法被blocked了,所以这个drained一直是false
-
-
为什么
eventHandlingThread.isAlive() = true
- 这个
eventHandlingThread
就是当前的这个createThread
线程,根据上文讲解,这个方法当前正处在dispatch()
方法中,并且由于是WAITING,所以当然是alive的;
- 这个
综上所述:
问题发生的原因是:
- 由于zk连接发生无法处理的严重错误IOException,触发了
RMFatalEvent
, - 这个Event对应的处理器是
ResourceManager.RMFatalEventDispatcher
,它的处理方式很粗暴,调用System.exit(1),直接退出jvm; - 这个System.exit(1)导致JVM的shutdown hook被调用
- 这个ShutDown Hook会调用
ResourceManager.serviceStop()
来停掉ResourceManager
的相关服务,但是在获取ActiveStandbyElector.class
的对象锁的时候被阻塞,说明已经有其它线程获取了这个锁 ActiveStandbyElector.class
已经被另外一个线程拿到,这个线程是一个Zookeeper的节点监听器的回调,这个回调发生得比较早,所以拿到了ActiveStandbyElector.class
对象锁,开始切换到standby模式,但是切换的时候需要调用ResourceManager.transitionToStandby
,这是一个synchronized方法,意味着需要拿到ResourceManager.class
的对象锁,但是此时这个对象锁已经被拿走了StandByTransitionThread Handler
,并且拿走这把锁的线程通过循环反复等待eventQueue已经清空了;
从上面客户看到,整个过程抽象为这样:
- 线程A 怎么都拿不到锁L,所以一直hang住,判断条件drain 始终未false
- 已经拿到锁L的线程,通过while循环反复判断并等待某个判断条件drained为true
这样就发生了相互等待,永远无法执行结束!