Yarn ResourceManager进行主从切换时发生脑裂原因分析

Brain Split 事故
我们使用的Hadoop版本基于Hadoop 2.6.0!

ResourceManager同zookeeper通信,发生异常:

2018-10-19 09:17:49,981 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Error storing info for AMRMTokenSecretManager
java.io.IOException: Wait for ZKClient creation timed out
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1119)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1155)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:947)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:961)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1002)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeOrUpdateAMRMTokenSecretManagerState(ZKRMStateStore.java:1244)
        at org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager.rollMasterKey(AMRMTokenSecretManager.java:148)
        at org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager$MasterKeyRoller.run(AMRMTokenSecretManager.java:134)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)

这是代码发生的位置:

  public synchronized void storeOrUpdateAMRMTokenSecretManagerState(
      AMRMTokenSecretManagerState amrmTokenSecretManagerState,
      boolean isUpdate) {
    AMRMTokenSecretManagerState data =
        AMRMTokenSecretManagerState.newInstance(amrmTokenSecretManagerState);
    byte[] stateData = data.getProto().toByteArray();
    try {
      setDataWithRetries(amrmTokenSecretManagerRoot, stateData, -1);
    } catch (Exception ex) { //在这里抛出异常
      LOG.info("Error storing info for AMRMTokenSecretManager", ex);
      notifyStoreOperationFailed(ex);
    }
  }

调用notifyStoreOperationFailed处理异常:

  /**
   * This method is called to notify the ResourceManager that the store
   * operation has failed.
   * @param failureCause the exception due to which the operation failed
   */
  protected void notifyStoreOperationFailed(Exception failureCause) {
    if (failureCause instanceof StoreFencedException) { //如果是StoreFencedException
      Thread standByTransitionThread =
          new Thread(new StandByTransitionThread());

      standByTransitionThread.setName("StandByTransitionThread Handler");
      standByTransitionThread.start();
    } else {  //如果不是StoreFencedException, 异常无法处理,属于严重异常
      rmDispatcher.getEventHandler().handle(
        new RMFatalEvent(RMFatalEventType.STATE_STORE_OP_FAILED, failureCause));
    }
  }

这是我从IDE里面截取的这个方法的调用堆栈,供参考:
Pasted Graphic 3.tiff

注意,从日志看到,上面的异常是IOException,不是StoreFencedException,因此走else分支,生成一个RMFatalEvent事件,交个ResourceManager的dispatcher rmDispatcher去处理;

这里的handle()方法并没有真正的处理事件,只是把这个事件放到AsyncDispatchereventQueue中:

  class GenericEventHandler implements EventHandler<Event> {
    public void handle(Event event) {
      if (blockNewEvents) {
        return;
      }
      drained = false;

      /* all this method does is enqueue all the events onto the queue */
      int qSize = eventQueue.size();
      if (qSize !=0 && qSize %1000 == 0) {
        LOG.info("Size of event-queue is " + qSize);
      }
      int remCapacity = eventQueue.remainingCapacity();
      if (remCapacity < 1000) {
        LOG.warn("Very low remaining capacity in the event-queue: "
            + remCapacity);
      }
      try {
        eventQueue.put(event); //放入eventQueue
      } catch (InterruptedException e) {
        if (!stopped) {
          LOG.warn("AsyncDispatcher thread interrupted", e);
        }
        // Need to reset drained flag to true if event queue is empty,
        // otherwise dispatcher will hang on stop.
        drained = eventQueue.isEmpty();
        throw new YarnRuntimeException(e);
      }
    };
  }

然后,AsyncDispatcher创建的一个独立的线程会负责从这个eventQueue中取出事件进行处理:


  Runnable createThread() {
    return new Runnable() {
      @Override
      public void run() {
        while (!stopped && !Thread.currentThread().isInterrupted()) {
          drained = eventQueue.isEmpty(); //标记事件队列是否为空
          // blockNewEvents is only set when dispatcher is draining to stop,
          // adding this check is to avoid the overhead of acquiring the lock
          // and calling notify every time in the normal run of the loop.
          if (blockNewEvents) {
            synchronized (waitForDrained) {
              if (drained) {
                waitForDrained.notify(); 
              }
            }
          }
          Event event;
          try {
            event = eventQueue.take(); //阻塞,直到取出事件
          } catch(InterruptedException ie) {
            if (!stopped) {
              LOG.warn("AsyncDispatcher thread interrupted", ie);
            }
            return;
          }
          if (event != null) {
            dispatch(event); //对事件进行处理
          }
        }
      }
    };
  }

这个event的类型是RMFatalEventType.STATE_STORE_OP_FAILED。这个eventType对应的Handler是ResourceManager.RMFatalEventDispatcher,从堆栈可以看到,当前jvm正处在这个dispatch()方法中:

"AsyncDispatcher event handler" #62 prio=5 os_prio=0 tid=0x00007fdb7ca8e000 nid=0x3084 in Object.wait() [0x00007fcf19fc2000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1249)
        - locked <0x00007fcfd48d9da0> (a org.apache.hadoop.util.ShutdownHookManager$1)
        at java.lang.Thread.join(Thread.java:1323)
        at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106)
        at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46)
        at java.lang.Shutdown.runHooks(Shutdown.java:123)
        at java.lang.Shutdown.sequence(Shutdown.java:167)
        at java.lang.Shutdown.exit(Shutdown.java:212)
        - locked <0x00007fcfd48a01e8> (a java.lang.Class for java.lang.Shutdown)
        at java.lang.Runtime.exit(Runtime.java:109)
        at java.lang.System.exit(System.java:971)
        at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:133)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:752)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:743)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
        at java.lang.Thread.run(Thread.java:745)

我们看它的RMFatalEventDispatcher.handle()方法:

  @Private
  public static class RMFatalEventDispatcher
      implements EventHandler<RMFatalEvent> {

    @Override
    public void handle(RMFatalEvent event) {
      LOG.fatal("Received a " + RMFatalEvent.class.getName() + " of type " +
          event.getType().name() + ". Cause:\n" + event.getCause());

      ExitUtil.terminate(1, event.getCause());
    }
  }

这里有一行fatal日志,从RM的日志里也找到了:

2018-10-19 09:17:49,983 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause:

这里调用了

ExitUtil.terminate(1, event.getCause());
  /**
   * Terminate the current process. Note that terminate is the *only* method
   * that should be used to terminate the daemon processes.
   *
   * @param status
   *          exit code
   * @param msg
   *          message used to create the {@code ExitException}
   * @throws ExitException
   *           if System.exit is disabled for test purposes
   */
  public static void terminate(int status, String msg) throws ExitException {
    LOG.info("Exiting with status " + status);
    if (systemExitDisabled) {
      ExitException ee = new ExitException(status, msg);
      LOG.fatal("Terminate called", ee);
      if (null == firstExitException) {
        firstExitException = ee;
      }
      throw ee;
    }
    System.exit(status); 
  }

我们从日志看到这行日志,证明执行路径的确如此:

2018-10-19 09:17:49,984 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1

这会导致ResourceManager启动的时候注册的Shutdown Hook被调用:

从当时的堆栈里面可以看到,ShutDownHook的确被调用:

"Thread-2" #25 prio=5 os_prio=0 tid=0x00007fdb5c7f2800 nid=0x1d5d waiting for monitor entry [0x00007fcf0a3c2000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.hadoop.ha.ActiveStandbyElector.quitElection(ActiveStandbyElector.java:355)
        - waiting to lock <0x00007fcfd3a4a450> (a org.apache.hadoop.ha.ActiveStandbyElector)
        at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:113)
        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
        - locked <0x00007fcfd40795c8> (a java.lang.Object)
        at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
        at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
        at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
        at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
        at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceStop(AdminService.java:155)
        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
        - locked <0x00007fcfd4079538> (a java.lang.Object)
        at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
        at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
        at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
        at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:1121)
        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
        - locked <0x00007fcfd407c228> (a java.lang.Object)
        at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
        at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
        at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:65)
        at org.apache.hadoop.service.CompositeService$CompositeServiceShutdownHook.run(CompositeService.java:183)
        at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)

但是这个线程被阻塞了,阻塞的原因是在等待ActiveStandbyElector的对象锁,因为quitElection()方法是一个synchronized方法:

public synchronized void quitElection(boolean needFence) {
  LOG.info("Yielding from election");
  if (!needFence && state == State.ACTIVE) {
    // If active is gracefully going back to standby mode, remove
    // our permanent znode so no one fences us.
    tryDeleteOwnBreadCrumbNode();
  }
  reset();
  wantToBeInElection = false;
}

那么,问题的关键是

  • 谁把这把锁抢走了?
  • 抢到这把锁的线程为什么一直执行不完?

我们从堆栈可以看到谁已经拿到了这把锁:

"main-EventThread" #897510 daemon prio=5 os_prio=0 tid=0x00007fdb65be1800 nid=0x5d01 waiting for monitor entry [0x00007fcf12f41000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1066)
        - waiting to lock <0x00007fcfd3b299e0> (a org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
        at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:322)
        - locked <0x00007fcfd3a7b930> (a org.apache.hadoop.yarn.server.resourcemanager.AdminService)
        at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeStandby(EmbeddedElectorService.java:131)
        at org.apache.hadoop.ha.ActiveStandbyElector.becomeStandby(ActiveStandbyElector.java:918)
        at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:429)
        - locked <0x00007fcfd3a4a450> (a org.apache.hadoop.ha.ActiveStandbyElector)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)

可以看到这个线程已经拿到了这把对象锁,因为processResult()方法也是需要这个对象锁的,这个线程已经执行进来,说明已经拿到了锁:

  /**
   * interface implementation of Zookeeper callback for create
   */
  @Override
  public synchronized void processResult(int rc, String path, Object ctx,
      String name) {
    if (isStaleClient(ctx)) return;
    LOG.debug("CreateNode result: " + rc + " for path: " + path
        + " connectionState: " + zkConnectionState +
        "  for " + this);
...

同时可以看到,这个线程同样处于BLOCKED模式,从代码看到,它在等ResourceManager.class的对象锁,这是因为ResourceManager.transitionToStandby()是一个synchronized方法:


  synchronized void transitionToStandby(boolean initialize)
      throws Exception {
    if (rmContext.getHAServiceState() ==
        HAServiceProtocol.HAServiceState.STANDBY) {
      LOG.info("Already in standby state");
      return;
    }

    LOG.info("Transitioning to standby state");
    if (rmContext.getHAServiceState() ==
        HAServiceProtocol.HAServiceState.ACTIVE) {
      stopActiveServices();
      reinitialize(initialize);
    }
    rmContext.setHAServiceState(HAServiceProtocol.HAServiceState.STANDBY);
    LOG.info("Transitioned to standby state");
  }

那么,同样的问题,谁已经拿到了ResourceManager的锁并且没有把锁释放呢?这也可以从堆栈里面找出来:

"StandByTransitionThread Handler" #897459 daemon prio=5 os_prio=0 tid=0x0000000004cd0000 nid=0x5cce in Object.wait() [0x00007fcf191b9000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:143)
        - locked <0x00007fcfd444f200> (a java.lang.Object)
        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
        - locked <0x00007fcfd444f1e8> (a java.lang.Object)
        at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
        at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
        at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
        at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
        - locked <0x00007fcfd444f1a0> (a java.lang.Object)
        at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
        at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
        at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
        at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
        - locked <0x00007fcfd4430550> (a java.lang.Object)
        at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
        at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
        at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
        at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:612)
        at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
        - locked <0x00007fcfd42ca768> (a java.lang.Object)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.stopActiveServices(ResourceManager.java:1020)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1075)
        - locked <0x00007fcfd3b299e0> (a org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandBy(ResourceManager.java:761)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StandByTransitionThread.run(RMStateStore.java:722)
        at java.lang.Thread.run(Thread.java:745)

可以看到,这个线程已经进入了ResourceManager.handleTransitionToStandBy()方法,并且这个方法是一个synchronized方法,因此这个线程已经拿到了ResourceManager.class的对象锁,这从日志也可以看出来:

2018-10-19 02:35:29,518 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning to standby state

那么, 为什么这个线程拿到了ResourceManager.class这个对象锁,却一直没有ResourceManager.class的对象锁呢?从这个线程堆栈可以看到它的状态是TIMED_WAITING,明显是在sleep:

  @Override
  protected void serviceStop() throws Exception {
    if (drainEventsOnStop) {
      blockNewEvents = true;
      LOG.info("AsyncDispatcher is draining to stop, igonring any new events.");
      synchronized (waitForDrained) {
        while (!drained && eventHandlingThread.isAlive()) {
          waitForDrained.wait(1000);//反复循环等待,只要drained=true或者eventHandlingThread.isAlive == false,就跳出循环
          LOG.info("n for AsyncDispatcher to drain. Thread state is :" +
              eventHandlingThread.getState());
        }
      }
    }
    stopped = true;
    if (eventHandlingThread != null) {
      eventHandlingThread.interrupt();
      try {
        eventHandlingThread.join();
      } catch (InterruptedException ie) {
        LOG.warn("Interrupted Exception while stopping", ie);
      }
    }

    // stop all the components
    super.serviceStop();
  }

一直处于TIMED_WAITING,即sleep()被反复调用, 说明这个while (!drained && eventHandlingThread.isAlive()) { 一直返回true形成无限循环,也就是说:

drained = false 并且 eventHandlingThread.isAlive() = true

这两个谓语分别进行分析:

  • 为什么drained = false 会成立呢?

    drained代表队列是否已经清空, 为false表示没有清空,为true代表已经清空。 从上面已经列出的AsyncDispatcher.createThread()可以看到, 发生的时序是:

    1. 队列里面还有最后一个事件,就是上面的那个RMFatalEvent事件,所以drained = eventQueue.isEmpty();导致drainedfalse

    2. 通过event = eventQueue.take();取出事件;

    3. 调用dispatch()来对这个事件进行处理;

      注意,虽然此时队列已经为空了,但是drained其实还没有修改过来,需要等到下一次while到来才行。但是很可以,这个dispatch()方法被blocked了,所以这个drained一直是false

  • 为什么eventHandlingThread.isAlive() = true

    • 这个eventHandlingThread就是当前的这个createThread线程,根据上文讲解,这个方法当前正处在dispatch()方法中,并且由于是WAITING,所以当然是alive的;

综上所述:

问题发生的原因是:

  • 由于zk连接发生无法处理的严重错误IOException,触发了RMFatalEvent,
  • 这个Event对应的处理器是ResourceManager.RMFatalEventDispatcher,它的处理方式很粗暴,调用System.exit(1),直接退出jvm;
  • 这个System.exit(1)导致JVM的shutdown hook被调用
  • 这个ShutDown Hook会调用ResourceManager.serviceStop()来停掉ResourceManager的相关服务,但是在获取ActiveStandbyElector.class的对象锁的时候被阻塞,说明已经有其它线程获取了这个锁
  • ActiveStandbyElector.class已经被另外一个线程拿到,这个线程是一个Zookeeper的节点监听器的回调,这个回调发生得比较早,所以拿到了ActiveStandbyElector.class对象锁,开始切换到standby模式,但是切换的时候需要调用ResourceManager.transitionToStandby,这是一个synchronized方法,意味着需要拿到ResourceManager.class的对象锁,但是此时这个对象锁已经被拿走了StandByTransitionThread Handler,并且拿走这把锁的线程通过循环反复等待eventQueue已经清空了;

从上面客户看到,整个过程抽象为这样:

  1. 线程A 怎么都拿不到锁L,所以一直hang住,判断条件drain 始终未false
  2. 已经拿到锁L的线程,通过while循环反复判断并等待某个判断条件drained为true

这样就发生了相互等待,永远无法执行结束!

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值