yarn RM crash问题一例

最新推荐文章于 2022-08-29 21:15:00 发布

weixin_33785108

最新推荐文章于 2022-08-29 21:15:00 发布

阅读量457

点赞数

文章标签： java 大数据 jira

原文链接：https://yq.aliyun.com/articles/434517

版权

今天收到线上的resource manager报警：

报错信息如下：

 
      
           2014 
           - 
           07 
           - 
           08  
           13 
           : 
           22 
           : 
           54 
           , 
           118  
           INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:xxxx: 
           53356  
           Timed out after  
           600  
           secs 
          
 
           2014 
           - 
           07 
           - 
           08  
           13 
           : 
           22 
           : 
           54 
           , 
           118  
           INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node xxxx: 
           53356  
           as it is now LOST 
          
 
           2014 
           - 
           07 
           - 
           08  
           13 
           : 
           22 
           : 
           54 
           , 
           118  
           INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: xxxx: 
           53356  
           Node Transitioned from UNHEALTHY to LOST 
          
 
           2014 
           - 
           07 
           - 
           08  
           13 
           : 
           22 
           : 
           54 
           , 
           118  
           FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_REMOVED to the scheduler 
          
 
           java.lang.NullPointerException 
          
 
                    
           at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeNode(FairScheduler.java: 
           715 
           ) 
          
 
                    
           at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java: 
           974 
           ) 
          
 
                    
           at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java: 
           108 
           ) 
          
 
                    
           at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java: 
           378 
           ) 
          
 
                    
           at java.lang.Thread.run(Thread.java: 
           662 
           ) 
          
 
           2014 
           - 
           07 
           - 
           08  
           13 
           : 
           22 
           : 
           54 
           , 
           118  
           INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. 
          
 
           2014 
           - 
           07 
           - 
           08  
           13 
           : 
           22 
           : 
           54 
           , 
           119  
           INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is  
           1000 
          
 
           2014 
           - 
           07 
           - 
           08  
           13 
           : 
           22 
           : 
           54 
           , 
           119  
           INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is  
           2000 
          
 
    

这是一个bug，bug id：https://issues.apache.org/jira/browse/YARN-502

根据bug的描述，是在rm删除标记为UNHEALTHY的nm的时候可能会触发bug（第一次已经删除，后面删除再进行删除操作时就会报错）。

根据堆栈信息来看代码:

 
           org.apache.hadoop.yarn.server.resourcemanager.scheduler.ResourceScheduler: 
          
           protected  
           ResourceScheduler scheduler;  
          
           private  
           final  
           class  
           EventProcessor  
           implements  
           Runnable {  
           // 开启一个EventProcessor 线程，对event进行处理 
          
           @Override 
          
           public  
           void  
           run() { 
          
           SchedulerEvent event; 
          
           while  
           (!stopped && !Thread.currentThread ().isInterrupted()) { 
          
           try  
           { 
          
           event = eventQueue.take();   
           // 从event queue里面拿出event 
          
           }  
           catch  
           (InterruptedException e) { 
          
           LOG.error( 
           "Returning, interrupted : "  
           + e); 
          
           return 
           ;  
           // TODO: Kill RM. 
          
           } 
          
           try  
           { 
          
           scheduler.handle(event);  
           //处理event 
          
           }  
           catch  
           (Throwable t) {  
           // cache event的异常 
          
           // An error occurred, but we are shutting down anyway. 
          
           // If it was an InterruptedException, the very act of 
          
           // shutdown could have caused it and is probably harmless. 
          
           if  
           (stopped ) { 
          
           LOG.warn( 
           "Exception during shutdown: "  
           , t); 
          
           break 
           ; 
          
           } 
          
           LOG.fatal( 
           "Error in handling event type "  
           + event.getType()  
           //根据日志来看，这里获取的event.getType()为 NODE_REMOVED 
          
           +  
           " to the scheduler" 
           , t); 
          
           if  
           (shouldExitOnError 
          
           && !ShutdownHookManager.get().isShutdownInProgress()) { 
          
           LOG.info( 
           "Exiting, bbye.."  
           ); 
          
           System. exit(- 
           1 
           ); 
          
           } 
          
           } 
          
           } 
          
           } 
          
           }

这里可以看到可以通过shouldExitOnError可以控制RM线程是否退出。

 
           private  
           boolean  
           shouldExitOnError =  
           false 
           ;  
           // 初始设置为false 
          
           @Override 
          
           public  
           synchronized  
           void  
           init(Configuration conf) {   
           // 在做初始化时，可以通过配置文件获取 
          
           this 
           . shouldExitOnError = 
          
           conf.getBoolean(Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY, 
          
           Dispatcher.DEFAULT_DISPATCHER_EXIT_ON_ERROR);  
           // 参数在Dispatcher类中定义 
          
           super 
           .init(conf); 
          
           }

 
           org.apache.hadoop.yarn.event.Dispatcher类： 
          
           public  
           interface  
           Dispatcher {    
          
           // Configuration to make sure dispatcher crashes but doesn't do system-exit in 
          
           // case of errors. By default, it should be false, so that tests are not 
          
           // affected. For all daemons it should be explicitly set to true so that 
          
           // daemons can crash instead of hanging around. 
          
           public  
           static  
           final  
           String DISPATCHER_EXIT_ON_ERROR_KEY = 
          
           "yarn.dispatcher.exit-on-error" 
           ;  
           // 控制参数 
          
           public  
           static  
           final  
           boolean  
           DEFAULT_DISPATCHER_EXIT_ON_ERROR =  
           false 
           ;  
           // 默认为false 
          
           EventHandler getEventHandler(); 
          
           void  
           register(Class<?  
           extends  
           Enum> eventType, EventHandler handler); 
          
           }

在ResourceManager类的init函数中：

 
           @Override 
          
           public  
           synchronized  
           void  
           init(Configuration conf) { 
          
           this 
           . conf = conf; 
          
           this 
           . conf.setBoolean(Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY,  
           true 
           );   
           // 这个值的默认值为true了（覆盖了Dispatcher类中的DEFAULT设置）

即默认在遇到dispather的错误时，会退出。
遇到错误是否退出可以由配置参数yarn.dispatcher.exit-on-error决定。不过这个改动影响比较大，最好还是不要设置，还是打patch来解决吧。

官方的patch也比较简单，即在rmnm时进行一次判断，防止二次删除操作：

 
           --- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java 
          
           +++ hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java 
          
           @@ - 
           501 
           , 
           8  
           + 
           501 
           , 
           13  
           @@  
           public  
           DeactivateNodeTransition(NodeState finalState) { 
          
           public  
           void  
           transition(RMNodeImpl rmNode, RMNodeEvent event) { 
          
           // Inform the scheduler 
          
           rmNode.nodeUpdateQueue.clear(); 
          
           -      rmNode.context.getDispatcher().getEventHandler().handle( 
          
           -           
           new  
           NodeRemovedSchedulerEvent(rmNode)); 
          
           +       
           // If the current state is NodeState.UNHEALTHY 
          
           +       
           // Then node is already been removed from the 
          
           +       
           // Scheduler 
          
           +       
           if  
           (!rmNode.getState().equals(NodeState.UNHEALTHY)) { 
          
           +        rmNode.context.getDispatcher().getEventHandler() 
          
           +          .handle(  
           new  
           NodeRemovedSchedulerEvent(rmNode)); 
          
           +      } 
          
           rmNode.context.getDispatcher().getEventHandler().handle( 
          
           new  
           NodesListManagerEvent( 
          
           NodesListManagerEventType.NODE_UNUSABLE, rmNode));