azkaban任务延迟执行问题跟进

最新推荐文章于 2024-04-27 11:28:47 发布

hrayha

最新推荐文章于 2024-04-27 11:28:47 发布

阅读量6.6k

点赞数 5

本文链接：https://blog.csdn.net/hrayha/article/details/77506976

版权

结论：在azkaban配置时如果filter有MinimumFreeMemory，会选择可用内存大于6GB的executor去执行任务，如所有的executor可用内存都不足6GB时则会分配executor失败，导致任务无法执行，一直到内存重新回到大于6GB分配executor成功，任务才会执行。

分析过程：第一次出现问题是在8月4号，当时看了下日志，发现如下内容

[ExecutorManager] [Azkaban] Reached handleNoExecutorSelectedCase stage for exec 4672839 with error count 0

以“Reached handleNoExecutorSelectedCase stage”作为关键字搜了一下日志，发现近三天每天都有好多，并且出现时间不固定，那段时间网络无抖动，主机机器iowait，cpu使用、内存等也都处于比较低的状态，

搜了一下，查到有人反馈如下

azkaban会记录每个任务下次的执行时间，那么如果executor主机时间比web主机慢的话，可能会导致该问题，结合之前掌握的信息，感觉这个是比较靠谱的，看了下其中一台executor机器的ntp配置，发现竟然使用了一个非内网的ntp服务器0.asia.pool.ntp.org，ping了下这个地址，延时在180+ms，使我更加相信问题的出现是因为时间同步失败了，于是果断改成指向内网的ntp服务，改完后我还重启了下executor

时间过了两个礼拜，20号上午同事反馈延时的问题又出现了，并且还处于现场中，我上去之后先看了下系统时间，发现是正常的，日志中也仍然是大量的Reached handleNoExecutorSelectedCase stage错误，先重启了一下webserver发现问题没解决，然后又重启了executor，重启后恢复正常。这个时候我就想这么久没出问题应该不是改了ntp服务器而是重启了executor，汗。。。

由于之前查过该问题，并且没有查出什么好的答案，所以这次准备通过查看源码找出问题原因，自己实在是不会java，里面各种依赖关系，看起来实在是吃力，好在azkaban用的人比较多，也有很多人分析过源码，网上找到了两篇不错的分析文章

http://www.acyouzi.com/2017/06/15/azkaban/

http://www.ivanopt.com/deep-research-source-code-azkaban/

重点看了一下任务调度相关的过程，大概为web服务将执行指令写入db--->web服务通知executor有任务需要执行（如果是多节点模式，会先根据配置的filter及权重来分配具体执行的主机，本次也是在这步出现的问题）--->Executor接收到指令，去db里面找出执行参数，发起执行命令

从分析文章里得知，导致日志里相关错误的代码段在

 
          private  
          void  
          selectExecutorAndDispatchFlow(ExecutionReference reference, 
         
            ExecutableFlow exflow, Set<Executor> availableExecutors) 
         
          throws  
          ExecutorManagerException { 
         
          synchronized  
          (exflow) { 
         
              Executor selectedExecutor = selectExecutor(exflow, availableExecutors); 
         
          if  
          (selectedExecutor !=  
          null 
          ) { 
         
          try  
          { 
         
                  dispatch(reference, exflow, selectedExecutor); 
         
                }  
          catch  
          (ExecutorManagerException e) { 
         
                  logger.warn(String.format( 
         
          "Executor %s responded with exception for exec: %d" 
          , 
         
                    selectedExecutor, exflow.getExecutionId()), e); 
         
                  handleDispatchExceptionCase(reference, exflow, selectedExecutor, 
         
                    availableExecutors); 
         
                } 
         
              }  
          else  
          { 
         
                handleNoExecutorSelectedCase(reference, exflow); 
         
              } 
         
            } 
         
          }

如果不满足if (selectedExecutor != null)这个条件，就会在日志中出现handleNoExecutorSelectedCase错误，这里跟一下selectedExecutor

Executor selectedExecutor = selectExecutor(exflow, availableExecutors);

继续跟进

 
          private  
          Executor selectExecutor(ExecutableFlow exflow, 
         
            Set<Executor> availableExecutors) { 
         
            Executor choosenExecutor = 
         
              getUserSpecifiedExecutor(exflow.getExecutionOptions(), 
         
                exflow.getExecutionId()); 
         
          // If no executor was specified by admin 
         
          if  
          (choosenExecutor ==  
          null 
          ) { 
         
              logger.info( 
          "Using dispatcher for execution id :" 
         
                + exflow.getExecutionId()); 
         
              ExecutorSelector selector =  
          new  
          ExecutorSelector(filterList, comparatorWeightsMap); 
         
              choosenExecutor = selector.getBest(availableExecutors, exflow); 
         
            } 
         
          return 
           choosenExecutor; 
         
          }

我们是没有选取具体的executor，所以会进入到

if (choosenExecutor == null)逻辑下，日志中存在“Using dispatcher for execution id xxx”也可以验证这点

后面的

        ExecutorSelector selector = new ExecutorSelector(filterList, comparatorWeightsMap);
        choosenExecutor = selector.getBest(availableExecutors, exflow);

看起来作用是结合配置里面的filters与comparator来选取适合的executor，开始看到这里有个猜测是会不会数据库中的executors表，active字段因为某种原因被update导致的（availableExecutors就是从这里取的

），不过日志中可以取到下面的值，说明应该不会

 
          2017/08/20 01:26:09.990 +0800 INFO [ExecutorManager] [Azkaban] Successfully refreshed executor: 10.10.221.204:12321 (i 
         
          d: 1) with executor info : { 
         
          "remainingMemoryPercent"  
          : 76.44902066392035, 
         
          "remainingMemoryInMB"  
          : 6104, 
         
          "remainingFlowCapacity"  
          : 29, 
         
          "numberOfAssignedFlows"  
          : 1, 
         
          "lastDispatchedTime"  
          : 1503153050310, 
         
          "cpuUsage"  
          : 0.0 
         
          } 
         
          2017/08/20 01:26:09.992 +0800 INFO [ExecutorManager] [Azkaban] Successfully refreshed executor: 10.10.4.50:12321 (id: 
         
          7) with executor info : { 
         
          "remainingMemoryPercent"  
          : 76.91752649941591, 
         
          "remainingMemoryInMB"  
          : 6141, 
         
          "remainingFlowCapacity"  
          : 22, 
         
          "numberOfAssignedFlows"  
          : 8, 
         
          "lastDispatchedTime"  
          : 1503163310196, 
         
          "cpuUsage"  
          : 0.08 
         
          }

后来还是看了下binlog，通过binlog确认executors表没有update操作

那么还是继续跟代码，省略漫长的跟进过程，最后终于看到了这么一处地方

 
          /** 
         
             * <pre> 
         
             * function to register the static Minimum Reserved Memory filter. 
         
             * NOTE : this is a static filter which means the filter will be filtering based on the system 
         
             * standard which is not 
         
             *        Coming for the passed flow. 
         
             *        This filter will filter out any executors that has the remaining  memory below 6G 
         
             * </pre> 
         
             */ 
         
          private  
          static  
          FactorFilter<Executor, ExecutableFlow> getMinimumReservedMemoryFilter() { 
         
          return  
          FactorFilter 
         
                  .create(MINIMUMFREEMEMORY_FILTER_NAME,  
          new  
          FactorFilter.Filter<Executor, ExecutableFlow>() { 
         
          private  
          static  
          final  
          int  
          MINIMUM_FREE_MEMORY =  
          6  
          *  
          1024 
          ; 
         
          @Override 
         
          public  
          boolean  
          filterTarget( 
          final  
          Executor filteringTarget, 
         
          final  
          ExecutableFlow referencingObject) { 
         
          if  
          ( 
          null  
          == filteringTarget) { 
         
                        logger.debug(String.format( 
          "%s : filtering out the target as it is null." 
          , 
         
                            MINIMUMFREEMEMORY_FILTER_NAME)); 
         
          return  
          false 
          ; 
         
                      } 
         
          final  
          ExecutorInfo stats = filteringTarget.getExecutorInfo(); 
         
          if  
          ( 
          null  
          == stats) { 
         
                        logger.debug(String.format( 
          "%s : filtering out %s as it's stats is unavailable." 
          , 
         
                            MINIMUMFREEMEMORY_FILTER_NAME, 
         
                            filteringTarget.toString())); 
         
          return  
          false 
          ; 
         
                      } 
         
          return  
          stats.getRemainingMemoryInMB() > MINIMUM_FREE_MEMORY; 
         
                    } 
         
                  }); 
         
            }

This filter will filter out any executors that has the remaining  memory below 6G

可用内存必须超过6G的时候才满足条件，6x1024=6144M，回头翻了下不成功时候取到的executor信息（为了确认基本翻了所有失败时候的日志与部分成功时间的日志），果然两台机器都不到6G的可用内存，只要有超过6144的时候都会成功分配executor，到目前为止终于算是成功定位了问题，那么解决方法就是把原配置中

azkaban.executorselector.filters=StaticRemainingFlowSize,MinimumFreeMemory,CpuStatus

更改为

azkaban.executorselector.filters=StaticRemainingFlowSize,CpuStatus

并重启webserver来生效新的配置

PS：这里再说下另外两个filters，如果再出现handleNoExecutorSelectedCase错误基本就是另外两个filters的问题了，StaticRemainingFlowSiz是每个executor可以同时执行30个flow，如果都满了就会分配失败，CpuStatus是指

cpuUsage的值大于等于95的时候分配失败

后续：设立azkaban执行情况的监控，目前最简单的做法是设置一个每分钟执行的任务，设置任务失败的报警，但本次这种情况由于没有进入到任务执行流程仍然无法监控，同时需要考虑其他维度的监控

hrayha

关注

5
点赞
踩
3

收藏

觉得还不错? 一键收藏
4
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫