结论:在azkaban配置时如果filter有MinimumFreeMemory,会选择可用内存大于6GB的executor去执行任务,如所有的executor可用内存都不足6GB时则会分配executor失败,导致任务无法执行,一直到内存重新回到大于6GB分配executor成功,任务才会执行。
分析过程:第一次出现问题是在8月4号,当时看了下日志,发现如下内容
[ExecutorManager] [Azkaban] Reached handleNoExecutorSelectedCase stage for exec 4672839 with error count 0
以“Reached handleNoExecutorSelectedCase stage”作为关键字搜了一下日志,发现近三天每天都有好多,并且出现时间不固定,那段时间网络无抖动,主机机器iowait,cpu使用、内存等也都处于比较低的状态,
搜了一下,查到有人反馈如下
azkaban会记录每个任务下次的执行时间,那么如果executor主机时间比web主机慢的话,可能会导致该问题,结合之前掌握的信息,感觉这个是比较靠谱的,看了下其中一台executor机器的ntp配置,发现竟然使用了一个非内网的ntp服务器0.asia.pool.ntp.org,ping了下这个地址,延时在180+ms,使我更加相信问题的出现是因为时间同步失败了,于是果断改成指向内网的ntp服务,改完后我还重启了下executor
时间过了两个礼拜,20号上午同事反馈延时的问题又出现了,并且还处于现场中,我上去之后先看了下系统时间,发现是正常的,日志中也仍然是大量的Reached handleNoExecutorSelectedCase stage错误,先重启了一下webserver发现问题没解决,然后又重启了executor,重启后恢复正常。这个时候我就想这么久没出问题应该不是改了ntp服务器而是重启了executor,汗。。。
由于之前查过该问题,并且没有查出什么好的答案,所以这次准备通过查看源码找出问题原因,自己实在是不会java,里面各种依赖关系,看起来实在是吃力,好在azkaban用的人比较多,也有很多人分析过源码,网上找到了两篇不错的分析文章
http://www.acyouzi.com/2017/06/15/azkaban/
http://www.ivanopt.com/deep-research-source-code-azkaban/
重点看了一下任务调度相关的过程,大概为web服务将执行指令写入db--->web服务通知executor有任务需要执行(如果是多节点模式,会先根据配置的filter及权重来分配具体执行的主机,本次也是在这步出现的问题)--->Executor接收到指令,去db里面找出执行参数,发起执行命令
从分析文章里得知,导致日志里相关错误的代码段在
private
void
selectExecutorAndDispatchFlow(ExecutionReference reference,
ExecutableFlow exflow, Set<Executor> availableExecutors)
throws
ExecutorManagerException {
synchronized
(exflow) {
Executor selectedExecutor = selectExecutor(exflow, availableExecutors);
if
(selectedExecutor !=
null
) {
try
{
dispatch(reference, exflow, selectedExecutor);
}
catch
(ExecutorManagerException e) {
logger.warn(String.format(
"Executor %s responded with exception for exec: %d"
,
selectedExecutor, exflow.getExecutionId()), e);
handleDispatchExceptionCase(reference, exflow, selectedExecutor,
availableExecutors);
}
}
else
{
handleNoExecutorSelectedCase(reference, exflow);
}
}
}
|
如果不满足if (selectedExecutor != null)这个条件,就会在日志中出现handleNoExecutorSelectedCase错误,这里跟一下selectedExecutor
Executor selectedExecutor = selectExecutor(exflow, availableExecutors);
继续跟进
private
Executor selectExecutor(ExecutableFlow exflow,
Set<Executor> availableExecutors) {
Executor choosenExecutor =
getUserSpecifiedExecutor(exflow.getExecutionOptions(),
exflow.getExecutionId());
if
(choosenExecutor ==
null
) {
logger.info(
"Using dispatcher for execution id :"
+ exflow.getExecutionId());
ExecutorSelector selector =
new
ExecutorSelector(filterList, comparatorWeightsMap);
choosenExecutor = selector.getBest(availableExecutors, exflow);
}
return
choosenExecutor;
}
|
我们是没有选取具体的executor,所以会进入到
if (choosenExecutor == null)逻辑下,日志中存在“Using dispatcher for execution id xxx”也可以验证这点
后面的
ExecutorSelector selector = new ExecutorSelector(filterList, comparatorWeightsMap);
choosenExecutor = selector.getBest(availableExecutors, exflow);
看起来作用是结合配置里面的filters与comparator来选取适合的executor,开始看到这里有个猜测是会不会数据库中的executors表,active字段因为某种原因被update导致的(availableExecutors就是从这里取的
),不过日志中可以取到下面的值,说明应该不会
2017/08/20 01:26:09.990 +0800 INFO [ExecutorManager] [Azkaban] Successfully refreshed executor: 10.10.221.204:12321 (i
d: 1) with executor info : {
"remainingMemoryPercent"
: 76.44902066392035,
"remainingMemoryInMB"
: 6104,
"remainingFlowCapacity"
: 29,
"numberOfAssignedFlows"
: 1,
"lastDispatchedTime"
: 1503153050310,
"cpuUsage"
: 0.0
}
2017/08/20 01:26:09.992 +0800 INFO [ExecutorManager] [Azkaban] Successfully refreshed executor: 10.10.4.50:12321 (id:
7) with executor info : {
"remainingMemoryPercent"
: 76.91752649941591,
"remainingMemoryInMB"
: 6141,
"remainingFlowCapacity"
: 22,
"numberOfAssignedFlows"
: 8,
"lastDispatchedTime"
: 1503163310196,
"cpuUsage"
: 0.08
}
|
后来还是看了下binlog,通过binlog确认executors表没有update操作
那么还是继续跟代码,省略漫长的跟进过程,最后终于看到了这么一处地方
/**
* <pre>
* function to register the static Minimum Reserved Memory filter.
* NOTE : this is a static filter which means the filter will be filtering based on the system
* standard which is not
* Coming for the passed flow.
* This filter will filter out any executors that has the remaining memory below 6G
* </pre>
*/
private
static
FactorFilter<Executor, ExecutableFlow> getMinimumReservedMemoryFilter() {
return
FactorFilter
.create(MINIMUMFREEMEMORY_FILTER_NAME,
new
FactorFilter.Filter<Executor, ExecutableFlow>() {
private
static
final
int
MINIMUM_FREE_MEMORY =
6
*
1024
;
@Override
public
boolean
filterTarget(
final
Executor filteringTarget,
final
ExecutableFlow referencingObject) {
if
(
null
== filteringTarget) {
logger.debug(String.format(
"%s : filtering out the target as it is null."
,
MINIMUMFREEMEMORY_FILTER_NAME));
return
false
;
}
final
ExecutorInfo stats = filteringTarget.getExecutorInfo();
if
(
null
== stats) {
logger.debug(String.format(
"%s : filtering out %s as it's stats is unavailable."
,
MINIMUMFREEMEMORY_FILTER_NAME,
filteringTarget.toString()));
return
false
;
}
return
stats.getRemainingMemoryInMB() > MINIMUM_FREE_MEMORY;
}
});
}
|
This filter will filter out any executors that has the remaining memory below 6G
可用内存必须超过6G的时候才满足条件,6x1024=6144M,回头翻了下不成功时候取到的executor信息(为了确认基本翻了所有失败时候的日志与部分成功时间的日志),果然两台机器都不到6G的可用内存,只要有超过6144的时候都会成功分配executor,到目前为止终于算是成功定位了问题,那么解决方法就是把原配置中
azkaban.executorselector.filters=StaticRemainingFlowSize,MinimumFreeMemory,CpuStatus
更改为
azkaban.executorselector.filters=StaticRemainingFlowSize,CpuStatus
并重启webserver来生效新的配置
PS:这里再说下另外两个filters,如果再出现handleNoExecutorSelectedCase错误基本就是另外两个filters的问题了,StaticRemainingFlowSiz是每个executor可以同时执行30个flow,如果都满了就会分配失败,CpuStatus是指
cpuUsage的值大于等于95的时候分配失败
后续:设立azkaban执行情况的监控,目前最简单的做法是设置一个每分钟执行的任务,设置任务失败的报警,但本次这种情况由于没有进入到任务执行流程仍然无法监控,同时需要考虑其他维度的监控