最近几天发现oozie调度的任务经常会被挂起(SUSPENDED), 之前也存在被挂起的情况,但频率很低, 一周也就一两次, 出现问题时由监控脚本重跑,也不影响正常业务,但最近一两天被挂起的非常频繁,甚至一天有3,4个小时的任务被挂起, 影响正常业务.
个人猜测跟hadoop集群状态(稳定性)有一定关系,但咨询hadoop运维人员后得知集群近几天并未做改动,也没异常。
被挂起任务截图:
注: loadFlash是一个hive节点,是把flash日志load到hive中, 在这里出现异常,状态变成
START_MANUAL, 点开loadFlash结点, 如下图:
从Error Code和Error Message可以看出,此action出现JA009 Filesystem closed异常
为了定位该问题,先来看看oozie日志吧,来到oozie安装目录, 找到oozie.log日志文件, 搜索 JA009: Filesystem closed信息,果然,有很多该异常信息
org.apache.oozie.action.ActionExecutorException: JA009: Filesystem closed
at org.apache.oozie.action.ActionExecutor.convertException(ActionExecutor.java:361)
at org.apache.oozie.action.hadoop.JavaActionExecutor.prepareActionDir(JavaActionExecutor.java:390)
at org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:636)
at org.apache.oozie.command.wf.ActionStartCommand.call(ActionStartCommand.java:128)
at org.apache.oozie.command.wf.ActionStartCommand.execute(ActionStartCommand.java:249)
at org.apache.oozie.command.wf.ActionStartCommand.execute(ActionStartCommand.java:47)
at org.apache.oozie.command.Command.call(Command.java:202)
at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:211)
at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:128)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:232)
at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:648)
at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:255)
at org.apache.oozie.action.hadoop.JavaActionExecutor.prepareActionDir(JavaActionExecutor.java:383)
... 10 more
有了该异常信息, 定位问题方便多了, 将ooize源码工程导入eclipse, 找到 JavaActionExecutor.java文件,定位到 prepareActionDir方法:
void prepareActionDir(FileSystem actionFs, Context context) throws ActionExecutorException {
try {
PathactionDir=context.getActionDir();
PathtempActionDir=newPath(actionDir.getParent(), actionDir.getName() + ".tmp");
if (!actionFs.exists(actionDir)) {
try {
actionFs.copyFromLocalFile(new Path(getOozieRuntimeDir(), getLauncherJarName()), new Path(
tempActionDir, getLauncherJarName()));
actionFs.rename(tempActionDir, actionDir);
}
catch (IOException ex) {
actionFs.delete(tempActionDir, true);
actionFs.delete(actionDir, true);
throw ex;
}
}
}
catch (Exception ex) {
throw convertException(ex);
}
}
从prepareActionDir 方法可以看出, 在使用actionFs的时候有可能会出现Filesystem closed异常(如果拿到的这个actionFs已经关闭自然就会抛出异常了)
接下来看看prepareActionDir 方法中FileSystem actionFs参数是如何传入的,找到调用prepareActionDir 的方法:
@Override
public void start(Context context, WorkflowAction action) throws ActionExecutorException {
try {
XLog.getLog(getClass()).debug("Starting action " + action.getId() + " getting Action File System");
FileSystem actionFs = getActionFileSystem(context, action);
XLog.getLog(getClass()).debug("Preparing action Dir through copying " + context.getActionDir());
prepareActionDir(actionFs, context);
XLog.getLog(getClass()).debug("Action Dir is ready. Submitting the action ");
submitLauncher(context, action);
XLog.getLog(getClass()).debug("Action submit completed. Performing check ");
check(context, action);
XLog.getLog(getClass()).debug("Action check is done after submission");
}
catch (Exception ex) {
throw convertException(ex);
}
}
通过FileSystem actionFs=getActionFileSystem(context, action);代码可知,actionFs是通过getActionFileSystem方法获取的, 再来看getActionFileSystem方法:
protected FileSystem getActionFileSystem(Context context, Element actionXml) throws ActionExecutorException {
try {
return context.getAppFileSystem();
}
catch (Exception ex) {
throw convertException(ex);
}
}
public FileSystem getAppFileSystem() throws HadoopAccessorException, IOException, URISyntaxException {
WorkflowJobworkflow=getWorkflow();
XConfigurationjobConf=newXConfiguration(new StringReader(workflow.getConf()));
ConfigurationfsConf=newConfiguration();
XConfiguration.copy(jobConf, fsConf);
return Services.get().get(HadoopAccessorService.class).createFileSystem(workflow.getUser(),
workflow.getGroup(), new URI(getWorkflow().getAppPath()), fsConf);
}
至此终于找到actionFs是通过HadoopAccessorService来获取的,看看HadoopAccessorService的createFileSystem方法:
public FileSystem createFileSystem(String user, String group, URI uri, Configuration conf)
throws HadoopAccessorException {
validateNameNode(uri.getAuthority());
conf=createConfiguration(user, group, conf);
try {
return FileSystem.get(uri, conf);
}
catch (IOException e) {
throw new HadoopAccessorException(ErrorCode.E0902, e);
}
}
真相大白,oozie通过调用hadoop的FileSystem.get(uri, conf); 方法来得到FileSystem.
接着看FileSystem.get(uri, conf)源码:
public static FileSystem get(URI uri, Configuration conf) throws IOException {
Stringscheme=uri.getScheme();
Stringauthority=uri.getAuthority();
if (scheme== null) { // no scheme: use default FS
return get(conf);
}
if (authority== null) { // no authority
URIdefaultUri=getDefaultUri(conf);
if (scheme.equals(defaultUri.getScheme()) // if scheme matches default
&& defaultUri.getAuthority() != null) { // & default has authority
return get(defaultUri, conf); // return default
}
}
StringdisableCacheName=String.format("fs.%s.impl.disable.cache", scheme);
if (conf.getBoolean(disableCacheName, false)) {
return createFileSystem(uri, conf);
}
return CACHE.get(uri, conf); // 有缓存哦
}
FileSystem.get(uri, conf)方法会根据conf.getBoolean(disableCacheName, false)的值决定是创建FileSystem还是从cache中获取FileSystem, 而默认情况下conf.getBoolean(disableCacheName, false)值为flase(除非特别指定disableCacheName值为true), 即从cache获取. 问题正是出在这里,我们的oozie作业是小时任务,并由多个action节点组成,每个action节点执行时从cache中获取FileSystem, 有可能该FileSystem因为网络原因或者其他原因已经被closed, 但仍旧被cache, 导致拿到该FileSystem的action在使用时发生IOException异常.定位到问题原因后就需要设法改进,方法也很简单,只要使conf.getBoolean(disableCacheName, false) 为true即可,这样每次都会重新创建一个FileSystem, 也就不会从cache中拿到失效的FileSystem了.
在oozie的workflow里进行如下配置:
oozie.launcher.fs.hdfs.impl.disable.cache
true
另外从源代码中发现oozie对action节点调度过程中的瞬态错误会有重试机制,默认状态下是3次,我在提交作业时修改成10次
oozie.wf.action.max.retries=10
经过上述修改后, oozie调度健壮性得到了提升^_^