问题描述
我们使用DS1.3.4新版本的时候,执行调度,调度一直在运行中,api-server日志正常,master-server没有报错,worker-server也没有报错,流程实例在运行中,任务实例处于已提交状态,然后不动了,卡死在这里了。
问题定位
流程实例在执行中,说明master-server改变了流程实例状态,排查到master-server日志中没有Netty发送部分,于是找到一下源码:
将任务添加到备用队列
private void addTaskToStandByList(TaskInstance taskInstance){
logger.info("add task to stand by list: {}", taskInstance.getName());
try {
readyToSubmitTaskQueue.put(taskInstance);
} catch (Exception e) {
logger.error("add task instance to readyToSubmitTaskQueue error");
}
}
TaskPriorityQueueConsumer队列优先级消费线程会从备用任务队列中不断的取出优先级高的队列,进行分发
public void run() {
List<String> failedDispatchTasks = new ArrayList<>();
while (Stopper.isRunning()){
try {
int fetchTaskNum = masterConfig.getMasterDispatchTaskNumber();
failedDispatchTasks.clear();
for(int i = 0; i < fetchTaskNum; i++){
if(taskPriorityQueue.size() <= 0){
Thread.sleep(Constants.SLEEP_TIME_MILLIS);
continue;
}
// 从任务备用队列中取出优先级高的任务
String taskPriorityInfo = taskPriorityQueue.take();
TaskPriority taskPriority = TaskPriority.of(taskPriorityInfo);
//分发备用任务队列中的任务
boolean dispatchResult = dispatch(taskPriority.getTaskId());
if(!dispatchResult){
failedDispatchTasks.add(taskPriorityInfo);
}
}
if (!failedDispatchTasks.isEmpty()) {
for (String dispatchFailedTask : failedDispatchTasks) {
taskPriorityQueue.put(dispatchFailedTask);
}
// If there are tasks in a cycle that cannot find the worker group,
// sleep for 1 second
if (taskPriorityQueue.size() <= failedDispatchTasks.size()) {
TimeUnit.MILLISECONDS.sleep(Constants.SLEEP_TIME_MILLIS);
}
}
}catch (Exception e){
logger.error("dispatcher task error",e);
}
}
}
TaskPriorityQueueConsumer#dispatch 任务分发方法新增日志
protected boolean dispatch(int taskInstanceId) {
logger.info("dispatch taskInstanceId:{}", taskInstanceId);
boolean result = false;
try {
TaskExecutionContext context = getTaskExecutionContext(taskInstanceId);
ExecutionContext executionContext = new ExecutionContext(context.toCommand(), ExecutorType.WORKER, context.getWorkerGroup());
//新增打印分发内容日志,否则不知道TaskPriorityQueueConsumer是否在分发任务
logger.info("dispatch executionContext:{}", JSONObject.toJSONString(executionContext));
if (taskInstanceIsFinalState(taskInstanceId)){
// when task finish, ignore this task, there is no need to dispatch anymore
return true;
}else{
result = dispatcher.dispatch(executionContext);
}
} catch (ExecuteException e) {
logger.error("dispatch error",e);
}
return result;
}
ExecutorDispatcher#dispatch实际的任务分发执行器,将会根据最小权重算法获取到work执行机器的host,然后去执行最终的netty发送操作,这里新增了打印host机器日志操作,我们就知道到时候接收任务的机器是哪一台,可以到那一台上去看日志。
public Boolean dispatch(final ExecutionContext context) throws ExecuteException {
/**
* get executor manager
*/
ExecutorManager<Boolean> executorManager = this.executorManagers.get(context.getExecutorType());
if(executorManager == null){
throw new ExecuteException("no ExecutorManager for type : " + context.getExecutorType());
}
/**
* host select
*/
Host host = hostManager.select(context);
logger.info("host info:{}", JSONObject.toJSONString(host));
if (StringUtils.isEmpty(host.getAddress())) {
throw new ExecuteException(String.format("fail to execute : %s due to no suitable worker , " +
"current task need to %s worker group execute",
context.getCommand(),context.getWorkerGroup()));
}
context.setHost(host);
executorManager.beforeExecute(context);
try {
/**
* task execute
*/
return executorManager.execute(context);
} finally {
executorManager.afterExecute(context);
}
}
NettyExecutorManager#doExecute Netty最终发送命令到work的方法,打印一下发送的command和host
private void doExecute(final Host host, final Command command) throws ExecuteException {
/**
* retry count,default retry 3
*/
int retryCount = 3;
boolean success = false;
do {
try {
logger.info("send command:{} host:{}", command, host);
nettyRemotingClient.send(host, command);
success = true;
} catch (Exception ex) {
logger.error(String.format("send command : %s to %s error", command, host), ex);
retryCount--;
try {
Thread.sleep(100);
} catch (InterruptedException ignore) {}
}
} while (retryCount >= 0 && !success);
if (!success) {
throw new ExecuteException(String.format("send command : %s to %s error", command, host));
}
}
判断master-server是否异常
我们修改完master-server后,重新启动一下,运行一个流程,查看日志,发现最终打印了NettyExecutorManager#doExecute方法的发送日志里面包括了command命令和host,说明master-server无异常
根据host查看接受命令的work-server
发现work-server依然没有任何日志打印。
观察logback-worker.xml发现work只打印线程名称开头为“Worker-”的日志
public class WorkerLogFilter extends Filter<ILoggingEvent> {
/**
* level
*/
Level level;
/**
* Accept or reject based on thread name
* @param event event
* @return FilterReply
*/
@Override
public FilterReply decide(ILoggingEvent event) {
if (event.getThreadName().startsWith("Worker-")){
return FilterReply.ACCEPT;
}
return FilterReply.DENY;
}
public void setLevel(String level) {
this.level = Level.toLevel(level);
}
}
手动修改if 判断为event.getThreadName().startsWith("Worker-") || event.getThreadName().startsWith("Netty")
,然后将netty线程名命名为以“Netty”开头。
private final ExecutorService defaultExecutor = Executors.newFixedThreadPool(Constants.CPUS, new ThreadFactoryBuilder()
.setDaemon(true)
.setNameFormat("NettyRemotingServer")
.build());
work-server中netty接收命令的地方新增日志打印,修改部分warn级别日志为info,打印到日志文件方便分析。
private void processReceived(final Channel channel, final Command msg) {
logger.info("processReceived command:{} channel:{}", JSONObject.toJSONString(msg), channel);
final CommandType commandType = msg.getType();
final Pair<NettyRequestProcessor, ExecutorService> pair = processors.get(commandType);
if (pair != null) {
Runnable r = new Runnable() {
@Override
public void run() {
try {
pair.getLeft().process(channel, msg);
} catch (Throwable ex) {
logger.error("process msg {} error", msg, ex);
}
}
};
try {
pair.getRight().submit(r);
} catch (RejectedExecutionException e) {
//修改warn为info
logger.info("thread pool is full, discard msg {} from {}", msg, ChannelUtils.getRemoteAddress(channel));
}
} else {
//修改warn为info
logger.info("commandType {} not support", commandType);
}
}
重新查看work-server日志
重启后运行流程,发现work-server日志打印正常了,接受到任务了,在获取sql类型任务数据路径失败报错了。
解决
获取sql类型任务数据这部分是我们在DS上新增的功能,修复获取sql类型任务数据路径失败问题,流程实例运行中卡死问题解决!