最近有同事反馈有些工作流实例在dolphinscheduler上提交后一直显示停止中,无法调度也未产生任务实例,且在工作流实例界面操作标记全部为灰色,任务实例也无法删除。
日志定位
查看master节点日志,发现有如下异常
[ERROR] 2022-06-13 17:59:45.280 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[244] - handler error:
org.mybatis.spring.MyBatisSystemException: nested exception is org.apache.ibatis.exceptions.TooManyResultsException: Expected one result (or null) to be returned by selectOne(), but found: 2
at org.mybatis.spring.MyBatisExceptionTranslator.translateExceptionIfPossible(MyBatisExceptionTranslator.java:78)
at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:440)
at com.sun.proxy.$Proxy91.selectOne(Unknown Source)
at org.mybatis.spring.SqlSessionTemplate.selectOne(SqlSessionTemplate.java:159)
at com.baomidou.mybatisplus.core.override.MybatisMapperMethod.execute(MybatisMapperMethod.java:89)
at com.baomidou.mybatisplus.core.override.MybatisMapperProxy.invoke(MybatisMapperProxy.java:61)
at com.sun.proxy.$Proxy115.queryByDefinitionCodeAndVersion(Unknown Source)
at sun.reflect.GeneratedMethodAccessor124.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
at org.springframework.cache.interceptor.CacheInterceptor.lambda$invoke$0(CacheInterceptor.java:54)
at org.springframework.cache.interceptor.CacheAspectSupport.invokeOperation(CacheAspectSupport.java:366)
at org.springframework.cache.interceptor.CacheAspectSupport.lambda$handleSynchronizedGet$1(CacheAspectSupport.java:447)
at org.springframework.cache.support.NoOpCache.get(NoOpCache.java:76)
at org.springframework.cache.interceptor.CacheAspectSupport.handleSynchronizedGet(CacheAspectSupport.java:442)
at org.springframework.cache.interceptor.CacheAspectSupport.execute(CacheAspectSupport.java:382)
at org.springframework.cache.interceptor.CacheAspectSupport.execute(CacheAspectSupport.java:345)
at org.springframework.cache.interceptor.CacheInterceptor.invoke(CacheInterceptor.java:64)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215)
at com.sun.proxy.$Proxy116.queryByDefinitionCodeAndVersion(Unknown Source)
at org.apache.dolphinscheduler.service.process.ProcessService.findTaskDefinition(ProcessService.java:2474)
at org.apache.dolphinscheduler.service.process.ProcessService.lambda$getTaskDefineLogListByRelation$4(ProcessService.java:2465)
at java.util.HashMap.forEach(HashMap.java:1291)
at org.apache.dolphinscheduler.service.process.ProcessService.getTaskDefineLogListByRelation(ProcessService.java:2464)
at org.apache.dolphinscheduler.service.process.ProcessService$$FastClassBySpringCGLIB$$ed138739.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:689)
at org.apache.dolphinscheduler.service.process.ProcessService$$EnhancerBySpringCGLIB$$ec695045.getTaskDefineLogListByRelation(<generated>)
at org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread.buildFlowDag(WorkflowExecuteThread.java:578)
at org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread.startProcess(WorkflowExecuteThread.java:542)
at org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread.run(WorkflowExecuteThread.java:239)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.ibatis.exceptions.TooManyResultsException: Expected one result (or null) to be returned by selectOne(), but found: 2
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectOne(DefaultSqlSession.java:80)
at sun.reflect.GeneratedMethodAccessor98.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:426)
... 42 common frames omitted
关键信息 Expected one result (or null) to be returned by selectOne(), but found: 2,即sql的查询逻辑是结果为1条,但是却出现2条的情况
定位源码中报错的位置
at org.apache.dolphinscheduler.service.process.ProcessService.findTaskDefinition(ProcessService.java:2474)
源码定位
看源码找到对应的findTaskDefinition方法,返回的是TaskDefinition类
定位queryByDefinitionCodeAndVersion方法的具体实现
查询sql为
<select id="queryByDefinitionCodeAndVersion" resultType="org.apache.dolphinscheduler.dao.entity.TaskDefinitionLog">
select
<include refid="baseSql"/>
from t_ds_task_definition_log
WHERE code = #{code}
and version = #{version}
</select>
在dolphinsscheduler对应的数据库中,查询t_ds_task_definition_log表
SELECT code ,version,COUNT(*) cnt from t_ds_task_definition_log group by code ,version order by cnt desc
这是一张任务定义的日志表,理论上同一个版本的任务编码,不应该出现多条数据,而查询发现表中确实存在几个任务有重复数据,而这几个有重复数据的任务恰好都属于那个无法调度和停止的工作流。
我的做法是手动删除了这些重复数据(同时t_ds_task_definition表里也是有同样的相同数据,也一并删除),并重启master节点。
重启后工作流实例操作界面恢复正常。

此时重启调度任务,任务也可恢复正常调度。
但其实还有个问题,就是这些重复数据是怎么进去的,还需要进一步定位,但是有必要怀疑dolphinscheduler的2.0.5版本在往数据库写入任务信息时出现了事务并发问题。