【Temporal】任务并发失败问题研究

萌兰三太子

于 2024-08-14 13:50:39 发布

阅读量245

点赞数 2

本文链接：https://blog.csdn.net/m0_47495420/article/details/141205082

版权

问题描述

我们有时候可能会收到这样的错误：

getState: illegal access from outside of workflow context

单从这个错误信息来看，是无法知道具体的错误根因是什么的，网上也基本没有这方面的资料，因此咋们就从源码来分析分析，找到具体的根因。

源码分析

源代码入口：temporal-go.sdk/internal/internal_workflow.go
找到错误的方法：

func getState(ctx Context) *coroutineState {
   s := ctx.Value(coroutinesContextKey)
   if s == nil {
      panic("getState: not workflow context")
   }
   state := s.(*coroutineState)
   if !state.dispatcher.IsExecuting() {
      panic(panicIllegalAccessCoroutinueState)
   }
   return state
}

从这个方法可以看到，是因为分发器的executing是false导致，那么我们看看这个变量是在哪里被设置成了false呢？
通过源码跟踪，找到如下：

Choose here javascripttypescripthtmlcssshellpythongolangjavacc++c#phprubyswiftkotlinscalarustdartelixirhaskellluaperlrsql

func (d *dispatcherImpl) ExecuteUntilAllBlocked(deadlockDetectionTimeout time.Duration) (err error) {
   d.mutex.Lock()
   if d.closed {
      panic("dispatcher is closed")
   }
   if d.executing {
      panic("call to ExecuteUntilAllBlocked (possibly from a coroutine) while it is already running")
   }
   d.executing = true
   d.mutex.Unlock()
   defer func() {
      d.mutex.Lock()
      d.executing = false
      d.mutex.Unlock()
   }()
   allBlocked := false
   // Keep executing until at least one goroutine made some progress
   for !allBlocked {
      // Give every coroutine chance to execute removing closed ones
      allBlocked = true
      lastSequence := d.sequence
      for i := 0; i < len(d.coroutines); i++ {
         c := d.coroutines[i]
         if !c.closed.Load() {
            // TODO: Support handling of panic in a coroutine by dispatcher.
            // TODO: Dump all outstanding coroutines if one of them panics
            c.call(deadlockDetectionTimeout)
         }
         // c.call() can close the context so check again
         if c.closed.Load() {
            // remove the closed one from the slice
            d.coroutines = append(d.coroutines[:i],
               d.coroutines[i+1:]...)
            i--
            if c.panicError != nil {
               return c.panicError
            }
            allBlocked = false


         } else {
            allBlocked = allBlocked && (c.keptBlocked || c.closed.Load())
         }
      }
      // Set allBlocked to false if new coroutines where created
      allBlocked = allBlocked && lastSequence == d.sequence
      if len(d.coroutines) == 0 {
         break
      }
   }
   return nil
}

上面的代码逻辑大致如下：

1. 将分发器的executing置为true
2. 启动一个loop循环，检查是否所有的任务协程都已经准备好，等待select调度了
3. 遍历分发器中的每个任务协程
4. 检查任务协程是否已经完成关闭，如果关闭，则从分发器中移除
5.如果分发器中已经没有需要运行的协程了，则从loop中退出
6.退出该方法后，将分发器的executing置为false

下面来看下检查任务协程是否关闭的方法：

func (s *coroutineState) call(timeout time.Duration) {
   s.unblock <- func(status string, stackDepth int) bool {
      return false // unblock
   }


   // Defaults are populated in the worker options during worker startup, but test environment
   // may have no default value for the deadlock detection timeout, so we also need to set it here for
   // backwards compatibility.
   if timeout == 0 {
      timeout = defaultDeadlockDetectionTimeout
      if debugMode {
         timeout = unlimitedDeadlockDetectionTimeout
      }
   }
   deadlockTicker := s.dispatcher.deadlockDetector.begin(timeout)
   defer deadlockTicker.end()


   select {
   case <-s.aboutToBlock:
   case <-deadlockTicker.reached():
      s.closed.Store(true)
      panic(fmt.Sprintf("Potential deadlock detected: "+
         "workflow goroutine %q didn't yield for over a second", s.name))
   }
}

上面的代码逻辑就是：根据分发器的的DeadlockDetectionTimeout时间，如果达到该超时时间，则将分发器中的任务协程关闭。
上面的DeadlockDetectionTimeout时间是可以配置的，默认是1秒。

解决方案

通过上面的分析我们就知道问题的原因了：

当我们通过代码向temporal添加执行工作流的时候：future := workflow.ExecuteChildWorkflow(mainCtx, RunTimeNodeMonitorWorkflow, req)
temporal的分发器就会根据配置的超时时间开始计算
如果在超时时间到来的时候都还没有开始执行工作流：selector.Select(mainCtx)则认为发生了死锁，则直接panic退出

因此当我们一次性加入分发器的任务过多时，默认的超时时间就会不够，就会出现上面的错误。

解决方法就是将超时时间设置大一点，如下：

w := worker.New(*c, taskQueue, worker.Options{
   MaxConcurrentWorkflowTaskPollers:   100,
   MaxConcurrentActivityTaskPollers:   100,
   MaxConcurrentActivityExecutionSize: 1000,
   DeadlockDetectionTimeout:           time.Second * 10,
})

就是该变量DeadlockDetectionTimeout: time.Second * 10,

萌兰三太子

关注

2
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
【Temporal】任务并发失败问题研究

问题描述我们有时候可能会收到这样的错误：getState: illegal access from outside of workflow context单从这个错误信息来看，是无法知道具体的错误根因是什么的，网上也基本没有这方面的资料，因此咋们就从源码来分析分析，找到具体的根因。源码分析源代码入口：temporal-go.sdk/internal/internal_workflow.go找到错误...
复制链接

扫一扫