【Temporal】任务并发失败问题研究

问题描述

我们有时候可能会收到这样的错误:

getState: illegal access from outside of workflow context

单从这个错误信息来看,是无法知道具体的错误根因是什么的,网上也基本没有这方面的资料,因此咋们就从源码来分析分析,找到具体的根因。

源码分析

源代码入口:temporal-go.sdk/internal/internal_workflow.go
找到错误的方法:

func getState(ctx Context) *coroutineState {
   s := ctx.Value(coroutinesContextKey)
   if s == nil {
      panic("getState: not workflow context")
   }
   state := s.(*coroutineState)
   if !state.dispatcher.IsExecuting() {
      panic(panicIllegalAccessCoroutinueState)
   }
   return state
}

从这个方法可以看到,是因为分发器的executing是false导致,那么我们看看这个变量是在哪里被设置成了false呢?
通过源码跟踪,找到如下:

       Choose here       javascripttypescripthtmlcssshellpythongolangjavacc++c#phprubyswiftkotlinscalarustdartelixirhaskellluaperlrsql     

func (d *dispatcherImpl) ExecuteUntilAllBlocked(deadlockDetectionTimeout time.Duration) (err error) {
   d.mutex.Lock()
   if d.closed {
      panic("dispatcher is closed")
   }
   if d.executing {
      panic("call to ExecuteUntilAllBlocked (possibly from a coroutine) while it is already running")
   }
   d.executing = true
   d.mutex.Unlock()
   defer func() {
      d.mutex.Lock()
      d.executing = false
      d.mutex.Unlock()
   }()
   allBlocked := false
   // Keep executing until at least one goroutine made some progress
   for !allBlocked {
      // Give every coroutine chance to execute removing closed ones
      allBlocked = true
      lastSequence := d.sequence
      for i := 0; i < len(d.coroutines); i++ {
         c := d.coroutines[i]
         if !c.closed.Load() {
            // TODO: Support handling of panic in a coroutine by dispatcher.
            // TODO: Dump all outstanding coroutines if one of them panics
            c.call(deadlockDetectionTimeout)
         }
         // c.call() can close the context so check again
         if c.closed.Load() {
            // remove the closed one from the slice
            d.coroutines = append(d.coroutines[:i],
               d.coroutines[i+1:]...)
            i--
            if c.panicError != nil {
               return c.panicError
            }
            allBlocked = false


         } else {
            allBlocked = allBlocked && (c.keptBlocked || c.closed.Load())
         }
      }
      // Set allBlocked to false if new coroutines where created
      allBlocked = allBlocked && lastSequence == d.sequence
      if len(d.coroutines) == 0 {
         break
      }
   }
   return nil
}

上面的代码逻辑大致如下:

  • 1. 将分发器的executing置为true

  • 2. 启动一个loop循环,检查是否所有的任务协程都已经准备好,等待select调度了

  • 3. 遍历分发器中的每个任务协程

  • 4. 检查任务协程是否已经完成关闭,如果关闭,则从分发器中移除

  • 5.如果分发器中已经没有需要运行的协程了,则从loop中退出

  • 6.退出该方法后,将分发器的executing置为false

下面来看下检查任务协程是否关闭的方法:

func (s *coroutineState) call(timeout time.Duration) {
   s.unblock <- func(status string, stackDepth int) bool {
      return false // unblock
   }


   // Defaults are populated in the worker options during worker startup, but test environment
   // may have no default value for the deadlock detection timeout, so we also need to set it here for
   // backwards compatibility.
   if timeout == 0 {
      timeout = defaultDeadlockDetectionTimeout
      if debugMode {
         timeout = unlimitedDeadlockDetectionTimeout
      }
   }
   deadlockTicker := s.dispatcher.deadlockDetector.begin(timeout)
   defer deadlockTicker.end()


   select {
   case <-s.aboutToBlock:
   case <-deadlockTicker.reached():
      s.closed.Store(true)
      panic(fmt.Sprintf("Potential deadlock detected: "+
         "workflow goroutine %q didn't yield for over a second", s.name))
   }
}

上面的代码逻辑就是:根据分发器的的DeadlockDetectionTimeout时间,如果达到该超时时间,则将分发器中的任务协程关闭。
上面的DeadlockDetectionTimeout时间是可以配置的,默认是1秒。

解决方案

通过上面的分析我们就知道问题的原因了:

  • 当我们通过代码向temporal添加执行工作流的时候:future := workflow.ExecuteChildWorkflow(mainCtx, RunTimeNodeMonitorWorkflow, req)

  • temporal的分发器就会根据配置的超时时间开始计算

  • 如果在超时时间到来的时候都还没有开始执行工作流:selector.Select(mainCtx)则认为发生了死锁,则直接panic退出

因此当我们一次性加入分发器的任务过多时,默认的超时时间就会不够,就会出现上面的错误。

解决方法就是将超时时间设置大一点,如下:

w := worker.New(*c, taskQueue, worker.Options{
   MaxConcurrentWorkflowTaskPollers:   100,
   MaxConcurrentActivityTaskPollers:   100,
   MaxConcurrentActivityExecutionSize: 1000,
   DeadlockDetectionTimeout:           time.Second * 10,
})

就是该变量DeadlockDetectionTimeout: time.Second * 10,

  • 2
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值