问题描述
我们有时候可能会收到这样的错误:
getState: illegal access from outside of workflow context
单从这个错误信息来看,是无法知道具体的错误根因是什么的,网上也基本没有这方面的资料,因此咋们就从源码来分析分析,找到具体的根因。
源码分析
源代码入口:temporal-go.sdk/internal/internal_workflow.go
找到错误的方法:
func getState(ctx Context) *coroutineState {
s := ctx.Value(coroutinesContextKey)
if s == nil {
panic("getState: not workflow context")
}
state := s.(*coroutineState)
if !state.dispatcher.IsExecuting() {
panic(panicIllegalAccessCoroutinueState)
}
return state
}
从这个方法可以看到,是因为分发器的executing是false导致,那么我们看看这个变量是在哪里被设置成了false呢?
通过源码跟踪,找到如下:
Choose here javascripttypescripthtmlcssshellpythongolangjavacc++c#phprubyswiftkotlinscalarustdartelixirhaskellluaperlrsql
func (d *dispatcherImpl) ExecuteUntilAllBlocked(deadlockDetectionTimeout time.Duration) (err error) {
d.mutex.Lock()
if d.closed {
panic("dispatcher is closed")
}
if d.executing {
panic("call to ExecuteUntilAllBlocked (possibly from a coroutine) while it is already running")
}
d.executing = true
d.mutex.Unlock()
defer func() {
d.mutex.Lock()
d.executing = false
d.mutex.Unlock()
}()
allBlocked := false
// Keep executing until at least one goroutine made some progress
for !allBlocked {
// Give every coroutine chance to execute removing closed ones
allBlocked = true
lastSequence := d.sequence
for i := 0; i < len(d.coroutines); i++ {
c := d.coroutines[i]
if !c.closed.Load() {
// TODO: Support handling of panic in a coroutine by dispatcher.
// TODO: Dump all outstanding coroutines if one of them panics
c.call(deadlockDetectionTimeout)
}
// c.call() can close the context so check again
if c.closed.Load() {
// remove the closed one from the slice
d.coroutines = append(d.coroutines[:i],
d.coroutines[i+1:]...)
i--
if c.panicError != nil {
return c.panicError
}
allBlocked = false
} else {
allBlocked = allBlocked && (c.keptBlocked || c.closed.Load())
}
}
// Set allBlocked to false if new coroutines where created
allBlocked = allBlocked && lastSequence == d.sequence
if len(d.coroutines) == 0 {
break
}
}
return nil
}
上面的代码逻辑大致如下:
1. 将分发器的executing置为true
2. 启动一个loop循环,检查是否所有的任务协程都已经准备好,等待select调度了
3. 遍历分发器中的每个任务协程
4. 检查任务协程是否已经完成关闭,如果关闭,则从分发器中移除
5.如果分发器中已经没有需要运行的协程了,则从loop中退出
6.退出该方法后,将分发器的executing置为false
下面来看下检查任务协程是否关闭的方法:
func (s *coroutineState) call(timeout time.Duration) {
s.unblock <- func(status string, stackDepth int) bool {
return false // unblock
}
// Defaults are populated in the worker options during worker startup, but test environment
// may have no default value for the deadlock detection timeout, so we also need to set it here for
// backwards compatibility.
if timeout == 0 {
timeout = defaultDeadlockDetectionTimeout
if debugMode {
timeout = unlimitedDeadlockDetectionTimeout
}
}
deadlockTicker := s.dispatcher.deadlockDetector.begin(timeout)
defer deadlockTicker.end()
select {
case <-s.aboutToBlock:
case <-deadlockTicker.reached():
s.closed.Store(true)
panic(fmt.Sprintf("Potential deadlock detected: "+
"workflow goroutine %q didn't yield for over a second", s.name))
}
}
上面的代码逻辑就是:根据分发器的的DeadlockDetectionTimeout时间,如果达到该超时时间,则将分发器中的任务协程关闭。
上面的DeadlockDetectionTimeout时间是可以配置的,默认是1秒。
解决方案
通过上面的分析我们就知道问题的原因了:
当我们通过代码向temporal添加执行工作流的时候:
future := workflow.ExecuteChildWorkflow(mainCtx, RunTimeNodeMonitorWorkflow, req)
temporal的分发器就会根据配置的超时时间开始计算
如果在超时时间到来的时候都还没有开始执行工作流:
selector.Select(mainCtx)
则认为发生了死锁,则直接panic退出
因此当我们一次性加入分发器的任务过多时,默认的超时时间就会不够,就会出现上面的错误。
解决方法就是将超时时间设置大一点,如下:
w := worker.New(*c, taskQueue, worker.Options{
MaxConcurrentWorkflowTaskPollers: 100,
MaxConcurrentActivityTaskPollers: 100,
MaxConcurrentActivityExecutionSize: 1000,
DeadlockDetectionTimeout: time.Second * 10,
})
就是该变量DeadlockDetectionTimeout: time.Second * 10,