对操作系统有过一些了解就知道linux下的线程其实是task_struct结构, 线程其实并不是真正运行的实体, 线程只是代表一个执行流和其状态. 真正运行驱动流程往前的其实是CPU. CPU在时钟的驱动下, 根据PC寄存器从程序中取指令和操作数, 从RAM中取数据, 进行计算, 处理, 跳转, 驱动执行流往前. CPU并不关注处理的是线程还是协程, 只需要设置PC寄存器, 设置栈指针等(这些称为上下文), 那么CPU就可以欢快的运行这个线程或者这个协程了.
线程的运行, 其实是被运行. 其阻塞, 其实是切换出调度队列, 不再去调度执行这个执行流. 其他执行流满足其条件, 便会把被移出调度队列的执行流重新放回调度队列.
协程同理, 协程其实也是一个数据结构, 记录了要运行什么函数, 运行到哪里了. go在用户态实现调度, 所以go要有代表协程这种执行流的结构体, 也要有保存和恢复上下文的函数, 运行队列.
Goroutine协程结构体
type g struct {
// 当前g使用的栈空间,stack结构包括 [lo, hi]两个成员
stack stack // offset known to runtime/cgo
// 用于检测是否需要进行栈扩张,go代码使用
stackguard0 uintptr // offset known to liblink
// 用于检测是否需要进行栈扩展,原生代码使用的
stackguard1 uintptr // offset known to liblink
// 当前g所绑定的m
m *m // current m; offset known to arm liblink
// 当前g的调度数据,当goroutine切换时,保存当前g的上下文,用于恢复
sched gobuf
// g当前的状态
atomicstatus uint32
// 当前g的id
goid int64
// 下一个g的地址,通过guintptr结构体的ptr set函数可以设置和获取下一个g,通过这个字段和sched.gfreeStack sched.gfreeNoStack 可以把 free g串成一个链表
schedlink guintptr
// 判断g是否允许被抢占
preempt bool // preemption signal, duplicates stackguard0 = stackpreempt
// g是否要求要回到这个M执行, 有的时候g中断了恢复会要求使用原来的M执行
lockedm muintptr
}
type gobuf struct {
sp uintptr //栈指针
pc uintptr//程序运行到的位置
g guintptr
ctxt unsafe.Pointer
ret sys.Uintreg
lr uintptr
bp uintptr // for GOEXPERIMENT=framepointer
}
协程切换函数
1、发生切换时保存上下文
TEXT runtime•gogo(SB), NOSPLIT, $24-8
MOVD buf+0(FP), R5
MOVD gobuf_g(R5), g
BL runtime•save_g(SB)
MOVD 0(g), R4 // make sure g is not nil
MOVD gobuf_sp(R5), R0
MOVD R0, RSP
MOVD gobuf_lr(R5), LR
MOVD gobuf_ret(R5), R0
MOVD gobuf_ctxt(R5), R26
MOVD $0, gobuf_sp(R5)
MOVD $0, gobuf_ret(R5)
MOVD $0, gobuf_lr(R5)
MOVD $0, gobuf_ctxt(R5)
CMP ZR, ZR // set condition codes for == test, needed by stack split
MOVD gobuf_pc(R5), R6
B (R6)
2、重新调度时恢复上下文
TEXT runtime•mcall(SB), NOSPLIT, $-8-8
// Save caller state in g->sched
MOVD RSP, R0
MOVD R0, (g_sched+gobuf_sp)(g)
MOVD LR, (g_sched+gobuf_pc)(g)
MOVD $0, (g_sched+gobuf_lr)(g)
MOVD g, (g_sched+gobuf_g)(g)
// Switch to m->g0 & its stack, call fn.
MOVD g, R3
MOVD g_m(g), R8
MOVD m_g0(R8), g
BL runtime•save_g(SB)
CMP g, R3
BNE 2(PC)
B runtime•badmcall(SB)
MOVD fn+0(FP), R26 // context
MOVD 0(R26), R4 // code pointer
MOVD (g_sched+gobuf_sp)(g), R0
MOVD R0, RSP // sp = m->g0->sched.sp
MOVD R3, -8(RSP)
MOVD $0, -16(RSP)
SUB $16, RSP
BL (R4)
B runtime•badmcall2(SB)
启动一个goroutine一般写作:
go func1(arg1 type1,arg2 type2){....}(a1,a2)
一个协程代表了一个执行流, 执行流有需要执行的函数(对应上面的 func1), 有函数的入参(a1, a2), 有当前执行流的状态和进度(对应 CPU 的 PC 寄存器和 SP 寄存器), 当然也需要有保存状态的地方, 用于执行流恢复.
真正代表协程的是 runtime.g 结构体. 每个 go func 都会编译成 runtime.newproc 函数, 最终有一个 runtime.g 对象放入调度队列. 上面的 func1 函数的指针设置在 runtime.g 的 startpc字段, 参数会在 newproc 函数里拷贝到 stack 中, sched 用于保存协程切换时的 pc 位置和栈位置。
GPM模型
type g struct {
stack stack // offset known to runtime/cgo
stackguard0 uintptr // offset known to liblink
stackguard1 uintptr // offset known to liblink
_panic *_panic // innermost panic - offset known to liblink
_defer *_defer // innermost defer
m *m // current m; offset known to arm liblink
sched gobuf
syscallsp uintptr // if status==Gsyscall, syscallsp = sched.sp to use during gc
syscallpc uintptr // if status==Gsyscall, syscallpc = sched.pc to use during gc
stktopsp uintptr // expected sp at top of stack, to check in traceback
param unsafe.Pointer // passed parameter on wakeup
atomicstatus uint32
stackLock uint32 // sigprof/scang lock; TODO: fold in to atomicstatus
goid int64
waitsince int64 // approx time when the g become blocked
waitreason string // if status==Gwaiting
schedlink guintptr
preempt bool // preemption signal, duplicates stackguard0 = stackpreempt
paniconfault bool // panic (instead of crash) on unexpected fault address
preemptscan bool // preempted g does scan for gc
gcscandone bool // g has scanned stack; protected by _Gscan bit in status
gcscanvalid bool // false at start of gc cycle, true if G has not run since last scan; TODO: remove?
throwsplit bool // must not split stack
raceignore int8 // ignore race detection events
sysblocktraced bool // StartTrace has emitted EvGoInSyscall about this goroutine
sysexitticks int64 // cputicks when syscall has returned (for tracing)
traceseq uint64 // trace event sequencer
tracelastp puintptr // last P emitted an event for this goroutine
lockedm muintptr
sig uint32
writebuf []byte
sigcode0 uintptr
sigcode1 uintptr
sigpc uintptr
gopc uintptr // pc of go statement that created this goroutine
startpc uintptr // pc of goroutine function
racectx uintptr
waiting *sudog // sudog structures this g is waiting on (that have a valid elem ptr); in lock order
cgoCtxt []uintptr // cgo traceback context
labels unsafe.Pointer // profiler labels
timer *timer // cached timer for time.Sleep
selectDone uint32 // are we participating in a select and did someone win the race?
gcAssistBytes int64
}
G:每个G都代表用户启动的一个goroutine,比较重要的成员如下
- stack: 当前g使用的栈空间, 有lo和hi两个成员
- stackguard0: 检查栈空间是否足够的值, 低于这个值会扩张栈, 0是go代码使用的
- stackguard1: 检查栈空间是否足够的值, 低于这个值会扩张栈, 1是原生代码使用的
- m: 当前g对应的m
- sched: g的调度数据, 当g中断时会保存当前的pc和rsp等值到这里, 恢复运行时会使用这里的值
- atomicstatus: g的当前状态
- schedlink: 下一个g, 当g在链表结构中会使用
- preempt: g是否被抢占中
- lockedm: g是否要求要回到这个M执行, 有的时候g中断了恢复会要求使用原来的M执行
type m struct {
g0 *g // goroutine with scheduling stack
morebuf gobuf // gobuf arg to morestack
divmod uint32 // div/mod denominator for arm - known to liblink
// Fields not known to debuggers.
procid uint64 // for debuggers, but offset not hard-coded
gsignal *g // signal-handling g
goSigStack gsignalStack // Go-allocated signal handling stack
sigmask sigset // storage for saved signal mask
tls [6]uintptr // thread-local storage (for x86 extern register)
mstartfn func()
curg *g // current running goroutine
caughtsig guintptr // goroutine running during fatal signal
p puintptr // attached p for executing go code (nil if not executing go code)
nextp puintptr
id int64
mallocing int32
throwing int32
preemptoff string // if != "", keep curg running on this m
locks int32
softfloat int32
dying int32
profilehz int32
helpgc int32
spinning bool // m is out of work and is actively looking for work
blocked bool // m is blocked on a note
inwb bool // m is executing a write barrier
newSigstack bool // minit on C thread called sigaltstack
printlock int8
incgo bool // m is executing a cgo call
freeWait uint32 // if == 0, safe to free g0 and delete m (atomic)
fastrand [2]uint32
needextram bool
traceback uint8
ncgocall uint64 // number of cgo calls in total
ncgo int32 // number of cgo calls currently in progress
cgoCallersUse uint32 // if non-zero, cgoCallers in use temporarily
cgoCallers *cgoCallers // cgo traceback if crashing in cgo call
park note
alllink *m // on allm
schedlink muintptr
mcache *mcache
lockedg guintptr
createstack [32]uintptr // stack that created this thread.
freglo [16]uint32 // d[i] lsb and f[i]
freghi [16]uint32 // d[i] msb and f[i+16]
fflag uint32 // floating point compare flags
lockedExt uint32 // tracking for external LockOSThread
lockedInt uint32 // tracking for internal lockOSThread
nextwaitm muintptr // next m waiting for lock
waitunlockf unsafe.Pointer // todo go func(*g, unsafe.pointer) bool
waitlock unsafe.Pointer
waittraceev byte
waittraceskip int
startingtrace bool
syscalltick uint32
thread uintptr // thread handle
freelink *m // on sched.freem
// these are here because they are too large to be on the stack
// of low-level NOSPLIT functions.
libcall libcall
libcallpc uintptr // for cpu profiler
libcallsp uintptr
libcallg guintptr
syscall libcall // stores syscall parameters on windows
mOS
}
M:底层真正执行G的线程,比较重要的成员如下
- g0: 用于调度的特殊g, 调度和执行系统调用时会切换到这个g
- curg: 当前运行的g
- p: 当前拥有的P
- nextp: 唤醒M时, M会拥有这个P
- park: M休眠时使用的信号量, 唤醒M时会通过它唤醒
- schedlink: 下一个m, 当m在链表结构中会使用
- mcache: 分配内存时使用的本地分配器, 和p.mcache一样(拥有P时会复制过来)
- lockedg: lockedm的对应值
type p struct {
lock mutex
id int32
status uint32 // one of pidle/prunning/...
link puintptr
schedtick uint32 // incremented on every scheduler call
syscalltick uint32 // incremented on every system call
sysmontick sysmontick // last tick observed by sysmon
m muintptr // back-link to associated m (nil if idle)
mcache *mcache
racectx uintptr
deferpool [5][]*_defer // pool of available defer structs of different sizes (see panic.go)
deferpoolbuf [5][32]*_defer
// Cache of goroutine ids, amortizes accesses to runtime•sched.goidgen.
goidcache uint64
goidcacheend uint64
// Queue of runnable goroutines. Accessed without lock.
runqhead uint32
runqtail uint32
runq [256]guintptr
runnext guintptr
// Available G's (status == Gdead)
gfree *g
gfreecnt int32
sudogcache []*sudog
sudogbuf [128]*sudog
tracebuf traceBufPtr
traceSweep bool
traceSwept, traceReclaimed uintptr
palloc persistentAlloc // per-P to avoid mutex
// Per-P GC state
gcAssistTime int64 // Nanoseconds in assistAlloc
gcFractionalMarkTime int64 // Nanoseconds in fractional mark worker
gcBgMarkWorker guintptr
gcMarkWorkerMode gcMarkWorkerMode
// gcMarkWorkerStartTime is the nanotime() at which this mark
// worker started.
gcMarkWorkerStartTime int64
// gcw is this P's GC work buffer cache. The work buffer is
// filled by write barriers, drained by mutator assists, and
// disposed on certain GC state transitions.
gcw gcWork
// wbBuf is this P's GC write barrier buffer.
//
// TODO: Consider caching this in the running G.
wbBuf wbBuf
runSafePointFn uint32 // if 1, run sched.safePointFn at next safe point
pad [sys.CacheLineSize]byte
}
P:默认机器核数个P,代表执行G所需要的资源,比较重要的成员如下
- status: p的当前状态
- link: 下一个p, 当p在链表结构中会使用
- m: 拥有这个P的M
- mcache: 分配内存时使用的本地分配器
- runqhead: 本地运行队列的出队序号
- runqtail: 本地运行队列的入队序号
- runq: 本地运行队列的数组, 可以保存256个G
- gfree: G的自由列表, 保存变为_Gdead后可以复用的G实例
- gcBgMarkWorker: 后台GC的worker函数, 如果它存在M会优先执行它
- gcw: GC的本地工作队列, 详细将在下一篇(GC篇)分析
G、P、M之间的关系如下图:
之前介绍内存分配时说到了mcache与一个p绑定,p上的这个mcache是从m结构赋值过去的。
每个p都有本地g队列runq,newproc产生的新g优先放入本地runq,当本地runq满了后再放入全局的runq。当取g的时候也是优先从本地的runq,本地获取不到时才去全局或者其他p上的runq获取g。
当M需要系统调用或者p执行了一段时间后Sysmon负责将g和M分离,然后调用schedule进行调度。
创建协程
上面介绍go function 其实就是调用runtime.newproc函数:
func newproc(siz int32, fn *funcval) {
//计算额外参数的地址argp
argp := add(unsafe.Pointer(&fn), sys.PtrSize)
//获取调用端的地址(返回地址)pc
pc := getcallerpc()
//使用systemstack调用newproc1
systemstack(func() {
newproc1(fn, (*uint8)(argp), siz, pc)
})
}
newproc
函数获取了参数和当前g的pc信息,并通过systemstack切换当前的g到g0, 并且使用g0的栈空间, 然后调用传入的函数, 再切换回原来的g和原来的栈空间.主要处理函数是在newproc1中:
func newproc1(fn *funcval, argp *uint8, narg int32, callergp *g, callerpc uintptr) {
//获取当前的g, 会编译为读取FS寄存器(TLS), 这里会获取到g0
_g_ := getg()
if fn == nil {
_g_.m.throwing = -1 // do not dump full stacks
throw("go of nil func value")
}
// 加锁禁止被抢占
_g_.m.locks++ // disable preemption because it can be holding p in a local var
siz := narg
siz = (siz + 7) &^ 7
// 如果参数过多,则直接抛出异常,栈大小是2k
if siz >= _StackMin-4*sys.RegSize-sys.RegSize {
throw("newproc: function arguments too large for new goroutine")
}
//获取m拥有的p
_p_ := _g_.m.p.ptr()
// 尝试获取一个空闲的g,如果获取不到,则新建一个,并添加到allg里面
// gfget首先会尝试从p本地获取空闲的g,如果本地没有的话,则从全局获取一堆平衡到本地p
newg := gfget(_p_)
if newg == nil {
newg = malg(_StackMin)
casgstatus(newg, _Gidle, _Gdead)
// 新建的g,添加到全局的 allg里面,allg是一个slice, append进去即可
allgadd(newg) // publishes with a g->status of Gdead so GC scanner doesn't look at uninitialized stack.
}
// 判断获取的g的栈是否正常
if newg.stack.hi == 0 {
throw("newproc1: newg missing stack")
}
// 判断g的状态是否正常
if readgstatus(newg) != _Gdead {
throw("newproc1: new g is not Gdead")
}
// 预留一点空间,防止读取超出一点点
totalSize := 4*sys.RegSize + uintptr(siz) + sys.MinFrameSize // extra space in case of reads slightly beyond frame
// 空间大小进行对齐
totalSize += -totalSize & (sys.SpAlign - 1) // align to spAlign
sp := newg.stack.hi - totalSize
spArg := sp
// usesLr 为0,这里不执行
if usesLR {
// caller's LR
*(*uintptr)(unsafe.Pointer(sp)) = 0
prepGoExitFrame(sp)
spArg += sys.MinFrameSize
}
if narg > 0 {
// 将参数拷贝入栈
memmove(unsafe.Pointer(spArg), unsafe.Pointer(argp), uintptr(narg))
}
// 初始化用于保存现场的区域及初始化基本状态
memclrNoHeapPointers(unsafe.Pointer(&newg.sched), unsafe.Sizeof(newg.sched))
newg.sched.sp = sp
newg.stktopsp = sp
// 这里保存了goexit的地址,在用户函数执行完成后,会根据pc来执行goexit
newg.sched.pc = funcPC(goexit) + sys.PCQuantum // +PCQuantum so that previous instruction is in same function
newg.sched.g = guintptr(unsafe.Pointer(newg))
// 这里调整 sched 信息,pc = goexit的地址
gostartcallfn(&newg.sched, fn)
newg.gopc = callerpc
newg.ancestors = saveAncestors(callergp)
newg.startpc = fn.fn
if _g_.m.curg != nil {
newg.labels = _g_.m.curg.labels
}
if isSystemGoroutine(newg) {
atomic.Xadd(&sched.ngsys, +1)
}
newg.gcscanvalid = false
casgstatus(newg, _Gdead, _Grunnable)
// 如果p缓存的goid已经用完,本地再从sched批量获取一点
if _p_.goidcache == _p_.goidcacheend {
// Sched.goidgen is the last allocated id,
// this batch must be [sched.goidgen+1, sched.goidgen+GoidCacheBatch].
// At startup sched.goidgen=0, so main goroutine receives goid=1.
_p_.goidcache = atomic.Xadd64(&sched.goidgen, _GoidCacheBatch)
_p_.goidcache -= _GoidCacheBatch - 1
_p_.goidcacheend = _p_.goidcache + _GoidCacheBatch
}
// 分配goid
newg.goid = int64(_p_.goidcache)
_p_.goidcache++
// 把新的g放到 p 的可运行g队列中
runqput(_p_, newg, true)
// 判断是否有空闲p,且是否需要唤醒一个m来执行g
if atomic.Load(&sched.npidle) != 0 && atomic.Load(&sched.nmspinning) == 0 && mainStarted {
wakep()
}
_g_.m.locks--
if _g_.m.locks == 0 && _g_.preempt { // restore the preemption request in case we've cleared it in newstack
_g_.stackguard0 = stackPreempt
}
}
runtime.newproc1的处理:
(1)、调用getg获取当前的g, 会编译为读取FS寄存器(TLS), 这里会获取到g0
(2)、从g中获取对应的m的然后locks++, 禁止抢占,再从m上获取当前的p,因为g要优先放入本地p的runq队列
(3)、新建一个g 首先调用gfget从p.gfree获取g, 如果之前有g被回收在这里就可以复用,获取不到时调用malg分配一个g, 初始的栈空间大小是2K。需要先设置g的状态为已中止(_Gdead), 这样gc不会去扫描这个g的未初始化的栈
(4)、把参数复制到g的栈上,把返回地址复制到g的栈上, 这里的返回地址是goexit, 表示调用完目标函数后会调用goexit
(5)、设置g的调度数据(sched) ,设置sched.sp等于参数+返回地址后的rsp地址,设置sched.pc等于目标函数的地址, 查看gostartcallfn和gostartcall,设置sched.g等于g,设置g的状态为待运行(_Grunnable)
(6)、调用runqput把g放到运行队列,首先随机把g放到p.runnext, 如果放到runnext则入队原来在runnext的g,然后尝试把g放到P的"本地运行队列",如果本地运行队列满了则调用runqputslow把g放到"全局运行队列" ,runqputslow会把本地运行队列中一半的g放到全局运行队列, 这样下次就可以继续用快速的本地运行队列了
(7)、如果当前有空闲的P, 但是无自旋的M(nmspinning等于0), 并且主函数已执行则唤醒或新建一个M (因为有空闲p就要充分利用),唤醒或新建一个M是通过wakep函数 。wakep函数主要调用startm函数:
func startm(_p_ *p, spinning bool) {
lock(&sched.lock)
if _p_ == nil {
// 如果没有指定p, 则从sched.pidle获取空闲的p
_p_ = pidleget()
if _p_ == nil {
unlock(&sched.lock)
// 如果没有获取到p,重置nmspinning
if spinning {
// The caller incremented nmspinning, but there are no idle Ps,
// so it's okay to just undo the increment and give up.
if int32(atomic.Xadd(&sched.nmspinning, -1)) < 0 {
throw("startm: negative nmspinning")
}
}
return
}
}
// 首先尝试从 sched.midle获取一个空闲的m
mp := mget()
unlock(&sched.lock)
if mp == nil {
// 如果获取不到空闲的m,则创建一个 mspining = true的m,并将p绑定到m上,直接返回
var fn func()
if spinning {
// The caller incremented nmspinning, so set m.spinning in the new M.
fn = mspinning
}
newm(fn, _p_)
return
}
// 判断获取到的空闲m是否是spining状态
if mp.spinning {
throw("startm: m is spinning")
}
// 判断获取到的m是否有p
if mp.nextp != 0 {
throw("startm: m has p")
}
if spinning && !runqempty(_p_) {
throw("startm: p has runnable gs")
}
// The caller incremented nmspinning, so set m.spinning in the new M.
// 调用函数的父函数已经增加了nmspinning, 这里只需要设置m.spining就ok了,同时把p绑上来
mp.spinning = spinning
mp.nextp.set(_p_)
// 唤醒m
notewakeup(&mp.park)
}
startm函数 :
(1)、调用pidleget从"空闲P链表"获取一个空闲的P
(2)、调用mget从"空闲M链表"获取一个空闲的M
(3)、如果没有空闲的M, 则调用newm新建一个M ,newm会新建一个m的实例, m的实例包含一个g0, 然后调用newosproc创建一个系统线程,newosproc会调用syscall clone创建一个新的线程,线程创建后会设置TLS, 设置TLS中当前的g为g0, 然后执行mstart
(4)、调用notewakeup(&mp.park)唤醒线程
创建goroutine的流程至此结束。
协程调度
在创建m的函数newm1中我们看到m的执行函数是mstart,所以M被唤醒时会调用mstart函数。
func mstart() {
_g_ := getg()
osStack := _g_.stack.lo == 0
if osStack {
// Initialize stack bounds from system stack.
// Cgo may have left stack size in stack.hi.
// minit may update the stack bounds.
// 从系统堆栈上直接划出所需的范围
size := _g_.stack.hi
if size == 0 {
size = 8192 * sys.StackGuardMultiplier
}
_g_.stack.hi = uintptr(noescape(unsafe.Pointer(&size)))
_g_.stack.lo = _g_.stack.hi - size + 1024
}
// Initialize stack guards so that we can start calling
// both Go and C functions with stack growth prologues.
_g_.stackguard0 = _g_.stack.lo + _StackGuard
_g_.stackguard1 = _g_.stackguard0
// 调用mstart1来处理
mstart1()
// Exit this thread.
if GOOS == "windows" || GOOS == "solaris" || GOOS == "plan9" || GOOS == "darwin" {
// Window, Solaris, Darwin and Plan 9 always system-allocate
// the stack, but put it in _g_.stack before mstart,
// so the logic above hasn't set osStack yet.
osStack = true
}
// 退出m,正常情况下mstart1调用schedule() 时,是不再返回的,所以,不用担心系统线程的频繁创建退出
mexit(osStack)
}
func mstart1() {
_g_ := getg()
if _g_ != _g_.m.g0 {
throw("bad runtime•mstart")
}
// Record the caller for use as the top of stack in mcall and
// for terminating the thread.
// We're never coming back to mstart1 after we call schedule,
// so other calls can reuse the current frame.
// 保存调用者的pc sp等信息
save(getcallerpc(), getcallersp())
asminit()
// 初始化m的sigal的栈和mask
minit()
// Install signal handlers; after minit so that minit can
// prepare the thread to be able to handle the signals.
// 安装sigal处理器
if _g_.m == &m0 {
mstartm0()
}
// 如果设置了mstartfn,就先执行这个
if fn := _g_.m.mstartfn; fn != nil {
fn()
}
if _g_.m.helpgc != 0 {
_g_.m.helpgc = 0
stopm()
} else if _g_.m != &m0 {
// 获取nextp
acquirep(_g_.m.nextp.ptr())
_g_.m.nextp = 0
}
schedule()
}
mstart函数:
(1)、调用getg获取当前的g, 这里会获取到g0,获取g0主要是为了检查g0是否需要分配栈,如果g未分配栈则从当前的栈空间(系统栈空间)上分配, 也就是说g0会使用系统栈空间
(2)、调用mstart1函数 ,该函数首先调用gosave函数保存当前的状态到g0的调度数据中, 因为有可能这个m是新创建的,注意这里是g0。以后每次调度都会从这个栈地址开始,然后调用schedule函数
func schedule() {
_g_ := getg()
if _g_.m.locks != 0 {
throw("schedule: holding locks")
}
// 如果有lockg,停止执行当前的m
if _g_.m.lockedg != 0 {
// 解除lockedm的锁定,并执行当前g
stoplockedm()
execute(_g_.m.lockedg.ptr(), false) // Never returns.
}
// We should not schedule away from a g that is executing a cgo call,
// since the cgo call is using the m's g0 stack.
if _g_.m.incgo {
throw("schedule: in cgo")
}
top:
// gc 等待
if sched.gcwaiting != 0 {
gcstopm()
goto top
}
var gp *g
var inheritTime bool
if gp == nil {
// Check the global runnable queue once in a while to ensure fairness.
// Otherwise two goroutines can completely occupy the local runqueue
// by constantly respawning each other.
// 为了保证公平,每隔61次,从全局队列上获取g
if _g_.m.p.ptr().schedtick%61 == 0 && sched.runqsize > 0 {
lock(&sched.lock)
gp = globrunqget(_g_.m.p.ptr(), 1)
unlock(&sched.lock)
}
}
if gp == nil {
// 全局队列上获取不到待运行的g,则从p local队列中获取
gp, inheritTime = runqget(_g_.m.p.ptr())
if gp != nil && _g_.m.spinning {
throw("schedule: spinning with local work")
}
}
if gp == nil {
// 如果p local获取不到待运行g,则开始查找,这个函数会从 全局 io poll, p locl和其他p local获取待运行的g,后面详细分析
gp, inheritTime = findrunnable() // blocks until work is available
}
// This thread is going to run a goroutine and is not spinning anymore,
// so if it was marked as spinning we need to reset it now and potentially
// start a new spinning M.
if _g_.m.spinning {
// 如果m是自旋状态,取消自旋
resetspinning()
}
if gp.lockedm != 0 {
// Hands off own p to the locked m,
// then blocks waiting for a new p.
// 如果g有lockedm,则休眠上交p,休眠m,等待新的m,唤醒后从这里开始执行,跳转到top
startlockedm(gp)
goto top
}
// 开始执行这个g
execute(gp, inheritTime)
}
schedule函数:
schedule函数获取g => [必要时休眠] => [唤醒后继续获取] => execute函数执行g => 执行后返回到goexit => 重新执行schedule函数
func execute(gp *g, inheritTime bool) {
_g_ := getg()
// 更改gp的状态,并不允许抢占
casgstatus(gp, _Grunnable, _Grunning)
gp.waitsince = 0
gp.preempt = false
gp.stackguard0 = gp.stack.lo + _StackGuard
if !inheritTime {
// 调度计数
_g_.m.p.ptr().schedtick++
}
_g_.m.curg = gp
gp.m = _g_.m
// 开始执行g的代码了
gogo(&gp.sched)
}
execute函数:
(1)、调用getg获取当前的g0
(2)、把G的状态由待运行(_Grunnable)改为运行中(_Grunning)
(3)、设置G的stackguard, 栈空间不足时可以扩张
(4)、增加P中记录的调度次数(对应上面的每61次优先获取一次全局运行队列)
(5)、设置g.m.curg = g,g.m = m
(6)、调用gogo函数,这个函数会根据g.sched中保存的状态恢复各个寄存器的值并继续运行g。
首先针对g.sched.ctxt调用写屏障(GC标记指针存活), ctxt中一般会保存指向[函数+参数]的指针。
设置TLS中的g为g.sched.g, 也就是g自身,设置rsp寄存器为g.sched.rsp,设置rax寄存器为g.sched.ret,设置rdx寄存器为g.sched.ctxt (上下文),设置rbp寄存器为g.sched.rbp,清空sched中保存的信息,跳转到g.sched.pc。因为前面创建goroutine的newproc1函数把返回地址设为了goexit, 函数运行完毕返回时将会调用goexit函数。
目标函数执行完毕后会调用goexit函数, goexitàmcallà goexit0.
mcall函数保运退出时的上下文, 处理如下:
- 设置g.sched.pc等于当前的返回地址
- 设置g.sched.sp等于寄存器rsp的值
- 设置g.sched.g等于当前的g
- 设置g.sched.bp等于寄存器rbp的值
- 切换TLS中当前的g等于m.g0
- 设置寄存器rsp等于g0.sched.sp, 使用g0的栈空间
- 设置第一个参数为原来的g
- 设置rdx寄存器为指向函数地址的指针(上下文)
- 调用指定的函数, 不会返回
保存了当前的上下文然后切换到g0和g0的栈空间, 再调用指定的函数.
回到g0的栈空间这个步骤非常重要, 因为这个时候g已经中断, 继续使用g的栈空间且其他M唤醒了这个g将会产生灾难性的后果.
G在中断或者结束后都会通过mcall回到g0的栈空间继续调度, 从goexit调用的mcall的保存状态其实是多余的, 因为G已经结束了.
func goexit0(gp *g) {
_g_ := getg()
// 转换g的状态为dead,以放回空闲列表
casgstatus(gp, _Grunning, _Gdead)
if isSystemGoroutine(gp) {
atomic.Xadd(&sched.ngsys, -1)
}
// 清空g的状态
gp.m = nil
locked := gp.lockedm != 0
gp.lockedm = 0
_g_.m.lockedg = 0
gp.paniconfault = false
gp._defer = nil // should be true already but just in case.
gp._panic = nil // non-nil for Goexit during panic. points at stack-allocated data.
gp.writebuf = nil
gp.waitreason = 0
gp.param = nil
gp.labels = nil
gp.timer = nil
// Note that gp's stack scan is now "valid" because it has no
// stack.
gp.gcscanvalid = true
dropg()
// 把g放回空闲列表,以备复用
gfput(_g_.m.p.ptr(), gp)
// 再次进入调度循环
schedule()
}
goexit0函数调用时已经回到了g0的栈空间, 处理如下:
- 把G的状态由运行中(_Grunning)改为已中止(_Gdead)
- 清空G的成员
- 调用dropg函数解除M和G之间的关联
- 调用gfput函数把G放到P的自由列表中, 下次创建G时可以复用
- 调用schedule函数继续调度
抢占
抢占时机:
- channel、mutex等sync操作发生了协程阻塞
- time.sleep
- 网络操作暂时未ready
- Gc
- 运行过久或者系统调用过久
- 其他
抢占实现 :
上面提到的sysmon函数负责协程抢占. sysmon中有netpool(获取fd事件), retake(抢占), forcegc(按时间强制执行gc), scavenge heap(释放自由列表中多余的项减少内存占用)等处理。
func retake(now int64) uint32 {
n := 0
// Prevent allp slice changes. This lock will be completely
// uncontended unless we're already stopping the world.
lock(&allpLock)
// We can't use a range loop over allp because we may
// temporarily drop the allpLock. Hence, we need to re-fetch
// allp each time around the loop.
for i := 0; i < len(allp); i++ {
_p_ := allp[i]
if _p_ == nil {
// This can happen if procresize has grown
// allp but not yet created new Ps.
continue
}
pd := &_p_.sysmontick
s := _p_.status
if s == _Psyscall {
// Retake P from syscall if it's there for more than 1 sysmon tick (at least 20us).
// pd.syscalltick 即 _p_.sysmontick.syscalltick 只有在sysmon的时候会更新,而 _p_.syscalltick 则会每次都更新,所以,当syscall之后,第一个sysmon检测到的时候并不会抢占,而是第二次开始才会抢占,中间间隔至少有20us,最多会有10ms
t := int64(_p_.syscalltick)
if int64(pd.syscalltick) != t {
pd.syscalltick = uint32(t)
pd.syscallwhen = now
continue
}
// On the one hand we don't want to retake Ps if there is no other work to do,
// but on the other hand we want to retake them eventually
// because they can prevent the sysmon thread from deep sleep.
// 是否有空p,有寻找p的m,以及当前的p在syscall之后,有没有超过10ms
if runqempty(_p_) && atomic.Load(&sched.nmspinning)+atomic.Load(&sched.npidle) > 0 && pd.syscallwhen+10*1000*1000 > now {
continue
}
// Drop allpLock so we can take sched.lock.
unlock(&allpLock)
// Need to decrement number of idle locked M's
// (pretending that one more is running) before the CAS.
// Otherwise the M from which we retake can exit the syscall,
// increment nmidle and report deadlock.
incidlelocked(-1)
// 抢占p,把p的状态转为idle状态
if atomic.Cas(&_p_.status, s, _Pidle) {
if trace.enabled {
traceGoSysBlock(_p_)
traceProcStop(_p_)
}
n++
_p_.syscalltick++
// 把当前p移交出去,上面已经分析过了
handoffp(_p_)
}
incidlelocked(1)
lock(&allpLock)
} else if s == _Prunning {
// Preempt G if it's running for too long.
// 如果p是running状态,如果p下面的g执行太久了,则抢占
t := int64(_p_.schedtick)
if int64(pd.schedtick) != t {
pd.schedtick = uint32(t)
pd.schedwhen = now
continue
}
// 判断是否超出10ms, 不超过不抢占
if pd.schedwhen+forcePreemptNS > now {
continue
}
// 开始抢占
preemptone(_p_)
}
}
unlock(&allpLock)
return uint32(n)
}
func preemptone(_p_ *p) bool {
mp := _p_.m.ptr()
if mp == nil || mp == getg().m {
return false
}
gp := mp.curg
if gp == nil || gp == mp.g0 {
return false
}
// 标识抢占字段
gp.preempt = true
// Every call in a go routine checks for stack overflow by
// comparing the current stack pointer to gp->stackguard0.
// Setting gp->stackguard0 to StackPreempt folds
// preemption into the normal stack overflow check.
// 更新stackguard0,保证能检测到栈溢
gp.stackguard0 = stackPreempt
return true
}
retake函数枚举所有的P,如果P在系统调用中(_Psyscall), 且经过了一次sysmon循环(20us~10ms), 则抢占这个P,调用handoffp解除M和P之间的关联。如果P在运行中(_Prunning), 且经过了一次sysmon循环并且G运行时间超过forcePreemptNS(10ms), 则抢占这个P,调用preemptone函数,设置g.preempt = true,设置g.stackguard0 = stackPreempt。
go函数的开头会比对stackguard0值判断是否需要扩张栈,stackPreempt是一个特殊的常量, 它的值会比任何的栈地址都要大, 检查时一定会触发栈扩张。
栈扩张调用的是morestack_noctxt函数, morestack_noctxt函数清空rdx寄存器并调用morestack函数.
TEXT runtime•morestack(SB),NOSPLIT,$-8-0
// Cannot grow scheduler stack (m->g0).
MOVD g_m(g), R8
MOVD m_g0(R8), R4
CMP g, R4
BNE 3(PC)
BL runtime•badmorestackg0(SB)
B runtime•abort(SB)
// Cannot grow signal stack (m->gsignal).
MOVD m_gsignal(R8), R4
CMP g, R4
BNE 3(PC)
BL runtime•badmorestackgsignal(SB)
B runtime•abort(SB)
// Called from f.
// Set g->sched to context in f
MOVD RSP, R0
MOVD R0, (g_sched+gobuf_sp)(g)
MOVD LR, (g_sched+gobuf_pc)(g)
MOVD R3, (g_sched+gobuf_lr)(g)
MOVD R26, (g_sched+gobuf_ctxt)(g)
// Called from f.
// Set m->morebuf to f's callers.
MOVD R3, (m_morebuf+gobuf_pc)(R8) // f's caller's PC
MOVD RSP, R0
MOVD R0, (m_morebuf+gobuf_sp)(R8) // f's caller's RSP
MOVD g, (m_morebuf+gobuf_g)(R8)
// Call newstack on m->g0's stack.
MOVD m_g0(R8), g
BL runtime•save_g(SB)
MOVD (g_sched+gobuf_sp)(g), R0
MOVD R0, RSP
MOVD.W $0, -16(RSP) // create a call frame on g0 (saved LR; keep 16-aligned)
BL runtime•newstack(SB)
// Not reached, but make sure the return PC from the call to newstack
// is still in this function, and not the beginning of the next.
UNDEF
TEXT runtime•morestack_noctxt(SB),NOSPLIT,$-4-0
MOVW $0, R26
B runtime•morestack(SB)
morestack函数会保存G的状态到g.sched, 切换到g0的栈, 然后调用newstack函数.
newstack是个go函数,判断本次是否是由抢占引起,如果是则调用gopreempt_m-> goschedImpl完成抢占.
func goschedImpl(gp *g) {
status := readgstatus(gp)
if status&^_Gscan != _Grunning {
dumpgstatus(gp)
throw("bad g status")
}
casgstatus(gp, _Grunning, _Grunnable)
dropg()
lock(&sched.lock)
globrunqput(gp)
unlock(&sched.lock)
schedule()
}
goschedImpl函数完成吉祥工作:
(1)、设置G的状态由运行中(_Grunnable)改为待运行(_Grunnable)
(2)、调用dropg函数解除M和G之间的关联
(3)、调用globrunqput把G放到全局运行队列
(4)、调用schedule函数继续调度
最后又回到了schedule函数,不停的循环往复。