这篇文章主要说明俩个问题:
1. 在APC_LEVEL上,Thread为何不能被suspend。
2. 在 APC_LEVEL上,可以使用分页内存的原因。
关于线程如何响应APC,要看是何种APC,请参考MSDN文档。我在看微软提供的资料的时候,发现俩个比较难懂的问题,把它们单独拿出来讨论。
首先看中断请求级:IRQL(Interrupt Request Levels)
IRQL | IRQL value | Description | ||
x86 | IA64 | AMD64 | ||
PASSIVE_LEVEL | 0 | 0 | 0 | User threads and most kernel-mode operations |
APC_LEVEL | 1 | 1 | 1 | Asynchronous procedure calls and page faults |
DISPATCH_LEVEL | 2 | 2 | 2 | Thread scheduler and deferred procedure calls (DPCs) |
CMC_LEVEL | N/A | 3 | N/A | Correctable machine-check level (IA64 platforms only) |
Device interrupt levels (DIRQL) | 3-26 | 4-11 | 3-11 | Device interrupts |
PC_LEVEL | N/A | 12 | N/A | Performance counter (IA64 platforms only) |
PROFILE_LEVEL | 27 | 15 | 15 | Profiling timer for releases earlier than Windows 2000 |
SYNCH_LEVEL | 27 | 13 | 13 | Synchronization of code and instruction streams across processors |
CLOCK_LEVEL | N/A | 13 | 13 | Clock timer |
CLOCK2_LEVEL | 28 | N/A | N/A | Clock timer for x86 hardware |
IPI_LEVEL | 29 | 14 | 14 | Interprocessor interrupt for enforcing cache consistency |
POWER_LEVEL | 30 | 15 | 14 | Power failure |
HIGH_LEVEL | 31 | 15 | 15 | Machine checks and catastrophic errors; profiling timer for Windows XP and later releases |
微软说:
When a processor is running at a given IRQL, interrupts at that IRQL and lower are masked off (blocked) on the processor.
但其实这句话只适合在DISPATCH_LEVEL到HIGH_LEVEL之间。在APC_LEVEL和PASSIVE_LEVEL级别要特殊对待。而且这两个级别可以被调度器调度,就显得更加复杂。
所以微软又说:
IRQL分为: Processor-specific and Thread-specific IRQLs
-
Processor-specific IRQLS:
- DISPATCH_LEVEL
- DIRQL
- HIGHEST_LEVEL
规则:When a processor is running at a given IRQL, interrupts at that IRQL and lower are masked off (blocked) on the processor.
-
Thread-specific IRQLS:
- PASSIVE_LEVEL
- IRQL PASSIVE_LEVEL in a critical region. Intermediate level(KeEnterCriticalRegion, KeLeaveCriticalRegion)
- APC_LEVEL
在这三个级别上运行的线程,都能被调度器调度(调度器运行在DISPATCH_LEVEL),调度器考虑的只是优先级,优先级高的就能抢占优先级低的线程。所一个运行在APC_LEVEL低优先级的线程,可以被一个运行在PASSIVE_LEVEL优先级高的线程给抢占。所有微软说:
The thread scheduler considers only thread priority, and not IRQL, when preempting a thread. If a thread running at IRQL=APC_LEVEL blocks, the scheduler might select a new thread for the processor that was previously running at PASSIVE_LEVEL.
线程相关的IRQL,可以将线程想象为一个伪CPU,此CPU只有三个LEVEL:PASSIVE_LEVEL,Intermediate level and APC_LEVEL.
IRQL PASSIVE_LEVEL in a critical region
Code that is running at PASSIVE_LEVEL in a critical region is effectively running at an intermediate level between PASSIVE_LEVEL and APC_LEVEL. Calls to KeGetCurrentIrql return PASSIVE_LEVEL. Driver code can determine whether it is operating in a critical region by calling the function KeAreApcsDisabled (available in Windows XP and later releases).
Driver code that is running above PASSIVE_LEVEL (either at PASSIVE_LEVEL in a critical region or at APC_LEVEL or higher) cannot be suspended. Almost every operation that a driver can perform at PASSIVE_LEVEL can also be performed in a critical region. Two notable exceptions are raising hard errors and opening a file on storage media.
IRQL APC_LEVEL
APC_LEVEL is a thread-specific IRQL that is most commonly associated with paging I/O. Applications cannot suspend code that is running at IRQL=APC_LEVEL. The system implements fast mutexes (a type of synchronization mechanism) at APC_LEVEL. The KeAcquireFastMutex routine raises the IRQL to APC_LEVEL, and KeReleaseFastMutex returns the IRQL to its original value.
The only difference between a thread that is running at PASSIVE_LEVEL with APCs disabled and a thread that is running at APC_LEVEL is that while running at APC_LEVEL, the thread cannot be interrupted to deliver a special kernel-mode APC.
Thread 进入APC_LEVEL方式:
- Call KeAcquireFastMutex
- Delivery a APC
- KeRaiseIrql (一般不用)等。
使用Fast Mutex进入APC_LEVEL后,对于其它线程,若要获取此Mutex,则会被设置为等待状态。对于线程自己而言,微软说:
Code paths that are protected by a fast mutex run at IRQL=APC_LEVEL, thus disabling delivery of all APCs and preventing the thread from suspension.
即:阻止响应任何APC,而且线程不能被挂起(suspend),为什么不能被挂起?因为操作系统实现线程挂起的方式,就是Delivery APC,在APC的回调函数里面等待一个信号量(这个是我查阅WRK中找到的答案)。由于运行在APC_LEVEL上,会disabling delivery of all APCs。
如果将有一个线程理解为一个伪CPU,此CPU只有三个LEVEL:PASSIVE_LEVEL,Intermediate level and APC_LEVEL. 然后将Delivery APC当做一个中断来处理,使用微软的中断规则来处理:When a processor is running at a given IRQL, interrupts at that IRQL and lower are masked off (blocked) on the processor. 就可以理解为CPU在APC_LEVEL上,屏蔽了所有等于和小于它的中断。
APC有三种:kernel normal apc, special kernel normal apc, and user mode apc。Delivery a APC,此APC并不是会马上运行,要看情况而定,这个情况很复杂,不在本文说明,可以去微软MSDN寻找答案。
另外一个问题:
在 APC_LEVEL上,可以使用分页内存。但我查找资料发现,当在分页内存中发生了Page fault,系统会delivery APC,那岂不是此APC也不会得到执行了么?那不就形成死锁,和在DIPATCH_LEVEL一样,形成蓝屏了么?后来,我找到了一个关键点。在APC_LEVEL上,一旦发生Page fault,系统首先是将此线程挂起,此线程将等待一个内核同步对象,虽然线程运行在APC_LEVEL上但是线程被挂起,此时调度器会调度别的线程。当Page fault处理完成后,系统会给此线程delivery a APC,此线程又临时变为可调度状态,调度器就去调用这个APC,这个APC会激活线程等待的那个内核同步对象,但线程等待的那个点并不会马上得到执行,得等到这个APC退出后,因为这个APC是在这个线程中执行的。等到Page fault的Delivery APC退出,线程等待的对象处于激活状态,此线程变为可调度,调度器会在合适的时间,继续调度此线程,由于内存错误得到了解决,所以访问内存没有任何问题。
这段解释是不对的,一旦线程在APC_LEVEL上,而且线程处于等待状态,给它排队APC,这个APC是不会得到执行的。排队APC等于给线程发送一个软中断,是不会得到响应的。在DISPATCH_LEVEL不能使用分页的主要原因在于,一旦有分页错误,它就会等待分页完成,在DISPATCH_LEVEL上,是不能调用KeWaitXXX来等待的。而不是在于排队一个APC,而且一个同步的分页,是不会排队APC的。在IoCompleteRequest函数中有如下代码:
if (Irp->Flags & (IRP_PAGING_IO | IRP_CLOSE_OPERATION |IRP_SET_USER_EVENT)) {
if (Irp->Flags & (IRP_SYNCHRONOUS_PAGING_IO | IRP_CLOSE_OPERATION |IRP_SET_USER_EVENT)) {
ULONG flags;
flags = Irp->Flags & (IRP_SYNCHRONOUS_PAGING_IO|IRP_PAGING_IO);
*Irp->UserIosb = Irp->IoStatus;
(VOID) KeSetEvent( Irp->UserEvent, PriorityBoost, FALSE );
if (flags) {
if (IopIsReserveIrp(Irp)) {
IopFreeReserveIrp(PriorityBoost);
} else {
IoFreeIrp( Irp );
}
}
} else {
thread = Irp->Tail.Overlay.Thread;
KeInitializeApc( &Irp->Tail.Apc,
&thread->Tcb,
Irp->ApcEnvironment,
IopCompletePageWrite,
(PKRUNDOWN_ROUTINE) NULL,
(PKNORMAL_ROUTINE) NULL,
KernelMode,
(PVOID) NULL );
(VOID) KeInsertQueueApc( &Irp->Tail.Apc,
(PVOID) NULL,
(PVOID) NULL,
PriorityBoost );
}
return;
}
如果线程运行在DISPATCH_LEVEL,显而易见,单核CPU中调度器无法执行(多核CPU中可以执行在另外一个核中)。但主要的原因是发送的APC相当于一个APC_LEVEL的中断,它是得不到执行的。所以在DISPATCH_LEVEL上,会崩溃。而且在DISPATCH_LEVEL上,不能等待一个timer不为0的内核对象。