最近,在写字符驱动练手,读写相关的派遣函数以异步StartIo方式处理IRP。测试时发现:当应用层发出了几次ReadFile请求后,驱动居然就无响应了。由于驱动是异步处理IO请求,因此,我猜测可能是驱动死锁了。本想借助windbg的!locks命令查看死锁,无奈,输出为空...绝望之余,想到可能可以用verifier工具检测驱动中潜在的死锁。
命令行下有2种方式激活verifier.exe的死锁检测功能(方便起见,我选方式2):
我驱动的名字是SampleChar.sys
1.重启生效的方式:
verifier /flags 0x20 /driver SampleChar.sys
2.立刻生效的方式:
verifier /volatile /flags 0x20 /adddriver SampleChar.sys
注:/flags 0x20用于设置死锁检测项---verifier.exe的死锁检测选项位于 Bit 5 (0x20)
激活这个选项后,并不能马上将潜在驱动中的死锁分析出来,而是需要借助测试程序来覆盖驱动中的代码。
下面是verifier分析出存在潜在死锁的代码片,代码原意是:当应用层调用ReadFile时,驱动调用IoStartPacket函数将IRP插入设备队列,然后异步返回。后续操作由StartIo完成。
#pragma code_seg()
void SampleStartIo(PDEVICE_OBJECT devObj, PIRP irp)
{
KEVENT workEvt, completeEvt;
KIRQL origIrql;
NTSTATUS status = STATUS_SUCCESS;
unsigned long readLen;
SampleCharDevContext* devCtx = (SampleCharDevContext*)devObj->DeviceExtension;
IO_STACK_LOCATION* curStack = IoGetCurrentIrpStackLocation(irp);
LARGE_INTEGER waitTime = RtlConvertLongToLargeInteger(-10*1000*1000*3);
KeInitializeEvent(&workEvt,SynchronizationEvent,FALSE);
KeInitializeEvent(&completeEvt, NotificationEvent, FALSE);
KeWaitForSingleObject(&workEvt, Executive, KernelMode, FALSE, &waitTime);
if (curStack->Parameters.Read.Length > 4096)
{
status = irp->IoStatus.Status = STATUS_BUFFER_OVERFLOW;
irp->IoStatus.Information = 0;
IoCompleteRequest(irp, IO_NO_INCREMENT);
IoStartNextPacket(devObj, FALSE);
return;
}
KeAcquireSpinLock(&devCtx->devSpinLock, &origIrql);
if(devCtx->buffPos != 0x00UL)
{
readLen = devCtx->buffPos >= curStack->Parameters.Read.Length ? curStack->Parameters.Read.Length : devCtx->buffPos;
RtlCopyMemory(irp->AssociatedIrp.SystemBuffer,
devCtx->SampleBuff,
readLen);
devCtx->buffRemained += readLen;
devCtx->buffPos -= readLen;
}
else
{
KeReleaseSpinLock(&devCtx->devSpinLock, origIrql);
irp->IoStatus.Status = STATUS_SUCCESS;
irp->IoStatus.Information = 0x00UL;
IoCompleteRequest(irp, IO_NO_INCREMENT);
return;
}
KeReleaseSpinLock(&devCtx->devSpinLock, origIrql);
IoCopyCurrentIrpStackLocationToNext(irp);
IoSetCompletionRoutine(irp, IrpAsyncReadCompleteRoutine, &completeEvt, TRUE, TRUE, TRUE);
status = IoCallDriver(devCtx->lowerDev,irp);
if (status == STATUS_PENDING)
{
KeWaitForSingleObject(&completeEvt, Executive, KernelMode, FALSE, NULL);
}
irp->IoStatus.Status = STATUS_SUCCESS;
irp->IoStatus.Information = readLen;
IoCompleteRequest(irp, IO_NO_INCREMENT);
IoStartNextPacket(devObj,FALSE);
}
NTSTATUS SampleCharReadAsync(PDEVICE_OBJECT devObj, PIRP irp)
{
IoMarkIrpPending(irp);
IoStartPacket(devObj,irp,NULL,NULL);
return STATUS_PENDING;
}
只要测试程序一运行,立马会触发0xC4的错误:
kd> g
*** Fatal System Error: 0x000000c4
(0x00000122,0x00000002,0xA2047BA8,0xA2047BC8)
Break instruction exception - code 80000003 (first chance)
A fatal system error has occurred.
Debugger entered on first try; Bugcheck callbacks have not been invoked.
A fatal system error has occurred.
通过Windbg !analyze -v命令可以得到错误的原因(这里仅截取重要的信息):
DRIVER_VERIFIER_DETECTED_VIOLATION (c4) ----> C4是由driver verifer引发的错误
Arguments:
Arg1: 00000122, Waiting at DISPATCH_LEVEL, with a timeout different than zero. 参数1:0x122用于查看windbg help error code
Arg2: 00000002, IRQL value.
Arg3: a2047ba8, Object to wait on.
Arg4: a2047bc8, Address of the time out value.
FAULTING_SOURCE_CODE:
244: KeInitializeEvent(&completeEvt, NotificationEvent, FALSE);
245:
246: KeWaitForSingleObject(&workEvt, Executive, KernelMode, FALSE, &waitTime);
247:
> 248: if (curStack->Parameters.Read.Length > 4096) ---->定位到引起蓝屏的函数栈
249: {
250: status = irp->IoStatus.Status = STATUS_BUFFER_OVERFLOW;
251: irp->IoStatus.Information = 0;
252:
253: IoCompleteRequest(irp, IO_NO_INCREMENT);
windbg给出了这么多信息,其实已经够定位错误原因了。再参考windbg Help对错误号0x122给出的解释:
The thread waits at DISPATCH_LEVEL and Timeout value is not equal to zero (0).
If the Timeout != 0, the callers of KeWaitForSingleObject or
KeWaitForMultipleObjects must run at IRQL <= APC_LEVEL.
基本知道是在高IRQL级别上调用了等待相关的函数,就是这一句。
KeWaitForSingleObject(&workEvt, Executive, KernelMode, FALSE, &waitTime);
我的代码参考了windows驱动开发详解第9章关于StartIo部分实现,其中用到定时器使得驱动在StartIo上等待一段时间然后再继续执行。开始时,我没有注意到StartIo调用时IRQL==DPC,不宜调用线程等待的函数(其实,不用driver verifier测试时,驱动运行的也还看得过去,至少没蓝屏)。去掉这段wait代码后再次编译加载,再用Driver verifier测试驱动死锁,倒是没有再次蓝屏的现象~
虽然,还没有解决死锁问题,但意外解决了一个隐藏的错误,也挺不错~
最后附上相关的链接: