导致崩溃的情况很多,同样崩溃的表现也是千差万别,既然如此,那么还是让我们先来看一下这个崩溃是如何用Windbg分析的吧。
某年某月某日,测试人员报告说,Sample.exe软件崩溃了,两名开发人员小崔与小阮闻讯, 也随即崩溃了.从Sample.exe崩溃时截下的dump文件看,当时出错的线程调用栈是:
0:000> kL
ChildEBP RetAddr
00100350 77d193f5 ntdll!KiFastSystemCallRet
00100388 77d2688a user32!NtUserWaitMessage+0xc
001003b0 77d3b7c5 user32!InternalDialogBox+0xd0
00100670 77d3b12b user32!SoftModalMessageBox+0x938
001007c0 77d65fdf user32!MessageBoxWorker+0x2ba
00100818 77d66084 user32!MessageBoxTimeoutW+0x7a
0010084c 77d50598 user32!MessageBoxTimeoutA+0x9c
0010086c 77d50550 user32!MessageBoxExA+0x1b
00100888 102150e7 user32!MessageBoxA+0x45
WARNING: Stack unwind information not available. Following frames may be wrong.
001008a8 10215863 MSVCRTD!CrtMemDumpStatistics+0x187
001019f4 10215556 MSVCRTD!CrtDbgReport+0x663
00104a40 10213b27 MSVCRTD!CrtDbgReport+0x356
00104a78 10213901 MSVCRTD!free_dbg+0x267
00104aac 5f429a22 MSVCRTD!free_dbg+0x41
00104abc 0040dcd6 MFC42D!operator delete+0xf
00104b18 0040ce2a sample!std::allocator<unsigned long>::deallocate+0x26
00104b80 0040cc39 sample!std::vector<unsignedlong,std::allocator<unsigned long> >::operator=+0x14a
00104bdc 0040c85a sample!Main_Wnd::_SVR_STATUS::operator=+0x69
0010546c 5f4317d6 sample!Main_Wnd::OnTimer+0x25a
00105568 5f4310b8 MFC42D!CWnd::OnWndMsg+0x6f4
两人盯着这调用栈,两眼发呆,漫无目地地看着各种信息,毫无头绪....经过一下午的努力,最终以失败告终.
若干天后...
程序员小崔在思考另一个由内存崩溃引起的BUG时想到VFNSMgr的问题是不是内存越界引起的?
随即与小阮打开dump文件重新观察调用栈信息.
0:000> kbL
ChildEBP RetAddr Args to Child
00100350 77d193f5 77d3ea24 001711de 00000001 ntdll!KiFastSystemCallRet
00100388 77d2688a 00171186 001711de 00000001 user32!NtUserWaitMessage+0xc
001003b0 77d3b7c5 77d10000 00154728 001711de user32!InternalDialogBox+0xd0
00100670 77d3b12b 001007cc 00000000 ffffffff user32!SoftModalMessageBox+0x938
001007c0 77d65fdf 001007cc 00000028 001711de user32!MessageBoxWorker+0x2ba
00100818 77d66084 001711de 001545b0 001546d8 user32!MessageBoxTimeoutW+0x7a
0010084c 77d50598 001711de 001008e8 102509a0 user32!MessageBoxTimeoutA+0x9c
0010086c 77d50550 001711de 001008e8 102509a0 user32!MessageBoxExA+0x1b
00100888 102150e7 001711de 001008e8 102509a0 user32!MessageBoxA+0x45
WARNING: Stack unwind information not available. Following frames may be wrong.
001008a8 10215863 001008e8 102509a0 00012012 MSVCRTD!CrtMemDumpStatistics+0x187
001019f4 10215556 00000001 00000000 00000000 MSVCRTD!CrtDbgReport+0x663
00104a40 10213b27 00000001 00000000 00000000 MSVCRTD!CrtDbgReport+0x356
00104a78 10213901 0140a4b0 00000001 00104b18 MSVCRTD!free_dbg+0x267
00104aac 5f429a22 0140a4b0 00000001 00104b18 MSVCRTD!free_dbg+0x41
00104abc 0040dcd6 0140a4b0 00104b80 0000000d MFC42D!operator delete+0xf
00104b18 0040ce2a 0140a4b0 00000008 00104bdc VFNSMgr!std::allocator<unsigned long>::deallocate+0x26
00104b80 0040cc39 014085b0 0010545c 01408abc VFNSMgr!std::vector<unsigned long,std::allocator<unsigned long> >::operator=+0x14a
00104bdc 0040c85a 0140859c 001055e8 001436f8 VFNSMgr!Main_Wnd::_SVR_STATUS::operator=+0x69
0010546c 5f4317d6 00000001 001055e8 001436f8 sample!Main_Wnd::OnTimer+0x25a
00105568 5f4310b8 00000113 00000001 00000000 MFC42D!CWnd::OnWndMsg+0x6f4
发现当时在delete一块内存,其地址为0140a4b0,现在需要进一步考查这地址为0140a4b0 的内存块结构是否完整.
考虑到sample是基于CRT堆结构,发现从内存中分配的内存单元格式如下:
Ntdll!_HEAP_ENTRY | _CrtMemBlockHeader | UserData | ExtrData |
其中Ntdll!_HEAP_ENTRY为8字节的堆块结构体.
_CrtMemBlockHeader为32字节的CRt堆的管理信息
UserData为数据区,其首地址也就是new返回的地址
ExtrData为保留字段,用于字节对齐,和作为栅栏存在.
利用windbg显示待delete的内存块的信息
其中Ntdll!_HEAP_ENTRY结构的内容为:
0:000> dt ntdll!_HEAP_ENTRY 0140a4b0-0n40
+0x000 Size : 0xb
+0x002 PreviousSize : 0xb
+0x000 SubSegmentCode : 0x000b000b
+0x004 SmallTagIndex : 0x11 ''
+0x005 Flags : 0x1 ''
+0x006 UnusedBytes : 0x8 ''
+0x007 SegmentIndex : 0x1 ''
可知待delete的内存块大小为0xb*8 = 88(byte)
0:000> dt _CrtMemBlockHeader 0140a4b0-0n32
MFC42D!_CrtMemBlockHeader
+0x000 pBlockHeaderNext : 0x0140a438 _CrtMemBlockHeader
+0x004 pBlockHeaderPrev : 0x0140a540 _CrtMemBlockHeader
+0x008 szFileName : (null)
+0x00c nLine : 0
+0x010 nDataSize : 0x20
+0x014 nBlockUse : 1
+0x018 lRequest : 929
+0x01c gap : [4] "???"
从nDataSize知道,待delete的内存的数据区其大小为0x20=32(byte)
在该数据区前后四字节为fdfdfdfd(栅栏)
0:000> db 0140a4b0-0n40 l88
0140a488 0b 00 0b 00 11 01 08 01-38 a4 40 01 40 a5 40 01 ........8.@.@.@.
0140a498 00 00 00 00 00 00 00 00-20 00 00 00 01 00 00 00 ........ .......
0140a4a8 a1 03 00 00 fd fd fd fd-fe 01 1e ac fe 01 1e ac ................
0140a4b8 fe 01 1e ac fe 01 1e ac-fe 01 1e ac fe 01 1e ac ................
0140a4c8 fe 01 1e ac fe 01 1e ac-fe 01 1e ac dd dd dd dd ................
0140a4d8 dd dd dd dd 00 00 00 00-09 00 0b 00 1c 01 08 01 ................
0140a4e8 b8 a5 40 01 98 a6 40 01-00 00 00 00 00 00 00 00 ..@...@.........
0140a4f8 1c 00 00 00 01 00 00 00-07 ae 02 00 fd fd fd fd ................
0140a508 30 ef 14 00 ff ff ff ff 0.......
但从以上内存数据发现,其数据区的最后一个栅栏被破坏,说明该数据区存在数据溢出,那么理所当然,当去delete这块内存时,软件将报错.当时的错误提示为:
0:000> da 001008e8
001008e8 "Debug Error!..Program: C:\Visual"
00100908 "Field3\sample.exe..DAMAGE: afte"
00100928 "r Normal block (#929) at 0x0140A"
00100948 "4B0....(Press Retry to debug the"
00100968 " application)"
与测猜吻合.
那么现在的关键问题是什么对象的内存发生了溢出,以及溢出现象又是怎么造成的.
第一个问题很容易回答,
0:000> kbn
# ChildEBP RetAddr Args to Child
00 00100350 77d193f5 77d3ea24 001711de 00000001 ntdll!KiFastSystemCallRet
01 00100388 77d2688a 00171186 001711de 00000001 user32!NtUserWaitMessage+0xc
02 001003b0 77d3b7c5 77d10000 00154728 001711de user32!InternalDialogBox+0xd0
03 00100670 77d3b12b 001007cc 00000000 ffffffff user32!SoftModalMessageBox+0x938
04 001007c0 77d65fdf 001007cc 00000028 001711de user32!MessageBoxWorker+0x2ba
05 00100818 77d66084 001711de 001545b0 001546d8 user32!MessageBoxTimeoutW+0x7a
06 0010084c 77d50598 001711de 001008e8 102509a0 user32!MessageBoxTimeoutA+0x9c
07 0010086c 77d50550 001711de 001008e8 102509a0 user32!MessageBoxExA+0x1b
08 00100888 102150e7 001711de 001008e8 102509a0 user32!MessageBoxA+0x45
WARNING: Stack unwind information not available. Following frames may be wrong.
09 001008a8 10215863 001008e8 102509a0 00012012 MSVCRTD!CrtMemDumpStatistics+0x187
0a 001019f4 10215556 00000001 00000000 00000000 MSVCRTD!CrtDbgReport+0x663
0b 00104a40 10213b27 00000001 00000000 00000000 MSVCRTD!CrtDbgReport+0x356
0c 00104a78 10213901 0140a4b0 00000001 00104b18 MSVCRTD!free_dbg+0x267
0d 00104aac 5f429a22 0140a4b0 00000001 00104b18 MSVCRTD!free_dbg+0x41
0e 00104abc 0040dcd6 0140a4b0 00104b80 0000000d MFC42D!operator delete+0xf [afxmem.cpp @ 351]
0f 00104b18 0040ce2a 0140a4b0 00000008 00104bdc sample!std::allocator<unsigned long>::deallocate+0x26 [c:\program files\microsoft visual studio\vc98\include\xmemory @ 64]
10 00104b80 0040cc39 014085b0 0010545c 01408abc sample!std::vector<unsigned long,std::allocator<unsignedlong>>::operator=+0x14a [c:\programfiles\microsoft visual studio\vc98\include\vector @ 76]
11 00104bdc 0040c85a 0140859c 001055e8 001436f8 sample!Main_Wnd::_SVR_STATUS::operator=+0x69
12 0010546c 5f4317d6 00000001 001055e8 001436f8 sample!Main_Wnd::OnTimer+0x25a [D:\备份\vf\branches\3.1\SOURCECODE\svr_con_center\main_wnd.cpp @ 646]
13 00105568 5f4310b8 00000113 00000001 00000000 MFC42D!CWnd::OnWndMsg+0x6f4 [wincore.cpp @ 1840]
从调用栈很容易可以看出是vector在释构数据数组是出错了,那么也就是说是vector的内部数据存储出现了异常,导致数据释放时出错了.
然后令人极度郁闷的是sample的pdb中不包含自定义的类型信息,即sample工程设置中pdbtype值不为con, 因此无法从帧11中通过查看局部变量的方式得知出错vector变量的地址.
排查只能到此止步了吗?
非也,注意 帧11,sample!Main_Wnd::_SVR_STATUS::operator=
其为!Main_Wnd::_SVR_STATUS的成员函数, 注意到类对象调用成员函数采用的是this调用方式,这种调用协义的最重要特征就是this指针会被放至ECX寄存器传递给被调用的方法.
观察
sample!Main_Wnd::_SVR_STATUS::operator=:
0040cbd0 55 push ebp
0040cbd1 8bec mov ebp,esp
0040cbd3 83ec44 sub esp,44h
0040cbd6 53 push ebx
0040cbd7 56 push esi
0040cbd8 57 push edi
0040cbd9 51 push ecx
0040cbda 8d7dbc lea edi,[ebp-44h]
0040cbdd b911000000 mov ecx,11h
0040cbe2 b8cccccccc mov eax,0CCCCCCCCh
0040cbe7 f3ab rep stos dword ptr es:[edi]
0040cbe9 59 pop ecx
0040cbea 894dfc mov dword ptr [ebp-4],ecx
0040cbed 8b45fc mov eax,dword ptr [ebp-4]
0040cbf0 8b4d08 mov ecx,dword ptr [ebp+8]
从以上标红指令发现this被保存在ebp-4这个地址中,
从帧11
11 00104bdc 0040c85a 0140859c 001055e8 001436f8 sample!Main_Wnd::_SVR_STATUS::operator=+0x69
我们可以知道 sample!Main_Wnd::_SVR_STATUS::operator=的ebp 为00104bdc,
那么可知
0:000> dd 00104bdc-4 l1
00104bd8 01408acc
Main_Wnd::_SVR_STATUS实例的地址为 01408acc
其内存数据为
0:000> dd 01408acc
01408acc 00000001 00000000 00000000 00000000
01408adc 00062aac cdcdcdcc 0140a4b0 0140a4d4
01408aec 0140a4d0 00000000 fdfdfdfd 00000000
01408afc 00000000 000f0007 010801e0 00000000
01408b0c 00000000 00000000 fedcbabc 00000008
01408b1c 00000003 00000000 fdfdfdfd 00098000
01408b2c 00000000 fdfdfdfd dddddddd 00070007
01408b3c 010801e7 01408a90 01408b78 0046d708
其中标红数据为该Main_Wnd::_SVR_STATUS的数据成员vector实例的内存数据,
从 0140a4b0 0140a4d4可知道该vector实例的数组元素个数为(0140a4d4 -0140a4b0)/4 = 9
在详细显示该vector实例的数组元素值
0:000> dd 0140a4b0 l0n10
0140a4b0 ac1e01fe ac1e01fe ac1e01fe ac1e01fe
0140a4c0 ac1e01fe ac1e01fe ac1e01fe ac1e01fe
0140a4d0 ac1e01fe dddddddd
该内存数据似乎没有什么问题?
但回顾先前我们排查知道从 0140a4b0 指向的数据区其实只有32个字节,也就是说以上标红的数据应该是fdfdfdfd,(栅栏),再进一步思考,vector实例的数据区其实只能存放8个元素.
然后,聪明的小阮再结合代码分析多线程对vector 进行resize操作,同时再使用[] 操作符对vector进行赋值,会出现以上错误,
从此一切真相大白~~~~