This series is about frame pointer omission (FPO) optimization and how it impacts the debugging experience.
- Frame pointer omission (FPO) and consequences when debugging, part 1.
- Frame pointer omission (FPO) and consequences when debugging, part 2.
Last time, I outlined the basics as to just what FPO does, and what it means in terms of generated code when you compile programs with or without FPO enabled. This article builds on the last, and lays out just what the impacts of having FPO enabled (or disabled) are when you end up having to debug a program.
For the purposes of this article, consider the following example program with several do-nothing functions that shuffle stack arguments around and call eachother. (For the purposes of this posting, I have disabled global optimizations and function inlining.)
__declspec(noinline) void f3( int* c, char* b, int a ) { *c = a * 3 + (int)strlen(b); __debugbreak(); } __declspec(noinline) int f2( char* b, int a ) { int c; f3( &c, b + 1, a - 3); return c; } __declspec(noinline) int f1( int a, char* b ) { int c; c = f2( b, a + 10); c ^= (int)rand(); return c + 2 * a; } int __cdecl wmain( int ac, wchar_t** av ) { int c; c = f1( (int)rand(), "test"); printf("%d/n", c); return 0; }
If we run the program and break in to the debugger at the hardcoded breakpoint, with symbols loaded, everything is as one might expect:
0:000> k ChildEBP RetAddr 0012ff3c 010015ef TestApp!f3+0x19 0012ff4c 010015fe TestApp!f2+0x15 0012ff54 0100161b TestApp!f1+0x9 0012ff5c 01001896 TestApp!wmain+0xe 0012ffa0 77573833 TestApp!__tmainCRTStartup+0x10f 0012ffac 7740a9bd kernel32!BaseThreadInitThunk+0xe 0012ffec 00000000 ntdll!_RtlUserThreadStart+0x23
Regardless of whether FPO optimization is turned on or off, since we have symbols loaded, we’ll get a reasonable call stack either way. The story is different, however, if we do not have symbols loaded. Looking at the same program, with FPO optimizations enabled and symbols not loaded, we get somewhat of a mess if we ask for a call stack:
0:000> k ChildEBP RetAddr WARNING: Stack unwind information not available. Following frames may be wrong. 0012ff4c 010015fe TestApp+0x15d8 0012ffa0 77573833 TestApp+0x15fe 0012ffac 7740a9bd kernel32!BaseThreadInitThunk+0xe 0012ffec 00000000 ntdll!_RtlUserThreadStart+0x23
Comparing the two call stacks, we lost three of the call frames entirely in the output. The only reason we got anything slightly reasonable at all is that WinDbg’s stack trace mechanism has some intelligent heuristics to guess the location of call frames in a stack where frame pointers are used.
If we look back to how call stacks are setup with frame pointers (from the previous article), the way a program trying to walk the stack on x86 without symbols works is by treating the stack as a sort of linked list of call frames. Recall that I mentioned the layout of the stack when a frame pointer is used:
[ebp-01] Last byte of the last local variable [ebp+00] Old ebp value [ebp+04] Return address [ebp+08] First argument...
This means that if we are trying to perform a stack walk without symbols, the way to go is to assume that ebp points to a “structure” that looks something like this:
typedef struct _CALL_FRAME { struct _CALL_FRAME* Next; void* ReturnAddress; } CALL_FRAME, * PCALL_FRAME;
Note how this corresponds to the stack layout relative to ebp that I described above.
A very simple stack walk function designed to walk frames that are compiled with frame pointer usage might then look like so (using the _AddressOfReturnAddress intrinsic to find “ebp”, assuming that the old ebp is 4 bytes before the address of the return address):
LONG StackwalkExceptionHandler( PEXCEPTION_POINTERS ExceptionPointers ) { if (ExceptionPointers->ExceptionRecord->ExceptionCode == EXCEPTION_ACCESS_VIOLATION) return EXCEPTION_EXECUTE_HANDLER; return EXCEPTION_CONTINUE_SEARCH; } void stackwalk( void* ebp ) { PCALL_FRAME frame = (PCALL_FRAME)ebp; printf("Trying ebp %p/n", ebp); __try { for (unsigned i = 0; i < 100; i++) { if ((ULONG_PTR)frame & 0x3) { printf("Misaligned frame/n"); break; } printf("#%02lu %p [@ %p]/n", i, frame, frame->ReturnAddress); frame = frame->Next; } } __except(StackwalkExceptionHandler( GetExceptionInformation())) { printf("Caught exception/n"); } } #pragma optimize("y", off) __declspec(noinline) void printstack( ) { void* ebp = (ULONG*)_AddressOfReturnAddress() - 1; stackwalk( ebp); } #pragma optimize("", on)
If we recompile the program, disable FPO optimizations, and insert a call to printstack inside the f3 function, the console output is something like so:
Trying ebp 0012FEB0 #00 0012FEB0 [@ 0100185C] #01 0012FED0 [@ 010018B4] #02 0012FEF8 [@ 0100190B] #03 0012FF2C [@ 01001965] #04 0012FF5C [@ 01001E5D] #05 0012FFA0 [@ 77573833] #06 0012FFAC [@ 7740A9BD] #07 0012FFEC [@ 00000000] Caught exception
In other words, without using any symbols, we have successfully performed a stack walk on x86.
However, this all breaks down when a function somewhere in the call stack does not use a frame pointer (i.e. was compiled with FPO optimizations enabled). In this case, the assumption that ebp always points to a CALL_FRAME structure is no longer valid, and the call stack is either cut short or is completely wrong (especially if the function in question repurposed ebp for some other use besides as a frame pointer). Although it is possible to use heuristics to try and guess what is really a call/return address record on the structure, this is really nothing more than an educated guess, and tends to be at least slightly wrong (and typically missing one or more frames entirely).
Now, you might be wondering why you might care about doing stack walk operations without symbols. After all, you have symbols for the Microsoft binaries that your program will be calling (such as kernel32) available from the Microsoft symbol server, and you (presumably) have private symbols corresponding to your own program for use when you are debugging a problem.
Well, the answer to that is that you will end up needing to record stack traces without symbols in the course of normal debugging for a wide variety of problems. The reason for this is that there is a lot of support baked into NTDLL (and NTOSKRNL) to assist in debugging a class of particularly insidious problems: handle leaks (and other problems where the wrong handle value is getting closed somewhere and you need to find out why), memory leaks, and heap corruption.
These (very useful!) debugging features offer options that allow you to configure the system to log a stack trace on each heap allocation, heap free, or each time a handle is opened or closed. Now the way these features work is that they will capture the stack trace in real time as the heap operation or handle operation happens, but instead of trying to break into the debugger to display the results of this output (which is undesirable for a number of reasons), they save a copy of the current stack trace in-memory and then continue execution normally. To display these saved stack traces, the !htrace, !heap -p, and !avrf commands have functionality that locates these saved traces in-memory and prints them out to the debugger for you to inspect.
However, NTDLL/NTOSKRNL needs a way to create these stack traces in the first place, so that it can save them for later inspection. There are a couple of requirements here:
- The functionality to capture stack traces must not rely on anything layed above NTDLL or NTOSKRNL. This already means that anything as complicated as downloading and loading symbols via DbgHelp is instantly out of the picture, as those functions are layered far above NTDLL / NTOSKRNL (and indeed, they must make calls into the same functions that would be logging stack traces in the first place in order to find symbols).
- The functionality must work when symbols for everything on the call stack are not even available to the local machine. For instance, these pieces of functionality must be deployable on a customer computer without giving that computer access to your private symbols in some fashion. As a result, even if there was a good way to locate symbols where the stack trace is being captured (which there isn’t), you couldn’t even find the symbols if you wanted to.
- The functionality must work in kernel mode (for saving handle traces), as handle tracing is partially managed by the kernel itself and not just NTDLL.
- The functionality must use a minimum amount of memory to store each stack trace, as operations like heap allocation, heap deallocation, handle creation, and handle closure are extremely frequent operations throughout the lifetime of the process. As a result, options like just saving the entire thread stack for later inspection when symbols are available cannot be used, since that would be prohibitively expensive in terms of memory usage for each saved stack trace.
Given all of these restrictions, the code responsible for saving stack traces needs to operate without symbols, and it must furthermore be able to save stack traces in a very concise manner (without using a great deal of memory for each trace).
As a result, on x86, the stack trace saving code in NTDLL and NTOSKRNL assumes that all functions in the call frame use frame pointers. This is the only realistic option for saving stack traces on x86 without symbols, as there is insufficient information baked into each individual compiled binary to reliably perform stack traces without assuming the use of a frame pointer at each call site. (The 64-bit platforms that Windows supports solve this problem with the use of extensive unwind metadata, as I have covered in a number of past articles.)
So, the functionality exposed by pageheap’s stack trace logging, and handle tracing are how stack traces without symbols end up mattering to you, the developer with symbols for all of your binaries, when you are trying to debug a problem. If you make sure to disable FPO optimization on all of your code, then you’ll be able to use tools like pageheap’s stack tracing on heap operations, UMDH (the user mode heap debugger), and handle tracing to track down heap-related problems and handle-related problems. The best part of these features is that you can even deploy them on a customer site without having to install a full debugger (or run your program under a debugger), only later taking a minidump of your process for examination in the lab. All of them rely on FPO optimizations being disabled (at least on x86), though, so remember to turn FPO optimizations off on your release builds for the increased debuggability of these tough-to-find problems in the field.