When the EXE calls one of the redirected functions, control goes to the CALL instruction in the stub. The DelayLoadProfileDLL_UpdateCount routine in DelayLoadProfileDLL.CPP simply increments the value of the count field of the stub. After that CALL returns, the JMP instruction transfers control to the original address that was stored in the IAT before I bashed it. Figure 2 shows the big picture after the IAT has been redirected to the stubs.
Assembler junkies might be wondering how the DelayLoadProfileDLL_UpdateCount function knows where the stub's count field is in memory. A quick look at the code shows that DelayLoadProfileDLL_UpdateCount finds the return address pushed on the stack by the CALL instruction. The return address points to the JMP XXXXXXXX instruction following the call. Since the CALL instruction is always five bytes, some pointer arithmetic yields the stub's starting address and easy access to the stub's count field.
I had one problem using the DelayLoadProfileDLL_UpdateCount code that's worth mentioning. Originally, the function didn't have the PUSHAD and POPAD instructions to save and restore all of the regular CPU registers. The code worked fine on many programs, but just blew up on others. Finally, I narrowed it down to programs that imported __CxxFrameHandler and _EH_prolog from MSVCRT.DLL. Both of these APIs expect the EAX register to be set to a given value, and DelayLoadProfileDLL_UpdateCount was trashing EAX.
Since the trashed EAX was the problem, I added PUSHAD and POPAD. Alas, the problem remained. In frustration, I examined the compiler-generated code, and then smacked my forehead. Normally when generating code for a debug build, the Visual C++ 6.0 compiler inserts code in the function prolog to set all local variables to the value 0xCC. This code was trashing EAX before my PUSHAD got a chance to execute. To get around this, I had to remove the /GZ option from the debug build settings for DelayLoadProfileDLL.
Reporting Results
As your process shuts down, the system sends the DLL_ PROCESS_DETACH notification to all loaded DLLs. DelayLoadProfileDLL uses this opportunity to harvest the information collected during the run. In a nutshell, this means scanning through all the stub arrays, counting the number of calls that were made through the stubs, and reporting what it finds.
During the setup phase when DelayLoadProfileDLL was redirecting the IATs, it stashed away the address of the EXE's IAT into a global variable (g_pFirstImportDesc). At shutdown time, ReportProfileResults uses this pointer to walk through the imports section again. For each imported DLL, it retrieves the address of the DLL's first IAT entry. If this is an IAT that I've redirected, the first pointer in the IAT should point to the first of the DLPD_IAT_STUB stubs allocated for that DLL. Of course, the code does some sanity checking to ensure that this is the case. If something doesn't look right, DelayLoadProfileDLL ignores that particular imported DLL.
Generally though, everything looks fine, and the first IAT entry points to my stubs. The code then iterates through all the stubs for the DLL. At each stub, the value of the stub's count field is added to a running total for the DLL. When the iteration completes, ReportProfileResults formats a string with the name of the DLL and how many calls were made through the stubs. The code uses OutputDebugString to broadcast its findings.
Loading and Injection
The program that loads your EXE and injects DelayLoadProfileDLL.DLL is called—you guessed it—DelayLoadProfile.EXE (the source code is available from the MSJ Web site at http://www.microsoft.com/msj). This code mainly drives the CDebugInjector class, which I'll describe shortly. Function main obtains the target EXE's command line and passes it to CDebugInjector::LoadProcess. If the process is created successfully, function main tells CDebugInjector which DLL it wants injected. In this case, it's DelayLoadProfileDLL.DLL, which should be located in the same directory as DelayLoadProfile.EXE.
The last step before letting the target run wild is to call CDebugInjector::SetOutputDebugStringCallback. When DelayLoadProfileDLL reports its results via OutputDebugString, CDebugInjector sees them and passes them to the callback you registered. This callback just printfs the strings to the console. Finally, function main calls CDebugInjector::Run. This call lets the target process begin and, when the time is right, injects the DLL into it.
Figure 3 shows The CDebugInjector class. This is where all the good stuff happens. CDebugInjector::LoadProcess creates the specified process as a debugee process. The ramifications of running as a debugee process have been discussed in many articles and in the MSDN documentation, so I won't go into all the details here.
For the purposes of this column, it's sufficient to say that the debugger process (in this case, DelayLoadProfile) has to enter a loop that calls WaitForDebugEvent and ContinueDebugEvent until the debugee terminates. Every time WaitForDebugEvent returns, something has happened in the debugee. This might be an exception (including break—points), a DLL load, a thread creation, or other event. The WaitForDebugEvent documentation covers all the events that might occur. The CDebugInjector::Run method contains the code for this loop.
So how does running the target process as a debugee help you inject a DLL? A debugger process has excellent control over the debugee process's execution. Every time a significant event occurs in the debugee, it is suspended until the debugger calls ContinueDebugEvent. Knowing this, a debugger process can add code to the debugee's address space and temporarily change the debugee's registers so that the added code executes.
In more specific terms, CDebugInjector synthesizes a small code stub that calls LoadLibrary. The DLL name parameter to LoadLibrary points to the name of the DLL to inject. CDebugInjector writes the stub (and the associated DLL name) to the debugee's address space. It then calls SetThreadContext to change the debugee's instruction pointer (EIP) to execute the LoadLibrary stub. All of this dirty work occurs within the CDebugInjector::PlaceInjectionStub method.
Immediately following the LoadLibrary call in the stub is a breakpoint instruction (INT 3). This stops the debugee and gives control back to the debugger process. The debugger then uses SetThreadContext again to restore the instruction pointer and other registers to their original values. Another call to ContinueDebugEvent and the debugee is on its way with the DLL injected, none the wiser that anything has happened.
If you don't think too hard, this injection process doesn't sound too messy. Nonetheless, a few interesting problems crop up that complicate things. For example, when is the proper time to create the stub code and redirect control to it? You can't do this immediately after the CreateProcess call because, among other reasons, the imported DLLs haven't been mapped into memory at this point and the EXE's IAT hasn't been fixed up by the Win32 loader. In other words, it's too early.
The solution I ultimately decided on was to let the debugee run until it encounters its first breakpoint. Then I set a breakpoint of my own at the entry point of the EXE. When this second breakpoint triggers, CDebugInjector knows that DLLs in the target process (including KERNEL32.DLL) have initialized, but no code in the EXE has run. This is the perfect time for injecting DelayLoadProfileDLL.DLL.
Incidentally, where does the first breakpoint come from? By definition, a Win32 process that's being debugged calls DebugBreak (also known as INT 3) very early in its execution. In my ancient APISPY32 code, I used the initial DebugBreak as the occasion to do the injection. Unfortunately in Windows 2000, this DebugBreak occurs before KERNEL32.DLL is initialized. Thus, CDebugInjector sets its own breakpoint to go off when the EXE is about to get control, and thus knows that KERNEL32.DLL has been initialized.
Earlier, I mentioned a breakpoint that occurs after the LoadLibrary call returns. This is a third breakpoint for CDebugInjector to handle. All of the mechanics for handling the different breakpoints can be seen in CDebugInjector::HandleException.
Another interesting problem to address with DLL injection is where to write the LoadLibrary stub. Under Windows NT 4.0 and later you can allocate space in another process with VirtualAllocEx, so I took that route. That leaves out Windows 9x, which doesn't support VirtualAllocEx. For this scenario, I took advantage of a unique property of Windows 9x memory-mapped files. These files are visible in all address spaces, and at the same address. I simply create a small memory-mapped file using the system page file as backing, and blast the LoadLibrary stub into it. The stub is implicitly accessible in the debugee process. For the details, see the code listing for CDebugInjector::GetMemoryForLoadLibraryStub at the link at the top of this article.
Using DelayLoadProfile
DelayLoadProfile is a command-line program that writes its results to standard output. From a command prompt, run DelayLoadProfile, specifying the target program and any arguments it needs, such as: |