利用DelayLoad来优化应用程序的性能.拦截API.

最新推荐文章于 2023-11-27 19:52:03 发布

sanjiang

最新推荐文章于 2023-11-27 19:52:03 发布

阅读量1.5k

点赞数

文章标签：优化 dll exe function shell windows

本文链接：https://blog.csdn.net/sanjiang/article/details/5114

版权

翻译 <Under the hood -by Matt Pietrek >

源文件 http://www.microsoft.com/msj/0200/hood/hood0200.asp

利用DelayLoad来优化应用程序的性能.拦截API.

               -- 中文翻译, by snake.(http://snake12.top263.net)

   利用DelayLoad来优化应用程序的性能.       拦截API.

  在 1998年12月的MSJ出版刊物中, Jeffrey和我
写了关于 在 vc6中使用DelayLoad 功能的专栏.
最终结果,是证明了它是多么cool.但是，不幸的是，
还有很多人不了解DelayLoad,他们以为这个新特点
是 最新版本的WINNT才有的.
  在开始的时候，让我重申一遍:DelayLoad不是最
新的操作系统带的特有功能,它可以在任何win32
系统中起作用.我将写一个简单例子来说明. 
DelayLoadProfile, 实现了一个很小功能,很多程序
都可以得益于它.


预览:
  通常的,当调用一个dll中的函数时,连接器
会将dll和函数加入你的可执行文件.最后,所有
引用的函数会放在imports段中.
  当加载该程序的时候,win32程序加载器会扫描
所有imports段的每个dll.加载,和重新定位imports
段的所有函数,将信息写入 引入地址表（Import
Address Table, IAT).简单说来,IAT就是一个函数
指针的表.调用该 引入函数的时候,就到IAT中去找.
  那么,DelayLoad的机理是什么呢?当你为一个Dll
进行"DelayLoad"的时候,连接器不将原来的值放入
imports段,相反,它为每个DelayLoad的引入函数的
名称和地址,生成一个小的根区, 备份下来。第一
次引用的时候，它调用LoadLibrary加载Dll,然后,
它调用GetProcAddress取得该函数的地址。最后,
改写自己在IAT的值,以便以后的程序可以直接
调用.
  上面的是简化的步骤.实际上,根区是一小段代码，
它以静态的方式连接到可执行文件中.代码在delayimp.lib
中,必须被 连接程序引用.并且,该代码要足够智能,
当一个函数第一次被引用的时候,要调用LoadLibrary,
以后调用就不用引用了.
  和引用Dll相比,DelayLoad不会加太多的时间和空间,
这种方式 调用LoadLibrary只会引起稍微一点点的性能
损失.每次程序启动，在针对引入表的函数地址定位的
时候，依次对DelayLoad引入的调用GetProcAddress,相
对于Win32加载器来说，所损失的性能也可以忽略不记.
  然而，DelayLoad带来的好处也是不可比拟的.例如：
如果你的程序从来没有 从Delay调用引入的函数,Dll
的第一次是不会被加载的。有时候，这个情况的出现频率
出乎你想象。假如，你的程序中，包含打印的代码，毫无
疑问，即使用户没有使用打印功能，你的程序也一定要加
载winspool.drv。在这种情况下，使用DelayLoad，你就不
必加载和初始化Winspool.drv.
  另外一个好处就是：DelayLoad可以避免调用某些目标
平台不存在的API。例如，假如你的程序需要调用AnimateWindow,
这个API在Win2000和Win98中存在，但是在Win95和WinNT4
中，就不存在，假如你用常规的方式调用AnimateWindow，
那么，你的程序将不能再早期的平台中运行。然而，你可
以用DelayLoad进行对AnimateWindow的加载检查。这样，
你就不必改写你的代码为LoadLibrary和GetProcAddress
的方式了。
  DelayLoad是很容易使用的。当你决定哪个dll你想使用
DelayLoad，只需要简单的增加/DELAYLOAD:DLLNAME。其中，
DLLNAME是相关的DLL文件名。你还需要增加DELAYIMP.LIB到
连接库中，你也需要原来的LIB，例如，SHELL32.LIB。把全
部放到一块，连接的命令就如下：
 SHELL32.LIB /DELAYLOAD:SHELL32.DLL DELAYIMP.LIB
  很不幸，Visual Studio 6.0 IDE 不提供一个简单的方法
去实现一个Dll的DelayLoad。所以，你必须手工加入:
/DELAYLOAD:XXX 命令行到 "Project settings"->"Link"->
"Project Options"中。
  
什么时候需要DelayLoad:
  当你有小的工程，它调用了多个dll，就是一个好的DelayLoad
候选例子。然而，工程可能在以后由于其它开发者的加入而
变大，很容易丢失调用dll的跟踪。我通常用sdk中的depends.exe。
一个只有少数函数要引入的dll就是一个好的开始。
  然而，我想找到一个简单的，自动的方法来跟踪。于是，
出来了DelayLoadProfile程序。它是一个exe,可以监视你的
exe文件对dll的调用，直到你的exe结束。它打印出dll被调
用的情况的汇总，包括多少个dll被调用，每个dll有多少个
函数被引入。
  我在这里强调：DelayLoadProfile只是针对exe有效，当它
涵盖你的程序所关联的所有dll的时候，有时会造成一点点复
杂。DelayLoadProfile只给你哪个dll可以用DelayLoad开关
的暗示，你最好在不确定的时候，使用原来的处理方法。

DelayLoadProfile:详细描述
  其实DelayLoadProfile的原理很简单：重定向 exe中，IAT
的函数的指针到一段根区。根区简单的标志一下，引入的函数
被调用了。然后，跳入原来的Win32加载提供的IAT地址。只是，
难的是如何实现。
  第一，你必须决定，要在哪里运行你的代码，实现对exe的IAT
入口的更改，把他们指定到那段根区去。这些都是在进程外完
成。这样可以避免你的代码牵涉到目的exe进程中。这个可以用
遍历所有的数据结构，定位和修改IAT结构的方法。我在这里利
用了很多ReadProcessMemory调用。
  接着的艰苦工作是要在和目的exe相同的进程空间里完成。几
乎是很琐碎的工作：遍历所有的数据结构，建立根区，从定向IAT
入口，然后在完成的时候，汇总结果。然而，为了完成进程空
间的工作，在exe进程运行的时候，一些 DelayLoadProfile代
码必须被加载到目的exe的进程空间。这个是我要做的。
  当确认到需要在目的进程中，加载我的代码的时候，下个问
题就是如何把我的代码加入到目的进程中。其中一个选择就是，
要求用户连接我的DelayLoadProfile库，这个会造成用户的很
大量的对他们源代码工程，或者Makefile的更改，所以，我不
能采用，现在需要一个完全自动化的方法。
  在这点上，我想到了加载程序，然后，插入我的
DelayLoadProfiledll进去，一个技术就是用CreateRemoteThread，
在目标进程，创建一个LoadLibrary的线程。我放弃了这个，因为，
win9x中，不提供CreateRemoteThread.
  很久以前的MSJ读者可能记得我5年前写的一个叫APISPY32的
程序。它加载一个进程，插入一个dll来记录API的调用。那个
有点像我今天的DelayLoadProfile工作。然而，我在Win200中，
调用那个dll失败。有一点点问题。我觉得现在是时候要重读
那段代码，并且改正那个错误了。
  
继续深入:
  重新温习一下，DelayLoadProfile包含2部分，一，是进程
加载功能，它会注射一个dll到你的进程的地址空间。然后，
那个dll扫描你的所有的exe IAT，重新定向他们到dll创建的
根区中。当你的程序完成后，注射的dll会扫描所有的根区，
统计出多少dll和函数它调用的。如果你曾经用过APIMON的
相关部件，你将认出类似的技术细节。
  完成所有的工作，包括 监视 一个程序的引入的dll,叫
DelayLoadProfileDLL.(看Figure 1).它用到DLL_PROCESS_ATTACH
和DLL_PROCESS_DETACH来初始化2个主要的工作。
  当DllMain获得DLL_PROCESS_ATTACH的消息的时候，
DelayLoadProfileDLL调用PrepareToProfile(),在PrepareToProfile中,
代码加载目的EXE的IAT,对于每个它发现的引用的DLL，代码
还检测是否安全的重定向IAT。通过IsModuleOKToHook函数来
检测，大多数情况下，是安全的，因此，PrepareToProfile
包括了RedirectIAT函数。
  RedirectIAT是比较复杂的函数。如果你理解了 winini.h中
的引入相关数据结构，你将得到很大的帮助。首先，函数定位
IAT和相关的引入名字表，然后，计算有多少个IAT入口，扫描
所有的IAT，查找NULL的指针。得到了数目后，程序将创建一个
DLPD_IAT_STUB根区，每个根区对应一个IAT入口。
  最后，代码重新扫描IAT，获取每个IAT入口的地址，用根区
的一个包含JMP指令的地址，替换IAT入口。它还扫描下一个IAT
DLPD_IAT_STUB根区。我在后面将还会继续解析。
  在重定向IAT入口的根区中，有2个值得提起：1，IAT常常
被放到EXE的只读段，通常，尝试改写只读段，会引起访问违规，
幸运的，VirtualProtect允许你更改一个目的地址的属性。现在
必须更改iat的属性为读/写。完成后，代码要恢复IAT段原来的
属性。
  另外一个要注意的地方，就是在重定向IAT的时候，有数据引
入的问题。虽然程序员们很少这样做，但是，很容易用增加的
代码去导入数据。vc++运行库DLL(MSVCRT.DLL)有数据导出。如
果重定向一个数据IAT的入口，会导致问题。
  那么，如何判断一个IAT是数据呢？一个商业的软件，应该用
准确的算法来判断一个IAT入口的类型。但是，我在这里用了一
个快捷方法。就是IsBadWritePtr。如果IAT包含的指针是可写
的，那么，很可能是一个数据指针。如果是只读的，那么，应
该是一段代码。这个测试合适吗？不，但是，它对DelayLoadProfile
是足够了。
  现在看一下根区，在DelayLoadProfileDLL.h中定义的
DLPD_IAT_STUB结构包含着代码和数据。简单来说，就是如下：
  CALL DelayLoadProfileDLL_UpdateCount
  JMP xxxxxxxx //original IAT 地址
  DWORD count
  DWORD pssNameOrOrdinal

  当exe调用其中一个重定向的函数时，控制权被转到根区的CALL
指令中，调用DelayLoadProfileDLL.CPP中的
DelayLoadProfileDLL_UpdateCount函数，在call指令返回时，继续
调用jmp 跳转到IAT原来取得的地址中。Figure2显示了结构示意
图。
  汇编高手会对DelayLoadProfileDLL_UpdateCount函数能确定
根区的COUNT字段的地址，感到疑惑，通过快速的察看代码，会
发现DelayLoadProfileDLL_UpdateCount会在堆栈中，查找到返
回地址。返回地址指着JMP xxxxxxxx指令。因为，CALL调用总是
5个字节，根据这些算法，可以确定COUNT字段的地址。
  有一个问题值得提醒，就是DelayLoadProfileDLL_UpdateCount
没有调用PUSHAD和POPAD指令来保存/回复CPU寄存器的值。这段代码
在很多程序上都工作正常，但是，却在一些函数中，不能正常工作。
最后，发现 MSVCRT.DLL的__CxxFrameHandler和 _EH_prolog有问题，
这2个函数 期望eax寄存器被设置成某个值。然而，
DelayLoadProfileDLL_UpdateCount更改了EAX.
  既然这个是由于EAX引起的问题，那么，我增加了PUSHAD和POPAD，
昏倒，问题还存在。在遭受挫折后，我检查了汇编生成的代码。通
常，VC6编译器会插入将所有本地变量都初始化为0xCC的代码。这
些代码会在PUSHAD和POPAD前，将EAX改变。我只好移去/GZ的选项。


结果报告：
  当你的进程停止的时候，系统对所有加载的DLL发送一个
DLL_PROCESS_DETACH消息。DelayLoadProfileDLL使用这个选项来
搜集程序运行过程中，获得的结果。也是说，再次遍历所有的根区
单元。收集所有获得的数据，输出。
  在DelayLoadProfileDLL安装的阶段，重定向IAT，它保存exe的IAT
到一个公共的变量出g_pFirstImportDesc。在关闭的过程中，
ReportProfileResults用到这个指针来再次遍历引入段。如果这个
IAT是被重定向的，那么，第一个IAT的指针应该指到第一个
为该DLL分配的DLPD_IAT_STUB根区内存。当然，代码保持了基本的
测试方法，如果某些地方不正确，DelayLoadProfileDLL忽略该
特定的dll。
  总的说来，所有的都很正常，并且，第一个IAT入口指到我的根区
单元。对于每个DLL，代码反复的遍历所有的根区。每个相关的根区，
它的包含的字段的值，将加到该DLL的总计数。当遍历完成，
ReportProfileResults格式化一个字符串，输出该dll的名字，和调
用的总次数。代码还用OutputDebugString广播该结果。

加载和注射：
  本程序加载你的exe，注射DelayLoadProfileDLL.dll将会调用，
（你猜到了），是DelayLoadProfile.exe（源文件可以在msj的网站
找到，http://www.microsoft.com/msj）。这个代码主要继承了
CDebugInjector类。我将简单的介绍它。函数主要包含了目的exe
的命令行，并且传递到CDebugInjector::LoadProcess。如果进程
被成功创建，函数会告诉CDebugInjector，哪个dll会被注射，既然
是这样，和DelayLoadProfile.exe同目录的DelayLoadProfileDLL.DLL，
将会被加载。
  在运行目标程序之前，最后的步骤是调用
CDebugInjector::SetOutputDebugStringCallBack。当DelayLoadProfileDLL
用OutputDebugString来输出报告结果的时候，CDebugInjector看到
他们，然后传递他们到你已经注册的回调函数中。这个回调函数只是
用printfs输出字符串到控制台。最后，函数调用CDebugInjector::Run。
这样，目的进程开始运行，当时机成熟，注射dll进去。
  描述3（hoodtextfigs.htm#fig3）说明了CDebugInjector类。这是
代码实现的地方。CDebugtInjector::LoadProcess创建了目的进程，
作为一个调试进程，它的分支已经在msdn的很多文档中讨论过了，这里，
不想作太多具体的讨论。
  调试进程运行后（这里是DelayLoadProfile)进入了一个循环，不断的
调用WaitForDebugEvent和ContinueDebugEvent，直到调试停止。每次
WaitForDebugEvent返回，都有些东西发生在调试程序身上。可能是一个
异常（包括断点），或者加载一个dll，或者创建一个线程，或者其他事
件。WaitForDebugEvent文档历包含了所有的可能的事件。
CDebugInjector::Run过程包含这个循环的代码。
  那么，如何让目的进程作为一个被调试进程，帮助你注射一个dll呢？
一个调试进程可以控制的被调试进程的执行过程。每次被调试程序有一
个信号事件发生，它都会暂停，等待调试者调用ContinueDebugEvent继
续运行。了解了这个，一个调试进程可以增加代码到被调试进程的空间，
和临时改变被调试者的寄存器值，以便增加的代码运行。
  在某些特定场合，CDebugInjector合成了一小段代码根区来调用
LoadLibrary。LoadLibrary的dll名字参数，指到要被注射的dll的名字。
CDebugInjector写那个根区（和相关联的dll名字）到被调试者的地址
空间。然后，调用SetThreadContext来改变被调试者的指令寄存器，运行
LoadLibrary根区。所有的相关代码在CDebugInjector::PlaceInjectionStub
过程中。
  立刻的，根区中的LoadLibrary调用后，是一个断点(int 3)。这个暂停
被调试者的运行，交回控制权给调试的进程。调试者用SetThreadContext，
恢复指令寄存器和其他寄存器到原来的值。另一次调用ContinueDebugEvent，
被调试者在dll注射的状态下，继续运行。没有人知道发生了什么事情。
  如果你不想那么多，这个注射进程不会觉得太难，但是，一些有兴趣的
东西，弄复杂了事情。例如，什么时候创建根区，改变运行代码，才是适
当呢？你不能在CreateProcess后立刻做这个，因为，引入的dll还没有被
映射到内存中，WIN32加载器还没有建立exe的IAT。相当于：太早了。
  最后，我决定让被调试者运行，直到碰到了第一个断点。我在程序入口
处，设置了一个自己的断点。当第2次中断被触发，CDebugInjector知道
目的进程的DLL，都被初始化了（包括Kernel32.dll）。但是，在exe中，
还没有代码运行。现在是时候注射DelayLoadProfileDLL.DLL了。
  顺便说一下：断点从哪里来呢？通过定义，一个被调试的win32的进程，
在运行之前，会调用DebugBreak（也是int3），在我早期的apispy32代码
中，我选用了最初的DebugBreak来做注射。在win2k中，非常不幸，这个
DebugBreak在Kernel32.dll初始化之前，被调用，那么，CDebugInjector
设置它的断点到exe即将获得控制的地方，那么，kernel32.dll被初始化
了。
  在之前，我提到在LoadLibrary调用后，发生的一个断点。这是第3个
CDebugInjector要处理的断点，所有的处理不同断点的技巧，可以参考
CDebugInjector::HandleException。
  另外一个关于注射dll的有兴趣的问题，就是在那里写LoadLibrary单元，
在winnt4.0以后，你可以用VirtualAllocEx来为某个线程申请内存。我采
用了这个方法。现在，剩下不能支持VirtualAllocEx的Win9x，针对这个
问题，我利用了win9x内存映射文件的一个特殊的特性，这些文件在所有
的地址空间都可见。并且，是同一个地址。我简单的利用系统页面文件
作为支持，创建了一个小的内存映射文件，写了LoadLibrary根区进去。
该根区对于被调试程序，是可见的。更多的详细情况，请看文章首部的
连结的CDebugInjector::GetMemoryForLoadLibraryStub。

使用DelayLoadProfile:
  DelayLoadProfile是一个输出结果到标准输出的命令行程序。在命令行
提示中，运行DelayLoadProfile，制定目的程序，和它需要的参数，例如：
  DelayLoadProfile notepad c:/autoexec.bat
下面是针对（windows 2000 Release Candidate2）的calc.exe， 运行
DelayLoadProfile的结果：
   [d:/column/col66/debug]delayloadprofile calc
   DelayLoadProfile: SHELL32.dll was called 0 times
   DelayLoadProfile: MSVCRT.dll was called 9 times
   DelayLoadProfile: ADVAPI32.dll was called 0 times
   DelayLoadProfile: GDI32.dll was called 60 times
   DelayLoadProfile: USER32.dll was called 691 times
我简单的开始calc，然后，立即关闭。注意到，shell32.dll和advapi32.dll
都没有调用，这2个dll是最初的calc用来DelayLoad的候选。
  你将回觉得奇怪，为什么calc调用shell32.dll，你没有调用它。如果你
针对CALC，调用DumpBin /IMPORTS或者Depends.exe分析，你将看到，CALC
从SHELL32.DLL中引入的函数只有ShellAboutW。简单来说，只有你选者CALC
的HELP|About Calculator菜单项，才会完全的调用SHELL32.DLL入内存。
这个是一个最明显的/DELAYLOAD显示其价值的例子。顺便说，SHELL322.DLL
简单的，毫无条件的加载SHLWAPI.DLL和COMCTL32.DLL，并且初始化。
  如果只是因为DelayLoadProfile报告一个dll没有被调用，或者很少调用，
你就可以自动的 延迟加载，你要认真的确定，哪一个暗中连结的dll，你要
使用/DELAYLOAD。这种情况下，如果由于其他的依赖，你的DLL要被自动的
加载和初始化，那么，/DELAYLOAD就没有意义了。平台sdk带的Depends.exe
是一个很有用的工具，可以看到一个dll的使用情况。
  在你的测试过程中，你的测试的程序的个数，也是值得考虑的。如果你
测试了所有的程序的功能，所有的被引入的dll都包括了。个人认为，我
觉得应该尽量缩小初始化时间，这个可能是意味着你只是开始你的程序，
然后关闭它。要加快初始化，就依次加载dll。用户都是主观的由启动时
间判断你的程序的速度。
  我发现几个DLL可以从/DELAYLOAD处得益。从上所述，SHELL32.DLL是
其中一个。另外一个是打印支持的WINSPOOL.DRV。既然很多用户都不经
常打印，那么，就是很好的采用者。还有，类似的OLE32.DLL和
OL3AUT32.DLL。一个多态的程序，在小容器中，用到COM和OLE，那么，
相关的DLL也是可以选用的。例如，WIN2000的CDPLAYER.EXE和OLE32.DLL
连接，用到了CreateStreamOnHGlobal函数。但是，在通常的情况下，我
没有觉察到这个函数被调用。
  DelayLoadProfile并不是没有它的毛病，当我在很多程序针对IAT，用
DelayLoadProfileDLL成功测试后，你可能还会碰到不正确的运行的情况。
要完全解决这个问题，就超出了本次讨论的范围。然而，如果你成功解决
了其中一个问题，请让我知道。我将在将来的一天更新DelayLoadProfile。
  我知道某些引入mfc42.dll和mfc42u.dll的程序会和DelayLoadProfile
冲突，于是，我采用了一个方法，在DelayLoadProfileDLL.cpp，有一个
IsModuleOKToHook函数，我放了MFC42.DLL，MFC42U.DLL和KERNEL32.DLL进
去。（你不能用 /DELAYLOAD 和KERNEL32.DLL关联，因为，是没有作用的）
如果一个特别的DLL会出问题，你应该放到IsModuleOKToHook函数中。
  我希望DelayLaodProfile会帮助你的程序采用/DELAYLOAD。我以后应该
还会有时间去更新一些专业的带骂，并且，我还希望听到你的成功的故事。

如果你对<under the hook>有任何建议，请mail给matt:
matt@wheaty.net，或者 http://www.wheaty.net

摘自 <
 
 
  
  Microsoft System Journal>2000年2月刊

附原版文档:

 
 In the December 1998 issue of MSJ, Jeffrey Richter and I wrote dueling columns on the DelayLoad feature of the Microsoft® Visual C++® 6.0 linker. The fact that both Jeff and I jumped on this topic is testimony to how cool this feature is. Unfortunately, I still find people who don't know anything about DelayLoad or they think it's some feature that's available only in the latest version of Windows NT®. 
For starters, let me scream from the highest rooftop that DelayLoad is not an operating system feature. It works on any Win32®-based system. With that off my chest, I'll demonstrate this month's utility, DelayLoadProfile, which makes it almost trivial to determine whether your program can benefit from DelayLoad. As I'll show, even some of Microsoft's own programs can benefit from it.
A Quick Review 
If you're wondering "What's this thing Matt's gone off the deep end over?" a quick recap of DelayLoad is in order. Here's how it works. Normally, when calling an imported function in a DLL, the linker adds information about the imported DLL and function to your executable. Collectively, the information for all the imported functions is known as the imports section. 
The Win32 loader scans through the imports section at load time and loads each DLL. For each DLL loaded, the loader iterates through all the imported functions and locates their addresses in the imported DLL. These addresses are written back to the imports section in a location known as the Import Address Table (IAT). A simple way to think of an IAT is as an array of function pointers. When calling an imported function, the call uses one of the function pointers from the IAT. 
How does the picture change with DelayLoad? When you specify DelayLoad for a DLL, the linker doesn't emit the usual data it would put in the imports section. Instead, it generates a small stub for each DelayLoad imported function. This stub points to the imported DLL and function name. Upon calling an imported function for the first time, the stub calls LoadLibrary to load the DLL. Next, it calls GetProcAddress to get the address of the called function. Finally, the stub overwrites part of itself so that subsequent calls to the function go directly to the target code. 
What I've just described is a slight simplification. In reality, the stub is a small bit of code that calls a routine statically linked into your executable. This routine resides in DELAYIMP.LIB, which must be included in the list of libraries that the linker uses. Also, the stubs and DELAYIMP.LIB code are smart enough to call LoadLibrary only the first time a function in the DLL is used. Subsequent calls to other functions in the same DelayLoad imported DLL don't call LoadLibrary.
All things considered, DelayLoad doesn't add much time or space overhead compared to importing the DLL the usual way. Calling LoadLibrary is only slightly less efficient than letting the Win32 loader load the DLL. Likewise, calling GetProcAddress once for each DelayLoad imported function is only slightly slower than having the Win32 loader locate the imported functions at startup. 
However, the benefits of DelayLoad can easily make up for these small speed penalties. For starters, if you never call a function in a DelayLoad imported DLL, the DLL isn't loaded in the first place. This comes in handy more often than you may think. Consider the situation in which you have printing code in your program. If the user doesn't print something during a program session, you've loaded WINSPOOL.DRV for no reason. In this case, using DelayLoad is actually faster since you never loaded and initialized WINSPOOL.DRV. 
Another benefit of using DelayLoad is that you avoid calling APIs that are not available on one of your target platforms. For instance, say you want to call AnimateWindow, which is supported in Windows® 98 and Windows 2000, but not Windows 95 or Windows NT 4.0. If you were to call AnimateWindow the usual way, your code wouldn't load on the earlier platforms. However, with DelayLoad you can make a runtime check of which operating system you're on and only call AnimateWindow if it's supported. There's no need for you to muck up your code with calls to LoadLibrary and GetProcAddress. 
Using DelayLoad is incredibly easy. Once you know which DLLs you want to use DelayLoad with, simply add /DELAYLOAD:DLLNAME, where DLLNAME is the name of the DLL. You'll also need to add DELAYIMP.LIB to the linker's library list, and you'll still need the original import library, for example, SHELL32.LIB. Putting everything together, to DelayLoad against SHELL32.DLL your linker line would need the following: 
 
 
 SHELL32.LIB /DELAYLOAD:SHELL32.DLL DELAYIMP.LIB

 
 Unfortunately, the Visual Studio® 6.0 IDE doesn't have an easy way for you to specify DelayLoading for DLLs. In Visual Studio 6.0, you'll have to add the /DELAYLOAD:XXX command-line fragment manually to the Project Settings | Link | Project Options edit field. 
When to Use DelayLoad 
When you have a small project, it's easy to come up with a list of DLLs that are good DelayLoad candidates. However, because projects may grow and can involve many developers, it's just as easy to lose track of who uses which DLL. In the past, I've relied on gut instinct and Depends.EXE from the Platform SDK. A DLL from which only a few functions are imported is a good place to start.
However, I wanted a way to automate and simplify the process. Thus was born the DelayLoadProfile program. DelayLoadProfile is a tool that runs your EXE and monitors the DLLs and functions that your EXE calls. After your program terminates, DelayLoadProfile spits out a summary of which DLLs were used and how many calls were made to each DLL. A DLL that's imported, but which had no calls made to it, is a good candidate for DelayLoad importing. 
Let me emphasize one point before continuing: DelayLoadProfile works only against your EXE. While it could be extended to recurse into all of your imported DLLs and their dependencies, that would significantly complicate its code. As I'll explain later, DelayLoadProfile just gives you hints about which DLLs you might consider using /DELAYLOAD on. You still have to use that neuron-based processing unit between your ears to make sure it makes sense to do so. 
DelayLoadProfile: The Big Picture 
The concept behind DelayLoadProfile is simple. Redirecting the function pointers in the EXE's IAT to point to a stub is all that's needed. The stub simply notes that the imported function has been called, then jumps to the address that the Win32 loader originally stored in the IAT. However, the devil is in the details. 
First, you must decide where the code will run that locates and modifies the EXE's IAT entries to point to the stubs. Doing the work out-of-process in some sort of control program is one option. This avoids the work involved in getting your code into the target EXE's process. The downside is that it's more work to traverse all the data structures necessary to locate and patch the IAT entries, as well as gather the results later. I'd be swimming in ReadProcessMemory calls. 
The other approach is to do the hard work in the same process space as the target EXE. This makes it almost trivial to march through the data structures, build stubs, redirect the IAT entries, and summarize the results at the end. However, doing the work in-process requires that some of the DelayLoadProfile code be loaded into the target EXE's process as it runs. This is the path I took. 
Having committed to running my code in-process with the target, the next problem was figuring out how to get my code into the target process. One choice would have been to ask the user to link with the DelayLoadProfile code. Knowing it would require some effort by the target audience, I discarded this option. If a DelayLoadProfile user needed to modify their source, project, or makefile, many would pass. I needed to make DelayLoadProfile a complete no-brainer. 
At this point, I had boxed myself into some sort of loader program that would run the target EXE and inject my DelayLoadProfile DLL into it. One technique for DLL injection is to use CreateRemoteThread to start a thread in the target process that calls LoadLibrary on your DLL. I discarded this approach because CreateRemoteThread isn't available on Windows 9x, which I wanted to support.
Longtime MSJ readers may remember a program I wrote more than five years ago called APISPY32. It loads a process and injects a DLL into it for the purposes of logging API calls. That sounds similar to what I needed DelayLoadProfile to do. Alas, when I ran APISPY32 on Windows 2000, it failed to load the DLL. A little digging revealed the source of the problem, and I decided it was time to revamp this code for a whole new generation of programmers. 
Into the Trenches 
To review quickly, DelayLoadProfile is a two-part system. A loader process runs your program. Early on in your program, the loader process injects a DLL into your program's address space. This DLL scans through your EXE's IAT and redirects the imported functions to point to stubs that the DLL creates. When your program shuts down, the injected DLL scans through the stubs it has created and summarizes how many calls were made to each imported DLL. If you've ever used the APIMON utility from the Platform SDK, you'll recognize the similarities. 
The DLL that does all the work of monitoring a program's use of imports is called DelayLoadProfileDLL (see Figure 1). DelayLoadProfileDLL uses the DLL_PROCESS_ATTACH and DLL_PROCESS_DETACH notifications sent to its DllMain procedure to initiate the two primary phases of the DLL's work. 
When its DllMain gets the DLL_PROCESS_ATTACH notification, DelayLoadProfileDLL calls PrepareToProfile. Inside PrepareToProfile, the code locates the target EXE's IAT. For each imported DLL it finds, the code determines if it's a DLL that's safe for IAT redirection. It does this by calling the IsModuleOKToHook function. Most of the time, it's OK to redirect the IAT, so PrepareToProfile invokes the RedirectIAT function. 
RedirectIAT is where things get dirty, and it really helps if you understand the import-related data structures in WINNT.H. First, the function locates the IAT and the associated Import Names Table. The code then counts how many IAT entries there are by scanning through the IAT, looking for a NULL pointer. With this count, an array of DLPD_IAT_STUB stubs is created, with one stub for each IAT entry. 
Finally, it's time for meatball surgery. The code makes yet another pass through the IAT. This time it grabs the address in each IAT entry, stuffs it into a JMP instruction in the stub, and redirects the IAT entry to point to the stub. As the code advances through each subsequent IAT entry, it also advances to the next DLPD_IAT_STUB stub in the allocated array. I'll explain DLPD_IAT_STUB stubs a little later in this column. 
Two aspects of redirecting the IAT entries to the allocated stubs are worth mentioning. First, the IAT is often placed in a read-only section of the EXE. Ordinarily, an attempt to modify such an IAT pointer would result in an access violation. Luckily, the VirtualProtect API comes to the rescue and enables you to modify the attributes of a target address, in this case, the IAT. Read-write is the attribute you're looking to modify. When it's finished, the code restores the original memory protection attributes. 
The other tricky part of redirecting the IAT occurs when you encounter a data import. Although programmers don't frequently do so, it's relatively easy to import data in addition to code. The Visual C++ runtime library DLL (MSVCRT.DLL) has data exports. Redirecting an IAT entry that refers to data in an imported DLL is almost certainly a recipe for problems. 
So how do you determine whether an import is a normal code import or a data import? A commercial product could implement a sophisticated algorithm to determine the import type of an IAT entry. However, I took a shortcut and used IsBadWritePtr. If the IAT points to memory that's writeable, it's probably pointing to data. Likewise, if it points to read-only memory, odds are that it's pointing to code. Is this a perfect test? No, but it's good enough for DelayLoadProfile's needs. 
Now let's take a look at the stubs. The DLPD_IAT_STUB structure in DelayLoadProfileDLL.H contains the layout, which is a mixture of code and data. Simplifying this structure, a DLPD_IAT_STUB stub looks like this: 
 
 
 CALL    DelayLoadProfileDLL_UpdateCount 
 JMP     XXXXXXXX // original IAT address 
 DWORD   count 
 DWORD   pszNameOrOrdinal 

 
 When the EXE calls one of the redirected functions, control goes to the CALL instruction in the stub. The DelayLoadProfileDLL_UpdateCount routine in DelayLoadProfileDLL.CPP simply increments the value of the count field of the stub. After that CALL returns, the JMP instruction transfers control to the original address that was stored in the IAT before I bashed it. Figure 2 shows the big picture after the IAT has been redirected to the stubs. 
Assembler junkies might be wondering how the DelayLoadProfileDLL_UpdateCount function knows where the stub's count field is in memory. A quick look at the code shows that DelayLoadProfileDLL_UpdateCount finds the return address pushed on the stack by the CALL instruction. The return address points to the JMP XXXXXXXX instruction following the call. Since the CALL instruction is always five bytes, some pointer arithmetic yields the stub's starting address and easy access to the stub's count field. 
I had one problem using the DelayLoadProfileDLL_UpdateCount code that's worth mentioning. Originally, the function didn't have the PUSHAD and POPAD instructions to save and restore all of the regular CPU registers. The code worked fine on many programs, but just blew up on others. Finally, I narrowed it down to programs that imported __CxxFrameHandler and _EH_prolog from MSVCRT.DLL. Both of these APIs expect the EAX register to be set to a given value, and DelayLoadProfileDLL_UpdateCount was trashing EAX. 
Since the trashed EAX was the problem, I added PUSHAD and POPAD. Alas, the problem remained. In frustration, I examined the compiler-generated code, and then smacked my forehead. Normally when generating code for a debug build, the Visual C++ 6.0 compiler inserts code in the function prolog to set all local variables to the value 0xCC. This code was trashing EAX before my PUSHAD got a chance to execute. To get around this, I had to remove the /GZ option from the debug build settings for DelayLoadProfileDLL. 
Reporting Results 
As your process shuts down, the system sends the DLL_ PROCESS_DETACH notification to all loaded DLLs. DelayLoadProfileDLL uses this opportunity to harvest the information collected during the run. In a nutshell, this means scanning through all the stub arrays, counting the number of calls that were made through the stubs, and reporting what it finds. 
During the setup phase when DelayLoadProfileDLL was redirecting the IATs, it stashed away the address of the EXE's IAT into a global variable (g_pFirstImportDesc). At shutdown time, ReportProfileResults uses this pointer to walk through the imports section again. For each imported DLL, it retrieves the address of the DLL's first IAT entry. If this is an IAT that I've redirected, the first pointer in the IAT should point to the first of the DLPD_IAT_STUB stubs allocated for that DLL. Of course, the code does some sanity checking to ensure that this is the case. If something doesn't look right, DelayLoadProfileDLL ignores that particular imported DLL. 
Generally though, everything looks fine, and the first IAT entry points to my stubs. The code then iterates through all the stubs for the DLL. At each stub, the value of the stub's count field is added to a running total for the DLL. When the iteration completes, ReportProfileResults formats a string with the name of the DLL and how many calls were made through the stubs. The code uses OutputDebugString to broadcast its findings. 
Loading and Injection 
The program that loads your EXE and injects DelayLoadProfileDLL.DLL is called—you guessed it—DelayLoadProfile.EXE (the source code is available from the MSJ Web site at http://www.microsoft.com/msj). This code mainly drives the CDebugInjector class, which I'll describe shortly. Function main obtains the target EXE's command line and passes it to CDebugInjector::LoadProcess. If the process is created successfully, function main tells CDebugInjector which DLL it wants injected. In this case, it's DelayLoadProfileDLL.DLL, which should be located in the same directory as DelayLoadProfile.EXE. 
The last step before letting the target run wild is to call CDebugInjector::SetOutputDebugStringCallback. When DelayLoadProfileDLL reports its results via OutputDebugString, CDebugInjector sees them and passes them to the callback you registered. This callback just printfs the strings to the console. Finally, function main calls CDebugInjector::Run. This call lets the target process begin and, when the time is right, injects the DLL into it. 
 Figure 3 shows The CDebugInjector class. This is where all the good stuff happens. CDebugInjector::LoadProcess creates the specified process as a debugee process. The ramifications of running as a debugee process have been discussed in many articles and in the MSDN documentation, so I won't go into all the details here. 
For the purposes of this column, it's sufficient to say that the debugger process (in this case, DelayLoadProfile) has to enter a loop that calls WaitForDebugEvent and ContinueDebugEvent until the debugee terminates. Every time WaitForDebugEvent returns, something has happened in the debugee. This might be an exception (including break—points), a DLL load, a thread creation, or other event. The WaitForDebugEvent documentation covers all the events that might occur. The CDebugInjector::Run method contains the code for this loop. 
So how does running the target process as a debugee help you inject a DLL? A debugger process has excellent control over the debugee process's execution. Every time a significant event occurs in the debugee, it is suspended until the debugger calls ContinueDebugEvent. Knowing this, a debugger process can add code to the debugee's address space and temporarily change the debugee's registers so that the added code executes. 
In more specific terms, CDebugInjector synthesizes a small code stub that calls LoadLibrary. The DLL name parameter to LoadLibrary points to the name of the DLL to inject. CDebugInjector writes the stub (and the associated DLL name) to the debugee's address space. It then calls SetThreadContext to change the debugee's instruction pointer (EIP) to execute the LoadLibrary stub. All of this dirty work occurs within the CDebugInjector::PlaceInjectionStub method. 
Immediately following the LoadLibrary call in the stub is a breakpoint instruction (INT 3). This stops the debugee and gives control back to the debugger process. The debugger then uses SetThreadContext again to restore the instruction pointer and other registers to their original values. Another call to ContinueDebugEvent and the debugee is on its way with the DLL injected, none the wiser that anything has happened. 
If you don't think too hard, this injection process doesn't sound too messy. Nonetheless, a few interesting problems crop up that complicate things. For example, when is the proper time to create the stub code and redirect control to it? You can't do this immediately after the CreateProcess call because, among other reasons, the imported DLLs haven't been mapped into memory at this point and the EXE's IAT hasn't been fixed up by the Win32 loader. In other words, it's too early. 
The solution I ultimately decided on was to let the debugee run until it encounters its first breakpoint. Then I set a breakpoint of my own at the entry point of the EXE. When this second breakpoint triggers, CDebugInjector knows that DLLs in the target process (including KERNEL32.DLL) have initialized, but no code in the EXE has run. This is the perfect time for injecting DelayLoadProfileDLL.DLL. 
Incidentally, where does the first breakpoint come from? By definition, a Win32 process that's being debugged calls DebugBreak (also known as INT 3) very early in its execution. In my ancient APISPY32 code, I used the initial DebugBreak as the occasion to do the injection. Unfortunately in Windows 2000, this DebugBreak occurs before KERNEL32.DLL is initialized. Thus, CDebugInjector sets its own breakpoint to go off when the EXE is about to get control, and thus knows that KERNEL32.DLL has been initialized. 
Earlier, I mentioned a breakpoint that occurs after the LoadLibrary call returns. This is a third breakpoint for CDebugInjector to handle. All of the mechanics for handling the different breakpoints can be seen in CDebugInjector::HandleException. 
Another interesting problem to address with DLL injection is where to write the LoadLibrary stub. Under Windows NT 4.0 and later you can allocate space in another process with VirtualAllocEx, so I took that route. That leaves out Windows 9x, which doesn't support VirtualAllocEx. For this scenario, I took advantage of a unique property of Windows 9x memory-mapped files. These files are visible in all address spaces, and at the same address. I simply create a small memory-mapped file using the system page file as backing, and blast the LoadLibrary stub into it. The stub is implicitly accessible in the debugee process. For the details, see the code listing for CDebugInjector::GetMemoryForLoadLibraryStub at the link at the top of this article. 
Using DelayLoadProfile 
DelayLoadProfile is a command-line program that writes its results to standard output. From a command prompt, run DelayLoadProfile, specifying the target program and any arguments it needs, such as: 
 
 
 DelayLoadProfile notepad c:/autoexec.bat 

 
 Here are the results of running DelayLoadProfile against CALC.EXE from Windows 2000 Release Candidate 2: 
 
 
 [d:/column/col66/debug]delayloadprofile calc
 DelayLoadProfile: SHELL32.dll was called 0 times
 DelayLoadProfile: MSVCRT.dll was called 9 times
 DelayLoadProfile: ADVAPI32.dll was called 0 times
 DelayLoadProfile: GDI32.dll was called 60 times
 DelayLoadProfile: USER32.dll was called 691 times

 
 I simply started CALC and immediately shut it down. Note that SHELL32.DLL and ADVAPI32.DLL both had no calls to them. These two DLLs are prime candidates for CALC to DelayLoad. 
You may be wondering why CALC loads SHELL32.DLL, yet doesn't call it. It would be easy enough to run DumpBin /IMPORTS or Depends.EXE against CALC. In doing so, you'd see that the only function CALC imports from SHELL32.DLL is ShellAboutW. Simply put, unless you select the Help | About Calculator menu item in CALC, it's a complete waste of time and memory to load SHELL32.DLL. This is a fabulous example of where /DELAYLOAD can really show its worth. Incidentally, SHELL32.DLL implicitly links against SHLWAPI.DLL and COMCTL32.DLL—two additional DLLs that are brought into memory and initialized for no reason. 
Just because DelayLoadProfile reports that a DLL is receiving few or no calls at all doesn't mean you should automatically DelayLoad it. Be sure to consider whether one of your implicitly linked DLLs also links against the DLL you're considering using DelayLoad with. If this is the case, it's not worth using /DELAYLOAD in your EXE since the DLL is still going to be loaded and initialized because of some other dependency. Depends.EXE from the Platform SDK is a great tool for quickly determining the scope of a DLL's usage. 
Another thing to consider when using DelayLoadProfile is how much of your app you'll exercise during your test. Obviously, if you exercise all aspects of your app, all the DLLs you import in the EXE will be invoked. Personally, I think minimal load time is a good target to shoot for. This might mean just starting your program and then closing it down. By spreading the work of loading and initializing your DLLs throughout your application as it runs, you can speed the initial load sequence. Users often subjectively judge the speed of your application by its startup time. 
I've found a few DLLs that will benefit from using /DELAYLOAD. As you saw earlier, SHELL32.DLL is one of them. Another is WINSPOOL.DRV, which is used for printing support. Since most users don't print frequently, it's a good candidate, as are OLE32.DLL and OLEAUT32.DLL. In addition, a variety of programs use COM and OLE in some minimal capacity, making those DLLs possible candidates, too. For example, the Windows 2000 CDPLAYER.EXE links against OLE32.DLL and the CreateStreamOnHGlobal API. Yet in ordinary usage, I didn't observe this function being called. 
DelayLoadProfile is not without its faults (literally). While I've tested it successfully with a large number of applications, you may still run into the occasional program that doesn't work so well when DelayLoadProfileDLL interfaces with its IAT. Trying to find and locate all these odd scenarios is beyond the scope of this column. However, if you locate and fix one of these problems, please let me know. I may update DelayLoadProfile at some future date. 
I know that programs that import MFC42.DLL and MFC42U.DLL can crash with DelayLoadProfile. For that reason I've provided an escape hatch. In DelayLoadProfileDLL.CPP it's the IsModuleOKToHook function. I've placed MFC42.DLL, MFC42U.DLL, and KERNEL32.DLL in it. (You can't use /DELAYLOAD with KERNEL32.DLL anyhow, so it's no loss.) If a particular DLL seems to be giving you problems, first try adding it to IsModuleOKToHook. 
I hope DelayLoadProfile's ease of use will inspire you to tune your applications to make use of /DELAYLOAD. I certainly had a good time updating some classic code, and I'd enjoy hearing your success stories, too.
 

 
 Have a suggestion for Under The Hood? Send it to Matt at matt@wheaty.net or http://www.wheaty.net.