Optimizing DLL Load Time Performance(优化 DLL 加载时间性能)

Optimizing DLL Load Time Performance

Matt Pietrek

Download the code for this article: UnderTheHood0500.exe (264KB)

Over the years, one of the dominant themes in my columns and seminars has been the benefits of techniques that optimize your executables. Typically this means basing and binding, but might also include importing functions by ordinal or changing executable page alignment. Intuitively, these strategies should make your code load faster. However, I've always had a nagging question about this topic—namely, just how much of an improvement can be expected with these techniques? This month, I've put my money where my mouth is and come up with some concrete numbers.
In this column, my goal is to measure the load time speed effects of several scenarios:

  • A program using DLLs that have conflicting load addresses versus having each DLL load at its preferred address. This tests the effects of the REBASE program.
  • A program that imports hundreds of APIs the normal way versus having all the APIs bound to their target DLL. This tests the effects of the BIND program.
  • Importing functions by ordinal rather than by name. Doing this requires some work on your part, typically creating and maintaining a .DEF file.

Shortly, I'll explain why the second scenario depends on the first. That is, binding your executables against DLLs that aren't loading at their preferred address is a waste of time. Binding assumes that you have a properly based system where DLLs load according to their header specification.

A Quick Review of Load Time Performance Tuning

Before getting into the performance improvement details, a quick synopsis of basing, binding, and importing by ordinal is in order. Basing is a good place to start. When creating a DLL, the linker assumes that the DLL will load at a particular address. Certain pieces of the code and data contain hardcoded addresses that are only correct if the DLL loads at the preferred address. However, at runtime it's possible that the operating system may have to load the DLL at a different memory location.

To handle the situation where the OS has to move the DLL, the linker adds base relocations to the DLL. Base relocations are addresses that require modification so that they contain the correct address for where the DLL loaded in memory. The more base relocations a DLL has, the more time the OS needs to process them and to load the DLL. A properly based DLL loads at its preferred address, and can skip processing the base relocation records.

When not given explicit directions, the Microsoft® linker creates each DLL with a preferred load address of 0x10000000. If your program uses more than one DLL of your own devising, you'll have multiple DLLs with the same preferred load address. The result is that every DLL but the first one will be relocated by the operating system at load time. This is sometimes referred to as a load address collision. However, if you intervene you'll be able to prevent this from happening.

The REBASE program that comes with Microsoft Visual Studio® and the Platform SDK is a handy tool for getting rid of load address collisions. You supply REBASE with a list of all the modules that make up your program (not counting system DLLs), and it picks new load addresses for the DLLs and modifies them accordingly.

Binding an executable builds on the premise that all DLLs can be tweaked to load at their preferred address. When you import a function from a DLL, the information necessary for the Windows® loader to find the imported function is stored in your executable. Typically, this information is the imported DLL's name and the name of the imported function. When the loader resolves an imported function, it's essentially executing the same code that GetProcAddress uses.

Normally at startup time, the loader spins through all the imported functions and looks up their addresses. However, if the imported DLLs don't change from run to run, the addresses that the loader gets back don't change either. An easy optimization is to write the target function's address to the importing executable, which is exactly what the BIND program does.

Normal Win32® executables have two identical copies of the information needed to look up an imported function. One is called the import address table (IAT), while the other is called the import name table. However, only one copy (the IAT) is required by the Win32 loader. The BIND program takes advantage of the fact that there are two copies of this information and overwrites the IAT entries with the imported function's actual addresses.

At load time, the loader checks to see if everything is kosher, and if so, uses the address that BIND has stored in your IAT. This eliminates the need to look up the function by its name. What if something is different than when the executable was linked? For instance, perhaps the imported DLL got loaded elsewhere. In this case, the loader uses the import name table information to do a normal lookup.

BIND.EXE is the most well-known way to bind an executable. However, it optimizes your executables based upon your system DLLs. If you distribute your program to users, they probably will have different system DLLs, so you'll want to bind your executables on their system. The Windows Installer has the BindImage action, which looks pain-free to use (although I must confess I've never written an installation script). Alternatively, you can use the API BindImageEx that's part of IMAGEHLP.DLL.

The final performance optimization under the microscope this month is importing by ordinal. Normally when you import a function, your binary contains the name of the imported functions. When the Win32 loader looks up the imported names, it has to do string comparisons to match the names you're importing to the names exported by the DLLs you're importing from.

When a Portable Executable binary exports functions, it contains an array of offsets to the exported functions. When you import by ordinal, the importing binary contains an array index (here, called the ordinal) into this array. This method of finding the imported function's address is a simple array lookup, so it's very fast. Importing by name is a lot more work, since the Win32 loader takes the name and does a search to find the corresponding ordinal value. From there, the loader continues as if you had specified an ordinal in the first place. Importing by name just adds an additional layer of code on top of importing by ordinal.

So why does Win32 allow importing and exporting by name? There's a variety of reasons, but two immediately come to mind. First, it can be a pain to keep the same export ordinal assigned to a given API if its DLL evolves over time, as the Win32 system DLLs tend to do. Keeping track of hundreds of functions in a .DEF file can be tedious. In addition, there's more than just one KERNEL32.DLL; there's one for Windows 2000, one for Windows 98, and so on. Second, exporting by name allows you to use the function's name with GetProcAddress, rather than its ordinal value. If you dump a random selection of import libraries from the Platform SDK or Visual Studio, you'll find that most DLLs export by name, but a few export by ordinal.

How do you import and export by ordinal? The importing part is actually done automatically for you by the Microsoft linker. However, in exchange you must export the APIs by ordinal. When the linker generates the import library corresponding to a DLL, it generates import records that will tell the linker how the APIs should be imported. The best way to export by ordinal is to explicitly tell the linker through a .DEF file. For instance, in a .DEF file, you'd have:

EXPORTS 
    MyExportedAPI    @1

If you don't use the @1 modifier, the Microsoft linker exports the API by name.

Besides the faster load time associated with importing and exporting by ordinal, there's another more subtle benefit. When exporting an API by ordinal, you can tell the linker not to store the exported API name in the exporting DLL. This means a smaller export section, and potentially a smaller binary with less data to demand page in. To eliminate the API name, use the NONAME modifier when exporting by ordinal.

EXPORTS 
    MyExportedAPI    @1 NONAME

If you look at MFC42.DLL, you'll see that it exports almost all of its 6000+ APIs by ordinal, and with the NONAME modifier. Imagine the added bulk if MFC42.DLL had to store all 6000 mangled C++ names in its exports!

Creating the Optimization Tests

To test the effects of proper basing, binding, and importing by ordinal, I wrote a program called MakeLoadTimeTest.EXE that generates the program to be benchmarked. The program I created allows me to easily tweak things like the number of imported functions, how many DLLs it calls, and the number of relocations each DLL has. The source for MakeLoadTimeTest is in this month's download files.

The MakeLoadTimeTest.CPP code isn't particularly pretty. However, you don't have to read it too closely, since the things you might want to change are isolated at the top—particularly these three lines:

const unsigned nDLLs = 10; 
const unsigned nExportedFunctions = 100; 
const unsigned nGlobalVariablesPerFunction = 5;

Based on these constants, the generated program (LoadTimeTest) will import 10 DLLs (in addition to KERNEL32.DLL). Each of these DLLs will export 100 functions, and the main executable will import all 100 functions. Finally, each exported function references five global variables. Why reference a global variable? It's an easy way to force a base relocation to be generated. The more base relocations, the more work the loader needs to do when a DLL doesn't load at its preferred address.

Figure 1 shows the code generated for one exported function. It starts out with five global variable declarations (for example, g_var_n2_0). Next is a #ifdef block that lets you decide at compile time whether the function is exported by name or by ordinal. Finally, the function itself (in this case, LoadTimeDLL_10_func_2) simply stores a value into previously declared variables. Why the funny variable and function names? The numbers at the end of the names make them all unique, so I avoid naming collisions.

void * g_var_n2_0; 
void * g_var_n2_1; 
void * g_var_n2_2; 
void * g_var_n2_3; 
void * g_var_n2_4; 
#ifdef ORDINAL_EXPORTS 
#pragma comment( linker, "/EXPORT:_LoadTimeDLL_10_func_2,@2,NONAME") 
#else 
#pragma comment( linker, "/EXPORT:_LoadTimeDLL_10_func_2") 
#endif 
extern "C" void LoadTimeDLL_10_func_2(void) 
{
   g_var_n2_0 = LoadTimeDLL_10_func_2;
   g_var_n2_1 = LoadTimeDLL_10_func_2;
   g_var_n2_2 = LoadTimeDLL_10_func_2;
   g_var_n2_3 = LoadTimeDLL_10_func_2;
   g_var_n2_4 = LoadTimeDLL_10_func_2;
}

Figure 1 A LoadTimeTest Function

As for the LoadTimeTest executable, it only needs a simple function main that references all the functions exported by the generated DLLs. The entire function is over 1000 lines long, but here's the relevant snippet:

#include "LoadTimeTest.H"
#include <windows.h> 

int main(int argc) 
{
     TerminateProcess( GetCurrentProcess(), 0 );
     LoadTimeDLL_1_func_1();
     LoadTimeDLL_1_func_2();
     LoadTimeDLL_1_func_3();
     LoadTimeDLL_1_func_4();
     // ...
}

You might be wondering why I used the TerminateProcess call. In an ideal world, I would be able to time just how long it takes to load my target process, right up to the point where its entry point is invoked. However, I couldn't come up with a simple way to do exactly this.

The hack I eventually decided on is to make the main function call TerminateProcess. This kills the process immediately without sending the DLL_PROCESS_DETACH notifications to the DLLs. In addition, because I don't actually call all the generated APIs (such as LoadTimeDLL_1_func_1), I don't incur the overhead of demand paging the code in. However, because I referenced all the generated exported functions in function main, the loader is forced to load the DLLs and potentially apply the base relocations.

Besides the code generator (MakeLoadTimeTest), and the generated program (LoadTimeTest), there's one more program in this benchmarking suite. The LoadTimer executable runs the LoadTimeTest program and times how long it takes to execute. Because Windows is a preemptive multitasking system, I expended a fair amount of effort to get fairly reliable timing information. The source code in the download files contains all the details, but it's worth summarizing here.

For starters, LoadTimer doesn't rely on a single run of LoadTimeTest. The first time LoadTimeTest runs, disk overhead adds to its load time. Subsequent runs are usually faster because the operating system has cached the pages from the EXE and DLLs. I time 30 invocations of LoadTimeTest and use the fastest run—the one with the least amount of external overhead from factors such as thread switching and interrupt processing.

In order to minimize the overhead of these external events, LoadTimer sets its priority to REALTIME_PRIORITY_CLASS. In addition, when it starts LoadTimeTest.EXE, the code specifies REALTIME_PRIORITY_CLASS for the process. This ensures that both the CreateProcess code executed in the LoadTimer process, as well as the code in the LoadTimeTest process, are run with the highest possible priority. As a result, the effect of external happenings should be minimized.

For the timing, I used the QueryPerformanceCounter API. I originally had tinkered with using the x86 architecture's RDTSC instruction, which can be as accurate as a single CPU clock cycle. However, it requires knowing the CPU's speed to calculate an actual time. While you can read the CPU's speed from the registry in Windows NT® (HKEY_LOCAL_MACHINE/HARDWARE/DESCRIPTION/ System/CentralProcessor/0), this number isn't exact. For instance, on my 550Mhz machine, the registry reports a speed of 548Mhz. (Who can I sue for missing 2Mhz ?) In the end, the granularity of QueryPerformanceCounter seemed perfectly adequate for the durations under consideration.

To make my benchmark code flexible, I made LoadTimer usable on any program that will exit on its own without user intervention. LoadTimer takes a command-line argument, specifying the file name to run. In my test, the command-line argument was LoadTimeTest.EXE.

The LoadTimeTest Benchmark Process

Here are the steps needed to reproduce my results from the code files in the download. First, build MakeLoadTimeTest and LoadTimer from the project files. Next, run MakeLoadTimeTest.EXE. The results will be 11 .CPP and 11 .H files. Running the BuildLoadTimeTest.BAT file compiles these files into LoadTimeTest.EXE and 10 associated DLLs. If you specify "ORDINAL" as the argument to BuildLoadTimeTest.BAT, LoadTimeTest.EXE imports the APIs by ordinal; otherwise, they're imported by name.

If you look at BuildLoadTimeTest.BAT, you'll see that it uses the linker defaults for the 10 DLLs. Thus, all of the generated DLLs will have the same preferred load address: 0x10000000. This is intentional, as it starts the benchmark with nine DLLs that will be relocated at runtime.

Now, on to the actual testing. First, run the command

LoadTimer.EXE LoadTimeTest.EXE

several times, and record the lowest time. This is the worst-case-scenario timing.

Now, let's fix the problem of all those DLLs needing to be relocated. Run the RebaseLoadTimeTest.BAT file, which uses REBASE.EXE on the EXE and the 10 generated DLLs, so that each one has a unique load address. Rerun the timing sequence, and record the lowest time. This gives you a feel for how much rebasing can affect loading times.

Now that the EXE and all the DLLs are loading at their preferred load address, it's worthwhile to see what additional gains can be had by binding them. Run BindLoadTimeTest.BAT and then rerun the timing sequence, again recording the lowest time.

At this point, you should have three load times: the default time without any intervention, the time after rebasing the executables, and the time after basing and binding. To see the effect of importing by ORDINAL, rerun the previous tests, but with one change: at the beginning specify "ORDINAL" as the argument when running BuildLoadTimeTest.BAT.

LoadTimeTest Results

Before getting to the actual numbers, let me first say that I was amazed at how fast programs can load. I intentionally created LoadTimeTest.EXE to make a lot of work for the Win32 loader. It has a fair number of DLLs and lots of exported functions and relocations. Even under the slowest scenario, my machine still loaded the program under Windows 2000 in less than 1/50th of a second. If your program takes a long time to load, don't blame the loader. The problem is almost certainly that somebody's initialization code is taking too long.

Figure 2 shows the results I obtained. The test machine was a Dell XPS T550, a single Pentium III CPU running at 550Mhz. The only visible process running (other than the Explorer shell) was a command prompt. The tests were run from a FAT16 partition so that I could test on both Windows 2000 and Windows 9x.

================================================
Windows 2000 By Name, FAT16

Default:
Fastest time: 58286 ticks, 0.016283 seconds
Ticks: 58286
Ticks/second: 3579545

Based:
Fastest time: 52611 ticks, 0.014698 seconds
Ticks: 52611
Ticks/second: 3579545

Bound:
Fastest time: 49006 ticks, 0.013691 seconds
Ticks: 49006
Ticks/second: 3579545

================================================
Windows 2000 By Ordinal, FAT16

Default:
Fastest time: 57824 ticks, 0.016154 seconds
Ticks: 57824
Ticks/second: 3579545

Based:
Fastest time: 50609 ticks, 0.014138 seconds
Ticks: 50609
Ticks/second: 3579545

Bound:
Fastest time: 49251 ticks, 0.013759 seconds
Ticks: 49251
Ticks/second: 3579545

================================================
Windows 98 SE By Name, FAT16

Default:
Fastest time: 32738 ticks, 0.027438 seconds
Ticks: 32738
Ticks/second: 1193180

Based:
Fastest time: 30150 ticks, 0.025269 seconds
Ticks: 30150
Ticks/second: 1193180

Bound:
Fastest time: 28944 ticks, 0.024258 seconds
Ticks: 28944
Ticks/second: 1193180

================================================
Windows 98 SE By Ordinal, FAT16

Default:
Fastest time: 31812 ticks, 0.026662 seconds
Ticks: 31812
Ticks/second: 1193180

Based:
Fastest time: 29569 ticks, 0.024782 seconds
Ticks: 29569
Ticks/second: 1193180

Bound:
Fastest time: 29513 ticks, 0.024735 seconds
Ticks: 29513
Ticks/second: 1193180

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

================================================
Windows 2000 By Name, Compressed NTFS

Default:
Fastest time: 61087 ticks, 0.017066 seconds
Ticks: 61087
Ticks/second: 3579545

Based:
Fastest time: 53834 ticks, 0.015039 seconds
Ticks: 53834
Ticks/second: 3579545

Bound:
Fastest time: 50388 ticks, 0.014077 seconds
Ticks: 50388
Ticks/second: 3579545


================================================
Windows 2000 By Ordinal, Compressed NTFS

Default:
Fastest time: 58607 ticks, 0.016373 seconds
Ticks: 58607
Ticks/second: 3579545

Based:
Fastest time: 51755 ticks, 0.014459 seconds
Ticks: 51755
Ticks/second: 3579545

Bound:
Fastest time: 50204 ticks, 0.014025 seconds
Ticks: 50204
Ticks/second: 3579545

Figure 2 LoadTimeTest Results

You can slice the data in a variety of ways, but here are the numbers I found interesting. On Windows 2000, properly basing the DLLs improved the load time by roughly 12 percent. Basing and binding the EXE and the DLLs improved the load time by around 18 percent. Importing by ordinal versus importing by name provided a 4 percent improvement. On Windows 98, Second Edition, properly basing the DLLs improved the load time by roughly 8 percent. Basing and binding the EXE and the DLLs improved the load time by around 12 percent. Importing by ordinal versus importing by name provided a mere 2 percent improvement.

Importing by name instead of ordinal doesn't affect the load time very much when the program is properly based and bound. When everything is properly configured, the loader can read the address of the imported function directly from the DLL doing the importing. The loader doesn't need to bother looking up names or indexing into arrays to get the address of the function.

Random Observations

While working through the code, I was thinking about the effects of DLLs loading at a nonpreferred address. In my test code, each of the DLLs exports 100 APIs, yet those APIs are never called. Because of demand paging, it's conceivable that the pages containing those APIs might not be brought into memory. As such, the overhead of applying the base relocations might not apply. With this in mind, I contacted a performance expert in Microsoft research. He told me that under Windows 9x, my hypothesis about not needing to apply the base relocations was correct.

However, under Windows NT and Windows 2000 the pages are set temporarily to read/write, are modified, and then returned to their original permission. As a result, any pages modified in this way are no longer shared between processes. Essentially, under Windows NT and Windows 2000, any executable that does not load at its preferred load address will have all of its code and data demand paged in when the executable first loads.

On a different note, I was surprised at the results of QueryPerformanceFrequency under Windows 2000. Looking at Figure 2, notice that under Windows 2000 the timer operates at 3.57Mhz. This is exactly three times faster than the 1.19 Mhz frequency used on Windows 98. If you're a PC old-timer, you may recall that motherboards traditionally have a 1.19Mhz oscillator that drives the 8254 chip.

Intrigued, I ran my tests on a Pentium Pro 200Mhz running Windows 2000 and got the 1.19Mhz frequency. I then ran LoadTimer on some dual processor machines and found that the frequency matched the CPU speed. The conclusion I've drawn from this is that Windows 2000 observes what the motherboard is capable of doing and uses the best available timer frequency for the QueryPerformanceXXX APIs. Yes, I admit to being a nerd and spending way too much time on this minor detail, but I enjoyed the experimentation and the opportunity to ponder things at the hardware level.


Matt Pietrek performs advanced research for the NuMega Labs of Compuware Corporation, and is the author of several books. His Web site, at http://www.wheaty.net, has a FAQ page and information on previous columns and articles.


From the May 2000 issue of MSDN Magazine.

 

 

//下面是对应中文版面内容
发布日期: 7/28/2004  | 更新日期: 7/28/2004

Matt Pietrek

请下载本文的代码: UnderTheHood0500.exe (264KB)

近年来,在我的专栏文章和技术讲座中,其中一个主要主题一直是优化可执行文件技术的优点。 通常,优化技术指的是确定基址和绑定,但可能还包括按序号导入函数或更改可执行文件页面对齐。 从直观的角度来说,这些策略应该使您的代码加载得更快。 不过,始终有一个关于这一主题的问题萦绕在我的心里 — 利用这些技术究竟能获得多大的改进? 这个月我一直在研究这个问题,并得出一些具体的数字。

在本期专栏文章中,我的目标是测量以下几个方案的加载时间速度效果:

使用具有冲突的加载地址的 DLL 的程序与在首选地址加载每个 DLL 的程序。 此方案测试 REBASE 程序的效果。

以正常方式导入数百个 API 的程序与将所有 API 绑定到其目标 DLL 的程序。 此方案测试 BIND 程序的效果。

按序号而不是按名称导入函数。 为此,您必须自己来完成某些工作,通常是创建和维护 .DEF 文件。

我将立刻解释为什么第二个方案取决于第一个方案。 原因在于,针对不在其首选地址加载的 DLL 绑定可执行文件是浪费时间。 绑定假定您有一个具有正确基址的系统,其中 DLL 根据其头规范进行加载。

*
本页内容
加载时间性能优化的快速回顾 加载时间性能优化的快速回顾
创建优化测试 创建优化测试
LoadTimeTest 基准测试进程 LoadTimeTest 基准测试进程
LoadTimeTest 结果 LoadTimeTest 结果
随机观测 随机观测

加载时间性能优化的快速回顾

在讨论性能改进的细节之前,迅速回顾一遍确定基址、绑定和按序号导入。 确定基址是一个很好的起点。 在创建 DLL 时,链接器假定 DLL 将在某个特定地址加载。 某些代码块和数据包含硬编码的地址,仅当 DLL 在首选地址加载时,这些地址才是正确的。 不过,运行时有可能发生这样的事情 — 操作系统可能不得不在另一个不同的内存位置加载 DLL。

为了处理操作系统不得不移动 DLL 的情况,链接器向 DLL 添加基重定位。 基重定位是需要修改的地址,这些地址在修改后就会包含 DLL 在内存中进行加载的正确地址。 DLL 拥有的基重定位越多,操作系统处理它们以加载 DLL 所需的时间也越多。 具有正确基址的 DLL 在其首选地址加载,并且可以跳过对基重定位记录的处理。

如果没有给定明确的指示,Microsoft 链接器会用首选加载地址 0x10000000 来创建每个 DLL。 如果您的程序使用了一个以上您自己设计的 DLL,则您将拥有多个具有相同首选加载地址的 DLL。 这种情况的结果是,除了第一个 DLL 之外的每个 DLL 都将由操作系统在加载时重定位。 有时,这种情况称为加载地址冲突。 不过,如果您进行干预,就能避免此类情况的发生。

随 Microsoft Visual Studio 和 Platform SDK 提供的 REBASE 程序是一个解决加载地址冲突问题的方便工具。 您为 REBASE 提供一个构成程序的所有模块(系统 DLL 不计算在内)的列表,该程序会为 DLL 挑选新的加载地址,然后再进行适当的修改。

绑定可执行文件建立在可将所有 DLL 调整为在其首选地址加载这一前提之上。 当您从 DLL 导入一个函数时,Windows 加载器为查找导入函数所必需的信息存储在您的可执行文件中。 通常,此信息是导入的 DLL 的名称和导入的函数的名称。 当加载器解析导入的函数时,实质上是在执行 GetProcAddress 使用的同一代码。

正常情况下,在启动时,加载器会搜索所有导入的函数,查找它们的地址。 不过,如果导入的 DLL 不随着运行而更改,则加载器找回的地址也不更改。 一个简单的优化方法是,将目标函数的地址写入进行导入的可执行文件,这正是 BIND 程序承担的工作。

普通的 Win32 可执行文件有两个完全相同的、查找导入函数所必需的信息副本。 一个称为导入地址表 (IAT),另一个称为导入名称表。 不过,只有一个副本 (IAT) 是 Win32 加载器所必需的。 BIND 程序利用此信息有两个副本这一事实,用导入的函数的实际地址覆盖 IAT 条目。

在加载时,加载器检查是否一切都符合规定。如果是,则使用 BIND 存储在 IAT 中的地址。 这样一来,就无需按名称查找函数。 如果链接可执行文件时出了差错,该怎么办呢? 例如,导入的 DLL 可能加载到其他位置。 在这种情况下,加载器使用导入名称表信息来进行正常的查找。

BIND.EXE 是绑定可执行文件的最著名的方法。 不过,它是根据您的系统 DLL 来优化可执行文件的。 如果您将自己的程序分发给用户,他们的系统 DLL 可能不同,因此您需要在这些用户的系统上绑定您的可执行文件。 Windows Installer 具有 BindImage 操作,非常易于使用(但我必须承认我从来没有编写过一个安装脚本)。 另一种方法是,您可以使用 API BindImageEx,它是 IMAGEHLP.DLL 的一部分。

本月专栏文章中最后一种性能优化是按序号导入。 一般来说,当您导入函数时,您的二进制文件包含导入的函数的名称。 当 Win32 加载器查找导入的名称时,它必须进行字符串比较,以将正在导入的名称与正在从其中导入的 DLL 所导出的名称进行匹配。

当一个可移植的二进制可执行文件导出函数时,它包含对导出的函数的偏移量数组。 当您按序号导入时,进行导入的二进制文件将数组索引(这里称为序号)包含在此数组中。 这种查找导入函数的地址的方法是一个简单的数组查找,因此非常快。 按名称导入则麻烦得多,因为 Win32 加载器先获得名称,然后执行搜索,以查找相应的序号值。 接着,加载器继续工作,就好像您已经先指定了序号一样。 按名称导入只是在按序号导入之上又添加了额外一个代码层。

那么,为什么 Win32 允许按名称导入和导出呢? 其中有好几个原因,但我只能立即想到两个原因。 第一个原因,如果其 DLL 随着时间变化(就像 Win32 系统 DLL 那样),则保留分配到给定的 API 的同一个导出序号非常困难。在一个 .DEF 文件中跟踪数百个函数会非常麻烦。 此外,KERNEL32.DLL 不只有一个;Windows 2000 有一个,Windows 98 有一个,等等。 第二个原因,按名称导出允许您使用函数的名称 GetProcAddress,而不是其序号值。 如果您从 Platform SDK 或 Visual Studio 转储随机选择的导入库,您会发现大多数 DLL 按名称导出,但有几个按序号导出。

您怎样按序号导入和导出呢? 导入部分实际上是由 Microsoft 链接器自动为您完成的。 不过,作为交换,您必须按序号导出 API。 当链接器生成与 DLL 相对应的导入库时,它会生成指定链接器应该如何导入 API 的导入记录。 按序号导出的最佳方法是,通过 .DEF 文件向链接器显式指定。 例如,在 .DEF 文件中可以包含:

EXPORTS
    MyExportedAPI    @1 

如果您不使用 @1 修饰符,Microsoft 链接器会按名称导出 API。

按序号导入和导出除了具有加载速度快这一优点之外,还有另一个好处。 按序号导出 API 时,您可以告诉链接器不要在进行导出的 DLL 中存储导出的 API 名称。 这意味着导出部分更小,二进制文件也可能更小,请求调页的数据更少。要消除 API 名称,请在按序号导出时使用 NONAME 修饰符。 EXPORTS

    MyExportedAPI    @1 NONAME 

如果查看一下 MFC42.DLL,您就会发现,几乎所有 6000 多个 API 都是按序号导出的,并使用了 NONAME 修饰符。 您可以想象,如果 MFC42.DLL 不得不在其导出中存储所有 6000 个打乱的 C++ 名称,它会有多么庞大!

创建优化测试

为了测试正确的确定基址、绑定和按序号导入的效果,我编写了一个名为 MakeLoadTimeTest.EXE 的程序,它会生成作为基准测试的程序。 利用我创建的这个程序,我可以很容易地调整导入的函数的数目、它调用多少个 DLL 以及每个 DLL 拥有的重定位的数目。 MakeLoadTimeTest 的源代码可在本月的下载文件中找到。

MakeLoadTimeTest.CPP 代码并非十分完美。 不过,您不必太仔细阅读该代码,因为您可能需要更改的内容已经在开头单独列出 — 尤其是这三行:

const unsigned nDLLs = 10;
const unsigned nExportedFunctions = 100; 
const unsigned nGlobalVariablesPerFunction = 5;

根据这些常量,生成的程序 (LoadTimeTest) 将导入 10 个 DLL(KERNEL32.DLL 除外)。 其中每个 DLL 将导出 100 个函数,主可执行文件将导入所有 100 个函数。 最后,每个导出的函数引用五个全局变量。 为什么引用全局变量呢? 因为这是强制生成基重定位的一个简便方法。 基重定位越多,DLL 不在其首选地址加载时加载器需要完成的工作就越多。

图 1 显示了为一个导出的函数生成的代码。 代码开头是五个全局变量声明(例如,g_var_n2_0)。 接下来是 #ifdef 程序块,可用于在编译时确定函数按名称导出还是按序号导出。 最后,函数本身(在此示例中,即 LoadTimeDLL_10_func_2)只将一个值存储到前面声明的变量中。 为什么变量和函数名这样奇怪呢? 原因在于,这些名称结尾处的数字使它们成为唯一名称,从而避免了命名冲突。

至于 LoadTimeTest 可执行文件,它只需要一个简单的函数 main,它引用生成的 DLL 导出的所有函数。 整个函数的长度超过 1000 行,但下面只列出了相关的代码片段:

#include "LoadTimeTest.H"
#include  <windows.h>

int main(int argc) 
{
     TerminateProcess( GetCurrentProcess(), 0 );
     LoadTimeDLL_1_func_1();
     LoadTimeDLL_1_func_2();
     LoadTimeDLL_1_func_3();
     LoadTimeDLL_1_func_4();
     // ...
}

您可能想知道为什么我使用了 TerminateProcess 调用。 在理想情况下,我能够计算出加载我的目标进程所用的时间有多长(直到调用其入口点)。 不过,我无法提出一个完全按此过程执行的简单方法。

我最终采用的办法是,使 main 函数调用 TerminateProcess。 这种做法可立即停止该进程,而不必向 DLL 发送 DLL_PROCESS_DETACH 通知。 此外,因为我实际上不调用所有生成的 API(例如,LoadTimeDLL_1_func_1),就不会发生对代码请求调页的系统开销。不过,因为我在函数 main 中引用了所有生成的导出的函数,加载器被强制加载 DLL,并且有可能应用基重定位。

除了代码生成器 (MakeLoadTimeTest) 和生成的程序 (LoadTimeTest),这个基准测试程序包中还有一个程序。 LoadTimer 可执行文件运行 LoadTimeTest 程序,并计算出执行所用的时间。 因为 Windows 是一个按优先级执行的多任务处理系统,我花了相当大的工夫才得到比较可靠的计时信息。 下载文件中的源代码包含了所有细节,但这里还是要概括一下。

对于启动程序,LoadTimer 并不依靠 LoadTimeTest 的一次运行。 LoadTimeTest 第一次运行时,磁盘系统开销增加了它的加载时间。 后来的运行通常更快一些,因为操作系统已缓存了 EXE 和 DLL 中的页面。 我计算 30 次调用 LoadTimeTest 的时间,并使用最快的运行 — 由于线程切换和中断处理等因素导致外部系统开销最少。

为了最大限度地减少这些外部事件的系统开销,LoadTimer 将其优先级设置为 REALTIME_PRIORITY_CLASS。 此外,当它启动 LoadTimeTest.EXE 时,代码为该进程指定 REALTIME_PRIORITY_CLASS。 这种做法可确保 LoadTimer 进程中执行的 CreateProcess 代码以及 LoadTimeTest 进程中的代码都以可能设定的最高优先级运行。 因此,外部事件的作用应该被最大限度地降低。

为了进行计时,我使用了 QueryPerformanceCounter API。 我原来使用 x86 结构的 RDTSC 指令来凑合一下,该指令可以与一个 CPU 时钟周期一样准确。 不过,它需要知道 CPU 的速度,以便计算实际时间。 虽然您可以从 Windows NT 中的注册表 (HKEY_LOCAL_MACHINE/HARDWARE/DESCRIPTION/ System/CentralProcessor/0) 读取 CPU 的速度,但该数字并不准确。 例如,在我的 550Mhz 的计算机中,注册表报告的速度是 548Mhz。 (我该向谁投诉那缺少的 2Mhz 呢?) 最后,QueryPerformanceCounter 的粒度似乎足以满足我们研究的持续时间的要求。

为了让我的基准测试代码变得灵活,我使 LoadTimer 可在任何无需用户干预、自动退出的程序中使用。 LoadTimer 接受命令行参数,指定要运行的文件名。 在我的测试中,命令行参数是 LoadTimeTest.EXE。

LoadTimeTest 基准测试进程

下面讨论重新生成所下载的代码文件中的结果需要的步骤。 首先,从项目文件构建 MakeLoadTimeTest 和 LoadTimer。 接着,运行 MakeLoadTimeTest.EXE。结果是 11 个 .CPP 文件和 11 个 .H 文件。 运行 BuildLoadTimeTest.BAT 文件可将这些文件编译到 LoadTimeTest.EXE 和 10 个相关联的 DLL。 如果您指定 "ORDINAL" 作为 BuildLoadTimeTest.BAT 的参数,LoadTimeTest.EXE 将按序号导入 API;否则,API 将按名称导入。

如果查看一下 BuildLoadTimeTest.BAT,您就会发现它将链接器默认值用于这 10 个 DLL。 因此,所有生成的 DLL 都有相同的首选加载地址: 0x10000000。 这种做法是故意的,因为这样可使基准测试程序的开头是将在运行时重定位的九个 DLL。

现在,开始实际测试。 首先,运行命令

LoadTimer.EXE LoadTimeTest.EXE 

若干次,并记录最短的时间。 这是一种最坏情况计时。

现在,我们来解决所有那些 DLL 需要重定位的问题。 运行 RebaseLoadTimeTest.BAT 文件,该文件对 EXE 和 10 个生成的 DLL 使用 REBASE.EXE,因此每个 DLL 都有一个唯一的加载地址。 重新运行计时序列,记录最短时间。 这样,您可以大致了解重新确定基址对加载时间有多大的影响。

既然 EXE 和所有 DLL 都在其首选加载地址加载,我们应该看看将它们进行绑定可以获得哪些额外的好处。 运行 BindLoadTimeTest.BAT,然后重新运行计时序列,再次记录最短时间。

此时,您应该有三个加载时间: 没有任何干预的默认时间、对可执行文件重新确定基址后的时间以及确定基址并进行绑定后的时间。 要查看按 ORDINAL 导入的效果,请重新运行前面的测试,但要进行一处更改: 运行 BuildLoadTimeTest.BAT 时,在开头指定 "ORDINAL" 作为参数。

LoadTimeTest 结果

在讨论实际数字之前,我首先要说的是,我对加载程序速度之快感到吃惊。 我故意创建了 LoadTimeTest.EXE,使 Win32 加载器不得不完成许多工作。 它包括许多 DLL、大量导出的函数和重定位。 即使在最慢的情况下,我的计算机在 Windows 2000 下仍然以低于每秒 1/50th 的速度加载该程序。 如果您的程序用了很长的时间才加载,不要责怪加载器。 我几乎可以肯定问题的原因是,某人的初始化代码用了太长时间。

图 2 显示了我获得的结果。 测试所使用的计算机是 Dell XPS T550,它具有以 550Mhz 速度运行的单个 Pentium III CPU。 唯一可见的运行进程(除了 Explorer 外壳程序)是命令提示。 测试从 FAT16 分区运行,这样我可以在 Windows 2000 和 Windows 9x 两种平台上测试。

您可以用各种方式来划分数据,但下面是我认为很有趣的几个数字。 在 Windows 2000 中,正确确定 DLL 的基址将加载时间缩短了大约 12%。 对 EXE 和 DLL 进行确定基址和绑定将加载时间缩短了大约 18%。 相比按名称导入而言,按序号导入使加载时间缩短了 4%。 在 Windows 98 第二版中,正确确定 DLL 的基址使加载时间缩短了大约 8%。 对 EXE 和 DLL 进行确定基址和绑定将加载时间缩短了大约 12%。 相比按名称导入而言,按序号导入使加载时间只缩短了 2%。

当程序正确地确定基址并绑定时,按名称导入(而不是按序号导入)对加载时间并没有太大的影响。 当一切都正确配置后,加载器可以直接从进行导入的 DLL 读取导入的函数的地址。 加载器不需要查找名称或在数组中索引来获得函数地址。

随机观测

在完成代码的同时,我考虑了在非首选地址加载的 DLL 的效果。 在我的测试代码中,其中每个 DLL 导出 100 个 API,可是那些 API 从未被调用。 由于请求调页,我们可以认为包含那些 API 的页面不会调入内存。 同样,可能不会应用使用基重定位的开销。 带着这个看法,我联系了一位从事 Microsoft 研究的性能专家。 他告诉我,在 Windows 9x 中,我关于不需要应用基重定位的假设是正确的。

不过,在 Windows NT 和 Windows 2000 中,页面被临时设置为读/写以进行修改,然后恢复到原来的权限。 因此,用这种方式修改的任何页面不再由各进程共享。 实质上,在 Windows NT 和 Windows 2000 中,不在其首选加载地址加载的任何可执行文件将在可执行文件第一次加载时将其所有代码和数据按请求调入页面。

另一个值得注意之处是,我对 QueryPerformanceFrequency 在 Windows 2000 中的结果感到吃惊。请看图 2,注意在 Windows 2000 中,计时器以 3.57Mhz 运行。 这个速度比 Windows 98 中使用的 1.19 Mhz 频率整整快了三倍。如果您是一个 PC 机的老用户,您可能会想起,传统上,母板使用的是驱动 8254 芯片的 1.19Mhz 示波器。

为了试验一下,我在运行 Windows 2000 的 Pentium Pro 200Mhz 上运行我的测试,并获得了 1.19Mhz 的频率。 然后,我在一些双处理器计算机上运行 LoadTimer,发现该频率与 CPU 速度相匹配。 我从中得出的结论是,Windows 2000 观察母板能够完成的工作,并将最佳可用计时器频率用于 QueryPerformanceXXX API。 我承认自己有些罗嗦,在这个很小的细节上占用了太多时间,不过我喜欢上面的实验,很高兴有机会以硬件级别深入探讨问题。

Matt Pietrek 为 Compuware Corporation 的 NuMega 实验室工作,从事高级研究,并且是几本书的作者。 他的 Web 站点 http://www.wheaty.net/ 包括一个常见问题解答页以及有关以前的专栏和文章的信息。

摘自 2000 年 5 月发行的 MSDN Magazine

 

转自:http://www.microsoft.com/china/MSDN/library/enterprisedevelopment/softwaredev/OptimDLLLoadTimePer.mspx?mfr=true

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值