Michael Abrash‘s black book--第一章--1.3

12 篇文章 0 订阅

1.3 Rules for Building High-Performance Code
编写高性能代码的规则
 

We've got the following rules for creating high-performance software:
-- Know where you're going (understand the objective of the software).
-- Make a big map (have an overall program design firmly in mind, so the various parts of the program and the data structures work well together).
-- Make lots of little maps (design an algorithm for each separate part of the overall design).
-- Know the territory (understand exactly how the computer carries out each task).
-- Know when it matters (identify the portions of your programs where performance matters, and don't waste your time optimizing the rest).
-- Always consider the alternatives (don't get stuck on a single approach; odds are there's a better way, if you're clever and inventive 善于创造 enough).
-- Know how to turn on the juice 接通电源 (optimize the code as best you know how when it does matter).

为了创建高性能软件,我们有以下原则:
--知道去哪里(理解软件的目标)
--制作一个大地图(头脑中有一个整体的程序设计,程序各部分和数据结构很好的协调工作)
--制作大量小地图(为总体设计的每个部分设计算法)
--知道范围(精确理解计算机如果执行每个任务)
--知道关键所在(识别你的程序中影响性能的关键部分,不要浪费时间优化其他部分)
--始终考虑替代方案(不要陷入一个单一方法;可能有更好的方法, 如果你是足够的聪明并善于创造)
--知道如何接上电源(你知道的最重要的部分,优化到极致) ---turn on the juice不太好翻译啊

Making rules is easy; the hard part is figuring out how to apply them in the real world. For my money(在我看来), examining some actual working code is always a good way to get a handle on programming concepts, so let's look at some of the performance rules in action.

制订规则容易,难的是找出如何在现实世界中应用它们。 在我看来,检查一些真实的工作代码始终是一个好的方法,来领会编程概念。 于是让我们检查一下性能规则的运用。

1.3.1 Know Where You're Going --不忘初心--去往哪里--目的是啥

If we're going to create high-performace code, first we have to know what that code is going to do. As an example, let's write a program that generates a 16-bit checksum of the bytes in a file. In other words, the program will add each byte in a specified file in turn into a 16-bit value. This checksum value might be used to make sure that a file hasn't been corrupted, as might occure during transmission over a modem or if a Trojan horse virus rears(竖起) its ugly head. We're not going to do anything with the checksum value other than print it out, however;
right now we're only interested in generating that checksum value as rapidly as possible.

如果我们要创建高性能代码,首先我们要知道代码要做什么。例如,让我们来写一个程序生成一个文件中的字节的16位校验和,这个程序依次将一个特定文件中的每个字节加到一个16bit的值中。这个校验和可被用于确保一个文件没有被破坏,这可能发生在通过modem传输文件(90年代都通过猫上网,比现在的宽带慢多了)或者特洛伊木马病毒发作的时候。然而,我们只是把这个校验和打印出来。现在我唯一感兴趣的是能以最快的速度生成校验和

1.3.2 Make a Big Map --- 制订大的蓝图--总体设计

How are we going to generate a checksum value for a specified file? The logical approach is to get the file name, open the file, read the bytes out of the file, add them together, and print the result. Most of those actions are straightforward; the only tricky 棘手的 part lies in reading the bytes and adding them together.

我们如何为一个特定文件生成校验和?逻辑步骤是获取文件名,打开文件,读出字节,把字节加在一起,打印结果。大部分动作都是直接了当的。唯一棘手部分在于读字节和加字节。

1.3.3 Make Lots of Little Maps--制订许多小规划

Actually, we're only going to make one little map, because we only have one program section that requires much thought -- the section that reads the bytes and adds them up. What's the best way to do this?
实际上,我们仅仅要做一个小规划,因为我们仅仅有一个程序部分需要多考虑--这部分就是读字节和加字节。有什么最好方法?

It would be convenient to load the entire file into memory and then sum the bytes in one loop. Unfortunately, there's no guarantee that any particular file will fit in the available memory; in fact, it's a sure thing that many files won't fit into memory, so that approach is out.
方便的方法是把整个文件加载到内存然后再一个循环中累加字节。不幸的是,无法确保任何特定文件都有足够的内存;实际上,基本确定的是大部分文件没有足够的内存盛它,这个方法放弃。

Well, if the whole file won't fit into memory, one byte sure will. If we read the file one byte at a time, adding each byte to the checksum value before reading the next byte, we'll minimize memory requirements and be able to handle any size file at all.
好了,如果整个文件不能放入内存,一个字节肯定行。如果我们每次读文件的一个字节,加到校验和,然后读下一个字节,我们使用最小内存并能够处理任意大小的文件了。

Sounds good, eh? Listing 1.1 shows an implementation of this approach. Listing 1.1 uses C's read() function to read a single byte, adds the byte into the checksum value, and loops back to handle the next byte until the end of the file is reached. The code is compact, easy to write, and functions perfectly -- with one slight hitch(小故障): It's slow.
听起来不错,啊? Listing1.1 展示了这个方法的实现。 Listing 1.1使用C语言的read()方法读一个字节,加这个字节到校验和,并循环回去处理下一个字节,知道到达文件末尾。代码是紧凑的,易于写的,功能是完美的--处理一个小故障(讽刺的说法?):太慢

LISTING 1.1 L1-1.C

/*
* Program to calculate the 16-bit checksum of all bytes in the 
* specified file. obtains the bytes one at a time via read(),
* letting DOS perform all data buffering.
*/
#include <stdio.h>
#include <fcntl.h>

main(int argc, char *argv[]) {
  int Handle;
  unsigned char Byte;
  unsigned int Checksum;
  int ReadLength;
  
  if(argc != 2) {
    printf("usage: checksum filename\n");    
    exit(1);
  }
  if( (Handle = open(argv[1], O_RDONLY | O_BINARY)) == -1) {
      printf("Can't open file: %s\n", argv[1]);
      exit(1);
  }
  
  /*  Initialize the chencksum accumulator*/
  Checksum = 0;
  
  /* *Add each byte in turn into the checksum accumulator */
  while( (ReadLength = read(Handel, &Byte, sizeof(Byte))) > 0) {
      Checksum += (unsigned int) Byte;
  }
  if( ReadLength == -1) {
      printf("Error reading file %s\n", argv[1]);
      exit(1);
  }
  
  /*  Report the result */
  printf("The checksum is: %u\n", Checksum);
  exit(0);
}

Table 1.1 shows the time taken for Listing 1.1 to generate a checksum of the WordPerfect version 4.2 thesaurus(词典) file, TH,WP(362,293 bytes in size), on a 10 MHz AT machine of no special parentage(出身). Execution times are given for Listing 1.1 compiled with Borland and Microsoft compilers, with optimization both on and off; all four times are pretty much the same, however, and all are much too slow to be acceptable. Listing 1.1 requires over two and one-half minutes to checksum one file!
表1.1显示的执行时间是程序1.1为一个WordPerfect 4.2版词典生成校验和,在一个10MHz 的普通AT上。(AT 兼容机当时也很流行)。给出的程序1.1执行时间包括Borland和微软的编译器,优化打开或者关闭。所有四个时间非常的相似,然而,所有时间都慢的不可接受。程序1.1需要1.5-2分钟来校验一个文件。

Warning: Listings 1.2 and 1.3 form the C/assembly equivalent to Listing 1.1, and Listing 1.6 and 1.7 form the C/assembly equivalent to Listing 1.5.
注意:程序1.2和1.3分别是用C和汇编语言编写的,等效与程序1.1,程序1.6和1.7用C和汇编写的,等效与程序1.5
 

These results make it clear that it's folly(愚蠢) to rely on your compiler's optimization to make your programs fast. Listing 1.1 is simply poorly designed, and no amount of compiler optimization will compensate(补偿) for the failing. To drive home the point, conListings 1.2 and 1.3, which together are equivalent to Listing 1.1 except that the entire checksum loop is written in tight assembly code. The assembly language implementation is indeed faster than any of the C versions, as shown in Table 1.1, but it's less than 10 percent faster, and it's still unacceptably slow.
这些结果清晰的说明依赖编译器使你的程序更快是愚蠢的。 程序1.1设计简单贫乏,无法靠编译器的优化补偿设计不足。
为了理解这一点,连着的程序1.2和1.3,合起来等效于程序1.1除了正规校验和循环是用紧凑的汇编语言编写的。
汇编语言实现确实比任何C语言版本快,像表1.1展示的那样。但是也就快了10%,仍旧是慢啊。

Listing   Borland  Microsoft  Borland  Microsoft  Assembly  Optiomization Ratio
               (no opt) (no opt)     (opt)      (opt) 
1             166.9      166.8         167.0     165.8          155.1          1.08      
4               13.5        13.6           13.5       13.5          ...                1.01
5                 4.7          5.5             3.8         3.4              2.7          2.04
Ratio       35.51       30.33       43.95     48.76          57.44
 

Note: The execution times (in seconds) for this chapter's listings were timed when the compiled listings were run on the WordPerfect 4.2 thesaurus file TH.WP (362,293 bytes in size), as compiled in the small model with Borland and Microsoft compilers with optimization on (opt) and off (no opt). All times were measured with Paradigm Systems' TIMER program on a 10 MHz 1-wait-state AT clone with a 28-ms hard disk, with disk caching turned off.
注意:本章程序的执行时间是被编译后基于WordPerfect 4.2词典的执行时间。使用small模式编译,开启或者关闭优化。使用Borland和微软编译器(当时主流的两种编译器)。所有时间使用TIMER程序测量,在一个10MHZ的等待状态的AT克隆机,硬盘演示28毫秒,磁盘Cache关闭。
--做性能测试必须针对具体数据、具体机器配置、具备编译器选项来说事儿,没有前提条件的性能数据都是耍流氓。

LISTING 1.1 L1-2.C
/*
 * Program to calculate the 16-bit checksum of the stream of bytes
 *  from the specified file. Obtains the bytes one at a time in 
 * assembler, via direct calls to DOS.
*/
#include <stdio.h>
#include <fcntl.h>

main(int argc, char *argv[]) {
    int Handle;
    unsigned char Byte;
    unsigned int Checksum;
    int ReadLength;
    
    if( argc != 2 ) {
        printf("usage: checksum filename\n");
        exit(1);
    }
    
    if( (Handle = open(argv[1], O_RDONLY|O_BINARY)) == -1) {
        printf("Can't open file: %s\n", argv[1]);
        exit(1);
    }
    
    if(!ChecksumFile(Handle, &Checksum)) {
        printf("Error reading file: %s\n", argv[1]);
        exit(1);
    }
    
    /*  Report the result*/
    printf("The checksum is: %u\n", Checksum);
    exit(0);
}

LISTING 1.3 L1-3.ASM

; Assembler subroutine to perform a 16-bit checksum on the file
; opened on the passed-in handle. Store the result in the
; passed-in checksum variable. Returns 1 for success, 0 for error.
;
; Call as:
;                int ChecksumFile(unsigned int Handle, unsigned int *Checksum);
;
; Where:
;                Handle = handle # under which file to checksum is open
;                Checksum = pointer to unsigned int variable checksum is 
;                 to be stored in
;
; Parameter structure:
;
Parms     struc
                         dw  ?       ; pushed BP
                         dw  ?       ;return address
Handle             dw  ?
Checksum        dw  ?
Parms     ends
;
                .model small
                .data
TempWord label word
TempByte            db           ?     ; each byte read by DOS will be stored here
                             db           0     ; high byte of TempWord is always 0
                                                   ; for 16-bit adds
;
                            .code
                            public _ChecksumFile
_ChecksumFile     proc near
                            push bp
                            mov  bp,sp
                            push si                           ; save C 's register variable
;
                            mov bx,[bp+Handle]    ; get file handle
                            sub   si,si                        ; zero the checksum ; accumulator
                            mov  cx,1                             ; request one byte on each ;read
                            mov  dx,offset TempByte   ; point DX to the byte in
                                                                         ; which DOS should store
                                                                         ; each byte read
                            
ChecksumLoop:
                            mov  ah,3fh                        ;  DOS read file function #
                            int 21h                                ; read the byte
jcErrorEnd ;a n error occured                            
                            and ax,ax                          ; any bytes read?
                            jz     Success                      ; no-end of file reached-we're done
                            add  si,[TempWord]         ; add the byte into the 
                                                                      ; checksum total
jmpChecksumLoop
ErrorEnd:
                            sub    ax,ax                         ;error
                            jmp    short Done
Success:
                            mov  bx,[bp+Checksum]  ;point to the checksum variable
                            mov  [bx],si                        ;save the new checksum
                            mov  ax,1                            ;success
;
Done:
                            pop si                                  ;restore C's register variable
                            pop bp
                            ret
_ChecksumFile   endp
                            end

汇编语言已经差不多忘记了

The lesson is clear: Optimization makes code faster, but without proper design, optimization just creates fast slow code.
教训是明确的:优化可以是程序更快,但是没有合适的设计,优化只会产生快速的低效代码

Well, then, how are we going to improve our design? Before we can do that we have to understand what's wrong with the current design.
那么,我们将如何改进设计?在此之前我们必须理解清楚当前设计有哪些问题。

1.3.4 Know the Territory--理解领域--明确边界--明确布局

Just why is Listing 1.1 so slow? In a word: overhead. The C library implements the read() function by calling DOS to read the desired number of bytes. (I figured this out by watching the code execute with a debugger, but you can buy library source code from both Microsoft and Borland.)  That means that Listing 1.1 (and Listing 1.3 as well) executes one DOS function per byte processed --- and DOS functions, especially this one, come with a lot of overhead.
为何程序1.1如此之慢?一句话:开销大。C库函数read()调用DOS来读想要的字节数。(我明白这些是通过用debugger观察代码执行,你可以从微软和宝来公司购买库源码了解)。这意味着程序1.1和程序1.3每一个字节处理都会执行DOS函数。DOS函数,尤其这个函数,带来大量开销。

For starters, DOS functions are invoked with interrupts, and interrupts are among the slowest instructions of the x86 family CPUs. Then, DOS has to set up internally and branch to the desired function, expending more cycles in the process. Finally, DOS has to search its own buffers to see if the desired byte has already been read, read it from the disk if not, store the byte in the specified location, and return. All of that takes a long time -- far, far longer than the rest of the main loop in Listing 1.1. In short, Listing 1.1 spends virtually 几乎 all of its time executing read(), and most of that time is spent somewhere down in DOS.
一开始,DOS函数调用产生中断,中断是执行最慢的指令,在x86家族的CPU中。接着,DOS必须内部设置并且跳转到预期的中断函数,在处理上花费更多周期。最后,DOS 必须查找自己的缓存,看看是否需要的字节已经读了,如果没有就从磁盘读取,存储到特定的位置并返回。所有这些花费较长时间,比主循环中的其余部分时间长很多。简言之,程序1.1花费几乎所有时间用于执行read(),大部分时间花在底层的DOS调用上了

You can verify (核实) this for yourself by watching the code with a debugger or using a code profiler, but take my word for 相信我的话 it: There's a great deal of overhead to DOS calls, and that's what's draining the life out of  耗尽生命 Listing 1.1.
你可以通过用调试器跟踪代码或者试用代码分析器观察核实上面的分析。但是请相信我的话:这里有大量DOS调用开销,是这些开销消耗了程序1.1的大部分时间。
 

How can we speed up Listing 1.1? It should be clear that we must somehow 以某种方法 avoid  invoking DOS for every byte in the file, and that means reading more than one byte at a time, then buffering the data and parceling it out (打包) for examination one byte at a time. By gosh, that's a description of C's stream I/O feature, whereby C reads files in chunks and buffers the bytes internally, doling(发放) them out to the application as needed by reading them from memory rather than calling DOS. Let's try using stream I/O and see what happens.
如何给程序1.1加速?明显的方式是我们必须使用某种方法避免文件中的每个字节都调用DOS函数,这意味着一次读取更多的字节,数据缓存并打包给检验程序一次一个字节。天哪,这正是C语言中IO流的特性啊,C语言成块读文件并内部缓存,根据需要发放给应用程序,通过读内存而不是调用DOS函数。让我们使用IO流看看发生什么。

Listing 1.4 is similar to Listing 1.1, but uses fopen() and getc() (rather than open() and read()) to access the file being check summed. The results confirm our theories splendidly, 壮观的 and validate our new design. As shown in Table 1.1, Listing 1.4 runs more than an order of magnitude faster than even the assembly version of Listing1.1, even though Listing 1.1 and Listing 1.4 look almost the same. To the casual 临时工 observer, read() and getc() would seem slightly different but pretty much interchangeable, and yet in this application the performance difference between the two is about the same as the between a 4.77 MHz PC and a 16MHz 386.
程序1.4和程序1.1类似,不过使用fopen和getc()代替了open()和read()来访问被校验的文件。结果充分的印证了我们的理论正确并验证了新的设计。--其实也没有啥设计。 程序1.4运行比程序1.1的汇编版本还要快一个数量级。尽管1.1和1.4看起来差不多。
从一个临时工的视角看,read()和getc()函数没啥区别几乎可以互换,然而再这个应用程序中两者性能差异只大类似于一台4.77HZ的PC机和16MHz的 386直接的差异。--- 现在的同学估计对此类比无感,我是心有戚戚啊,386当时可是好机器,记得学校机房大部分是286,大家都抢少数几台386.
 

Warn: Make sure you understand what really goes on when you insert a seemingly-innocuous 看起来无害 function call into the time critical portions of your code.
提示:当你在你的代码的时间关键部分插入看似人畜无害的函数调用时,确保你完全理解一个函数实际如何运行。--工作中经常遇到这类想当然的同学:我只改了1行代码,影响不大,结果除了问题往往排查不打关键地方,浪费大量时间

In this case that means knowing how DOS and the C/C++ file-access libraries do their work. In other words, know the territory!
在当前例子中意味这知道DOS和C语言文件访问库如何做他们的工作(运行机制),换句话说,了解本领域细节

LISTING 1.4 L1-4.C

/*
* Program to calculate the 16-bit checksum of the stream of bytes
* from the specified file. Obtains the bytes one at a time via
* getc(), allowing C to perform data buffering.
*/
#include <stdio.h>
main(int argc, char *argv[]) {
    FILE *CheckFile;
    int Byte;
    unsigned int Checksum;
    
    if(argc != 2){
        printf("usage: checksum filename\n");
        exit(1);
    }
    
    if( (CheckFile = fopen(argv[1], "rb")) == NULL){
        printf("Can't open file: %s\n", argv[1]);
        exit(1);
    }
    
    /* Initialize the checksum accumulator */
    Checksum = 0;
    
    /*  Add each byte in turn into the checksum accumulator  */
    while( (Byte =getc(CheckFile)) != EOF){
        Checksum += (unsigned int) Byte;
    }
    
    /*  Report the result  */
    printf("The checksum is: %u\n", Checksum);
    exit(0);
}

1.3.5 Know When It Matters--知道关键所在--知道重点

The last section contained a particularly interesting phrase: the time-critical portions of your code. Time critical portions of your code are those portions in which the speed of the code makes a significant difference in the overall performance of your program -- and by "significant", I don't mean that it makes the code 100 percent faster, or 200 percent, or any particular amount at all, but rather that it makes the program more responsive and/or usable from the user's perspective.
最后一节包括一个特别有趣的短语:代码的时间关键部分。代码的时间关键部分是指加速这部分代码可以使你的程序的整体性能有显著提高。显著的意思不是意味着程序有1-2倍的加速或者任何特殊的数值,而是从用户的视角,程序响应更快更好用。
 

Don't waste time optimizing non-time-critical code: set-up code, initialization code, and the like. Spend your time improving the performance of the code inside heavily-used loops and in the portions of your programs that directly affect response time. Notice, for example, that I haven't bothered to implement a version of the checksum program entirely in assembly; Listings 1.2 and 1.6 call assembly subroutines that handle the time-cirtical operations, but C is still used for checking command-line parameters, opening files, printing, and the like.
不要浪费时间优化非时间关键代码:设置代码,初始化代码,和类似的代码。花费你的时间提高重要的循环代码的性能和那些直接影响响应时间的部分。注意,例如,我没有烦恼于实现一个全汇编版本的校验和程序;程序1.2和1.6调用了汇编子程序来处理时间关键操作,但是C语言仍旧用于检查命令行参数,打开文件,打印和其他类似操作。

 

If you were to implement any of the listings in this chapter entirely in hand-optimized assembly, I suppose you might get a performance improvement of a few percent--but I rather doubt you'd get even that much, and you'd sure as heck spend an awful lot of time for whatever meager 贫乏的 improvement does result. Let C do what it does well, and use assembly only when it makes a perceptible difference.
如果你完全用汇编实现本章程序的任何一个,我猜你可能你获得百分只几的性能提升--但我也怀疑你是否能获得那么多,并且你确信花费大量时间到这种效果很贫乏的优化上。---得不偿失。 让C语言做它能做的好的部分,使用汇编语言仅当可以产生可感知的不同。
 

Besides, we don't want to optimize until the design is refined to our satisfaction, and that won't be the case until we've thought about other approaches.
此外,我们不能想优化的事情,知道设计被调整到我们满意的程度,并且知道我们想好了其他方案,再考虑优化。
 

1.3.6 Always Consider the Alternatives -- 总是考虑替代方法--备胎--考虑多套方案

Listing 1.4 is good, but let's see if there are other -- perhaps less obvious -- ways to get the same results faster.  Let's start by considering why Listing 1.4 is so much better than Listing 1.1, Like read(), getc() calls DOS to read from the file; the speed improvement of Listing 1.4 over Listring 1.1 occurs because getc() reads many bytes at once via DOS, then manages those bytes for us. That's faster than reading them one at a time using read() -- but there's no reason to think that it's faster than having our program read and manage blocks itself. Easier,yes,but not faster. --- 有点难懂
程序1.4不错,但是让我们看看是否有其他可能不那么明显的方法来获得同样快的效果。让我们考虑为什么程序1.4比1.1好很多,与read()相比,getc()也调用DOS函数从文件读;程序1.4加速的是因为getc()每次DOS调用读多个字节,然后管理这写字节。这就比一次读一个字节的read()快。不难想到使我们自己的程序读并管理块程序运行会更快。容易做的,并不是最快的。

Consider this: Every invocation of getc() involves pushing a parameter, executing a call to the C library function, getting the parameter ( in the C library code), looking up information about the desired stream, unbuffering the next byte from the stream, and returning to the calling code. That takes a considerable amount of time, especially by contrast with simply maintaining a pointer to a buffer and whizzing 掠过 through the data in the buffer inside a single loop.
考虑如下:每次的getc()函数调用涉及参数压栈,执行一个C函数调用,获取参数,查找指定流的信息,从流中弹出一个字节,并返回给调用者。这些花费大量的时间,尤其与简单的维护一个指向缓冲区的指针并在一个循环中扫过缓冲区数据的算法比。

There are four reasons that many programmers would give for not trying to improve on Listing 1.4:
1. The code is already fast enough.
2. The code works, and some people are content with 满足于 code that works, even when it's slow enough to be annoying. 讨厌的
3. The C library is written in optimized assembly, and it's likely to be faster than any code that the average programmer could write to perform essentially the same function.
4. The C library conveniently handles the buffering of file data, and it would be a nuisance 讨厌的 to have to implement that capability.
有四个原因,许多程序员给出不再尝试改进程序1.4:
1. 代码已经够快;
2. 代码能工作,一些人满足于代码可工作,即使当它允许缓慢的令人讨厌;
3.C函数库用优化的汇编编写,这就快过任何普通程序员编写同样功能的代码;
4.C函数库方便的处理了文件数据缓冲,实现这个缓冲功能是令人讨厌的;--其实就是懒

I'll ignore the first reason, both because performance is no longer an issue if the code is fast enough and because the current application does not run fast enough -- 13 seconds is a long time. (Stop and wait for 13 seconds while you're doing something intense, and you'll see just how long it is.)
我将忽略第一个原因,因为如果足够快,性能就不再是一个问题。 还因为当前应用不是足够快--13秒是一个相当长的时间(当你非常项做某事时停下来等13秒,你将明白13秒多么漫长

The second reason is the hallmark 特点 of the mediocre 平庸的 programmer.  Know when optimization matters -- and then optimize when it does!
第二个原因是平庸程序员的特色,当优化是个问题时,他们才去优化。
 

The third reason is often fallacious. 靠不住的 C library functions are not always written in assembly, nor are they always particularly well-optimized. (In fact, they're often written for portability, which has nothing to do with optimization.) what's more, they're general-purpose functions, and often can be outperformed by well-but-not-brilliantly-written code that is well-matched to a specific task. As an example, consider Listing 1.5, which uses internal buffering to handle blocks of bytes at a time. Table 1.1 shows that Listing 1.5 is 2.5 to 4 times faster than Listing 1.4 (and as much as 49 times faster than Listing 1.1!), even though it uses no assembly at all.
第三个原因通常是靠不住的。 C函数库通常不是用汇编写的,也不是经过特殊优化的(事实上,他们经常为了可移植性编写--支持多个操作系统,这与优化毫无关系)。更进一步,他们是通用目的的函数,通常表现不如与特定任务匹配的专门编写的代码。例如,考虑程序1.5,使用内部的一个缓冲区一次处理一个字节块。 列表1.1 显示程序1.5比程序1.4快4倍(比1.1快49倍),尽管它完全没有使用汇编语言

 

Clearly, you can do well by using special-purpose C code in place of a C library function -- if you have a thorough understanding of how the C library function operates and exactly what your application needs done. Otherwise, you'll end up rewriting C library functions in C, which makes no sense at all.
很明显,你可以通过使用特殊目的的C代码替代C库函数做的很好--如果你完全了解C库函数如何运行并清楚你的应用程序需要做什么。否则,你就要停止重新C库函数的行动,这完全没用。---理解的基础上再优化

LISTING 1.5 L1-5.C
/*
* Program to calculate the 16-bit checksum of the stream of bytes
* from the specified file. Buffers the bytes internally, rather
* than letting C or DOS do the work.
*/
#include <stdio.h>
#include <fcntl.h>
#include <alloc.h> /* alloc.h for Borland,
                                    malloc.h for Microsoft */
#define BUFFER_SIZE 0x8000 /* 32Kb data buffer */

main(int argc, char *argv[]) {
    int Handle;
    unsigned int Checksum;
    unsigned char *WorkingBuffer, *WorkingPtr;
    int WorkingLength, LengthCount;
    
    if ( argc != 2 ) {
        printf("usage: checksum filename\n");
        exit(1);
    }
    if ( (Handle = open(argv[1], O_RDONLY | O_BINARY )) == -1 ) {
        printf("Can't open file: %s\n", argv[1]);
        exit(1);
    }
    
    /*  Get memory in which to buffer the data */
    if ( (WorkingBuffer = malloc(BUFFER_SIZE)) == NULL ) {
        printf("Can't get enough memory\n");
        exit(1);
    }
    
    /*  Initialize the checksum accumulator */
    Checksum = 0;
    
    /* Process the file in BUFFER_SIZE chunks */
    do {
        if ( (WorkingLength = read(Handle, WorkingBuffer, BUFFER_SIZE)) == -1 ) {
            printf("Error reading file %s\n", argv[1]);
            exit(1);
        }
        /*  Checksum this chunk */
        WorkingPtr = WorkingBuffer;
        LengthCount = WorkingLength;
        while( LengthCount-- ) {
            /*  Add each byte in turn into the checksum accumulator */
            Checksum += (unsigned int) *WorkingPtr++;
        }
    } while( WorkingLength );
    
    /*  Report the result */
    printf("The checksum is: %u\n", Checksum);
    exit(0);
}
 

That brings us to the fourth reason: avoiding an internal-buffered implementation like Listing 1.5 because of the difficulty of coding such an approach. True, it is easier to let a C library function do the work, but it's not all that hard to do the buffering internally. The key is the concept of handling data in restartable blocks; that is , reading a chunk of data, operating on the data until it runs out, suspending the operation while more data is read in, and then continuing as though nothing had happened.
这引来了第四个原因:因为编程困难而避免使用一个类似程序1.5的内部缓冲区实现。 确实,使用C函数库做这个工作简单很多,但是做一个内部换成区也不难。关键概念是处理数据在一个可重用块中。这就是,读一个数据块,对数据进行操作直到完成这个数据块,暂停该操作读入更多的数据,然后继续好像什么事情也没有发生。---其实就是两层循环,表述起来很费劲

 

In Listing 1.5 the restartable block implementation is pretty simple because checksumming works with one byte at a time, forgetting about each byte immediately after adding it into the total. Listing 1.5 reads in a block of bytes from the file, checksums the bytes in the block, and gets another block, repeating the process until the entire file has been processed. In Chapter 5, we'll see a more complex restartable block implementation, involving searching for text strings.
程序1.5的可重用块实现是相当简单,因为校验和的工作一次处理一个字节,加到总数以后就立刻丢弃了这个字节。程序1.5从文件读一块数据,对这块数据求校验和,然后取下一块, 重复这个过程指导整个文件被处理。
在第五章,我们将看到一个更加复杂的可重用块实现,涉及文本字符串查询。
 

At any rate, Listing 1.5 isn't much more complicated than Listing 1.4 -- and it's a lot faster. Always consider the alternatives; a bit of clever thinking and program redesign can go a long way.
无论如何,程序1.5不比1.4更复杂--并且运行快很多。总是考虑替代方法--备胎--考虑多套方案;
一点点聪明的思考和程序重新设计可以大有作用

 

1.3.7 Know How to Turn On the Juice --- 接通电源--打开果汁--锦上添花?

I have said time and again that optimization is pointless until the design is settled. When that time comes, however, optimization can indeed make a significant difference. Table 1.1 indicates that the optimized version of Listing 1.5 produced by Microsoft C outperforms an unoptimized version of the same code by more than 60 percent. What's more, a mostly assembly version of Listing 1.5, shown in Listings 1.6 and 1.7, outperforms even the best optimized C version of List 1.5 by 26 percent. These are considerable improvements,well worth pursuing 追求 --  once the design has been maxed out.
我已经说过数次:在设计搞定之前优化毫无意义。 然而当设计搞定,优化确实带来显著不同。表1.1表明,程序1.5经过微软C编译器的优化版本比不优化版本快60%。 更进一步,程序1.5的汇编版本1.6和1.7,性能比最优化的C版本好26%。 这是显著的进步,值得追求 -- 一旦设计已经最大化--设计已最优
 

LISTING 1.6 L1-6.C
/*
* Program to calculate the 16-bit checksum of the stream of bytes
* from the specified file. Buffers the bytes internally, rather
* than letting C or DOS do the work, with the time-critical
* portion of the code written in optimized assembler.
*/
#include <stdio.h>
#include <fcntl.h>
#include <alloc.h> /* alloc.h for Borland,
                                    malloc.h for Microsoft */
#define BUFFER_SIZE 0x8000   /* 32K data buffer */

main(int argc, char *argv[]) {
    int Handle;
    unsigned int Checksum;
    unsigned char *WorkingBuffer;
    int WorkingLength;
    
    if ( argc != 2 ) {
        printf("usage: checksum filename\n");
        exit(1);
    }
    if ( (Handle = open(argv[1], O_RDONLY | O_BINARY)) == -1 ) {
        printf("Can't open file: %s\n", argv[1]);
        exit(1);
    }
    
    /* Get memory in which to buffer the data */
    if ( (WorkingBuffer = malloc(BUFFER_SIZE)) == NULL ) {
        printf("Can't get enough memory\n");
        exit(1);
    }
    
    /* Initialize the checksum accumulator */
    Checksum = 0;
    
    /* Process the file in 32K chunks */
    do {
        if ( (WorkingLength = read(Handle, WorkingBuffer, BUFFERSIZE)) == -1 ) {
            printf("Error reading file %s\n", argv[1]);
            exit(1);
        }
        /* Checksum this chunk if  there's anything in it */
        if(WorkingLength)
            ChecksumChunk(WorkingBuffer, WorkingLength, &Checksum);
    } while(WorkingLength);
    
    /* Report the result */
    printf("The checksum is: %u\n", Checksum);
    exit(0);
}

LISTING 1.7 L1-7.ASM
; Assembler subroutine to perform a 16-bit checksum on a block of 
; bytes 1 to 64K in size. Adds checksum for block into passed-in
; checksum.
;
; Call as:
;        void ChecksumChunk(unsigned char *Buffer,
;        unsigned int BufferLength, unsigned int *Checksum);
;
; Where:
;            Buffer = pointer to start of block of bytes to checksum
;            BufferLength = # of bytes to checksum (0 means 64K, not 0 )
;            Checksum = pointer to unsigned int variable checksum is stored in
;
; Parameter structure:
;
Parms struc
        dw ?  ;pushed BP
        dw ?  ;return address
Buffer dw ?
BufferLength dw ?
Checksum      dw ?
Parms ends
;
.model small
.code
public _ChecksumChunk
_ChecksumChunk proc near
  push bp
  mov bp,sp
  push si         ;save C's register variable
;
  cld                ;make LODSB increment SI
;
  mov si,[bp+Buffer]  ;point to buffer
  mov cx,[bp+BufferLength] ;get buffer Length
  mov bx,[bp+Checksum]  ;point to checksum variable
  mov dx,[bx]  ;get the current checksum
  sub ah,ah  ;so AX will be a 16-bit value after LODSB
ChecksumLoop:
  lodsb  ;get the next bytes
  add dx,ax   ;add it into the checksum total
  loop ChecksumLoop  ;continue for all bytes in block
  mov [bx],dx  ;save the new checksum
;
  pop si  ;restore C's register variable
  pop bp
  ret
_ChecksumChunk endp
end

Note that in Table 1.1, optimization makes little difference except in the case of Listing 1.5, where the design has been refined considerably. Execution time in the other cases is dominated by time spent in DOS and/or the C library, so optimization of the code you write is pretty much irrelevant 无关痛痒.  What's more, while the approximately two-times improvement we got by optimizing is not to be sneezed at,  it pales  against the up-to-50-times improvement we got by redesigning.
 注意表1.1, 优化起得作用很小,除了程序1.5,当设计被显著调优后。其他情况的执行时间主要花费在DOS和C函数调用上,所以代码优化是无关痛痒。更进一步,当我们通过优化获得大约2倍的性能提升已经感觉不错了,比起通过重新设计获得的50倍的提升就显得苍白无力了。
 

By the way, the execution times even of Listings 1.6 and 1.7 are dominated by DOS disk access times. If a disk cache is enabled and the file to be checksummed is already in the cache,  the assembly version is three times as fast as the C version. In other words, the inherent nature of this application limits the performance improvement that can be obtained via assembly. In applications that are more CPU-intensive and less disk-bound, particularly those applications in which string instructions and/or unrolled loops can be used effectively, assembly tends to be considerably faster relative to C than it is in this very specific case.
 顺便说说,程序1.6 和1.7的执行时间也主要是DOS磁盘访问时间。如果磁盘缓冲是使能的并且校验的文件已经在缓冲区,汇编版本将比C语言版本快3倍。 换句话说,程序的固有特性限制了通过汇编获得的性能提升。 在一个CPU密集并很少用磁盘的的应用中,尤其他写可以使用串操作和循环展开的应用程序中,汇编相对于C表现有显著的快速,比当前这个特定的情况。--- 串操作和循环展开都是汇编技巧。
 
 Don't get hung up optimizing compilers or assembly language -- the best optimizer is between your ears.
 不要依赖编译器或者汇编语言--最好的优化器是你的大脑
 
 All this is basically a way of saying: Know where you're going, know the territory, and know when it matters.
 所以这些可总结一句话:明确目标,了解细节,知道重点
 

1.4 Where We've Been, What We've Seen--我们去了哪里,我们看到了什么
 
 What have we learned? Don't let other people's code -- even DOS -- do the work for you when speed matters, at least not without knowing what that code does and how well it performs.
 我们学到了什么?不要使用他人的代码--即使是DOS的--当时速度有关的代码就自己写。 至少不要在不知道代码如何运行和是否有良好表现的情况下使用。

 

Optimization only matters after you've done your part on the program design end. Consider the ratios on the vertical axis of Table 1.1, which show that optimization is almost totally wasted in the checksumming application without an efficient design. Optimization is no panacea 万能药. Table 1.1 shows a two-times improvement from optimization--and a 50-times-plus improvement from redesign. The longstanding debate about which C compiler optimizes code best doesn't matter quite so much in light of Table 1.1, does it? Your organic optimizer matters much more than your compiler's optimizer, and there's always assembly for those usually small sections of code where performance really matters.
 程序设计完成后优化才有用。考虑表1.1垂直方向的比率,显示了在没有有效设计的情况下,校验和的优化纯粹是浪费。优化不是万能药。 表1.1显示优化可以有两倍提高--而重新设计可以提高50倍。根据表1.1,看出长期争论的C编译器优化没有啥用,不是吗? 个人优化比较编译器优化好得多。 性能是关键的代码块总是要使用汇编语言。

Where We're Going--往何处去

This chapter has presented a quick step-by-step overview of the design process. I'm not claiming that this is the only way to create high-performance code; it's just an approach that works for me. Create code however you want, but never forget that design matters more than detailed optimization. Never stop looking for inventive ways to boost performance -- and never waste time speeding up code that doesn't need to be sped up.
本章展示了一个逐步设计程序的概述。我没有宣称这是创建高性能代码的唯一方法;这只是适用于我的方法。然而你创建代码时,不要忘记设计比优化细节重要很多。 不要停止寻找创造性的方法来推动性能--不要浪费时间在不需要性能提升的代码上。

I'm going to focus on specific ways to create high-performace code from now on. In Chapter 5, we'll continue to look at restartable blocks and internal buffering, in the form of a program that searches files for text strings. 
从现在起,我将聚焦于编写高性能代码的方法上。在第五章,我将继续着眼于可重用块和内部缓冲,程序是在文件中搜索文本串。

 

 

 


 

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值