DELPHI ASM教程(1)

最新推荐文章于 2024-07-29 21:25:49 发布

ttch

最新推荐文章于 2024-07-29 21:25:49 发布

阅读量3.4k

点赞数

分类专栏：翻译文章连载文章标签： delphi integer pascal function compiler optimization

翻译文章连载专栏收录该内容

0 篇文章 0 订阅

订阅专栏

这篇文章出处在

http://dennishomepage.gugs-cats.dk/BASM-filer/BASMForBeginners.htm

原作者为:Dennis Kjaer Christensen, Denmark

Introduction to BASM for Beginners

The series of articles named “BASM for beginners” currently consists of 7 articles and no. 8 and 9 are in progress. Common for the articles, and coming articles, is that they explain some BASM issues by use of an example function. Most often this function is first implemented in Pascal and then the compiler generated assembler code is copied from the CPU view in Delphi and then analyzed and optimized. Sometimes optimization involves the usage of MMX, SSE or SSE2 instructions.

这篇连载文章的名字叫“BASM入门”由普通的7篇文章和8篇文章和9篇处理而成，基本的内容都来自文章，他们解释了一些BASM例函数的使用问题，大部分的函数实现在PASCAL和编译器生成码来自DELPHI开发环境中的CPU VIEW窗口的分析和优化，另一些优化包含使用了MMX，SSE或者SSE2的用法说明书

By taking the code made by the compiler from a Pascal function the most commonly used instructions from the big 32-bit Intel Architecture instruction set are introduced to the beginner first. Seeing which code the compiler generates is leading to a valuable insight in the effectiveness of compiler generated code in general and into the Delphi compiler specifically.

As specific assembly code optimizations are introduced generalizations will be introduced when suitable. These general optimizations are suitable for implementation in compilers and most compilers including Delphi have them. At some point in the future a tool that automatically optimizes assembler code will be developed.

Knowledge about the target processor is often needed when optimizing code and therefore are a lot of CPU details, such as pipelines, explained in the series too.

As far as I know there is only little literature available that explains all these issues on a level where beginners can follow it. I hope this series will help fill this void.

Best regards
Dennis Kjaer Christensen

Lesson 1

第一课

The first little example gets us started. It is a simple function in Pascal with multiplies an integer with the constant 2.

我们从一个小例子开始，一个在PASCAL中相当于乘常量2的简单函数

function MulInt2(I : Integer) : Integer;

begin

Result := I * 2;

end;

Lets steal the BASM from the CPU view. I compiled with optimizations turned on.

让我们在CPU 观察窗口中查看BASM代码我的编译最优化已经开启

function MulInt2_BASM(I : Integer) : Integer;

begin

Result := I * 2;

{

add eax,eax

ret

}

end;

From this we see that I am transferred to the function in eax and that the result is transferred back to the caller in eax too. This is the convention for the register calling convention, which is the default in Delphi . The actual code is very simple, the times 2 multiplication is obtained by adding I to itself, I+I = 2I. The ret instruction returns execution to the line after the one which called the function.

从这里我们可以看到,转换的函数在eax和结果调用也在eax,这个惯例来自delphi默认的Register调用惯例。实际代码非常简单,乘两次2就是自身增加两次比如I+I=2I

Lets code the function as a pure asm function.

让函数的代码变成一个纯粹的汇编函数

function MulInt2_BASM2(I : Integer) : Integer;

asm

//Result := I * 2;

add eax,eax

//ret

end;

Observe that the ret function is supplied by the inline assembler.

观察这个函数的内联汇编代码

Let us take a look at the calling code.

让我们观察这段代码的调用

This is the Pascal code

调用的PASCAL代码如下

procedure TForm1.Button1Click(Sender: TObject);

var

I, J : Integer;

begin

I := StrToInt(IEdit.Text);

J := MulInt2_BASM2(I);

JEdit.Text := IntToStr(J);

end;

The important line is

重要的一行如下

J := MulInt2_BASM2(I);

From the cpu view

从CPU View窗体中可以看到:

call StrToInt

call MulInt2_BASM2

mov esi,eax

After the call to StrToInt from the line before the one, which calls our function, I am in eax. (StrToInt is also following the register calling convention). MulInt2_BASM2 is called and returns the result in eax, which is copied, to esi in the next line.

在调用StrToInt后的仅仅调用了MuInt2_BASM2,我在eax MulInt2_Basm2调用和返回结果都在EAX中，拷贝副本到ESI是下一行.

Optimization issues: Multiplication by 2 can be done in two more ways. Use the mul instruction or shifting left by one. In the Intel IA32 SW developers manual 2 page 536 mul is described. It multiplies the value in eax by another register and the result is returned in the register pair edx:eax. A register pair is needed because a multiplication of two 32 bit numbers results in a 64 bit result, just like 9*9=81 - two one digit numbers (can) result in a two digit result.

最优化问题：乘2可能有两种以上的方法。使用mul指令或者左移1.在Intel IA32 SW 开发者手册2的第536页有关于mul的描述，eax中的值同另一个寄存器中值进行乘法运算和

返回结果在寄存器对edx:eax中时.寄存器对必须的因为乘法是两个32位整数结果存在一个64位结果中。例如9*9=81 两个一位数结果在一个两位结果中。

This raises the issue of which registers must be preserved by a function and which can be used freely. This is explained in the Delphi help.

"An asm statement must preserve the EDI, ESI, ESP, EBP, and EBX registers, but can freely modify the EAX, ECX, and EDX registers."

一个ASM statement 必须保护EDI，ESI，ESP，EBP和EBX寄存器，但是可以自由的修改EAX，ECX和EDX寄存器。

We can conclude that it is no problem that edx is modified by the mul instruction and our function can also be implemented like this.

我们可以断定在我们的函数里用用MUL指令修改edx想下面这样也没有问题

function MulInt2_BASM3(I : Integer) : Integer;

asm

//Result := I * 2;

mov ecx, 2

mul ecx

end;

ecx is used also but this is also ok. As long as the result is less than the range of integer it is returned correctly in eax. If I am bigger than half the range of integer overflow will occur and the result is incorrect.

Ecx也可以同样使用,只要结果是小于integer的长度就可以正确的返回在eax中的值,如果长度大于integer的长度将会发生溢出和结果不正确.

Implementation with shift

左移的实现

function MulInt2_BASM4(I : Integer) : Integer;

asm

//Result := I * 2;

shl eax,1

end;

Timing can reveal which implementation is fastest. We can also consult Intel or AMD documents with latency and throughput tables. Add & mov is 0.5 cycles latency and throughput, mul is 14-18 cycles latency and 5 cycles throughput. shl is 4 cycles latency and 1 cycle throughput. The version chosen by Delphi is the most efficient on P4 and this will probably also be the case on Athlon and P3.

适时的选择能让实现的更快（适时的选择能让你的代码更快）.我们也能参考Intel或者AMD的文档关于latency（反应）和throughput tables,Add 或者mov是0.5个cycles latency and throughput ，mul是14-18个cycles latency and 5个 cycles throughput. 大多数DELPHI的版本在P4上同样凑效于Athlon和P3上

Issues not covered: mul versus imul and range checking, other calling conventions, benchmarking, clock count on other processors, clock count for call + ret, location of return address for ret etc..

隐藏的问题: Mul与相对的imul和范围检测和其他的转换惯例，规则,时钟记数在其他的处理器上,时钟记数调用为 + ret ,现场地址返回用 ret 等等.