red zone

来源:http://homepage.ntlworld.com/jonathan.deboynepollard/FGA/function-perilogues.html

The gen on function perilogues.

Yes, "perilogue" is a real word — sort of.It's only ever been used as a technical term in computing,and was first used byEdward Anton Schneider of Carnegie-Mellon University in 1976to mean the start or finish of an operation. Clearly this is a usefulterm for the combination of a prologue and an epilogue, which areinseparable from each other when it comes to discussions of compiledfunctions in computer languages, and lack another word for theircombination.

As "prologue" comes from the Greek "προ", meaning "before",and as "epilogue" comes from the Greek "επι", meaning"after", so "perilogue" comes from the Greek "περι",meaning "around/about". Indeed, the word"περιλεγειν"actually exists in Classical Greek, in the writings of Hermippus, meaning"to talk around" something.

A function perilogue comprises a functionprologue and a function epilogue, which bracket the actualbody of a function fore and aft. The prologue and epilogue are stronglycoupled to one another, and are effectively a single unit, the perilogue.The prologue sets up an exection environment for the function body, andthe epilogue tears it down again.

Activation records and red zones

Function perilogues are, by their natures, specific to instruction setarchitectures. The standard perilogues for an architecture set up astandard stack frame for the architecture in the perilogue andtear it back down again in the epilogue.

Formally, a stack frame is part of a function's overallactivation record. An activation record formallycomprises:

  • a parameter area where the stack-stored parameters (or, insome function calling conventions such asIBM's Optlink calling convention,space reserved for register-stored parameters to be spilled into) areheld
  • a return address where the address of the next instruction toexecute within the calling function is stored
  • one or more saved frame pointers holding the callingfunctions'/function's saved frame pointer(s)
  • a save area where non-volatile registers that the functioncode modifies are saved across the function body
  • a locals area where function-local variables are stored, theaddress of the start of which is the current stack pointer
  • a red zone that is below the stack pointer

Strictly speaking, function activation records don't have to be stored oncall stacks. Modern processor architectures employ what is known asdynamic allocation for activation records, where activation records arecreated and destroyed on the fly, being pushed onto and popped off the topof a call stack. Another possibility, not employed in mainstreamarchitectures nowadays, is static allocation, where it is knownthat functions are not reëntrant and programs are not multi-threaded.With static allocation, activation records are simply stored in fixedportions of a program's read/write data area, determined at compile/linktime.

A red zone is the area immediately below the stack, that can beoverwritten at any time as a consequence of asynchronous events occuringduring execution of the function: signals, exceptions, or interrupts.Because it is liable to be destroyed at any moment, functions should notattempt to use it for storage.

In many processor architectures, the stack pointer registeralways points to the bottom of the standard stack frame. When a functionneeds itself to call other functions (i.e. it is not a so-called leaffunction) it has to construct the parameter area for the calledfunction's stack frame. There are two common techniques:

  • On the x86 architecture, code temporarily modifies the stackpointer to make space for the new parameter area, on the fly, by pushingand popping (or otherwise manipulating) the stack. This is not terriblyefficient, requiring specialized optimization hardware in x86 CPUs justto reduce its gross serial depencies upon a single processor register.As explained later, some optimizing compilers for x86 take a moreRISC-like approach.

  • On more RISC-like architectures such as MIPS, the convention is forspace large enough to hold the parameter area for any function that willbe called (which size the compiler obviously can work out at codegeneration time, by simply taking the maximum of the sizes of theindividual parameter areas) to be pre-allocated by the function prologue,below the locals area, with the stack pointer register pointing to itsbase. Thus a standard stack frame contains an extra area, a callingparameter area, below the locals area, part or all of which overlapsthe parameter areas of the activation records of called functions.

Many processor architectures also have a frame pointer register.Sometimes this points to the saved frame pointers area. Sometimes itpoints elsewhere within a standard stack frame. The frame pointer isn'tstrictly speaking used to locate the stack frame itself. Thestack pointer does that quite happily, after all. The frame pointer isused to provide simple access, using the shortest/quickest instructionforms, to the locals area and parameters area of a stack frame.For example:

  • On the x86 architecture, code temporarily modifies the stackpointer to make space for new calling parameter areas, on the fly, bypushing and popping (or otherwise manipulating) the stack. However, Theframe pointer register — (E)BP — remains fixed,pointing at the middle of the stack frame. Compilers thus generate codethat references function parameters and function-local variables usingregister-relative addressing via the frame pointer, and that referencescalling parameters using register-relative addressing via the stackpointer.

  • With the x86-64 standard function perilogue, the frame pointer is, byconvention, exactly 128 bytes above the base of the stack frame. Thisdoesn't necessarily point to any definite part of the stack frame, becausestack frames for different functions (of course) have different sizes oflocals areas, save areas, calling parameter areas, and so forth, meaningthat whatever is at offset 128 is going to depend from exactly whatfunction is called. The purpose of this offset is so that the framepointer register can be used in preference to the stack pointer registerwhen accessing the stack frame. The 128 byte offset means that shortinstructions using a register-relative addressing mode with signed 8-bitoffsets can access 256 bytes of the stack frame via the frame pointerregister, as opposed to only 128 bytes of the stack frame via the stackpointer register.

The standard x86 function perilogue

The standard function perilogue on the x86architecture comprises the ENTER and LEAVEinstructions:

enter N,M
…
leave
ret

This standard x86 perilogue sets up and tears down a standard x86stack frame. The base of the stack frame is pointed to by theESP register, and comprises N bytes for function-localvariables, followed by M+1 saved frame pointers, pointing to thestack frame(s) of the calling function(s), followed by the caller to thefunction's return address and the (stack-stored) parameters to thefunction.

The EBP register points to the first (i.e. immediatelyenclosing caller's) saved frame pointer, which in turn was the value ofthe EBP register within that function's standard perilogue.

Although this is the smallest size for such a perilogue, it is notnecessarily the fastest to execute, especially in the case where there isonly the 1 enclosing stack frame pointer to be saved. Moreover, theENTER and LEAVE instructions did not exist onthe 8086. So traditionally, and in some cases for speed, the standardperilogue (for a non-nested function, where M will be zero)has an alternative form, that does exactly the same thing:

push ebp
mov ebp, esp
sub esp, N
…
mov esp, ebp
pop ebp
ret

Interestingly, this is not optimal. The execution of eachPUSH or POP instruction depends from the resultof execution of all preceding ones, since each instruction has to read,modify, and write the stack pointer register. Consider the case where thefunction uses several non-volatile registers (for the sake of exposition,assumethe 32-bit OS/2 system API calling convention and a function body thatdoes something like an optimized memcpy() usingMOVSD and thus requiring EDI andESI), and as a consequence its perilogue has to save andrestore several non-volatile register values on the call stack in the savearea:

push ebp
mov ebp, esp
sub esp, N
push edi
push esi
push ebx
…
pop ebx
pop esi
pop edi
mov esp, ebp
pop ebp
ret

Each PUSH instruction (and, indeed, the RETinstruction at the end) has a sequential dependency from the instructionsthat immediately precede it, because they all need to know theESP value from their predecessors. Similarly, the second andsubsequent POP instructions all have sequential dependencieson their immediate predecessors. Worse still, the incrementedESP register value is then entirely discardedanyway. This does not provide the processor with the ability toschedule multiple operations in parallel, internally, as many x86processors are nowadays capable of.

The Intel Pentium M and the Intel Pentium Atom have a mechanism called"ESP folding" that ameliorates this to an extent. ESP folding handles theimplicit accesses to the ESP register, by PUSH,POP, CALL, and RET, in the AddressGeneration Unit (AGU). This reduces the impact of the multiple successivePUSH and POP instructions in the aforegivennon-optimal perilogue. However, asIntel's IA-32 Software Optimization Reference Manualexplains (see §2.4.1, §3.4.2.6, §12.3.2.2, and§12.3.3.6), this doesn't solve all of the execution speed problemswith this perilogue, because it still contains explicitaccesses to the ESP register, in the SUB andMOV instructions. As the Reference Manual statesmixing the Arithmetic Logic Unit (ALU) with the AGU causes executionstalls. And that's exactly what mixing SUB/MOVwith PUSH/POP does.

Moreover: These are but two x86 processors from one vendor that even haveESP folding in the first place. Other processors have no suchamelioration even for the sequential dependencies of thePUSH, POP, and RET instructions.

All of these problems go away by employing a more optimal perilogue, whosespeedy execution isn't limited to just a certain few Intel x86 processorswith special-case support. A more optimal perilogue has just tworead-modify-write operations on ESP, manipulating it onceeach in prologue/epilogue and then using instructions that have no serialdependencies upon one another to place/retrieve the saved non-volatileregisters and saved frame pointer onto/from the stack. (On other, moreRISC-like, processor architectures, this is the approach taken by standardfunction perilogues. Even on x86 architectures, this is the approachtaken by some optimizing compilers when setting up the parameter areabefore calling a function.) Further optimization can be enabled byfollowing Intel's recommendation to not mix the ALU with the AGU, usingLEA to calculate the new effective ESP address(which is, after all, what LEA is there for). Yet moreoptimizations still can be performed by taking advantage of thewrite-combining properties of the processor's L1 cache, and scheduling theMOV instructions accordingly.

Combining all these results in a standard perilogue that looks like:

LOCALS_SIZE equ …
SAVE_SIZE   equ 16

lea esp, [esp-LOCALS_SIZE-SAVE_SIZE]
mov [esp+LOCALS_SIZE+SAVE_SIZE-16], ebx
mov [esp+LOCALS_SIZE+SAVE_SIZE-12], esi
mov [esp+LOCALS_SIZE+SAVE_SIZE-8], edi
mov [esp+LOCALS_SIZE+SAVE_SIZE-4], ebp
lea ebp, [esp+LOCALS_SIZE+SAVE_SIZE-4]
…
mov ebx, [esp+LOCALS_SIZE+SAVE_SIZE-16]
mov esi, [esp+LOCALS_SIZE+SAVE_SIZE-12]
mov edi, [esp+LOCALS_SIZE+SAVE_SIZE-8]
mov ebp, [esp+LOCALS_SIZE+SAVE_SIZE-4]
lea esp, [esp+LOCALS_SIZE+SAVE_SIZE]
ret

As mentioned, when setting up the parameter area of a new stack frame, inorder to call a function, some optimizing compilers (when optimizing fortime, not space) use the above approach, thus:

; … code that calculates parameters in local variables …
mov edx,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_3]
mov ecx,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_2]
mov eax,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_1]
lea esp, [esp-PARAMS_SIZE]
mov [esp+8], edx
mov [esp+4], ecx
mov [esp+0], eax
call function
lea esp, [esp+PARAMS_SIZE]

However, other compilers generate code that simply PUSHes andPOPs the stack within the body of the function to set upparameter areas. To add insult to injury, the code also uses an ALUinstruction, ADD, on the ESP registerimmediately after the called function has issued an AGUinstruction that is "ESP foldable", RET, causing an executionstall.

; … code that calculates parameters in local variables …
mov edx,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_3]
mov ecx,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_2]
mov eax,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_1]
push edx
push ecx
push eax
call function
add esp, PARAMS_SIZE

Both approaches — PUSH and MOV —modify ESP on the fly, allocating and deallocating call stackspace for the new parameter area of each individual called function as itis called. In general, therefore, in x86 programming the stack pointerregister is not always at a fixed position at the base of the function'sstandard stack frame (as it is by convention on other instructionarchitectures). This in turn means that accessing local variables andfunction parameters is generally done via the frame pointer, usingpositive constant offsets for function parameters and negative constantoffsets for local variables, not the stack pointer.

The standard MIPS function perilogue

The standard function perilogue on the MIPSarchitecture is fairly similar to the perilogue on the x86 architecture:

subu sp,frame_size
sw ra,frame_size-8(sp)
…
lw ra,frame_size-8(sp)
addu sp,frame_size
jr ra

There are several notable differences:

  • The MIPS standard perilogue allows greater instruction parallelism by onlywriting to the stack pointer register once, allowing all registersaves/loads to be overlapped since they don't depend from each other'sresults as a sequence of x86 PUSH and POPinstructions do.

    Saving a non-volatile register, that the function wants to use for itself,in the save area of the stack frame is merely a matter of an additionalsp-relative save/load pair:

    sw s0,frame_size-16(sp)
    …
    lw s0,frame_size-16(sp)
  • The MIPS standard perilogue involves saving the return address from aregister onto the call stack and then retrieving it from the call stackbefore returning. The x86 architecture's CALL andRET instructions do this implicitly. In the MIPSarchitecture, the return address is in the ra ("returnaddress") register, and how (and indeed whether) it is saved into a stackframe is up to each individual function. ra is less of aspecial case register and more like a simple non-volatile register,that one handles like any other non-volatile register: saving it in thesave area if the function needs to re-use it itself (as a non-leaffunction will, but a leaf function will not).

  • By convention, the stack frame on MIPS includes enough space toconstruct the largest parameters area of any called function, andconstructing a parameters area again is a sequence of parallelizablestores that are not interdependent. For example:

    sw s0,16(sp) # set parameter #5
    sw s1,20(sp) # set parameter #6
    jal called_function

There are also similarities. There are stack pointer (sp)and frame pointer (fp) registers. The frame pointer pointsto the middle of a stack frame, and the stack pointer points to thebottom.

Stack walking

On x86 architectures, because the frame pointer register(EBP) points to the saved frame pointer of the immediatecaller, it is possible to walk the stack, as long as everyfunction employs a standard stack frame. Such a process starts with thecurrent frame pointer register value and follows the saved stack framepointers until the top of the stack area (or an invalid saved framepointer value) is reached.

Many compilers targetting x86 platforms provide options for disabling thegeneration of a standard perilogue for functions that can do without one,or disabling just the part that involves saving and restoring the caller'sframe pointer and setting up and tearing down a new one for the calledfunction. However, walking the stack is employed by debuggers, exceptionhandlers, and postmortem analysis utilities. So although many compilersprovide these options, one should be aware of the effect that they willhave upon debugging, exception handling, and postmortem analyses of aprogram's execution.

One trick with frame pointers used on 16:32 and 16:16 x86 code is tosignal the calling distance of a function, by setting a flag in its savedframe pointer for the calling function. This allows anything walking thestack to know whether the return address in the stack frame is a near(0:16 or 0:32) or a far (16:16 or 16:32) address (and hence where theparameter area begins relative to the frame pointer area).

This marking is done by taking advantage of the fact that in normal use astack frame will never be aligned to an odd address. (Compilers andoperating systems conspire to enforce this. Compilers ensure that theyonly ever change the value of ESP by an even number of bytes,and operating systems ensure that stack tops are always aligned to amultiple of 2 bytes.) A far function is marked by simple expedient ofincrementing and decrementing the saved frame pointer by 1 byte. Bit #0of the saved frame pointer thus becomes a "far function" flag. Such aperilogue generally looks like this:

inc ebp
enter N,M
…
leave
dec ebp
ret

Stack walking code has, of course, to be aware of whether this conventionis used by the libraries on the target platform that the program isrunning on, and whether the program itself was compiled to employ it, ofcourse.

Stack probes

Both 32-bit OS/2 and Win32 provide applications softwares with thecapability of having what are known as decommitted stacks. Adecommitted stack is a thread's stack area that starts offallocated but not committed. In other words: The rangeof virtual address space is allocated to the stack, but there are novirtual memory pages committed into that range of addresses. Decommittedstacks allow applications to specify large stack sizes without incurring(unless actualy required as the program executes) the overhead of all ofthe virtual memory pages that would be necessary were the stack area fullycommitted. This is done via a mechanism that involves guardpages.

The details of operation of the guard page mechanism can be found in the32-bit OS/2 Developers' Toolkit documentation and the MSDN documentationfor Win32. Simply put: All pages in a thread's stack down to some pointare committed; the next page is a committed page marked as a guard page;and all pages below that are not committed. Accessing a guard page causesa page fault and an application-mode exception, in response to which theoperating system automatically turns the guard page into anordinary committed page (by removing its guard page attribute), to committhe next lower page in the stack (if it can), and (if committed) mark thatnext lower page as a guard page in its turn.

Thus any application that guarantees that it will access its stack spacemore-or-less as an actual stack, pushing things onto the top serially,will have its initially entirely decommitted stack automatically committedfor it by the operating system as the stack grows downwards. (When theoperating system hits the bottom of the stack, usually an exception israised by the operating system. But that's another discussion, withcomplexities and subtleties of its own, tangential to this one.)

The x86 standard function perilogue does access its stack spacemore-or-less like a stack. It pushes activation records onto the callstack, and pops them back off again. The problem ensues when anactivation record is larger than a memory page.

More particularly, the problem ensues when the perilogue is using theSUB or LEA instructions to decrement the stackpointer. A PUSH or ENTER instruction decrementsthe stack pointer, but also touches the memory at the stack pointeraddress. The write access to the memory location will trigger the guardpage mechanism. A perilogue that used PUSH orENTER to decrement the stack pointer exclusively would neversuffer a problem. But perilogues don't always use PUSH orENTER. Indeed, as already discussed, for the greatest speedoptimization a perilogue doesn't use PUSH, POP,ENTER, or LEAVE anywhere, but ratheruses LEA to modify the stack pointer register andESP-relative MOV instructions to save andrestore registers and frame pointers. It doesn't necessarily incorporatethose MOV instructions in strictly descending order of stacklocation, either.

This is where stack probes come in. Stack probes are a simpleidea. The compiler generates extra code in the function perilogue ifthere's a danger that it might skip over more than one page of the stackwithout actually touching it when pushing the function's activationrecord.

Usually, compilers simply note when the sizes of the locals area is largerthan a page, and spit out extra dummy memory references, immediately afterthe SUB instruction that decrements ESP, thattouch the intervening stack pages in the right, descending, order. Thereare several complications to this scheme:

  • ESP cannot be decremented more than 1 page past the loweststack page access at any point. This is because an asynchronousexception, whose handler would of course immediately start pushing thingsonto the stack at the current ESP address, may arrive at anypoint during the stack probe sequence itself. So compilerstake one of two approaches:

    • They do all stack probe operations before decrementing ESP,by writing into the red zone. It's safe to touch the red zone,just not safe to rely upon it retaining the values that one may havewritten to it. Stack probes don't care about values of the memorylocations touched. Indeed, some stack probe mechanisms write completelyarbitrary data themselves (such as zero, or 0xdeadbeef, orwhatever happens to be in the EAX register at the time).

    • They perform progressive stack probes, decrementing ESP onepage at a time. For example, a 12KiB stack probe, unrolled, would looklike:

      lea esp,[esp - 1000h]
      mov [esp],eax
      lea esp,[esp - 1000h]
      mov [esp],eax
      lea esp,[esp - 1000h]
      mov [esp],eax
  • The range of stack locations that the stack probe has to cover isnot simply the size of the locals area. If the functionperilogue doesn't use PUSH to save non-volatile registersinto the save area, the size of the save area must be added to the probetotal as well. Similarly with saved stack frame pointers, if they arestored with MOV rather than PUSH. Basically,however much space is to be skipped over with SUB orLEA is however much space needs to be probed.

  • An often forgotten part of the function activation record, when it comesto stack probes, is the parameters area. Unless a compiler always usesPUSH to place parameter values onto the call stack, it mustalso perform stack probes when it calls functions thattake more than one page's worth of parameters. Here is a simple C++program that illustrates a case where stack probes of the callingparameters area are also required:

    struct s { char b[16384] ; } ;
    
    int f ( s p ) { return p.b[0] ; }
    
    s d ;
    
    int main () { return f(d) ; }

    Unfortunately, many compilers forget that stack probes are also requiredfor setting up a calling parameters area properly, and don't employ anystack probe generation logic when generating calls to functions. BorlandC/C++, MetaWare High C/C++, and EMX C/C++ all overlook this necessity.Watcom C/C++ does not.

It is worth nothing that the latter two points are strong arguments infavour of the MIPS standard function perilogue design over the(non-optimal) x86 standard function perilogue design. In the MIPSapproach, setting up enough room to hold the largest parameters area ofany called function is a part of the standard function perilogue itself,rather than being deferred to on-the-fly modifications to the stackpointer within the function body. As such, a single stack probe operationcan be done in the perilogue that encompasses the sizes of the localsarea, the saved frame pointer(s) area, the save area, and thecalling parameters areas, all in one gulp

Callback functions in Win16

In Win16, thestandard function calling convention for callback functions requiresthat a function perform additional set up and teardown, within thestandard perilogue. This nested perilogue ensures that theDS register within the function has whatever valuethe AX register had on entry to the function, and takes theform:

push ds
mov ds,ax
…
pop ds

In addition, the prologue must be prefixed with one of two 3-byteprefixes:

mov ax,ds
nop

or:

push ds
pop ax
nop

Seemingly, this is a very long-winded way of doing nothing, setting theDS register to what it already was and splatting theAX register along the way. If the function is not exportedand used as a callback, that is exactly what it is.

The point of the extra perilogue that modifies the DSregister during function execution is to allow an instance thunkto be set up with the MakeProcInstance() call in order to usethe function as a callback. This instance thunk loads a targetDS selector value into AX and calls thefunction. The Windows loader collaborates in this. For every functionexported from an EXE or a DLL, it scans its first 3 bytes, and overwriteseither of the aforegiven sequences with three nopinstructions. (This of course makes it impossible to call themexcept through an instance thunk.)

The total function perilogue for a Win16 callback function, with thestandard perilogue (in "optimize for time" form), the far function callmarker, and the Win16 mechanism for making a function instance thunkable,was quite hefty. In 1989,Michael Geary discovered a trickthat did away with a lot of this, by observing that the whole instancethunk mechanism wasn't necessary. In EXEs, the SS registeralready held the proper data segment selector; and in DLLs, one couldsimply perform a load of a constant which the program image loader wouldfixup to point to DGROUP for the DLL.

Comparison of Win16 function perilogues
Instance thunkable function"Smart callback" in an EXE"Smart callback" in a DLLNon-callback far function
mov ax,ds
nop
inc ebp
enter N,M
push ds
mov ds,ax
…
pop ds
leave
dec ebp
ret
mov ax,ss
inc ebp
enter N,M
push ds
mov ds,ax
…
pop ds
leave
dec ebp
ret
mov ax,DGROUP
inc ebp
enter N,M
push ds
mov ds,ax
…
pop ds
leave
dec ebp
ret
inc ebp
enter N,M
…
leave
dec ebp
ret

© Copyright 2010Jonathan de Boyne Pollard."Moral" rights asserted.
Permission is hereby granted to copy and to distribute this web page in itsoriginal, unmodified form as long as its last modification datestamp is preserved.

The gen on function perilogues.

Yes, "perilogue" is a real word — sort of.It's only ever been used as a technical term in computing,and was first used byEdward Anton Schneider of Carnegie-Mellon University in 1976to mean the start or finish of an operation. Clearly this is a usefulterm for the combination of a prologue and an epilogue, which areinseparable from each other when it comes to discussions of compiledfunctions in computer languages, and lack another word for theircombination.

As "prologue" comes from the Greek "προ", meaning "before",and as "epilogue" comes from the Greek "επι", meaning"after", so "perilogue" comes from the Greek "περι",meaning "around/about". Indeed, the word"περιλεγειν"actually exists in Classical Greek, in the writings of Hermippus, meaning"to talk around" something.

A function perilogue comprises a functionprologue and a function epilogue, which bracket the actualbody of a function fore and aft. The prologue and epilogue are stronglycoupled to one another, and are effectively a single unit, the perilogue.The prologue sets up an exection environment for the function body, andthe epilogue tears it down again.

Activation records and red zones

Function perilogues are, by their natures, specific to instruction setarchitectures. The standard perilogues for an architecture set up astandard stack frame for the architecture in the perilogue andtear it back down again in the epilogue.

Formally, a stack frame is part of a function's overallactivation record. An activation record formallycomprises:

  • a parameter area where the stack-stored parameters (or, insome function calling conventions such asIBM's Optlink calling convention,space reserved for register-stored parameters to be spilled into) areheld
  • a return address where the address of the next instruction toexecute within the calling function is stored
  • one or more saved frame pointers holding the callingfunctions'/function's saved frame pointer(s)
  • a save area where non-volatile registers that the functioncode modifies are saved across the function body
  • a locals area where function-local variables are stored, theaddress of the start of which is the current stack pointer
  • a red zone that is below the stack pointer

Strictly speaking, function activation records don't have to be stored oncall stacks. Modern processor architectures employ what is known asdynamic allocation for activation records, where activation records arecreated and destroyed on the fly, being pushed onto and popped off the topof a call stack. Another possibility, not employed in mainstreamarchitectures nowadays, is static allocation, where it is knownthat functions are not reëntrant and programs are not multi-threaded.With static allocation, activation records are simply stored in fixedportions of a program's read/write data area, determined at compile/linktime.

A red zone is the area immediately below the stack, that can beoverwritten at any time as a consequence of asynchronous events occuringduring execution of the function: signals, exceptions, or interrupts.Because it is liable to be destroyed at any moment, functions should notattempt to use it for storage.

In many processor architectures, the stack pointer registeralways points to the bottom of the standard stack frame. When a functionneeds itself to call other functions (i.e. it is not a so-called leaffunction) it has to construct the parameter area for the calledfunction's stack frame. There are two common techniques:

  • On the x86 architecture, code temporarily modifies the stackpointer to make space for the new parameter area, on the fly, by pushingand popping (or otherwise manipulating) the stack. This is not terriblyefficient, requiring specialized optimization hardware in x86 CPUs justto reduce its gross serial depencies upon a single processor register.As explained later, some optimizing compilers for x86 take a moreRISC-like approach.

  • On more RISC-like architectures such as MIPS, the convention is forspace large enough to hold the parameter area for any function that willbe called (which size the compiler obviously can work out at codegeneration time, by simply taking the maximum of the sizes of theindividual parameter areas) to be pre-allocated by the function prologue,below the locals area, with the stack pointer register pointing to itsbase. Thus a standard stack frame contains an extra area, a callingparameter area, below the locals area, part or all of which overlapsthe parameter areas of the activation records of called functions.

Many processor architectures also have a frame pointer register.Sometimes this points to the saved frame pointers area. Sometimes itpoints elsewhere within a standard stack frame. The frame pointer isn'tstrictly speaking used to locate the stack frame itself. Thestack pointer does that quite happily, after all. The frame pointer isused to provide simple access, using the shortest/quickest instructionforms, to the locals area and parameters area of a stack frame.For example:

  • On the x86 architecture, code temporarily modifies the stackpointer to make space for new calling parameter areas, on the fly, bypushing and popping (or otherwise manipulating) the stack. However, Theframe pointer register — (E)BP — remains fixed,pointing at the middle of the stack frame. Compilers thus generate codethat references function parameters and function-local variables usingregister-relative addressing via the frame pointer, and that referencescalling parameters using register-relative addressing via the stackpointer.

  • With the x86-64 standard function perilogue, the frame pointer is, byconvention, exactly 128 bytes above the base of the stack frame. Thisdoesn't necessarily point to any definite part of the stack frame, becausestack frames for different functions (of course) have different sizes oflocals areas, save areas, calling parameter areas, and so forth, meaningthat whatever is at offset 128 is going to depend from exactly whatfunction is called. The purpose of this offset is so that the framepointer register can be used in preference to the stack pointer registerwhen accessing the stack frame. The 128 byte offset means that shortinstructions using a register-relative addressing mode with signed 8-bitoffsets can access 256 bytes of the stack frame via the frame pointerregister, as opposed to only 128 bytes of the stack frame via the stackpointer register.

The standard x86 function perilogue

The standard function perilogue on the x86architecture comprises the ENTER and LEAVEinstructions:

enter N,M
…
leave
ret

This standard x86 perilogue sets up and tears down a standard x86stack frame. The base of the stack frame is pointed to by theESP register, and comprises N bytes for function-localvariables, followed by M+1 saved frame pointers, pointing to thestack frame(s) of the calling function(s), followed by the caller to thefunction's return address and the (stack-stored) parameters to thefunction.

The EBP register points to the first (i.e. immediatelyenclosing caller's) saved frame pointer, which in turn was the value ofthe EBP register within that function's standard perilogue.

Although this is the smallest size for such a perilogue, it is notnecessarily the fastest to execute, especially in the case where there isonly the 1 enclosing stack frame pointer to be saved. Moreover, theENTER and LEAVE instructions did not exist onthe 8086. So traditionally, and in some cases for speed, the standardperilogue (for a non-nested function, where M will be zero)has an alternative form, that does exactly the same thing:

push ebp
mov ebp, esp
sub esp, N
…
mov esp, ebp
pop ebp
ret

Interestingly, this is not optimal. The execution of eachPUSH or POP instruction depends from the resultof execution of all preceding ones, since each instruction has to read,modify, and write the stack pointer register. Consider the case where thefunction uses several non-volatile registers (for the sake of exposition,assumethe 32-bit OS/2 system API calling convention and a function body thatdoes something like an optimized memcpy() usingMOVSD and thus requiring EDI andESI), and as a consequence its perilogue has to save andrestore several non-volatile register values on the call stack in the savearea:

push ebp
mov ebp, esp
sub esp, N
push edi
push esi
push ebx
…
pop ebx
pop esi
pop edi
mov esp, ebp
pop ebp
ret

Each PUSH instruction (and, indeed, the RETinstruction at the end) has a sequential dependency from the instructionsthat immediately precede it, because they all need to know theESP value from their predecessors. Similarly, the second andsubsequent POP instructions all have sequential dependencieson their immediate predecessors. Worse still, the incrementedESP register value is then entirely discardedanyway. This does not provide the processor with the ability toschedule multiple operations in parallel, internally, as many x86processors are nowadays capable of.

The Intel Pentium M and the Intel Pentium Atom have a mechanism called"ESP folding" that ameliorates this to an extent. ESP folding handles theimplicit accesses to the ESP register, by PUSH,POP, CALL, and RET, in the AddressGeneration Unit (AGU). This reduces the impact of the multiple successivePUSH and POP instructions in the aforegivennon-optimal perilogue. However, asIntel's IA-32 Software Optimization Reference Manualexplains (see §2.4.1, §3.4.2.6, §12.3.2.2, and§12.3.3.6), this doesn't solve all of the execution speed problemswith this perilogue, because it still contains explicitaccesses to the ESP register, in the SUB andMOV instructions. As the Reference Manual statesmixing the Arithmetic Logic Unit (ALU) with the AGU causes executionstalls. And that's exactly what mixing SUB/MOVwith PUSH/POP does.

Moreover: These are but two x86 processors from one vendor that even haveESP folding in the first place. Other processors have no suchamelioration even for the sequential dependencies of thePUSH, POP, and RET instructions.

All of these problems go away by employing a more optimal perilogue, whosespeedy execution isn't limited to just a certain few Intel x86 processorswith special-case support. A more optimal perilogue has just tworead-modify-write operations on ESP, manipulating it onceeach in prologue/epilogue and then using instructions that have no serialdependencies upon one another to place/retrieve the saved non-volatileregisters and saved frame pointer onto/from the stack. (On other, moreRISC-like, processor architectures, this is the approach taken by standardfunction perilogues. Even on x86 architectures, this is the approachtaken by some optimizing compilers when setting up the parameter areabefore calling a function.) Further optimization can be enabled byfollowing Intel's recommendation to not mix the ALU with the AGU, usingLEA to calculate the new effective ESP address(which is, after all, what LEA is there for). Yet moreoptimizations still can be performed by taking advantage of thewrite-combining properties of the processor's L1 cache, and scheduling theMOV instructions accordingly.

Combining all these results in a standard perilogue that looks like:

LOCALS_SIZE equ …
SAVE_SIZE   equ 16

lea esp, [esp-LOCALS_SIZE-SAVE_SIZE]
mov [esp+LOCALS_SIZE+SAVE_SIZE-16], ebx
mov [esp+LOCALS_SIZE+SAVE_SIZE-12], esi
mov [esp+LOCALS_SIZE+SAVE_SIZE-8], edi
mov [esp+LOCALS_SIZE+SAVE_SIZE-4], ebp
lea ebp, [esp+LOCALS_SIZE+SAVE_SIZE-4]
…
mov ebx, [esp+LOCALS_SIZE+SAVE_SIZE-16]
mov esi, [esp+LOCALS_SIZE+SAVE_SIZE-12]
mov edi, [esp+LOCALS_SIZE+SAVE_SIZE-8]
mov ebp, [esp+LOCALS_SIZE+SAVE_SIZE-4]
lea esp, [esp+LOCALS_SIZE+SAVE_SIZE]
ret

As mentioned, when setting up the parameter area of a new stack frame, inorder to call a function, some optimizing compilers (when optimizing fortime, not space) use the above approach, thus:

; … code that calculates parameters in local variables …
mov edx,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_3]
mov ecx,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_2]
mov eax,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_1]
lea esp, [esp-PARAMS_SIZE]
mov [esp+8], edx
mov [esp+4], ecx
mov [esp+0], eax
call function
lea esp, [esp+PARAMS_SIZE]

However, other compilers generate code that simply PUSHes andPOPs the stack within the body of the function to set upparameter areas. To add insult to injury, the code also uses an ALUinstruction, ADD, on the ESP registerimmediately after the called function has issued an AGUinstruction that is "ESP foldable", RET, causing an executionstall.

; … code that calculates parameters in local variables …
mov edx,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_3]
mov ecx,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_2]
mov eax,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_1]
push edx
push ecx
push eax
call function
add esp, PARAMS_SIZE

Both approaches — PUSH and MOV —modify ESP on the fly, allocating and deallocating call stackspace for the new parameter area of each individual called function as itis called. In general, therefore, in x86 programming the stack pointerregister is not always at a fixed position at the base of the function'sstandard stack frame (as it is by convention on other instructionarchitectures). This in turn means that accessing local variables andfunction parameters is generally done via the frame pointer, usingpositive constant offsets for function parameters and negative constantoffsets for local variables, not the stack pointer.

The standard MIPS function perilogue

The standard function perilogue on the MIPSarchitecture is fairly similar to the perilogue on the x86 architecture:

subu sp,frame_size
sw ra,frame_size-8(sp)
…
lw ra,frame_size-8(sp)
addu sp,frame_size
jr ra

There are several notable differences:

  • The MIPS standard perilogue allows greater instruction parallelism by onlywriting to the stack pointer register once, allowing all registersaves/loads to be overlapped since they don't depend from each other'sresults as a sequence of x86 PUSH and POPinstructions do.

    Saving a non-volatile register, that the function wants to use for itself,in the save area of the stack frame is merely a matter of an additionalsp-relative save/load pair:

    sw s0,frame_size-16(sp)
    …
    lw s0,frame_size-16(sp)
  • The MIPS standard perilogue involves saving the return address from aregister onto the call stack and then retrieving it from the call stackbefore returning. The x86 architecture's CALL andRET instructions do this implicitly. In the MIPSarchitecture, the return address is in the ra ("returnaddress") register, and how (and indeed whether) it is saved into a stackframe is up to each individual function. ra is less of aspecial case register and more like a simple non-volatile register,that one handles like any other non-volatile register: saving it in thesave area if the function needs to re-use it itself (as a non-leaffunction will, but a leaf function will not).

  • By convention, the stack frame on MIPS includes enough space toconstruct the largest parameters area of any called function, andconstructing a parameters area again is a sequence of parallelizablestores that are not interdependent. For example:

    sw s0,16(sp) # set parameter #5
    sw s1,20(sp) # set parameter #6
    jal called_function

There are also similarities. There are stack pointer (sp)and frame pointer (fp) registers. The frame pointer pointsto the middle of a stack frame, and the stack pointer points to thebottom.

Stack walking

On x86 architectures, because the frame pointer register(EBP) points to the saved frame pointer of the immediatecaller, it is possible to walk the stack, as long as everyfunction employs a standard stack frame. Such a process starts with thecurrent frame pointer register value and follows the saved stack framepointers until the top of the stack area (or an invalid saved framepointer value) is reached.

Many compilers targetting x86 platforms provide options for disabling thegeneration of a standard perilogue for functions that can do without one,or disabling just the part that involves saving and restoring the caller'sframe pointer and setting up and tearing down a new one for the calledfunction. However, walking the stack is employed by debuggers, exceptionhandlers, and postmortem analysis utilities. So although many compilersprovide these options, one should be aware of the effect that they willhave upon debugging, exception handling, and postmortem analyses of aprogram's execution.

One trick with frame pointers used on 16:32 and 16:16 x86 code is tosignal the calling distance of a function, by setting a flag in its savedframe pointer for the calling function. This allows anything walking thestack to know whether the return address in the stack frame is a near(0:16 or 0:32) or a far (16:16 or 16:32) address (and hence where theparameter area begins relative to the frame pointer area).

This marking is done by taking advantage of the fact that in normal use astack frame will never be aligned to an odd address. (Compilers andoperating systems conspire to enforce this. Compilers ensure that theyonly ever change the value of ESP by an even number of bytes,and operating systems ensure that stack tops are always aligned to amultiple of 2 bytes.) A far function is marked by simple expedient ofincrementing and decrementing the saved frame pointer by 1 byte. Bit #0of the saved frame pointer thus becomes a "far function" flag. Such aperilogue generally looks like this:

inc ebp
enter N,M
…
leave
dec ebp
ret

Stack walking code has, of course, to be aware of whether this conventionis used by the libraries on the target platform that the program isrunning on, and whether the program itself was compiled to employ it, ofcourse.

Stack probes

Both 32-bit OS/2 and Win32 provide applications softwares with thecapability of having what are known as decommitted stacks. Adecommitted stack is a thread's stack area that starts offallocated but not committed. In other words: The rangeof virtual address space is allocated to the stack, but there are novirtual memory pages committed into that range of addresses. Decommittedstacks allow applications to specify large stack sizes without incurring(unless actualy required as the program executes) the overhead of all ofthe virtual memory pages that would be necessary were the stack area fullycommitted. This is done via a mechanism that involves guardpages.

The details of operation of the guard page mechanism can be found in the32-bit OS/2 Developers' Toolkit documentation and the MSDN documentationfor Win32. Simply put: All pages in a thread's stack down to some pointare committed; the next page is a committed page marked as a guard page;and all pages below that are not committed. Accessing a guard page causesa page fault and an application-mode exception, in response to which theoperating system automatically turns the guard page into anordinary committed page (by removing its guard page attribute), to committhe next lower page in the stack (if it can), and (if committed) mark thatnext lower page as a guard page in its turn.

Thus any application that guarantees that it will access its stack spacemore-or-less as an actual stack, pushing things onto the top serially,will have its initially entirely decommitted stack automatically committedfor it by the operating system as the stack grows downwards. (When theoperating system hits the bottom of the stack, usually an exception israised by the operating system. But that's another discussion, withcomplexities and subtleties of its own, tangential to this one.)

The x86 standard function perilogue does access its stack spacemore-or-less like a stack. It pushes activation records onto the callstack, and pops them back off again. The problem ensues when anactivation record is larger than a memory page.

More particularly, the problem ensues when the perilogue is using theSUB or LEA instructions to decrement the stackpointer. A PUSH or ENTER instruction decrementsthe stack pointer, but also touches the memory at the stack pointeraddress. The write access to the memory location will trigger the guardpage mechanism. A perilogue that used PUSH orENTER to decrement the stack pointer exclusively would neversuffer a problem. But perilogues don't always use PUSH orENTER. Indeed, as already discussed, for the greatest speedoptimization a perilogue doesn't use PUSH, POP,ENTER, or LEAVE anywhere, but ratheruses LEA to modify the stack pointer register andESP-relative MOV instructions to save andrestore registers and frame pointers. It doesn't necessarily incorporatethose MOV instructions in strictly descending order of stacklocation, either.

This is where stack probes come in. Stack probes are a simpleidea. The compiler generates extra code in the function perilogue ifthere's a danger that it might skip over more than one page of the stackwithout actually touching it when pushing the function's activationrecord.

Usually, compilers simply note when the sizes of the locals area is largerthan a page, and spit out extra dummy memory references, immediately afterthe SUB instruction that decrements ESP, thattouch the intervening stack pages in the right, descending, order. Thereare several complications to this scheme:

  • ESP cannot be decremented more than 1 page past the loweststack page access at any point. This is because an asynchronousexception, whose handler would of course immediately start pushing thingsonto the stack at the current ESP address, may arrive at anypoint during the stack probe sequence itself. So compilerstake one of two approaches:

    • They do all stack probe operations before decrementing ESP,by writing into the red zone. It's safe to touch the red zone,just not safe to rely upon it retaining the values that one may havewritten to it. Stack probes don't care about values of the memorylocations touched. Indeed, some stack probe mechanisms write completelyarbitrary data themselves (such as zero, or 0xdeadbeef, orwhatever happens to be in the EAX register at the time).

    • They perform progressive stack probes, decrementing ESP onepage at a time. For example, a 12KiB stack probe, unrolled, would looklike:

      lea esp,[esp - 1000h]
      mov [esp],eax
      lea esp,[esp - 1000h]
      mov [esp],eax
      lea esp,[esp - 1000h]
      mov [esp],eax
  • The range of stack locations that the stack probe has to cover isnot simply the size of the locals area. If the functionperilogue doesn't use PUSH to save non-volatile registersinto the save area, the size of the save area must be added to the probetotal as well. Similarly with saved stack frame pointers, if they arestored with MOV rather than PUSH. Basically,however much space is to be skipped over with SUB orLEA is however much space needs to be probed.

  • An often forgotten part of the function activation record, when it comesto stack probes, is the parameters area. Unless a compiler always usesPUSH to place parameter values onto the call stack, it mustalso perform stack probes when it calls functions thattake more than one page's worth of parameters. Here is a simple C++program that illustrates a case where stack probes of the callingparameters area are also required:

    struct s { char b[16384] ; } ;
    
    int f ( s p ) { return p.b[0] ; }
    
    s d ;
    
    int main () { return f(d) ; }

    Unfortunately, many compilers forget that stack probes are also requiredfor setting up a calling parameters area properly, and don't employ anystack probe generation logic when generating calls to functions. BorlandC/C++, MetaWare High C/C++, and EMX C/C++ all overlook this necessity.Watcom C/C++ does not.

It is worth nothing that the latter two points are strong arguments infavour of the MIPS standard function perilogue design over the(non-optimal) x86 standard function perilogue design. In the MIPSapproach, setting up enough room to hold the largest parameters area ofany called function is a part of the standard function perilogue itself,rather than being deferred to on-the-fly modifications to the stackpointer within the function body. As such, a single stack probe operationcan be done in the perilogue that encompasses the sizes of the localsarea, the saved frame pointer(s) area, the save area, and thecalling parameters areas, all in one gulp

Callback functions in Win16

In Win16, thestandard function calling convention for callback functions requiresthat a function perform additional set up and teardown, within thestandard perilogue. This nested perilogue ensures that theDS register within the function has whatever valuethe AX register had on entry to the function, and takes theform:

push ds
mov ds,ax
…
pop ds

In addition, the prologue must be prefixed with one of two 3-byteprefixes:

mov ax,ds
nop

or:

push ds
pop ax
nop

Seemingly, this is a very long-winded way of doing nothing, setting theDS register to what it already was and splatting theAX register along the way. If the function is not exportedand used as a callback, that is exactly what it is.

The point of the extra perilogue that modifies the DSregister during function execution is to allow an instance thunkto be set up with the MakeProcInstance() call in order to usethe function as a callback. This instance thunk loads a targetDS selector value into AX and calls thefunction. The Windows loader collaborates in this. For every functionexported from an EXE or a DLL, it scans its first 3 bytes, and overwriteseither of the aforegiven sequences with three nopinstructions. (This of course makes it impossible to call themexcept through an instance thunk.)

The total function perilogue for a Win16 callback function, with thestandard perilogue (in "optimize for time" form), the far function callmarker, and the Win16 mechanism for making a function instance thunkable,was quite hefty. In 1989,Michael Geary discovered a trickthat did away with a lot of this, by observing that the whole instancethunk mechanism wasn't necessary. In EXEs, the SS registeralready held the proper data segment selector; and in DLLs, one couldsimply perform a load of a constant which the program image loader wouldfixup to point to DGROUP for the DLL.

Comparison of Win16 function perilogues
Instance thunkable function"Smart callback" in an EXE"Smart callback" in a DLLNon-callback far function
mov ax,ds
nop
inc ebp
enter N,M
push ds
mov ds,ax
…
pop ds
leave
dec ebp
ret
mov ax,ss
inc ebp
enter N,M
push ds
mov ds,ax
…
pop ds
leave
dec ebp
ret
mov ax,DGROUP
inc ebp
enter N,M
push ds
mov ds,ax
…
pop ds
leave
dec ebp
ret
inc ebp
enter N,M
…
leave
dec ebp
ret

© Copyright 2010Jonathan de Boyne Pollard."Moral" rights asserted.
Permission is hereby granted to copy and to distribute this web page in itsoriginal, unmodified form as long as its last modification datestamp is preserved.

The gen on function perilogues.

Yes, "perilogue" is a real word — sort of.It's only ever been used as a technical term in computing,and was first used byEdward Anton Schneider of Carnegie-Mellon University in 1976to mean the start or finish of an operation. Clearly this is a usefulterm for the combination of a prologue and an epilogue, which areinseparable from each other when it comes to discussions of compiledfunctions in computer languages, and lack another word for theircombination.

As "prologue" comes from the Greek "προ", meaning "before",and as "epilogue" comes from the Greek "επι", meaning"after", so "perilogue" comes from the Greek "περι",meaning "around/about". Indeed, the word"περιλεγειν"actually exists in Classical Greek, in the writings of Hermippus, meaning"to talk around" something.

A function perilogue comprises a functionprologue and a function epilogue, which bracket the actualbody of a function fore and aft. The prologue and epilogue are stronglycoupled to one another, and are effectively a single unit, the perilogue.The prologue sets up an exection environment for the function body, andthe epilogue tears it down again.

Activation records and red zones

Function perilogues are, by their natures, specific to instruction setarchitectures. The standard perilogues for an architecture set up astandard stack frame for the architecture in the perilogue andtear it back down again in the epilogue.

Formally, a stack frame is part of a function's overallactivation record. An activation record formallycomprises:

  • a parameter area where the stack-stored parameters (or, insome function calling conventions such asIBM's Optlink calling convention,space reserved for register-stored parameters to be spilled into) areheld
  • a return address where the address of the next instruction toexecute within the calling function is stored
  • one or more saved frame pointers holding the callingfunctions'/function's saved frame pointer(s)
  • a save area where non-volatile registers that the functioncode modifies are saved across the function body
  • a locals area where function-local variables are stored, theaddress of the start of which is the current stack pointer
  • a red zone that is below the stack pointer

Strictly speaking, function activation records don't have to be stored oncall stacks. Modern processor architectures employ what is known asdynamic allocation for activation records, where activation records arecreated and destroyed on the fly, being pushed onto and popped off the topof a call stack. Another possibility, not employed in mainstreamarchitectures nowadays, is static allocation, where it is knownthat functions are not reëntrant and programs are not multi-threaded.With static allocation, activation records are simply stored in fixedportions of a program's read/write data area, determined at compile/linktime.

A red zone is the area immediately below the stack, that can beoverwritten at any time as a consequence of asynchronous events occuringduring execution of the function: signals, exceptions, or interrupts.Because it is liable to be destroyed at any moment, functions should notattempt to use it for storage.

In many processor architectures, the stack pointer registeralways points to the bottom of the standard stack frame. When a functionneeds itself to call other functions (i.e. it is not a so-called leaffunction) it has to construct the parameter area for the calledfunction's stack frame. There are two common techniques:

  • On the x86 architecture, code temporarily modifies the stackpointer to make space for the new parameter area, on the fly, by pushingand popping (or otherwise manipulating) the stack. This is not terriblyefficient, requiring specialized optimization hardware in x86 CPUs justto reduce its gross serial depencies upon a single processor register.As explained later, some optimizing compilers for x86 take a moreRISC-like approach.

  • On more RISC-like architectures such as MIPS, the convention is forspace large enough to hold the parameter area for any function that willbe called (which size the compiler obviously can work out at codegeneration time, by simply taking the maximum of the sizes of theindividual parameter areas) to be pre-allocated by the function prologue,below the locals area, with the stack pointer register pointing to itsbase. Thus a standard stack frame contains an extra area, a callingparameter area, below the locals area, part or all of which overlapsthe parameter areas of the activation records of called functions.

Many processor architectures also have a frame pointer register.Sometimes this points to the saved frame pointers area. Sometimes itpoints elsewhere within a standard stack frame. The frame pointer isn'tstrictly speaking used to locate the stack frame itself. Thestack pointer does that quite happily, after all. The frame pointer isused to provide simple access, using the shortest/quickest instructionforms, to the locals area and parameters area of a stack frame.For example:

  • On the x86 architecture, code temporarily modifies the stackpointer to make space for new calling parameter areas, on the fly, bypushing and popping (or otherwise manipulating) the stack. However, Theframe pointer register — (E)BP — remains fixed,pointing at the middle of the stack frame. Compilers thus generate codethat references function parameters and function-local variables usingregister-relative addressing via the frame pointer, and that referencescalling parameters using register-relative addressing via the stackpointer.

  • With the x86-64 standard function perilogue, the frame pointer is, byconvention, exactly 128 bytes above the base of the stack frame. Thisdoesn't necessarily point to any definite part of the stack frame, becausestack frames for different functions (of course) have different sizes oflocals areas, save areas, calling parameter areas, and so forth, meaningthat whatever is at offset 128 is going to depend from exactly whatfunction is called. The purpose of this offset is so that the framepointer register can be used in preference to the stack pointer registerwhen accessing the stack frame. The 128 byte offset means that shortinstructions using a register-relative addressing mode with signed 8-bitoffsets can access 256 bytes of the stack frame via the frame pointerregister, as opposed to only 128 bytes of the stack frame via the stackpointer register.

The standard x86 function perilogue

The standard function perilogue on the x86architecture comprises the ENTER and LEAVEinstructions:

enter N,M
…
leave
ret

This standard x86 perilogue sets up and tears down a standard x86stack frame. The base of the stack frame is pointed to by theESP register, and comprises N bytes for function-localvariables, followed by M+1 saved frame pointers, pointing to thestack frame(s) of the calling function(s), followed by the caller to thefunction's return address and the (stack-stored) parameters to thefunction.

The EBP register points to the first (i.e. immediatelyenclosing caller's) saved frame pointer, which in turn was the value ofthe EBP register within that function's standard perilogue.

Although this is the smallest size for such a perilogue, it is notnecessarily the fastest to execute, especially in the case where there isonly the 1 enclosing stack frame pointer to be saved. Moreover, theENTER and LEAVE instructions did not exist onthe 8086. So traditionally, and in some cases for speed, the standardperilogue (for a non-nested function, where M will be zero)has an alternative form, that does exactly the same thing:

push ebp
mov ebp, esp
sub esp, N
…
mov esp, ebp
pop ebp
ret

Interestingly, this is not optimal. The execution of eachPUSH or POP instruction depends from the resultof execution of all preceding ones, since each instruction has to read,modify, and write the stack pointer register. Consider the case where thefunction uses several non-volatile registers (for the sake of exposition,assumethe 32-bit OS/2 system API calling convention and a function body thatdoes something like an optimized memcpy() usingMOVSD and thus requiring EDI andESI), and as a consequence its perilogue has to save andrestore several non-volatile register values on the call stack in the savearea:

push ebp
mov ebp, esp
sub esp, N
push edi
push esi
push ebx
…
pop ebx
pop esi
pop edi
mov esp, ebp
pop ebp
ret

Each PUSH instruction (and, indeed, the RETinstruction at the end) has a sequential dependency from the instructionsthat immediately precede it, because they all need to know theESP value from their predecessors. Similarly, the second andsubsequent POP instructions all have sequential dependencieson their immediate predecessors. Worse still, the incrementedESP register value is then entirely discardedanyway. This does not provide the processor with the ability toschedule multiple operations in parallel, internally, as many x86processors are nowadays capable of.

The Intel Pentium M and the Intel Pentium Atom have a mechanism called"ESP folding" that ameliorates this to an extent. ESP folding handles theimplicit accesses to the ESP register, by PUSH,POP, CALL, and RET, in the AddressGeneration Unit (AGU). This reduces the impact of the multiple successivePUSH and POP instructions in the aforegivennon-optimal perilogue. However, asIntel's IA-32 Software Optimization Reference Manualexplains (see §2.4.1, §3.4.2.6, §12.3.2.2, and§12.3.3.6), this doesn't solve all of the execution speed problemswith this perilogue, because it still contains explicitaccesses to the ESP register, in the SUB andMOV instructions. As the Reference Manual statesmixing the Arithmetic Logic Unit (ALU) with the AGU causes executionstalls. And that's exactly what mixing SUB/MOVwith PUSH/POP does.

Moreover: These are but two x86 processors from one vendor that even haveESP folding in the first place. Other processors have no suchamelioration even for the sequential dependencies of thePUSH, POP, and RET instructions.

All of these problems go away by employing a more optimal perilogue, whosespeedy execution isn't limited to just a certain few Intel x86 processorswith special-case support. A more optimal perilogue has just tworead-modify-write operations on ESP, manipulating it onceeach in prologue/epilogue and then using instructions that have no serialdependencies upon one another to place/retrieve the saved non-volatileregisters and saved frame pointer onto/from the stack. (On other, moreRISC-like, processor architectures, this is the approach taken by standardfunction perilogues. Even on x86 architectures, this is the approachtaken by some optimizing compilers when setting up the parameter areabefore calling a function.) Further optimization can be enabled byfollowing Intel's recommendation to not mix the ALU with the AGU, usingLEA to calculate the new effective ESP address(which is, after all, what LEA is there for). Yet moreoptimizations still can be performed by taking advantage of thewrite-combining properties of the processor's L1 cache, and scheduling theMOV instructions accordingly.

Combining all these results in a standard perilogue that looks like:

LOCALS_SIZE equ …
SAVE_SIZE   equ 16

lea esp, [esp-LOCALS_SIZE-SAVE_SIZE]
mov [esp+LOCALS_SIZE+SAVE_SIZE-16], ebx
mov [esp+LOCALS_SIZE+SAVE_SIZE-12], esi
mov [esp+LOCALS_SIZE+SAVE_SIZE-8], edi
mov [esp+LOCALS_SIZE+SAVE_SIZE-4], ebp
lea ebp, [esp+LOCALS_SIZE+SAVE_SIZE-4]
…
mov ebx, [esp+LOCALS_SIZE+SAVE_SIZE-16]
mov esi, [esp+LOCALS_SIZE+SAVE_SIZE-12]
mov edi, [esp+LOCALS_SIZE+SAVE_SIZE-8]
mov ebp, [esp+LOCALS_SIZE+SAVE_SIZE-4]
lea esp, [esp+LOCALS_SIZE+SAVE_SIZE]
ret

As mentioned, when setting up the parameter area of a new stack frame, inorder to call a function, some optimizing compilers (when optimizing fortime, not space) use the above approach, thus:

; … code that calculates parameters in local variables …
mov edx,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_3]
mov ecx,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_2]
mov eax,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_1]
lea esp, [esp-PARAMS_SIZE]
mov [esp+8], edx
mov [esp+4], ecx
mov [esp+0], eax
call function
lea esp, [esp+PARAMS_SIZE]

However, other compilers generate code that simply PUSHes andPOPs the stack within the body of the function to set upparameter areas. To add insult to injury, the code also uses an ALUinstruction, ADD, on the ESP registerimmediately after the called function has issued an AGUinstruction that is "ESP foldable", RET, causing an executionstall.

; … code that calculates parameters in local variables …
mov edx,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_3]
mov ecx,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_2]
mov eax,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_1]
push edx
push ecx
push eax
call function
add esp, PARAMS_SIZE

Both approaches — PUSH and MOV —modify ESP on the fly, allocating and deallocating call stackspace for the new parameter area of each individual called function as itis called. In general, therefore, in x86 programming the stack pointerregister is not always at a fixed position at the base of the function'sstandard stack frame (as it is by convention on other instructionarchitectures). This in turn means that accessing local variables andfunction parameters is generally done via the frame pointer, usingpositive constant offsets for function parameters and negative constantoffsets for local variables, not the stack pointer.

The standard MIPS function perilogue

The standard function perilogue on the MIPSarchitecture is fairly similar to the perilogue on the x86 architecture:

subu sp,frame_size
sw ra,frame_size-8(sp)
…
lw ra,frame_size-8(sp)
addu sp,frame_size
jr ra

There are several notable differences:

  • The MIPS standard perilogue allows greater instruction parallelism by onlywriting to the stack pointer register once, allowing all registersaves/loads to be overlapped since they don't depend from each other'sresults as a sequence of x86 PUSH and POPinstructions do.

    Saving a non-volatile register, that the function wants to use for itself,in the save area of the stack frame is merely a matter of an additionalsp-relative save/load pair:

    sw s0,frame_size-16(sp)
    …
    lw s0,frame_size-16(sp)
  • The MIPS standard perilogue involves saving the return address from aregister onto the call stack and then retrieving it from the call stackbefore returning. The x86 architecture's CALL andRET instructions do this implicitly. In the MIPSarchitecture, the return address is in the ra ("returnaddress") register, and how (and indeed whether) it is saved into a stackframe is up to each individual function. ra is less of aspecial case register and more like a simple non-volatile register,that one handles like any other non-volatile register: saving it in thesave area if the function needs to re-use it itself (as a non-leaffunction will, but a leaf function will not).

  • By convention, the stack frame on MIPS includes enough space toconstruct the largest parameters area of any called function, andconstructing a parameters area again is a sequence of parallelizablestores that are not interdependent. For example:

    sw s0,16(sp) # set parameter #5
    sw s1,20(sp) # set parameter #6
    jal called_function

There are also similarities. There are stack pointer (sp)and frame pointer (fp) registers. The frame pointer pointsto the middle of a stack frame, and the stack pointer points to thebottom.

Stack walking

On x86 architectures, because the frame pointer register(EBP) points to the saved frame pointer of the immediatecaller, it is possible to walk the stack, as long as everyfunction employs a standard stack frame. Such a process starts with thecurrent frame pointer register value and follows the saved stack framepointers until the top of the stack area (or an invalid saved framepointer value) is reached.

Many compilers targetting x86 platforms provide options for disabling thegeneration of a standard perilogue for functions that can do without one,or disabling just the part that involves saving and restoring the caller'sframe pointer and setting up and tearing down a new one for the calledfunction. However, walking the stack is employed by debuggers, exceptionhandlers, and postmortem analysis utilities. So although many compilersprovide these options, one should be aware of the effect that they willhave upon debugging, exception handling, and postmortem analyses of aprogram's execution.

One trick with frame pointers used on 16:32 and 16:16 x86 code is tosignal the calling distance of a function, by setting a flag in its savedframe pointer for the calling function. This allows anything walking thestack to know whether the return address in the stack frame is a near(0:16 or 0:32) or a far (16:16 or 16:32) address (and hence where theparameter area begins relative to the frame pointer area).

This marking is done by taking advantage of the fact that in normal use astack frame will never be aligned to an odd address. (Compilers andoperating systems conspire to enforce this. Compilers ensure that theyonly ever change the value of ESP by an even number of bytes,and operating systems ensure that stack tops are always aligned to amultiple of 2 bytes.) A far function is marked by simple expedient ofincrementing and decrementing the saved frame pointer by 1 byte. Bit #0of the saved frame pointer thus becomes a "far function" flag. Such aperilogue generally looks like this:

inc ebp
enter N,M
…
leave
dec ebp
ret

Stack walking code has, of course, to be aware of whether this conventionis used by the libraries on the target platform that the program isrunning on, and whether the program itself was compiled to employ it, ofcourse.

Stack probes

Both 32-bit OS/2 and Win32 provide applications softwares with thecapability of having what are known as decommitted stacks. Adecommitted stack is a thread's stack area that starts offallocated but not committed. In other words: The rangeof virtual address space is allocated to the stack, but there are novirtual memory pages committed into that range of addresses. Decommittedstacks allow applications to specify large stack sizes without incurring(unless actualy required as the program executes) the overhead of all ofthe virtual memory pages that would be necessary were the stack area fullycommitted. This is done via a mechanism that involves guardpages.

The details of operation of the guard page mechanism can be found in the32-bit OS/2 Developers' Toolkit documentation and the MSDN documentationfor Win32. Simply put: All pages in a thread's stack down to some pointare committed; the next page is a committed page marked as a guard page;and all pages below that are not committed. Accessing a guard page causesa page fault and an application-mode exception, in response to which theoperating system automatically turns the guard page into anordinary committed page (by removing its guard page attribute), to committhe next lower page in the stack (if it can), and (if committed) mark thatnext lower page as a guard page in its turn.

Thus any application that guarantees that it will access its stack spacemore-or-less as an actual stack, pushing things onto the top serially,will have its initially entirely decommitted stack automatically committedfor it by the operating system as the stack grows downwards. (When theoperating system hits the bottom of the stack, usually an exception israised by the operating system. But that's another discussion, withcomplexities and subtleties of its own, tangential to this one.)

The x86 standard function perilogue does access its stack spacemore-or-less like a stack. It pushes activation records onto the callstack, and pops them back off again. The problem ensues when anactivation record is larger than a memory page.

More particularly, the problem ensues when the perilogue is using theSUB or LEA instructions to decrement the stackpointer. A PUSH or ENTER instruction decrementsthe stack pointer, but also touches the memory at the stack pointeraddress. The write access to the memory location will trigger the guardpage mechanism. A perilogue that used PUSH orENTER to decrement the stack pointer exclusively would neversuffer a problem. But perilogues don't always use PUSH orENTER. Indeed, as already discussed, for the greatest speedoptimization a perilogue doesn't use PUSH, POP,ENTER, or LEAVE anywhere, but ratheruses LEA to modify the stack pointer register andESP-relative MOV instructions to save andrestore registers and frame pointers. It doesn't necessarily incorporatethose MOV instructions in strictly descending order of stacklocation, either.

This is where stack probes come in. Stack probes are a simpleidea. The compiler generates extra code in the function perilogue ifthere's a danger that it might skip over more than one page of the stackwithout actually touching it when pushing the function's activationrecord.

Usually, compilers simply note when the sizes of the locals area is largerthan a page, and spit out extra dummy memory references, immediately afterthe SUB instruction that decrements ESP, thattouch the intervening stack pages in the right, descending, order. Thereare several complications to this scheme:

  • ESP cannot be decremented more than 1 page past the loweststack page access at any point. This is because an asynchronousexception, whose handler would of course immediately start pushing thingsonto the stack at the current ESP address, may arrive at anypoint during the stack probe sequence itself. So compilerstake one of two approaches:

    • They do all stack probe operations before decrementing ESP,by writing into the red zone. It's safe to touch the red zone,just not safe to rely upon it retaining the values that one may havewritten to it. Stack probes don't care about values of the memorylocations touched. Indeed, some stack probe mechanisms write completelyarbitrary data themselves (such as zero, or 0xdeadbeef, orwhatever happens to be in the EAX register at the time).

    • They perform progressive stack probes, decrementing ESP onepage at a time. For example, a 12KiB stack probe, unrolled, would looklike:

      lea esp,[esp - 1000h]
      mov [esp],eax
      lea esp,[esp - 1000h]
      mov [esp],eax
      lea esp,[esp - 1000h]
      mov [esp],eax
  • The range of stack locations that the stack probe has to cover isnot simply the size of the locals area. If the functionperilogue doesn't use PUSH to save non-volatile registersinto the save area, the size of the save area must be added to the probetotal as well. Similarly with saved stack frame pointers, if they arestored with MOV rather than PUSH. Basically,however much space is to be skipped over with SUB orLEA is however much space needs to be probed.

  • An often forgotten part of the function activation record, when it comesto stack probes, is the parameters area. Unless a compiler always usesPUSH to place parameter values onto the call stack, it mustalso perform stack probes when it calls functions thattake more than one page's worth of parameters. Here is a simple C++program that illustrates a case where stack probes of the callingparameters area are also required:

    struct s { char b[16384] ; } ;
    
    int f ( s p ) { return p.b[0] ; }
    
    s d ;
    
    int main () { return f(d) ; }

    Unfortunately, many compilers forget that stack probes are also requiredfor setting up a calling parameters area properly, and don't employ anystack probe generation logic when generating calls to functions. BorlandC/C++, MetaWare High C/C++, and EMX C/C++ all overlook this necessity.Watcom C/C++ does not.

It is worth nothing that the latter two points are strong arguments infavour of the MIPS standard function perilogue design over the(non-optimal) x86 standard function perilogue design. In the MIPSapproach, setting up enough room to hold the largest parameters area ofany called function is a part of the standard function perilogue itself,rather than being deferred to on-the-fly modifications to the stackpointer within the function body. As such, a single stack probe operationcan be done in the perilogue that encompasses the sizes of the localsarea, the saved frame pointer(s) area, the save area, and thecalling parameters areas, all in one gulp

Callback functions in Win16

In Win16, thestandard function calling convention for callback functions requiresthat a function perform additional set up and teardown, within thestandard perilogue. This nested perilogue ensures that theDS register within the function has whatever valuethe AX register had on entry to the function, and takes theform:

push ds
mov ds,ax
…
pop ds

In addition, the prologue must be prefixed with one of two 3-byteprefixes:

mov ax,ds
nop

or:

push ds
pop ax
nop

Seemingly, this is a very long-winded way of doing nothing, setting theDS register to what it already was and splatting theAX register along the way. If the function is not exportedand used as a callback, that is exactly what it is.

The point of the extra perilogue that modifies the DSregister during function execution is to allow an instance thunkto be set up with the MakeProcInstance() call in order to usethe function as a callback. This instance thunk loads a targetDS selector value into AX and calls thefunction. The Windows loader collaborates in this. For every functionexported from an EXE or a DLL, it scans its first 3 bytes, and overwriteseither of the aforegiven sequences with three nopinstructions. (This of course makes it impossible to call themexcept through an instance thunk.)

The total function perilogue for a Win16 callback function, with thestandard perilogue (in "optimize for time" form), the far function callmarker, and the Win16 mechanism for making a function instance thunkable,was quite hefty. In 1989,Michael Geary discovered a trickthat did away with a lot of this, by observing that the whole instancethunk mechanism wasn't necessary. In EXEs, the SS registeralready held the proper data segment selector; and in DLLs, one couldsimply perform a load of a constant which the program image loader wouldfixup to point to DGROUP for the DLL.

Comparison of Win16 function perilogues
Instance thunkable function"Smart callback" in an EXE"Smart callback" in a DLLNon-callback far function
mov ax,ds
nop
inc ebp
enter N,M
push ds
mov ds,ax
…
pop ds
leave
dec ebp
ret
mov ax,ss
inc ebp
enter N,M
push ds
mov ds,ax
…
pop ds
leave
dec ebp
ret
mov ax,DGROUP
inc ebp
enter N,M
push ds
mov ds,ax
…
pop ds
leave
dec ebp
ret
inc ebp
enter N,M
…
leave
dec ebp
ret

© Copyright 2010Jonathan de Boyne Pollard."Moral" rights asserted.
Permission is hereby granted to copy and to distribute this web page in itsoriginal, unmodified form as long as its last modification datestamp is preserved.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值