ARM Stack Unwinding

ARM Stack Unwinding

August 2007

Introduction

Languages like C++ and Java have very useful facilities that allow a stacktrace to be collected and displayed in a variety of ways. In Java, a snapshotof the current stack trace can be taken simply by constructing aThrowable object, and the trace can be displayed using theprintStackTrace() method.

    Throwable t = new Throwable();

    t.printStackTrace(System.out);

Example 1: Displaying the current call stack in Java.

C++ offers similar facilities, and because of this, both these languages canprovide useful information when an unexpected failure occurs and isdetected. For example, assertions maybe placed in the code with the failureaction being to display the current stack trace to help the programmer debugthe cause.

Unfortunately C offers no such inbuilt luxury, and as such, debugging without adebugger or other logging mechanism maybe a little more difficult in thefirst instance. This interests me as I've worked as an embedded engineercreating software for consumer devices which are extensively field testedcontaining code that makes extensive use of assertion macros. I decided toinvestigate stack unwinding in order to enable better information to begathered from devices that have failed in the field, with a goal to supportingstack tracing without the need for expensive or cumbersome supporting hardware.

Therefore I decided to make an ARM stack unwinder that would be suitable torun on an embedded target to provide stack trace capabilities for C, similarto those already enjoyed by other languages.

Design Consideration

Since the target is an embedded processor in a consumer device, there aresome restrictions on how the solution can be engineered and what optionsare available. At the same time, knowing the target processor is likelyto be an ARM7 or similar processor using the ARM and Thumb Instruction SetArchitectures allows the solution to be targeted specifically to thisfamily of RISC processors.

The restrictions imposed by the embedded target are as follows:

  • There is little free storage (RAM or ROM).
  • Large amounts of code will run from FLASH and cannot be modified.
  • The form factor device cannot be directly connected to a debugger.

Since storage is at a premium on these devices, I decided to forego use ofdebug tables as these would need embedding on the target and would be large,even if compressed. Since code runs from FLASH, I also have to be carefulthat the solution does not try and patch any code - one stack unwindingapproach described in the ARM APCS document is to 'patch' function entriesand exits and to then execute code to cause the stack adjustments to be madeand stack frames unwound. (Patching code in FLASH is not easy since FLASHcan generally only be erased in blocks and then written once i.e. it is nottruly random access. Additionally erasing any block of FLASH could potentiallypermanently damage the device until it is reprogrammed, so that is a riskand complexity best avoided if at all possible).

Finally the form factor devices usually don't bring out connectors fordebugging such as JTAG. This may sound weird, but all the pins on thepackage inside the device have a use and JTAG is usually multiplexed withsomething that is generally more useful, and exposing debug interfaces isalso considered a security issue and frowned upon in the industry.Additionally having less external connectors makes Electrostatic Discharge(ESD) protection simpler as well as having other small benefits. And inany case, Java requires no debugger to grab a stack trace, so why should C?

Method

Given the above restrictions, I decided that the best approach is to writea small model of the ARM processor which can interpret the code and look forthe tell-tale signs of functions returning to divine the call stack. Yup,I decided to write a model ARM processor to run on the ARM to interpret thecode with the aim of unwinding the stack, as opposed to producing bit exactinterpretation of the code. Implementing the model ARM provides a coupleof challenges that are laid out in the following subsections.

Function Epilogs

All functions look pretty much the same. They have a prologue, a main bodyand an epilogue. The compiler generates the prologue and epilogue on almostall functions (IRQ handling functions can be an explicit exception to this)to setup and teardown the stack frame used during the function body for thingssuch as local variables. Debug tables can give details about the locationof function prologues and epilogues to assist debuggers in interpreting thestack at any time, but I don't have these, so have to locate themautomatically.

Since this is targeted at ARM processors, only ARM functions need to beconsidered. Looking at a few epilogues, it can be seen that they generallylook the same and take the format shown in the following examples:

ADD      sp,#0x28
POP      {r4-r6,pc}

Example 2: Function epilog in Thumb.

ADD      sp,sp,#0x28
LDMFD    sp!,{r4-r6,pc}

Example 3: Function epilog in ARM.

The examples show a stack adjustment to remove storage allocated for locals,then a restoration of the corrupted registers as required by the ARM ABI,and the restoration of the program counter (PC) from the stack effectivelyexecuting the return. The return is very similar and can be detected easilyregardless of processor operating mode (Thumb or ARM), although this raisesan interesting question. The examples show a return that is only suitableif the function is always called from code in the same operating mode as thefunction itself i.e. the Thumb return code can only be used if returning toThumb code, and the same is also true for ARM code. Practically it issometimes the case that ARM code may call Thumb functions and vice versa,and this is known as interworking. Fortunately the ARM ABI also describesinterworking and more examples can be generated to show function epiloguesused when interworking is required.

ADD      sp,#0x28
POP      {r4-r6}
POP      {r3}
BX       r3

Example 4: Function epilog in Thumb with interworking.

ADD      sp,sp,#0x28
LDMFD    sp!,{r4-r6,lr}
BX       lr

Example 5: Function epilog in ARM with interworking.

With interworking, the BX instruction is used to enable theprocessor mode to be changed at the same time as returning from the function.In each case, the return address is restored to a register before being usedwith the BX instruction to cause the return, where the leastsignificant bit of the return address is used to indicate the desiredprocessor mode once the branch has been taken. So we now have a good ideaof what we need to detect in order to determine where a function exits -the reading of a return address from the stack, it being loaded into thePC either directly by the load, or via a BX instruction.

The first thing in the model ARM is therefore to store not only registercontents, but a bit of data about where the contents originated from. Todo this, I've made a couple of types for my model ARM to use.

typedef enum
{
    REG_VAL_INVALID      = 0x00,
    REG_VAL_FROM_STACK   = 0x01,
    REG_VAL_FROM_MEMORY  = 0x02,
    REG_VAL_FROM_CONST   = 0x04,
    REG_VAL_ARITHMETIC   = 0x80
}
RegValOrigin;

typedef struct
{
    Int32              v;
    RegValOrigin       o;
}
RegData;

Code 1: Representation of a register in the model ARM.

Creating an array of RegData structures then allows the registerfile to be emulated. The Program Counter (PC) and Stack Pointer (SP) can beadded to the register file to give the model a basis to interpret code.

At this point, two basic loops are added to the model - one to interpret Thumbcode, and one to interpret ARM code and decoding for the POP andLDMFD instructions are added respectively. When a value is loadedto a register, the RegValOrigin can now be updated to indicate thatthe data originated on the stack, REG_VAL_FROM_STACK, making iteasy to spot a function returning when a BX is encountered forsuch a tracked register. Interpretation of BX is thereforeadded to both the ARM and Thumb modes, faithfully checking the LSB of thebranch address to change the interpretation mode between ARM and Thumb asneeded.

At this point, the model ARM is capable of detecting the return from afunction. This is a good start, but it needs to handle the stack adjustmentif it is to be able to unwind more than one stack frame. In the examples seenso far, the stack adjust has just been the addition of0x28 to the stack pointer, although this will not always be thecase. Depending on the amount of stack data utilised by a function, the stackadjust maybe for a different value, and unfortunately not all adjustments canbe accommodated by a single instruction. The following shows an odd C functionand the generated assembly.

int testStackResize(void)
{
    char biggie[0x81111];
    char *c = biggie;
    int  t;

    sprintf(biggie, "Hello");

    t = 0;

    while(*c)
    {
        t += *c;
        c++;
    }

    runFunc();
    return t;
}

Example 6a: Function with odd stack usage.

testStackResize PROC
        LDR      r3,|L1.364| + 56
        PUSH     {r4,r5,lr}
        ADD      sp,r3
        MOV      r0,sp
        MOV      r4,sp
        ADR      r1,|L1.364| + 60
        BL       __0sprintf
        MOV      r5,#0
        B        |L1.202|
|L1.198|
        ADD      r5,r0,r5
        ADD      r4,#1
|L1.202|
        LDRB     r0,[r4,#0]
        CMP      r0,#0
        BNE      |L1.198|
        LDR      r0,|L1.364| + 68
        LDR      r0,[r0,#0]  ; runFunc
        BL       __call_via_r0
        LDR      r3,|L1.364| + 56
        MOV      r0,r5
        NEG      r3,r3
        ADD      sp,r3
        POP      {r4,r5}
        POP      {r3}
        BX       r3

Example 6b: Assembly listing of function with odd stack usage.

This fictional function is far from pretty, and the epilogue is somewhatmore complicated. The important instructions are the LDR intor3 from a constant memory address, then the NEG operation beforethe stack adjust. Interlaced with this is an instruction to move the function'sreturn value into r0 in accordance with the ABI. This requires the model ARM notonly to interpret a number of new instructions, but also provokes thought abouthow the compiler generates awkward constant values.

In ARM and Thumb mode there are a number of ways in which to generate aconstant value, and the model ARM will need to be able to interpret them all.Worse still is the possibility that the desired constant, or some part of it,has already been created for use by the function body an that the compilerwill use this already constructed value. This means that the model not onlyneeds to be able to interpret any instructions that can be used to generateconstant values, but that it needs to be able to look outside the functionepilogue and into the function body too. Therefore the model ARM becomes yetmore sophisticated and now attempts to interpret every instruction of theprogram, starting at the PC and SP values from where the stack trace isrequired, and stopping when some to be determined criteria is met.

Clearly it is desirable for the model ARM is to remain small, and so only thesubset of instructions that are needed for stack unwinding should beimplemented. The required instructions that have been identified so farare those that are involved in constant value generation, stack adjustingand returning. The question is what to do when an instruction that isnot understood is encountered. The solution to this is simple - invalidateall state and continue the interpretation! As seen earlier, theregisters also have a status attached to their value, one status beingREG_VAL_INVALID. Upon an uninterpreted instruction beingfound, all registered values, with the exception of the PC and SP aretherefore invalidated.

Now that register values can be invalid, the rules of arithmetic also haveto change to propagate this meta data. For example, a simple addition oftwo registers to yield a value in a third should produce a result withstatus REG_VAL_INVALID if either of the inputs is invalid.Additionally a register MOV should copy not only the registervalue, but the status too. The following code fragment from the ARMinterpreting loop shows how the propagation of the register statusdata is handled for Data Processing instructions.

/* Propagate register validity */
switch(arithOp)
{
    case  0: /* AND: Rd := Op1 AND Op2 */
    case  1: /* EOR: Rd := Op1 EOR Op2 */
    case  2: /* SUB: Rd:= Op1 - Op2 */
    case  3: /* RSB: Rd:= Op2 - Op1 */
    case  4: /* ADD: Rd:= Op1 + Op2 */
    case 12: /* ORR: Rd:= Op1 OR Op2 */
    case 14: /* BIC: Rd:= Op1 AND NOT Op2 */
        if(!M_IsOriginValid(state->regData[rn].o) ||
           !M_IsOriginValid(op2origin))
        {
            state->regData[rd].o = REG_VAL_INVALID;
        }
        else
        {
            state->regData[rd].o = state->regData[rn].o;
            state->regData[rd].o |= op2origin;
        }
        break;
    case  5: /* ADC: Rd:= Op1 + Op2 + C */
    case  6: /* SBC: Rd:= Op1 - Op2 + C */
    case  7: /* RSC: Rd:= Op2 - Op1 + C */
        /* CPSR is not tracked */
        state->regData[rd].o = REG_VAL_INVALID;
        break;

    case  8: /* TST: set condition codes on Op1 AND Op2 */
    case  9: /* TEQ: set condition codes on Op1 EOR Op2 */
    case 10: /* CMP: set condition codes on Op1 - Op2 */
    case 11: /* CMN: set condition codes on Op1 + Op2 */
        break;


    case 13: /* MOV: Rd:= Op2 */
    case 15: /* MVN: Rd:= NOT Op2 */
        state->regData[rd].o = op2origin;
        break;
}

Code 2: Propagation of register state in interpretationof ARM Data Processing instruction.

Now that the model is attempting to interpret all instructions, the handlingof conditional code needs to be considered. Specifically the model mustmeet the following requirements:

  • It must find the function epilogue for any function.
  • It should not get stuck in loops.
  • Infinite loops should be detectable.
  • There should be no significant overhead on the interpretation.

ARM instructions employ conditional guarding, meaning that condition codescan be attached to most instructions such that they are only executed if thecondition is met. Thumb mode uses conditional branch instructions, BNEBEQ etc..., to achieve a similar goal, and in both cases theStatus Register (SR) holds the condition flags which determine if a branch istaken or an ARM instruction executed. Tracking the SR would apply overheadand make the model more complex, as will any sort of branch analysis to findinfinite loops and function exits. Therefore I make the followingassumptions, to simplify the model ARM.

  • All conditional code can be ignored.
  • Conditional branches never need to be taken.
  • Unconditional branched must always be taken.

It seems highly unlikely that the function epilogue will containconditional code and the 'stack moves once' rule of the ABI means that thereis no risk of needing to conditionally correct the stack depending on thepath taken through the function. Unconditional branches must always betaken since without them it is possible that interpretation could wanderinto another function or data area. Ignoring conditional branches alsogreatly simplifies the interpreter, but introduces a risk that some loopsmay appear infinite, as the following example shows.

int loop()
{
    while(1)
    {
        int v = getch();

        if(v == EOF)      { break; }
        else if(v == 10)  { printf("\n"); }
        else              { printf("%c", v); }
    }

    return funcB();
}

Example 7a: Example loop where the exit condition istested within the loop.

loop PROC
        PUSH     {r4,lr}
|L1.2|
        BL       getch
        CMP      r0,#0
        BEQ      |L1.32|
        CMP      r0,#0xa
        BNE      |L1.22|
        ADR      r0,|L1.36|
        BL       __0printf
        B        |L1.2|
|L1.22|
        MOV      r1,r0
        ADR      r0,|L1.36| + 4
        BL       __0printf
        B        |L1.2|
|L1.32|
        POP      {r4,pc}
        DCW      0000
|L1.36| DATA
        DCB      "\n\0\0\0"
        DCB      "%c\0\0"
        ENDP

Example 7b: Thumb assembly showing compiler output.

In the above example, the BEQ needs to be taken in order toreach the function epilogue. Without understanding of the Status Register,the model ARM cannot do this, so instead gets stuck in the loop - this isthe first caveat of the scheme. Accepting for the moment that this type ofconstruct may occur (although I personally would try to avoid writing suchC code as it misses the purpose of the while statement!), a scheme fordetecting an infinite loop is required. I opt simply to count the numberof instructions since a function return was discovered, and to stop theinterpretation if some predefined limit is exceeded. This is very simple,and has little overhead, although may take longer to determine thatunwinding is stuck than a more analytical approach would allow.

At this point, the model ARM should be capable of unwinding most stacks,has the ability to interpret all the code that could appear in a functionepilogue and can interpret and detect returning to a register valuesourced from the stack. The interpreter has a simple method to blunderinto most function epilogues and also has a primitive method of detectingwhen it is stuck in an infinite loop. A little polish can be applied totrap cases such as the branching to a register whose value is invalid andthe scheme is basically working. However, there are a couple of surprisesyet...

Function Prologues

So far the model ARM has been built to concentrate on unwinding thestack frames by interpretation of code leading up to and including thefunction epilogues - the prologue has not needed consideration. However,there is an optimisation that can be supplied by a compiler that causesprologues to become significant. The optimisation is 'tail calling' butis not specific to ARM architectures.

Tail calling is when a function always calls another function as the lastthing that it does before returning. The compiler can spot this patternand instead of generating return code from the first function, it cancall the second function in such a way that its return code will return tothe original caller.

void tailCall(int v)
{
    v *= v;
    printf("%d", v);
    tailFunc(v);
}

Example 8a: Function that makes a tail call.

tailCall PROC
        STMFD    sp!,{r4,lr}
        MUL      r4,r0,r0
        MOV      r1,r4
        ADR      r0,|L1.524|
        BL       __0printf
        MOV      r0,r4
        LDMFD    sp!,{r4,lr}
        B        tailFunc
        ENDP

Example 8b: ARM assembly listing of tail call function.

In this case, the Link Register (LR) is restored from the stack, but insteadof the commonly seen BX, an unconditional branch is made totailFunc(), such that tailFunc() will insteadreturn to the value placed in the LR. It's a small optimisationthat saves a word and a few cycles, but it complicates the interpretationperformed by the model ARM.

To accommodate this, the model ARM must either be aware of tail calling andignore it, or must additionally be able to interpret function prologues.Detecting the tail call would not be impossible, the pattern or restoring theLink Register from the stack and then unconditionally branching is detectable,but there could be a small risk of misdetection if the LR wereused in a function body for any purpose such as temporary storage orarithmetic.

Interpreting a function prologue is much the same as interpreting an epilogueand the same instructions will be used to generate the constant value forthe stack adjust. However, whereas previously all values were being readfrom memory and the stack, some values are now stored to the stack to savestate before the function executes. This could potentially damage the stackon an executing system, so a small hash table to store memory addresses andtheir values is implemented to store stack data instead; before reading frommemory the hash table is inspected, and if a value is found it is used inplace of the value from the device memory. Since function prologues arelikely to start with a PUSH or STMFD instruction,there is also the possibility of an invalid register value being stored tomemory, so the memory hash has to allow for storage of some state data toprevent an invalid register value becoming valid if it is PUSHedand then POPed from the stack. Finally, to prevent the memoryhash needing to be large or risk overflowing, it is periodically purged ofdata stored at addresses that are above the current top of the stack.

Caveats

And there we have it - a scheme performing a kind of abstract interpretationof ARM or Thumb code in order to unwind the stack frames. However, whilesmall, this method of abstract interpretation is not quite perfect. Thereare a number of situations where it will not work, although the general caseso far suggests that it works very well in practice. Still, there arelimitations, and these are best listed.

  • It is possible for the compiler to construct loops that appear as infinitedue to the lack of interpretation of conditional code or branches.
  • The unwinder interprets the return path of the code. While this isgenerally the same or very similar to the calling path, there arecircumstances where the two can be subtlety different.
  • It is easy to construct code by hand that fools the interpretation.
  • If the stack has already been corrupted, unwinding cannot succeed.

The problem of infinite loops could be dealt with by adding a random elementto interpretation. For example, if the model suspects itself to be stuckin an infinite loop (a large number of instructions have been interpretedwith no function epilogue being found), it may start randomly takingconditional branches in an attempt to 'chance upon' the function epilogue.While more sophisticated methods of exiting infinite loops, such as trackingthe PC and marking branch history, could be implemented, they would requiremore memory and complexity for something that has rarely been found to causea problem in practical usage of the unwinder.

A greater problem with no solution is that of tail calling masking functionsfrom the unwound stack. If the interpretation is started from a functionthat was tail called, the function that made the tail call will be omittedfrom the unwound stack. In example 8, unwinding started fromtailFunc() or a sub-function thereof would omit to reporttailCall() since the return path would not pass through thatfunction. In hindsight it may have been better to run the model ARMbackwards, although this presents different problems.

Implementation

The following shows the amount of code and data occupied by the unwinderwhen compiler using RVCT2.1, compiled to Thumb code with -O2. The totalsshow that under 3k of ROM is used to implement the model ARM, which isvery acceptable for my application.

Code(inc. data)RO DataRW DataZI DataLibrary Member Name
440000unwarminder.o
1820000unwarm.o
90468000unwarm_arm.o
119860000unwarm_thumb.o
3000000unwarmmem.o
2628128000Totals

Table 1: Unwinder code size (using RVCT2.1, TCC -O2).

The unwinding code is implemented in a handful of files, and a single headerfile named unwarminder.h needs to be included to access thefunctionality. Accessing to the system memory from the unwinder isabstracted through callbacks that must be implemented by the 'client'code, and this allows reads to be validated such that unwinding isstopped if alignment or the address being read is at fault. The clientcode passes a small structure of function pointers to the unwinder toequip it with the callbacks required to read memory and report returnaddresses.

The function that starts the unwinding is given as follows:

UnwResult UnwindStart(Int32                  spValue,
                      const UnwindCallbacks *cb,
                      void                  *data);

Code 3: The function to start unwinding.

The cb structure gives the callbacks to allow memoryaccesses and return address reporting, while the datapointer can take any value and is passed to the reporting function(cb->report()) such that it may store state if required.

The spValue gives the stack pointer value at which tostart unwinding; the PC value is determined automatically as it iseffectively passed to the function via the Link Register and so canbe retrieved. When using RVCT, the compiler intrinsic function__current_sp() allows the SP value to be read into avariable, so a call to start unwinding typically looks somethinglike the following:

const UnwindCallbacks cliCallbacks = { ... };
CliStack              results;
Int8                  t;
UnwResult             r;

results.frameCount = 0;
r = UnwindStart(__current_sp(), &cliCallbacks, &results);

Code 4: Typical call to start unwinding, passinga pointer to some structure that lists all the callbacks, as well asa pointer to local storage.

Finally, the implementation is not aware of any OS or memory protectionor management schemes that maybe in use on the target. The system on whichthis has been tested is simply configured with a flat memory map and hasfew restrictions on memory access, and the RTOS used poses no restrictionseither. It maybe the case that to use this on other targets the MMU orMPU has to be reconfigured or disabled before unwinding is started, or thefunctions used by the unwinder to access the memory specially constructedto ensure that memory protections will not cause a problem. Should theunwinder request access to addresses that are genuinely invalid, theclient functions for memory access can return FALSE to indicatethat the memory cannot be accessed, and unwinding will terminate.

Licence and Download

The source code for the stack unwinder is available for free download and I'mmaking it PUBLIC DOMAIN. This means that there is no copyright and anyoneis able to take a copy for free and use it as they wish, with or withoutmodifications, and in any context they like, commercially or otherwise.The only limitation is that I don't guarantee that the software is fitfor any purpose or accept any liability for its use or misuse - the softwareis without warranty.

Having said all this, the software has been ran under Valgrind and testedboth in PC simulations (using ARMSD) and on ARM7TDMI and ARM920T targets.

The download package is available here (right-click, Save As...):

This package contains the source code for the unwinder as well as two 'clients'that allow the unwinder to be exercised. The first client is contained in twofiles, client.c and client.h and can be built toproduce an image that can be executed either on an ARM target or in an emulatorand demonstrates the unwinding of the stack from which the unwinder is called.The second client is the 'simulation' client, simclient.c andsimclient.h, which uses two memory images that are also suppliedand contain a snapshot of a call stack and executable code which allowsinterpretation by the unwinder on a PC, where PC tools can also be used todebug the unwinder. The memory images supplied cannot however be ran on atarget since I've zero'd the areas of code that are not needed to demonstratethe unwinder such that the ARM runtime is not present in binary form.

Addendum

29/02/2012: Thomas Jarosch kindly provided this small bugfix.


This page is maintained byMichael McTernan

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值