Writing Efficient C Code for Embedded Systems_c for embedded systems-CSDN博客

By Hai Shalom

原文链接: http://www.rt-embedded.com/blog/archives/writing-efficient-c-code-for-embedded-systems/

The traditional definition of efficiency has two aspects: speed and size. In most cases, optimizing the first aspect causes a degradation in the other, and it’s a matter of balancing between the two, according to the specific needs. Per each embedded system, or even a software module, the appropriate strategy must be balanced between the two. Nowadays, there is a new dimension to this definition; power. In this article, I am going to discuss about the traditional aspects. In the years I’ve been working as a software engineer, I gained a lot of experience about C code efficiency. I saw how changing a few lines of code makes the difference, either in performance, final size or memory consumption. The examples I’ll show here were written in C and tested on an ARM platform. You should expect similar behavior by other processors. I might update this article from time to time with some more tips, so it is recommended to bookmark it and catch up.

Global variables

The use of global variables is not recommended in most cases, but there are cases where we must use them. In some cases, we declare large tables/arrays which are used later by the software. When using global variables, follow the following guidelines:

Tables/Arrays which never change, should be defined as const. When defining them as constant, they will be moved to the code section. If this table/array was defined in a shared library, without the const directive, the table would be copied per each instance of a linked process. However, with the const directive, it will have only a single instance in memory.
Reading and writing global variables require additional opcodes (load the global address, load the value from the address, and then store the value back). If possible, try to use local variables, or use a local copy.
Declare all globals as static, unless other C files need to see them (not recommended at all. You can use a local read and write functions instead, and keep them static).

Variable types, signess and localization

In is very important to select the appropriate variable types you are using. There are a few rules of thumb in this case:

Try to use variables in the native size of the processor (like int for 32-bit processors). For a typical 32-bit processor, the use of short int or char types is less recommended. There are processors which need to read a 32-bit word anyway and do some shifting and masking, just to get the right value. Other pipeline processors which have 16-bit or 8-bit access opcodes, may read 32-bit anyway.
If you don’t require negative values, use unsigned variables. Unsigned variables yield better performance, especially when using division and multiplications.
Declare variables only when you really need them and not in the begining of the function. This will allow the compiler to better utilize the internal registers and avoiding assignment of a register to a variable which is used only later.

Division and Modulus

The division and modulus arithmetic operations require extensive CPU cycles. In ARM platform, they are done by software, and it is also known that in other processors, these operations are much slower than any other arithmetic operations.

The modulus operation can be implemented much more efficiently, in some cases, using a simply counter:

// Bad example:
int tick_after_100_cycles_mod( void )
{
    static unsigned int i = 0;

    if( i % 100 == 0) {
        return 1;
    } else {
        i++;
    }

    return 0;
}

// Good example:
int tick_after_100_cycles_cnt(
    void )
{
    static unsigned int i = 0;

    if( i >= 100) {
        return 1;
    } else {
        i++;
    }

    return 0;
}

The division operation is extremely efficient if the divider is a natural number in power of two (2,4,8,16,32…). The reason for this is the binary representation of the numbers inside the processor, where each division by two equals to a single right shift. In this case, if you are dividing a variable with a hard coded value in the power of two, the compiler will automatically convert it to a shift operation. In case you are dividing by a variable, the compiler can not know what values could be used in it and will not optimize this division, so you’ll have to help it by doing the shift operation manually. Here are a few examples and the corresponding output for ARM:

/* Results in a call to division opcode or function */
int test_div( unsigned num, unsigned div )
{
    return num / div;
}

/* Results in an optimized automatic shift right */
int test_div_hardcoded_power_2( unsigned num )
{
    return num / 8;
}

/* Results in an optimized automatic shift */
int test_div_shift_right( unsigned num, unsigned powerof_two )
{
    return num >> powerof_two;
}

00000000 <test_div>:
   0:   e52de004        push    {lr}            ; (str lr, [sp, #-4]!)
   4:   e24dd004        sub     sp, sp, #4      ; 0x4
   8:   ebfffffe        bl      0 <__aeabi_uidiv>
   c:   e28dd004        add     sp, sp, #4      ; 0x4
  10:   e8bd8000        pop     {pc}

00000014 <test_div_hardcoded_power_2>:
  14:   e1a001a0        lsr     r0, r0, #3
  18:   e12fff1e        bx      lr

0000001c <test_div_shift_right>:
  1c:   e1a00130        lsr     r0, r0, r1
  20:   e12fff1e        bx      lr

We can see that the functions that use shift operation are very efficient, while the function with the real division calls the external__aeabi_uidiv( ) function to calculate the result.

For loops

For loops are widely used in the code. The natural human thinking makes us write our for loops from 0 to the maximum value. However, in case the order of the index is not important, it is more efficient to do the for loop from top to 0. The reason is due to an extra comparison opcode required when not comparing with 0. Per each incrementing for iteration, the CPU needs to check that we have reached the maximum value, and then break. However, if the for loop is decrementing, the comparison is not required because the result of the decrement action will trigger the Z (zero) flag, and the CPU will know when to break just by looking at it.

extern void foo( int );

void test_incrementing_for_loop( void )
{
    int i;

    for(i=0; i<100;i++) {
        foo(i);
    }
}

void test_decrementing_for_loop( void )
{
    int i;

    for(i=100; i; i--) {
        foo(i);
    }
}

00000000 <test_decrementing_for_loop>:
   0:   e92d4010        push    {r4, lr}
   4:   e3a00064        mov     r0, #100        ; 0x64
   8:   ebfffffe        bl      0 <foo>
   c:   e3a04063        mov     r4, #99 ; 0x63
  10:   e1a00004        mov     r0, r4
  14:   ebfffffe        bl      0 <foo>
  18:   e2544001        subs    r4, r4, #1      ; 0x1
  1c:   1afffffb        bne     10 <test_decrementing_for_loop+0x10>
  20:   e8bd8010        pop     {r4, pc}

00000024 <test_incrementing_for_loop>:
  24:   e92d4010        push    {r4, lr}
  28:   e3a00000        mov     r0, #0  ; 0x0
  2c:   ebfffffe        bl      0 <foo>
  30:   e3a04001        mov     r4, #1  ; 0x1
  34:   e1a00004        mov     r0, r4
  38:   e2844001        add     r4, r4, #1      ; 0x1
  3c:   ebfffffe        bl      0 <foo>
  40:   e3540064        cmp     r4, #100        ; 0x64
  44:   1afffffa        bne     34 <foo+0x34>
  48:   e8bd8010        pop     {r4, pc}

We can see that the decrementing for loop has one less opcode.

In some cases, when short loops are required, it could be more efficient (in terms of speed) to unroll the loop and save all the loop overhead. The compiler can be configured to do that automatically when using the -O3 optimization flag.

extern void foo( int );

void for_loop( void )
{
    int i;

    for(i=4; i; i--) {
        foo(i);
    }
}

void unrolled_loop( void )
{
    foo(4);
    foo(3);
    foo(2);
    foo(1);
    foo(0);
}

00000000 <unrolled_loop>:
   0:   e52de004        push    {lr}            ; (str lr, [sp, #-4]!)
   4:   e3a00004        mov     r0, #4  ; 0x4
   8:   e24dd004        sub     sp, sp, #4      ; 0x4
   c:   ebfffffe        bl      0 <foo>
  10:   e3a00003        mov     r0, #3  ; 0x3
  14:   ebfffffe        bl      0 <foo>
  18:   e3a00002        mov     r0, #2  ; 0x2
  1c:   ebfffffe        bl      0 <foo>
  20:   e3a00001        mov     r0, #1  ; 0x1
  24:   ebfffffe        bl      0 <foo>
  28:   e3a00000        mov     r0, #0  ; 0x0
  2c:   e28dd004        add     sp, sp, #4      ; 0x4
  30:   e49de004        pop     {lr}            ; (ldr lr, [sp], #4)
  34:   eafffffe        b       0 <foo>

00000038 <for_loop>:
  38:   e92d4010        push    {r4, lr}
  3c:   e3a00004        mov     r0, #4  ; 0x4
  40:   ebfffffe        bl      0 <foo>
  44:   e3a04003        mov     r4, #3  ; 0x3
  48:   e1a00004        mov     r0, r4
  4c:   ebfffffe        bl      0 <foo>
  50:   e2544001        subs    r4, r4, #1      ; 0x1
  54:   1afffffb        bne     48 <foo+0x48>
  58:   e8bd8010        pop     {r4, pc}

In the result we can see that the unrolled loop is longer. However, if we take a closer look, we can see that in the unrolled loop runs faster, because each iteration is done with 2 opcodes, compared to 4. In the unrolled for loop, we have 2 opcodes; loading a value and calling foo( ).In the regular for loop, we have 4 opcodes; loading a value, calling foo( ), reducing the register by 1 and comparing the result to 0 (with an incrementing for loop there could be even another opcode).

If-Else and Switch

In many cases, we use branching in our code. We all know the cases where we need to add some more conditions when the code grows or new features are added. We might end up with a multiple if-else-if-else clauses. Long if-else clauses are not efficient because for the worst case scenario, where the last comparison is positive, the CPU must check all other possibilities. In this case it is more efficient to use a switch clause which is implemented as a lookup table. The compiler generates a list of addresses and the CPU jumps directly to the correct one using the index in the switch.

If it is not possible to use switch, it may be possible to apply the binary search strategy. If the range is large, we can divide it to sub areas, and increase the performance by reducing the amount of tests in the worst case scenario.

/* Bad in worst case scenario */
if(i==1) {
    do_something1();
} else if(i==2) {
    do_something2();
} else if(i==3) {
    do_something3();
} else if(i==4) {
    do_something4();
} else if(i==5) {
    do_something5();
} else if(i==6) {
    do_something6();
} else if(i==7) {
    do_something7();
} else if(i==8) {
    do_something8();
}

/* Improved version */
if(i<=4) {
    if(i==1) {
        do_something1();
    } else if(i==2) {
        do_something2();
    } else if(i==3) {
        do_something3();
    } else if(i==4) {
        do_something4();
} else {
    if(i==5) {
        do_something5();
    } else if(i==6) {
        do_something6();
    } else if(i==7) {
        do_something7();
    } else if(i==8) {
        do_something8();
    }
}

We can see that for case i == 5, it takes 5 comparisons in version 1 and 2 comparisons in version 2, and in case i == 8, it takes 8 comparisons in version 1 and 5 comparisons in version 2. The binary break can be even finer if required.

Lazy Evaluation

Another principle in the if-else calculation of C is referred as “Lazy Evaluation“. In multiple conditional clauses, where there is a need to check various conditions, the generated code will evaluate only the minimum conditions which are required to satisfy the clause, from left to right. For example, in a multiple OR clause, it is sufficient that only one of the conditions becomes true to satisfy the clause, and in a multiple AND clause, it is sufficient that only on of the conditions become false to negate the clause. We can utilize this property to put the “easy” or “trivial” checks first, and thus avoiding some work.

int check_something_1( int i )
{
    if( i > 99 && i % 100 == 0 ) {
        return 1;
    }

    return 0;
}

int check_something_2( int i )
{
    if( i == 0 || i % 100 == 0 ) {
        return 1;
    }

    return 0;
}

In the first function, we first check if i is bigger than 99, and only then perform the modulus calculation. So actually, the modulus will be skipped every time we call this function with 0-99. In the second example, we will skip the modulus when i equals 0. In both cases, it would be wrong to switch between the conditions because the “easy” condition is first.

Return Value checking

Another thing that could slightly optimize the code is to check return values of functions by comparing to 0. Since comparison to zero is native to the CPU, it will usually save one or two opcodes. If a function returns 0 on success and -1 for failure, don’t check if the return value is -1. Instead, check if the return value is less than 0. Another example; If a function returns OK or NOK, compare the return value against the enumeration which is equal to 0.

/* returns 0 on success, -1 on failure */
extern int foo( int i );

/* Good example 1: Good path */
if( foo( 7 ) == 0 ) {
    /* Good path: do something */
}

/* Good example 2: Bad path */
if( foo( 7 ) != 0 ) {
    /* Bad path: do something */
}

/* Good example 3: Bad path */
if( foo( 7 ) < 0 ) {
    /* Bad path: do something */
}

/* Bad example */
if( foo( 7 ) == -1 ) {
    /* Bad path */
}

Lookup tables

Lookup tables can speed up the processing but increase the size by actually having all the answers in advance without the need to calculate them. Here’s an example that beats both size and speed:

char get_char( unsigned int i )
{
    switch(i) {
    case 0:
        return 'r';
    case 1:
        return 't';
    case 2:
        return '-';
    case 3:
        return 'e';
    case 4:
        return 'm';
    case 5:
        return 'b';
    default:
        return '\0';
    }
}

static const char lookup_table[] = { 'r', 't', '-', 'e','m','b' };

char get_char_lookup( unsigned int i )
{
    if(i>=sizeof(lookup_table)) {
        return '\0';
    }
    return lookup_table[i];
}

00000000 <get_char>:
   0:   e3500005        cmp     r0, #5  ; 0x5
   4:   979ff100        ldrls   pc, [pc, r0, lsl #2]
   8:   ea000005        b       24 <get_char+0x24>
   c:   0000002c        .word   0x0000002c
  10:   00000034        .word   0x00000034
  14:   0000003c        .word   0x0000003c
  18:   00000044        .word   0x00000044
  1c:   0000004c        .word   0x0000004c
  20:   00000054        .word   0x00000054
  24:   e3a00000        mov     r0, #0  ; 0x0
  28:   e12fff1e        bx      lr
  2c:   e3a00072        mov     r0, #114        ; 0x72
  30:   e12fff1e        bx      lr
  34:   e3a00074        mov     r0, #116        ; 0x74
  38:   e12fff1e        bx      lr
  3c:   e3a0002d        mov     r0, #45 ; 0x2d
  40:   e12fff1e        bx      lr
  44:   e3a00065        mov     r0, #101        ; 0x65
  48:   e12fff1e        bx      lr
  4c:   e3a0006d        mov     r0, #109        ; 0x6d
  50:   e12fff1e        bx      lr
  54:   e3a00062        mov     r0, #98 ; 0x62
  58:   e12fff1e        bx      lr

0000005c <get_char_lookup>:
  5c:   e3500005        cmp     r0, #5  ; 0x5
  60:   959f3008        ldrls   r3, [pc, #8]    ; 70 <get_char_lookup+0x14>
  64:   83a00000        movhi   r0, #0  ; 0x0
  68:   97d30000        ldrbls  r0, [r3, r0]
  6c:   e12fff1e        bx      lr
  70:   00000000        .word   0x00000000

We can see that the lookup table implementation is fast and small.

Data caching and avoiding calling other functions/system calls

My meaning here is not for the actual CPU cache. Sometimes, we call some functions or system calls in order to perform some calculation or get some data. If possible, try to reduce and eliminate the use of these, especially system calls which cause a context switch and data copying (relatively slow actions).

In this example, we can skip calling getpid( ) starting from the second time we call this function:

#include <unistd.h>
#include <stdlib.h>

pid_t my_getpid( void )
{
    pid_t pid = getpid();
    return pid;
}

pid_t my_getpid_optimized( void )
{
    static pid_t pid = -1;

    if(pid<0) {
         pid = getpid();
    }
    return pid;
}

00000000 <my_getpid_optimized>:
   0:   e92d4010        push    {r4, lr}
   4:   e59f4020        ldr     r4, [pc, #32]   ; 2c <my_getpid_optimized+0x2c>
   8:   e5943000        ldr     r3, [r4]
   c:   e3530000        cmp     r3, #0  ; 0x0
  10:   ba000001        blt     1c <my_getpid_optimized+0x1c>
  14:   e5940000        ldr     r0, [r4]
  18:   e8bd8010        pop     {r4, pc}
  1c:   ebfffffe        bl      0 <getpid>
  20:   e5840000        str     r0, [r4]
  24:   e5940000        ldr     r0, [r4]
  28:   e8bd8010        pop     {r4, pc}
  2c:   00000000        .word   0x00000000

00000030 <my_getpid>:
  30:   eafffffe        b       0 <getpid>

While the my_getpid( ) function looks a lot shorter (both in the code and in the assembly), it is actually much slower than the other function. The reason is because the my_getpid( ) function will initiate a system call each time it is called, and on the other hand, themy_getpid_optimized( ) function, will call the system call only once. The rest of the times will return the stored process id. In this example, we can see the tradeoff between speed and size.

Efficient search and sort functions

The uClibc library (and probably other standard C libraries) provide very efficient search and sort functions, so if you need to perform searches or sorts on any type of variable, you don’t need to write an algorithm from scratch, but use the existing ones. The binary search function is called bsearch( ), and the quick sort function is called qsort( ). When using these functions, you need to specify the size of your element, and a comparison function which will be used by the algorithms to perform their actions. Here’s an example for quick sorting an array of elements. The elements are made of an index and a name. We define the sort criteria as the index number, and write the comparison function accordingly.

#include <stdlib.h>

/* Entry element */
struct entry_t {
    int index;
    char name[ 10 ];
};

static int mycmp( const void *a, const void *b )
{
    /* Compare the elements by indexes */
    if(((struct entry_t *)a)->index == ((struct entry_t *)b)->index) {
        return 0;
    }
    return ((struct entry_t *)a)->index > ((struct entry_t *)b)->index;
}

/* Maximum elements */
#define MAX_ENTRIES 100

/* The array to be sorted */
struct entry_t entries_array[ MAX_ENTRIES ];

int do_sort( void )
{
    /* Sort the array */
    qsort( &entries_array, MAX_ENTRIES, sizeof( struct entry_t ), mycmp );
    return 0;
}

Inline functions

Inline functions are functions which are copied inside the calling function thus avoiding the function call overhead. It could be used to increase the performance of a specific critical area. Note that inlines create code duplication per each call to them, so use them wisely.

Bitmaps

A bitmap is usually a register (32-bit) variable that can represent data. The 32-bits can be divided to represent up to 32 pieces of data, wit 1 bit per piece which represent 2 states (on or off, enabled or disabled). The less data pieces represented, the more data can be saves (for 16 pieces, there are 2 bits which represent 4 different states). Sometimes bitmaps can be very efficient comparing to other data types, mainly because the single memory access, and the fact that data can be manipulated using bit-wise operations (&, |, ~, ^). For example, suppose we need to keep a list of objects (like interfaces, devices, etc.), and we want to keep track which of them has a specific property enabled (like, link, connectivity, etc.). Usually, we will use an array or any other data type to store this information. If the list is 32 entries long or less, we can store it in a register, and mark the property using a single bit. Each object will have its own offset in the register. Enabling a property is a single OR operation in the object’s offset, and disabling a property is a single AND operation in the object’s offset.