By Hai Shalom
原文链接: http://www.rt-embedded.com/blog/archives/writing-efficient-c-code-for-embedded-systems/
The traditional definition of efficiency has two aspects: speed and size. In most cases, optimizing the first aspect causes a degradation in the other, and it’s a matter of balancing between the two, according to the specific needs. Per each embedded system, or even a software module, the appropriate strategy must be balanced between the two. Nowadays, there is a new dimension to this definition; power. In this article, I am going to discuss about the traditional aspects. In the years I’ve been working as a software engineer, I gained a lot of experience about C code efficiency. I saw how changing a few lines of code makes the difference, either in performance, final size or memory consumption. The examples I’ll show here were written in C and tested on an ARM platform. You should expect similar behavior by other processors. I might update this article from time to time with some more tips, so it is recommended to bookmark it and catch up.
Global variables
The use of global variables is not recommended in most cases, but there are cases where we must use them. In some cases, we declare large tables/arrays which are used later by the software. When using global variables, follow the following guidelines:
-
Tables/Arrays which never change, should be defined as const. When defining them as constant, they will be moved to the code section. If this table/array was defined in a shared library, without the const directive, the table would be copied per each instance of a linked process. However, with the const directive, it will have only a single instance in memory.
-
Reading and writing global variables require additional opcodes (load the global address, load the value from the address, and then store the value back). If possible, try to use local variables, or use a local copy.
-
Declare all globals as static, unless other C files need to see them (not recommended at all. You can use a local read and write functions instead, and keep them static).
Variable types, signess and localization
In is very important to select the appropriate variable types you are using. There are a few rules of thumb in this case:
-
Try to use variables in the native size of the processor (like int for 32-bit processors). For a typical 32-bit processor, the use of short int or char types is less recommended. There are processors which need to read a 32-bit word anyway and do some shifting and masking, just to get the right value. Other pipeline processors which have 16-bit or 8-bit access opcodes, may read 32-bit anyway.
-
If you don’t require negative values, use unsigned variables. Unsigned variables yield better performance, especially when using division and multiplications.
-
Declare variables only when you really need them and not in the begining of the function. This will allow the compiler to better utilize the internal registers and avoiding assignment of a register to a variable which is used only later.
Division and Modulus
The division and modulus arithmetic operations require extensive CPU cycles. In ARM platform, they are done by software, and it is also known that in other processors, these operations are much slower than any other arithmetic operations.
The modulus operation can be implemented much more efficiently, in some cases, using a simply counter:
// Bad example:
int tick_after_100_cycles_mod( void )
{
static unsigned int i = 0;
if( i % 100 == 0) {
return 1;
} else {
i++;
}
return 0;
}
// Good example:
int tick_after_100_cycles_cnt(
void )
{
static unsigned int i = 0;
if( i >= 100) {
return 1;
} else {
i++;
}
return 0;
}
The division operation is extremely efficient if the divider is a natural number in power of two (2,4,8,16,32…). The reason for this is the binary representation of the numbers inside the processor, where each division by two equals to a single right shift. In this case, if you are dividing a variable with a hard coded value in the power of two, the compiler will automatically convert it to a shift operation. In case you are dividing by a variable, the compiler can not know what values could be used in it and will not optimize this division, so you’ll have to help it by doing the shift operation manually. Here are a few examples and the corresponding output for ARM:
/* Results in a call to division opcode or function */
int test_div( unsigned num, unsigned div )
{
return num / div;
}
/* Results in an optimized automatic shift right */
int test_div_hardcoded_power_2( unsigned num )
{
return num / 8;
}
/* Results in an optimized automatic shift */
int test_div_shift_right( unsigned num, unsigned powerof_two )
{
return num >> powerof_two;
}
00000000 <test_div>:
0: e52de004 push {lr} ; (str lr, [sp, #-4]!)
4: e24dd004 sub sp, sp, #4 ; 0x4
8: ebfffffe bl 0 <__aeabi_uidiv>
c: e28dd004 add sp, sp, #4 ; 0x4
10: e8bd8000 pop {pc}
00000014 <test_div_hardcoded_power_2>:
14: e1a001a0 lsr r0, r0, #3
18: e12fff1e bx lr
0000001c <test_div_shift_right>:
1c: e1a00130 lsr r0, r0, r1
20: e12fff1e bx lr
We can see that the functions that use shift operation are very efficient, while the function with the real division calls the external__aeabi_uidiv( ) function to calculate the result.
For loops
For loops are widely used in the code. The natural human thinking makes us write our for loops from 0 to the maximum value. However, in case the order of the index is not important, it is more efficient to do the for loop from top to 0. The reason is due to an extra comparison opcode required when not comparing with 0. Per each incrementing for iteration, the CPU needs to check that we have reached the maximum value, and then break. However, if the for loop is decrementing, the comparison is not required because the result of the decrement action will trigger the Z (zero) flag, and the CPU will know when to break just by looking at it.
extern void foo( int );
void test_incrementing_for_loop( void )
{
int i;
for(i=0; i<100;i++) {
foo(i);
}
}
void test_decrementing_for_loop( void )
{
int i;
for(i=100; i; i--) {
foo(i);
}
}
00000000 <test_decrementing_for_loop>:
0: e92d4010 push {r4, lr}
4: e3a00064 mov r0, #100 ; 0x64
8: ebfffffe bl 0 <foo>
c: e3a04063 mov r4, #99 ; 0x63
10: e1a00004 mov r0, r4
14: ebfffffe bl 0 <foo>
18: e2544001 subs r4, r4, #1 ; 0x1
1c: 1afffffb bne 10 <test_decrementing_for_loop+0x10>
20: e8bd8010 pop {r4, pc}
00000024 <test_incrementing_for_loop>:
24: e92d4010 push {r4, lr}
28: e3a00000 mov r0, #0 ; 0x0
2c: ebfffffe bl 0 <foo>
30: e3a04001 mov r4, #1 ; 0x1
34: e1a00004 mov r0, r4
38: e2844001 add r4, r4, #1 ; 0x1
3c: ebfffffe bl 0 <foo>
40: e3540064 cmp r4, #100 ; 0x64
44: 1afffffa bne 34 <foo+0x34>
48: e8bd8010 pop {r4, pc}
We can see that the decrementing for loop has one less opcode.
In some cases, when short loops are required, it could be more efficient (in terms of speed) to unroll the loop and save all the loop overhead. The compiler can be configured to do that automatically when using the -O3 optimization flag.
extern void foo( int );
void for_loop( void )
{
int i;
for(i=4; i; i--) {
foo(i);
}
}
void unrolled_loop( void )
{
foo(4);
foo(3);
foo(2);
foo(1);
foo(0);
}
00000000 <unrolled_loop>:
0: e52de004 push {lr} ; (str lr, [sp, #-4]!)
4: e3a00004 mov r0, #4 ; 0x4
8: e24dd004 sub sp, sp, #4 ; 0x4
c: ebfffffe bl 0 <foo>
10: e3a00003 mov r0, #3 ; 0x3
14: ebfffffe bl 0 <foo>
18: e3a00002 mov r0, #2 ; 0x2
1c: ebfffffe bl 0 <foo>
20: e3a00001 mov r0, #1 ; 0x1
24: ebfffffe bl 0 <foo>
28: e3a00000 mov r0, #0 ; 0x0
2c: e28dd004 add sp, sp, #4 ; 0x4
30: e49de004 pop {lr} ; (ldr lr, [sp], #4)
34: eafffffe b 0 <foo>
00000038 <for_loop>:
38: e92d4010 push {r4, lr}
3c: e3a00004 mov r0, #4 ; 0x4
40: ebfffffe bl 0 <foo>
44: e3a04003 mov r4, #3 ; 0x3
48: e1a00004 mov r0, r4
4c: ebfffffe bl 0 <foo>
50: e2544001 subs r4, r4, #1 ; 0x1
54: 1afffffb bne 48 <foo+0x48>
58: e8bd8010 pop {r4, pc}
In the result we can see that the unrolled loop is longer. However, if we take a closer look, we can see that in the unrolled loop runs faster, because each iteration is done with 2 opcodes, compared to 4. In the unrolled for loop, we have 2 opcodes; loading a value and calling foo( ).In the regular for loop, we have 4 opcodes; loading a value, calling foo( ), reducing the register by 1 and comparing the result to 0 (with an incrementing for loop there could be even another opcode).
If-Else and Switch
In many cases, we use branching in our code. We all know the cases where we need to add some more conditions when the code grows or new features are added. We might end up with a multiple if-else-if-else clauses. Long if-else clauses are not efficient because for the worst case scenario, where the last comparison is positive, the CPU must check all other possibilities. In this case it is more efficient to use a switch clause which is implemented as a lookup table. The compiler generates a list of addresses and the CPU jumps directly to the correct one using the index in the switch.
If it is not possible to use switch, it may be possible to apply the binary search strategy. If the range is large, we can divide it to sub areas, and increase the performance by reducing the amount of tests in the worst case scenario.
/* Bad in worst case scenario */
if(i==1) {
do_something1();
} else if(i==2) {
do_something2();
} else if(i==3) {
do_something3();
} else if(i==4) {
do_something4();
} else if(i==5) {
do_something5();
} else if(i==6) {
do_something6();
} else if(i==7) {
do_something7();
} else if(i==8) {
do_something8();
}
/* Improved version */
if(i<=4) {
if(i==1) {
do_something1();
} else if(i==2) {
do_something2();
} else if(i==3) {
do_something3();
} else if(i==4) {
do_something4();
} else {
if(i==5) {
do_something5();
} else if(i==6) {
do_something6();
} else if(i==7) {
do_something7();
} else if(i==8) {
do_something8();
}
}
We can see that for case i == 5, it takes 5 comparisons in version 1 and 2 comparisons in version 2, and in case i == 8, it takes 8 comparisons in version 1 and 5 comparisons in version 2. The binary break can be even finer if required.
Lazy Evaluation
Another principle in the if-else calculation of C is referred as “Lazy Evaluation“. In multiple conditional clauses, where there is a need to check various conditions, the generated code will evaluate only the minimum conditions which are required to satisfy the clause, from left to right. For example, in a multiple OR clause, it is sufficient that only one of the conditions becomes true to satisfy the clause, and in a multiple AND clause, it is sufficient that only on of the conditions become false to negate the clause. We can utilize this property to put the “easy” or “trivial” checks first, and thus avoiding some work.
int check_something_1( int i )
{
if( i > 99 && i % 100 == 0 ) {
return 1;
}
return 0;
}
int check_something_2( int i )
{
if( i == 0 || i % 100 == 0 ) {
return 1;
}
return 0;
}
In the first function, we first check if i is bigger than 99, and only then perform the modulus calculation. So actually, the modulus will be skipped every time we call this function with 0-99. In the second example, we will skip the modulus when i equals 0. In both cases, it would be wrong to switch between the conditions because the “easy” condition is first.
Return Value checking
Another thing that could slightly optimize the code is to check return values of functions by comparing to 0. Since comparison to zero is native to the CPU, it will usually save one or two opcodes. If a function returns 0 on success and -1 for failure, don’t check if the return value is -1. Instead, check if the return value is less than 0. Another example; If a function returns OK or NOK, compare the return value against the enumeration which is equal to 0.
/* returns 0 on success, -1 on failure */
extern int foo( int i );
/* Good example 1: Good path */
if( foo( 7 ) == 0 ) {
/* Good path: do something */
}
/* Good example 2: Bad path */
if( foo( 7 ) != 0 ) {
/* Bad path: do something */
}
/* Good example 3: Bad path */
if( foo( 7 ) < 0 ) {
/* Bad path: do something */
}
/* Bad example */
if( foo( 7 ) == -1 ) {
/* Bad path */
}
Lookup tables
Lookup tables can speed up the processing but increase the size by actually having all the answers in advance without the need to calculate them. Here’s an example that beats both size and speed:
char get_char( unsigned int i )
{
switch(i) {
case 0:
return 'r';
case 1:
return 't';
case 2:
return '-';
case 3:
return 'e';
case 4:
return 'm';
case 5:
return 'b';
default:
return '\0';
}
}
static const char lookup_table[] = { 'r', 't', '-', 'e','m','b' };
char get_char_lookup( unsigned int i )
{
if(i>=sizeof(lookup_table)) {
return '\0';
}
return lookup_table[i];
}
00000000 <get_char>:
0: e3500005 cmp r0, #5 ; 0x5
4: 979ff100 ldrls pc, [pc, r0, lsl #2]
8: ea000005 b 24 <get_char+0x24>
c: 0000002c .word 0x0000002c
10: 00000034 .word 0x00000034
14: 0000003c .word 0x0000003c
18: 00000044 .word 0x00000044
1c: 0000004c .word 0x0000004c
20: 00000054 .word 0x00000054
24: e3a00000 mov r0, #0 ; 0x0
28: e12fff1e bx lr
2c: e3a00072 mov r0, #114 ; 0x72
30: e12fff1e bx lr
34: e3a00074 mov r0, #116 ; 0x74
38: e12fff1e bx lr
3c: e3a0002d mov r0, #45 ; 0x2d
40: e12fff1e bx lr
44: e3a00065 mov r0, #101 ; 0x65
48: e12fff1e bx lr
4c: e3a0006d mov r0, #109 ; 0x6d
50: e12fff1e bx lr
54: e3a00062 mov r0, #98 ; 0x62
58: e12fff1e bx lr
0000005c <get_char_lookup>:
5c: e3500005 cmp r0, #5 ; 0x5
60: 959f3008 ldrls r3, [pc, #8] ; 70 <get_char_lookup+0x14>
64: 83a00000 movhi r0, #0 ; 0x0
68: 97d30000 ldrbls r0, [r3, r0]
6c: e12fff1e bx lr
70: 00000000 .word 0x00000000
We can see that the lookup table implementation is fast and small.
Data caching and avoiding calling other functions/system calls
My meaning here is not for the actual CPU cache. Sometimes, we call some functions or system calls in order to perform some calculation or get some data. If possible, try to reduce and eliminate the use of these, especially system calls which cause a context switch and data copying (relatively slow actions).
In this example, we can skip calling getpid( ) starting from the second time we call this function:
#include <unistd.h>
#include <stdlib.h>
pid_t my_getpid( void )
{
pid_t pid = getpid();
return pid;
}
pid_t my_getpid_optimized( void )
{
static pid_t pid = -1;
if(pid<0) {
pid = getpid();
}
return pid;
}
00000000 <my_getpid_optimized>:
0: e92d4010 push {r4, lr}
4: e59f4020 ldr r4, [pc, #32] ; 2c <my_getpid_optimized+0x2c>
8: e5943000 ldr r3, [r4]
c: e3530000 cmp r3, #0 ; 0x0
10: ba000001 blt 1c <my_getpid_optimized+0x1c>
14: e5940000 ldr r0, [r4]
18: e8bd8010 pop {r4, pc}
1c: ebfffffe bl 0 <getpid>
20: e5840000 str r0, [r4]
24: e5940000 ldr r0, [r4]
28: e8bd8010 pop {r4, pc}
2c: 00000000 .word 0x00000000
00000030 <my_getpid>:
30: eafffffe b 0 <getpid>
While the my_getpid( ) function looks a lot shorter (both in the code and in the assembly), it is actually much slower than the other function. The reason is because the my_getpid( ) function will initiate a system call each time it is called, and on the other hand, themy_getpid_optimized( ) function, will call the system call only once. The rest of the times will return the stored process id. In this example, we can see the tradeoff between speed and size.
Efficient search and sort functions
The uClibc library (and probably other standard C libraries) provide very efficient search and sort functions, so if you need to perform searches or sorts on any type of variable, you don’t need to write an algorithm from scratch, but use the existing ones. The binary search function is called bsearch( ), and the quick sort function is called qsort( ). When using these functions, you need to specify the size of your element, and a comparison function which will be used by the algorithms to perform their actions. Here’s an example for quick sorting an array of elements. The elements are made of an index and a name. We define the sort criteria as the index number, and write the comparison function accordingly.
#include <stdlib.h>
/* Entry element */
struct entry_t {
int index;
char name[ 10 ];
};
static int mycmp( const void *a, const void *b )
{
/* Compare the elements by indexes */
if(((struct entry_t *)a)->index == ((struct entry_t *)b)->index) {
return 0;
}
return ((struct entry_t *)a)->index > ((struct entry_t *)b)->index;
}
/* Maximum elements */
#define MAX_ENTRIES 100
/* The array to be sorted */
struct entry_t entries_array[ MAX_ENTRIES ];
int do_sort( void )
{
/* Sort the array */
qsort( &entries_array, MAX_ENTRIES, sizeof( struct entry_t ), mycmp );
return 0;
}
Inline functions
Inline functions are functions which are copied inside the calling function thus avoiding the function call overhead. It could be used to increase the performance of a specific critical area. Note that inlines create code duplication per each call to them, so use them wisely.
Bitmaps
A bitmap is usually a register (32-bit) variable that can represent data. The 32-bits can be divided to represent up to 32 pieces of data, wit 1 bit per piece which represent 2 states (on or off, enabled or disabled). The less data pieces represented, the more data can be saves (for 16 pieces, there are 2 bits which represent 4 different states). Sometimes bitmaps can be very efficient comparing to other data types, mainly because the single memory access, and the fact that data can be manipulated using bit-wise operations (&, |, ~, ^). For example, suppose we need to keep a list of objects (like interfaces, devices, etc.), and we want to keep track which of them has a specific property enabled (like, link, connectivity, etc.). Usually, we will use an array or any other data type to store this information. If the list is 32 entries long or less, we can store it in a register, and mark the property using a single bit. Each object will have its own offset in the register. Enabling a property is a single OR operation in the object’s offset, and disabling a property is a single AND operation in the object’s offset.
Compiler optimizations
Don’t forget to enable the appropriate compiler optimizations. You can read about it here
Resources
Quick sort: http://linux.die.net/man/3/qsort
Binary search: http://linux.die.net/man/3/bsearch
Using gcc: http://www.rt-embedded.com/blog/archives/using-gcc-part-2/