architecture is a RISC load/store architecture. In other words you must load values from
memory into registers before acting on them. There are no arithmetic or logical instructions
that manipulate values in memory directly.
loads that act on 8- or 16-bit values extend the value to 32 bits before writing
to an ARM register. Unsigned values are zero-extended, and signed values sign-extended.
This means that the cast of a loaded value to an int type does not cost extra instructions.
Similarly, a store of an 8- or 16-bit value selects the lowest 8 or 16 bits of the register. The
cast of an int to smaller type does not cost extra instructions on a store.
The Efficient Use of C Types
■ For local variables held in registers, don’t use a char or short type unless 8-bit or
16-bitmodular arithmetic is necessary. Use the signed or unsigned int types instead.
Unsigned types are faster when you use divisions.
■ For array entries and global variables held in main memory, use the type with the
smallest size possible to hold the required data. This saves memory footprint. The
ARMv4 architecture is efficient at loading and storing all data widths provided you
traverse arrays by incrementing the array pointer. Avoid using offsets from the base of
the array with short type arrays, as LDRH does not support this.
■ Use explicit casts when reading array entries or global variables into local variables, or
writing local variables out to array entries. The castsmake it clear that for fast operation
you are taking a narrow width type stored in memory and expanding it to a wider type
in the registers. Switch on implicit narrowing cast warnings in the compiler to detect
implicit casts.
■ Avoid implicit or explicit narrowing casts in expressions because they usually cost extra
cycles. Casts on loads or stores are usually free because the load or store instruction
performs the cast for you.
■ Avoid char and short types for function arguments or return values. Instead use the
int type even if the range of the parameter is smaller. This prevents the compiler
performing unnecessary casts.
The compiler is not being inefficient. It must be careful about the case when
i = -0x80000000 because the two sections of code generate different answers in this case.
For the first piece of code the SUBS instruction compares i with 1 and then decrements i.
Since -0x80000000 < 1, the loop terminates. For the second piece of code, we decrement
i and then compare with 0. Modulo arithmetic means that i now has the value
+0x7fffffff, which is greater than zero. Thus the loop continues for many iterations.
Of course, in practice, i rarely takes the value -0x80000000. The compiler can’t usu-
ally determine this, especially if the loop starts with a variable number of iterations (see
Section 5.3.2).
Therefore you should use the termination condition i!=0 for signed or unsigned loop
counters. It saves one instruction over the condition i>0 for signed i.
Writing Loops Efficiently
■ Use loops that count down to zero. Then the compiler does not need to allocate
a register to hold the termination value, and the comparison with zero is free.
■ Use unsigned loop counters by default and the continuation condition i!=0 rather than
i>0. This will ensure that the loop overhead is only two instructions.
■ Use do-while loops rather than for loops when you know the loop will iterate at least
once. This saves the compiler checking to see if the loop count is zero.
■ Unroll important loops to reduce the loop overhead. Do not overunroll. If the loop
overhead is small as a proportion of the total, then unrolling will increase code size and
hurt the performance of the cache.
■ Try to arrange that the number of elements in arrays aremultiples of four or eight. You
can then unroll loops easily by two, four, or eight times without worrying about the
leftover array elements.