Optimizing Code for ARM Cortex-A8 with NEON SIMD
Architecture
-
ARMv7-A Instruction set
The right C programming style for NEON optimization
<box green 80% round center>Obeying these guidelines will significantly speed up your code and will perform equally to hand crafted assembler in most cased. Read the Official ARM NEON Optimization Examples to learn the details why this statement holds.</box>
-
Use the
restrict
keyword for pointers in function signatures: It signals to the compiler that no other pointer inside the function points at the same data block. This allows NEON SIMD optimization to be used internally by the compiler. The keyword is specific to C99, but the variant _ _restrict should be usable in any C-standard -
Tell the compiler to unroll loops:
#pragma unroll(n)
-
Tell the compiler the natural iteration steps: If you for example iterate through an array, but you know that the array size will be always a multiple of four, tell the compiler this by using this loop syntax:
for (n = 0; n < (limit/4)*4; n++)
-
Use the natural data types: 8, 16, 32 and 64 bit
-
Do not use array grouping
-
Do not use padding for your data:
struct pixel {char r, char g, char b, char padding}. The NEON engine can load unaligned data, so padding (e.g. to 32 bit) is not necessary.
Using ARM NEON instructions
<box red 80% round center>In 90% of the cases the compiler will generate better code than hand-crafted assembler instructions. Use the NEON intrinsics only for well-defined cases.</box>
You can use the special NEON instruction set to speed up your code. Just use the right #ifdef
statements to make sure the instructions are only used when the code is actually running on the ARM Cortex-A8 NEON.
-
Include the NEON header file:
#include <arm_neon.h>
to use these functions -
All functions are documented in the ARM NEON instruction set reference
Preprocessor ifdef example
This example allows to put NEON code in the same file as for other processors (e.g. x86). The optimized routines will only be used if the '-fpu=neon -mfloatabi=softfp
' flags are set (These flags toggle the 'ARM_NEON
' define).
/// At the beginning of the file: #ifdef __ARM_NEON__ #include <arm_neon.h> #endif /// In the main function space: /** * @brief This function multiplies two floats. * * This function is optimized for the ARM NEON instruction set. However * a standard C fallback version is present as well (e.g. for x86 systems). * * @param f1 The first float number * @param f2 The second float number * @return The two floats multiplied **/ float optimized_function(float f1, float f2) { #ifdef __ARM_NEON__ /// ARM NEON Code implementation #else /// Standard implementation return f1 * f2; #endif }
VFP3 (Standard ARM floating point unit)
Cycles per instruction
Instruction | Single precision cycles | Double precision cycles | Subnormal penalty |
---|---|---|---|
FADD | 9-10 | 9-10 | operand/result |
FSUB | 9-10 | 9-10 | operand/result |
FMUL | 10-12 | 11-17 | operand/result |
FNMUL | 10-12 | 11-17 | operand/result |
FMAC | 18-21 | 19-26 | operand/intermediate/result |
FDIV | 20-37 | 29-65 | operand/result |
FSQRT | 19-33 | 29-60 | operand |
NEON Architecture
Reference
Overview
-
Zero instruction load from L1 cache
-
32 x 64 bit registers
-
Support for unaligned data ( No padding to 32 bit chunks necessary)
-
128 / 2x 64 bit integer pipeline (ALU, Shift, MAC)
-
Parallel dual, single-precision (float) floating point pipeline (2x FADD, 2x FMUL, etc.)
-
Double precision floating point is handled by VFP3 engine ( SLOW!)
NEON 128 bit register
The registers can be processed by the NEON engine as:
-
32-bit single-precision floating-point numbers → Which is why double is slow
-
8-bit, 16-bit, 32-bit, or 64-bit signed or unsigned integers
-
8-bit, 16-bit, 32-bit, or 64-bit bitfields
-
8-bit or 16-bit polynomials with 1-bit coefficients.