ARM Cortex-A8 Overview & Introduction
Architecture Overview
Available Tools
There are a variety of tools:
- Microsoft Visual Studio 2005 + Platform Builder plugin for the IDE debugger ( WinCE application Debug)
- Lauterbach (good for low-level ARM and DSP debug, experienced with previous OMAP/AM products)
- GreenHills's MULTI (good for low-level ARM and DSP debug)
- MontaVista's Devrocket (good for Linux application debug)
If you need a tool that understands Linux, CodeSourcery and MontaVista are the way to go; at present CodeSourcery tool-chain has better support for Cortex-A8 found in OMAP 35x / AM 35x devices.
What is Neon?
According to ARM, the Neon block of the Cortex-A8 core includes both the Neon and VFP accelerators. Neon is a SIMD (Single Instruction Multiple Data) accelerator processor integrated in as part of the ARM Cortex-A8. What does SIMD mean? It means that during the execution of one instruction the same operation will occur on up to 16 data sets in parallel. It is also synonymous with the term vector processor. Since there is parallelism inside the Neon, you can get more MIPS or FLOPS out of Neon than you can a standard SISD processor running at the same clock rate. Many Neon benchmarks are shown as ARM takes N instructions while Neon takes less than N instructions. This shows how much parallelism can be achieved for that benchmark. Reducing instruction count will reduce the number of clocks used to perform the same task. A simple rule of thumb for how fast Neon will speed up a specific loop is to look at the data size of the operation. Since the largest Neon register is 128 bits, if you are performing an operation on 8-bit values you can perform 16 operations simultaneously. On the other end of the spectrum, if you are using 32 bit data, then you can perform 4 operation simultaneously. However remember that there are always other considerations that affect execution speed such as memory throughput and loop overhead. Neon instructions are mainly for numerical, load/store, and some logical operations. Neon operations will be executing in the NEON pipline while other instruction such as branching will occur in the main ARM core pipeline. (See reference to Cortex-A8 Architecture above for a description of the ARM Cortex-A8 and NEON pipelines)
What are the advantages of Neon
- Aligned and unaligned data access allows for efficient vectorization of SIMD operations.
- Support for both integer and floating point operations ensures adaptability to a broad range of applications, from compression decoding to 3D graphics.
- Tight coupling to the ARM core provides a single instruction stream and a unified view of memory, presenting a single development platform target with a simpler tool flow.
- The large Neon register file with its multiple views enables efficient handling of data and minimizes access to memory, enhancing data throughput performance.
How to develop code for Neon
Unfortunately, in most cases you can not simply compile general C code and get a huge speed up using Neon. But if you truly want to utilize the power of Neon there are some basic steps you can follow. You need some basic understanding of what it means to vectorize the code. You need to know how to enable Neon in the Cortex-A8. Also, you need to have L2 cache enabled to get appreciable speed increases.
- Compiler Options - you can direct the compiler to auto-vectorize: The compiler generates Neon code. See the Neon auto vectorization example.
- Neon intrinsics - Compileable macros that give low level access to Neon operations
- Assembly Code - Write your own assembly or link highly optimized libraries.
For Intrinsics, ARM has created a NEON support library called NE10 to make your jump into NEON easier: http://blogs.arm.com/software-enablement/703-ne10-a-new-open-source-library-to-accelerate-your-applications-with-neon
What does a Neon assembly instruction look like
A Neon instruction would look like one of the following:
VMUL.I16 q0,q0,q1
- VMUL - multiply assembly instruction
- .I16 - Indicates this instruction operates on 16 bit integers. This would be a "short int" in C code.
- q0,q1 - Neon registers. A 'q' register is 128 bits wide and will hold 8 short ints.
This Neon instruction would simultaneously multiply the 8 operands in q0 with the 8 operands in q1 and store the 8 results in q0.
How to enable NEON
The NEON/VFP unit comes up disabled on power on reset. To enable Neon requires some co-processor commands. Below is the assembly code required to enable NEON/VFP. It is in the gcc type syntax. ARM code tools use a slightly different syntax.
MRC p15, #0, r1, c1, c0, #2 ; r1 = Access Control Register ORR r1, r1, #(0xf << 20) ; enable full access for p10,11 MCR p15, #0, r1, c1, c0, #2 ; Access Control Register = r1 MOV r1, #0 MCR p15, #0, r1, c7, c5, #4 ; flush prefetch buffer because of FMXR below ; and CP 10 & 11 were only just enabled ; Enable VFP itself MOV r0,#0x40000000 FMXR FPEXC, r0 ; FPEXC = r0
Neon Auto Vectorization Compiler directives and Example
Compiler tools | Autovectorization compiler directives |
Code Composer Studio | "-o3 -mv7a8 --neon -mf " |
CodeSourcery (gcc) | "-march=armv7-a -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -ffast-math -mfloat-abi=softfp" |
Realview | "--cpu=Cortex-A8 -O3 -Otime --vectorize" |
Here is a simple example of autovectorizing a small C function using the Realview compiler
void NeonTest(short int * __restrict a, short int * __restrict b, short int * __restrict z) { int i; for (i = 0; i < 200; i++) { z[i] = a[i] * b[i]; } }
ARM only code
generated by ARM/Thumb C/C++ Compiler, RVCT3.1 [Build 616]
commandline armcc [-c --asm --interleave --cpu=Cortex-A8 itest.c]
loop iterations: 200
000000 e92d4030 PUSH {r4,r5,lr}
000004 e3a03000 MOV r3,#0
|L1.8|
000008 e0804083 ADD r4,r0,r3,LSL #1
00000c e0815083 ADD r5,r1,r3,LSL #1
000010 e1d440b0 LDRH r4,[r4,#0]
000014 e1d550b0 LDRH r5,[r5,#0]
000018 e1640584 SMULBB r4,r4,r5
00001c e0825083 ADD r5,r2,r3,LSL #1
000020 e2833001 ADD r3,r3,#1
000024 e35300c8 CMP r3,#0xc8
000028 e1c540b0 STRH r4,[r5,#0]
00002c bafffff5 BLT |L1.8|
000030 e8bd8030 POP {r4,r5,pc}
ARM + Neon code
generated by ARM/Thumb NEON C/C++ Compiler with Crescent Bay VAST 10.7z8 ARM NEON, RVCT3.1 [Build 616]
commandline armcc [-c --asm --interleave --cpu=Cortex-A8 -O3 -Otime --vectorize itest.c]
loop iterations: 25
Neon instructions
000000 e3a03019 MOV r3,#0x19 |L1.4| 000004 f4200a4d VLD1.16 {d0,d1},[r0]! 000008 e2533001 SUBS r3,r3,#1 00000c f4212a4d VLD1.16 {d2,d3},[r1]! 000010 f2100952 VMUL.I16 q0,q0,q1 000014 f4020a4d VST1.16 {d0,d1},[r2]! 000018 1afffff9 BNE |L1.4| 00001c e12fff1e BX lr
Neon Intrinsics
You may find that using the autovectorizing compiler does not always work well for more complex functions. Intrinsics are a combination of assembly code and C code. They give you direct control over the Neon SIMD functionality similar to coding in assembly. They also give you C level compiler errors to warn you if you are not matching type inputs and output consistently.
Here is a small function in C which adds together the 200 corresponding elements from arrays x and y and stores each result in array z:void NeonTest(int * x, int * y, int * z) { int i; for(i=0;i<200;i++) { z[i] = x[i] + y[i]; } }
Here is the equivalent function using intrinsics:
#include "arm_neon.h" void intrinsics(uint32_t *x, uint32_t *y, uint32_t *z) { int i; uint32x4_t x4,y4; // These 128 bit registers will contain 4 values from the x array and 4 values from the y array uint32x4_t z4; // This 128 bit register will contain the 4 results from the add intrinsic uint32_t *ptra = x; // pointer to the x array data uint32_t *ptrb = y; // pointer to the y array data uint32_t *ptrz = z; // pointer to the z array data for(i=0; i < 200/4; i++) { x4 = vld1q_u32(ptra); // intrinsic to load x4 with 4 values from x y4 = vld1q_u32(ptrb); // intrinsic to load y4 z4=vaddq_u32(x4,y4); // intrinsic to add z4=x4+y4 vst1q_u32(ptrz, z4); // store the 4 results to z ptra+=4; // increment pointers ptrb+=4; ptrz+=4; } }
Here is the output of the intrinsic function compiled with: GCC: (CodeSourcery Sourcery G++ Lite 2007q3-51) 4.2.1
22 0000 323E81E2 add r3, r1, #800 23 .L2: 24 0004 8F4A21F4 vld1.32 {d4-d5}, [r1] 25 0008 101081E2 add r1, r1, #16 26 000c 8F6A20F4 vld1.32 {d6-d7}, [r0] 27 0010 030051E1 cmp r1, r3 28 0014 446826F2 vadd.i32 q3, q3, q2 29 0018 8F6A02F4 vst1.32 {d6-d7}, [r2] 30 001c 100080E2 add r0, r0, #16 31 0020 102082E2 add r2, r2, #16 32 0024 F6FFFF1A bne .L2 33 0028 1EFF2FE1 bx lr
Assembly
Coding in Assembly is a last resort. If autovectorization and intrinsics are not getting the desired results, then hand coding in assembly can be the way to maximize a functions performance. Coding in assembly that will improve on compiled code is a skill that may require a considerable learning curve.
Compiler Comparison
Compiler capabilities | Autovectorization | Intrinsics | Assembly |
Code Composer Studio | Yes | Yes | Yes |
CodeSourcery (gcc) | Yes | Yes | Yes |
Realview | Yes | Yes | Yes |
Note: For Code Composer you need version 4.6.x or greater of the TMS470 Compiler Tools
What is VFP?
VFP is a floating point hardware accelerator. It is not a parallel architecture like Neon. Basically it performs one operation on one set of inputs and returns one output. It's purpose is to speed up floating point calculations. If a processor like ARM does not have floating hardware, then it relies on software math libraries which can prohibitively slow down floating point calculations. The VFP supports both single and double precision floating point calculations compliant with IEEE754. Further, the VFP is not fully pipelined like Neon, so it will not have equivalent performance to Neon.
How to compile and run VFP code
What is the relationship between Neon and VFP?
Neon and VFP share the same large register file inside of the Cortex-A8. These registers are separate from the ARM core registers. The Neon/VFP register file is 256 bytes as shown in the diagram.
The Neon Register file has a dual view:
- 32 - 64 bit registers (The Dx registers)
- 16 - 128 bit registers (The Qx registers)
The VFP Register file also has a dual view:
- 32 - 64 bit registers (The Dx registers)
- 32 - 32 bit registers (The Sx registers - Only 1/2 of the registers may be viewed as 32 bit)
From the Neon point of view: register Q0 may be accessed as either Q0 or D0:D1
From the VFP point of view: register D0 may be accessed as either D0 or S0:S1
There are 2 paths or pipelines through Neon:
- Integer and fixed point (supports 8 bit, 16 bit, 32 bit integers)
- Single precision floating point (supports 32 bit floating point)
VFP has a single path:
- Single or double precision floating point (supports 32 bit and 64 bit floating point)
(Note that Neon does not support double precision floating point operations)
(Note that Neon and VFP both support single precision floating point operations)
Neon and VFP both support floating point, which should I use?
- The VFPv3 is fully compliant with IEEE 754
- Neon is not fully compliant with IEEE 754, so it is mainly targeted for multimedia applications
Here is an example of showing how Neon pipelining will outperform VFP:
Taking the same C function from earlier, but using floating point types instead:
void NeonTest(float * __restrict a, float * __restrict b, float * __restrict z) { int i; for(i=0;i<200;i++) { z[i] = a[i] * b[i]; } }
Compile the above code using CodeSourcery: GCC: (CodeSourcery Sourcery G++ Lite 2007q3-51) 4.2.1
Compile the above function for both Neon and VFP and compare results:
- arm-none-linux-gnueabi-gcc -O3 -march=armv7-a -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp
- arm-none-linux-gnueabi-gcc -O3 -march=armv7-a -mtune=cortex-a8 -mfpu=vfp -ftree-vectorize -mfloat-abi=softfp
Running on OMAP3EVM under Linux with a Cortex-A8 clock speed of 600MHz
VFP/NEON | Time to execute this function 500,000 times |
VFP | 7.36 seconds |
Neon | 0.94 seconds |
Useful documentation
- The TRM is a large document, but contains good information to answer many questions. You can get the TRM on the ARM website: ARM's Main WebsiteHowever you need to know what revision of the Cortex-A8 you have. You can find out how to read that information here: How to Find the Cortex-A8 Revision of your OMAP35x
- There are many useful App Notes by ARM
- Other valuable ARM documents
Useful links
TI Open Source Projects - Cutting edge information
ARM Cortex-A8 Terminology
- SIMD - A processor capable of Single Instruction Multiple Data. For example during one single operation such as an "add", up to 16 sets of data will be added in parallel.
- Superscalar - An architecture which employs instruction level parallelism. The Cortex-A8 has dual in-order instruction issue.
- Vector Processor - Synonymous with SIMD processor