Optimizing Code for ARM Cortex-A8 with NEON SIMD

最新推荐文章于 2022-01-10 16:14:10 发布

yuyin86

最新推荐文章于 2022-01-10 16:14:10 发布

阅读量1.6k

点赞数 1

分类专栏： linux学习视频音频 arm

linux学习同时被 3 个专栏收录

636 篇文章 4 订阅

订阅专栏

视频音频

117 篇文章 4 订阅

订阅专栏

arm

9 篇文章 0 订阅

订阅专栏

本文档提供了一套详细的指南，用于优化 ARM Cortex-A8 处理器上的 NEON SIMD 代码。包括如何采用正确的 C 编程风格以实现优化，使用 ARM NEON 指令的方法，以及 VFP3 浮点单元的周期性能概述。

摘要由CSDN通过智能技术生成

Optimizing Code for ARM Cortex-A8 with NEON SIMD

Optimizing Code for ARM Cortex-A8 with NEON SIMD

Architecture

ARMv7-A Instruction set
ARM Cortex-A8 Technical Reference Manual
ARM Architecture Overview
ARM Assembler Tutorial

The right C programming style for NEON optimization

<box green 80% round center>Obeying these guidelines will significantly speed up your code and will perform equally to hand crafted assembler in most cased. Read the Official ARM NEON Optimization Examples to learn the details why this statement holds.</box>

Use the restrict keyword for pointers in function signatures: It signals to the compiler that no other pointer inside the function points at the same data block. This allows NEON SIMD optimization to be used internally by the compiler. The keyword is specific to C99, but the variant _ _restrict should be usable in any C-standard
Tell the compiler to unroll loops:

#pragma unroll(n)

Tell the compiler the natural iteration steps: If you for example iterate through an array, but you know that the array size will be always a multiple of four, tell the compiler this by using this loop syntax:

for (n = 0; n < (limit/4)*4; n++)

Use the natural data types: 8, 16, 32 and 64 bit
Do not use array grouping
Do not use padding for your data: ~~struct pixel {char r, char g, char b, char padding}~~. The NEON engine can load unaligned data, so padding (e.g. to 32 bit) is not necessary.

Using ARM NEON instructions

<box red 80% round center>In 90% of the cases the compiler will generate better code than hand-crafted assembler instructions. Use the NEON intrinsics only for well-defined cases.</box>

You can use the special NEON instruction set to speed up your code. Just use the right #ifdef statements to make sure the instructions are only used when the code is actually running on the ARM Cortex-A8 NEON.

Include the NEON header file: #include <arm_neon.h> to use these functions
All functions are documented in the ARM NEON instruction set reference

Preprocessor ifdef example

This example allows to put NEON code in the same file as for other processors (e.g. x86). The optimized routines will only be used if the '-fpu=neon -mfloatabi=softfp' flags are set (These flags toggle the 'ARM_NEON' define).

/// At the beginning of the file:
#ifdef __ARM_NEON__
#include <arm_neon.h>
#endif
 
 
/// In the main function space:
 
/**
 * @brief This function multiplies two floats.
 *
 * This function is optimized for the ARM NEON instruction set. However
 * a standard C fallback version is present as well (e.g. for x86 systems).
 *
 * @param f1 The first float number
 * @param f2 The second float number
 * @return The two floats multiplied
 **/
 
float optimized_function(float f1, float f2) {
#ifdef __ARM_NEON__
 
/// ARM NEON Code implementation
 
#else
 
/// Standard implementation
return f1 * f2;
 
#endif
 
}

VFP3 (Standard ARM floating point unit)

Cycles per instruction

Instruction	Single precision cycles	Double precision cycles	Subnormal penalty
FADD	9-10	9-10	operand/result
FSUB	9-10	9-10	operand/result
FMUL	10-12	11-17	operand/result
FNMUL	10-12	11-17	operand/result
FMAC	18-21	19-26	operand/intermediate/result
FDIV	20-37	29-65	operand/result
FSQRT	19-33	29-60	operand

Zero instruction load from L1 cache
32 x 64 bit registers
Support for unaligned data ( No padding to 32 bit chunks necessary)
128 / 2x 64 bit integer pipeline (ALU, Shift, MAC)
Parallel dual, single-precision (float) floating point pipeline (2x FADD, 2x FMUL, etc.)
Double precision floating point is handled by VFP3 engine ( SLOW!)

NEON 128 bit register

The registers can be processed by the NEON engine as:

32-bit single-precision floating-point numbers → Which is why double is slow
8-bit, 16-bit, 32-bit, or 64-bit signed or unsigned integers
8-bit, 16-bit, 32-bit, or 64-bit bitfields
8-bit or 16-bit polynomials with 1-bit coefficients.