Optimizing Code for ARM Cortex-A8 with NEON SIMD

117 篇文章 4 订阅
9 篇文章 0 订阅
本文档提供了一套详细的指南,用于优化 ARM Cortex-A8 处理器上的 NEON SIMD 代码。包括如何采用正确的 C 编程风格以实现优化,使用 ARM NEON 指令的方法,以及 VFP3 浮点单元的周期性能概述。
摘要由CSDN通过智能技术生成

Optimizing Code for ARM Cortex-A8 with NEON SIMD

Architecture

The right C programming style for NEON optimization

<box green 80% round center>Obeying these guidelines will significantly speed up your code and will perform equally to hand crafted assembler in most cased. Read the Official ARM NEON Optimization Examples to learn the details why this statement holds.</box>

  • Use the  restrict keyword for pointers in function signatures: It signals to the compiler that no other pointer inside the function points at the same data block. This allows  NEON SIMD optimization to be used internally by the compiler. The keyword is specific to C99, but the variant _ _restrict should be usable in any C-standard
  • Tell the compiler to unroll loops:
#pragma unroll(n)
  • Tell the compiler the natural iteration steps: If you for example iterate through an array, but you know that the array size will be always a multiple of four, tell the compiler this by using this loop syntax:
for (n = 0; n < (limit/4)*4; n++)
  • Use the natural data types: 8, 16, 32 and 64 bit
  • Do not use array grouping
  • Do not use padding for your data:  struct pixel {char r, char g, char b, char padding}. The  NEON engine can load unaligned data, so padding (e.g. to 32 bit) is not necessary.

Using ARM NEON instructions

<box red 80% round center>In 90% of the cases the compiler will generate better code than hand-crafted assembler instructions. Use the NEON intrinsics only for well-defined cases.</box>

You can use the special NEON instruction set to speed up your code. Just use the right #ifdef statements to make sure the instructions are only used when the code is actually running on the ARM Cortex-A8 NEON.

  1. Include the  NEON header file:  #include <arm_neon.h> to use these functions
  2. All functions are documented in the  ARM NEON instruction set reference
Preprocessor ifdef example

This example allows to put NEON code in the same file as for other processors (e.g. x86). The optimized routines will only be used if the '-fpu=neon -mfloatabi=softfp' flags are set (These flags toggle the 'ARM_NEON' define).

/// At the beginning of the file:
#ifdef __ARM_NEON__
#include <arm_neon.h>
#endif
 
 
/// In the main function space:
 
/**
 * @brief This function multiplies two floats.
 *
 * This function is optimized for the ARM NEON instruction set. However
 * a standard C fallback version is present as well (e.g. for x86 systems).
 *
 * @param f1 The first float number
 * @param f2 The second float number
 * @return The two floats multiplied
 **/
 
float optimized_function(float f1, float f2) {
#ifdef __ARM_NEON__
 
/// ARM NEON Code implementation
 
#else
 
/// Standard implementation
return f1 * f2;
 
#endif
 
}

VFP3 (Standard ARM floating point unit)

Cycles per instruction
Instruction Single precision cycles Double precision cycles Subnormal penalty
FADD 9-10 9-10 operand/result
FSUB 9-10 9-10 operand/result
FMUL 10-12 11-17 operand/result
FNMUL 10-12 11-17 operand/result
FMAC 18-21 19-26 operand/intermediate/result
FDIV 20-37 29-65 operand/result
FSQRT 19-33 29-60 operand

NEON Architecture

Reference

Overview

  • Zero instruction load from L1 cache
  • 32 x 64 bit registers
  • Support for unaligned data ( No padding to 32 bit chunks necessary)
  • 128 / 2x 64 bit integer pipeline (ALU, Shift, MAC)
  • Parallel dual, single-precision (float) floating point pipeline (2x FADD, 2x FMUL, etc.)
  • Double precision floating point is handled by VFP3 engine ( SLOW!)
NEON 128 bit register

The registers can be processed by the NEON engine as:

  • 32-bit single-precision floating-point numbers →  Which is why double is slow
  • 8-bit, 16-bit, 32-bit, or 64-bit signed or unsigned integers
  • 8-bit, 16-bit, 32-bit, or 64-bit bitfields
  • 8-bit or 16-bit polynomials with 1-bit coefficients.

ARM Cortex-A8 to NEON pipeline

NEON pipeline

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值