1、 ARM-NEON简介
ARM-NEON全称单指令多数据协处理器(Single instruction, multiple data)。其核心是有两类存储数据的寄存器组成
2、数据类型
- ARM-NEON数据类型分两种,向量和向量数组,命名规则如下:
<type><size>x<number_of_lanes>_t
数据类型 | 含义 |
---|
uint8x8_t | 包含8个uint8的向量类型 |
uint8x8x2_t | 包含2个向量的数组类型,每个向量包含8个uint8 |
64-bit type (D-register) | 128-bit type (Q-register) |
---|
int8x8_t | int8x16_t |
int16x4_t | int16x8_t |
int32x2_t | int32x4_t |
int64x1_t | int64x2_t |
uint8x8_t | uint8x16_t |
uint16x4_t | uint16x8_t |
uint32x2_t | uint32x4_t |
uint64x1_t | uint64x2_t |
float16x4_t | float16x8_t |
float32x2_t | float32x4_t |
poly8x8_t | poly8x16_t |
poly16x4_t | poly16x8_t |
3、NEON INTRINSICSS指令函数
<opname><flags>_<type>
指令名 | 含义 |
---|
uint8x8_t vadd_u8(uint8x8_t a, uint8x8_t b) | 两个向量相乘 |
uint8x16_t vaddq_u8(uint8x16_t a, uint8x16_t b) | 128位的Q寄存器向量加法 |
- C代码编程流程图
step1: 定义Neon向量
step2: 读取数据
step3: 处理数据
step4: 回写数据
4、数据加载函数
函数定义 | 含义 |
---|
Result_t vldN_type(Scalar_t* N,…) | 利用64位的D寄存器加载数组长度为N的数据 |
Result_t vldNq_type(Scalar_t* N, …) | 利用128位的Q寄存器加载数组长度为N的数据 |
Result_t | type | Scalar_t |
---|
int8x8_t | s8 | int8_t |
int16x4_t | s16 | int16_t |
int32x2_t | s32 | int32_t |
int64x1_t | s64 | int64_t |
uint8x8_t | u8 | uint8_t |
uint16x4_t | u16 | uint16_t |
uint32x2_t | u32 | uint32_t |
uint64x1_t | u64 | uint64_t |
float16x4_t | f16 | float16_t |
float32x2_t | f32 | float32_t |
poly8x8_t | p8 | poly8_t |
poly16x4_t | p16 | poly16_t |
Result_t | type | Scalar_t |
---|
int8x16_t | s8 | int8_t |
int16x8_t | s16 | int16_t |
int32x4_t | s32 | int32_t |
int64x2_t | s64 | int64_t |
uint8x16_t | u8 | uint8_t |
uint16x8_t | u16 | uint16_t |
uint32x4_t | u32 | uint32_t |
uint64x2_t | u64 | uint64_t |
float16x8_t | f16 | float16_t |
float32x4_t | f32 | float32_t |
poly8x16_t | p8 | poly8_t |
poly16x8_t | p16 | poly16_t |
5、数据回写函数
函数定义 | 含义 |
---|
void vstN_type(Scalar_t* N, Vector_t M) | 以步长为N将D寄存器数据存入M中 |
void vstNq_type(Scalar_t* N, Vector_t M) | 以步长N将Q寄存器中数据存入M中 |
6、代码实战
void GetBGRImageFromGPUNeon(uint8_t * renderPtr, cv::Mat &img) {
if(!img.data)
img.creat(frameBufferHeight, frameBufferWidth, CV_8UC3);
uint8_t *data = static_cast<uint8_t*>(img.data);
const int stridePixels = 16;
const int srcStrideByte = stridePixel * channelNum;
const int destStrideByte = stridePixels * 3;
int remainderPixels = pixelNum % stridePixels;
int dividePixels = pixelNum_ - remainderPixels;
uint8x16x4 rgba;
uint8x16x3 bgr;
for(int i = 0; i < dividePixels; i += stridePixels) {
rgba = vld4q_u8(renderPtr);
bgr.val[0] = rgba.val[2];
bgr.val[1] = rgba.val[1];
bgr.val[2] = rgba.val[0];
vst3q_u8(data, bgr);
data += destStrideByte;
renderPtr += srcStrideByte;
}
for(int i = dividePixels; i < pixelNum_; ++i) {
*data++ = *(renderPtr + 2);
*data++ = *(renderPtr + 2);
*data++ = *(renderPtr + 2);
render += 4;
}
}