解决Resize 8-bit image by 2 with ARM NEON

解决Resize 8-bit image by 2 with ARM NEON

分享于

2019阿里云全部产品优惠券(新购或升级都可以使用,强烈推荐)
领取地址https://promotion.aliyun.com/ntms/yunparter/invite.html

 

I have an 8-bit 640x480 image that I would like to shrink to a 320x240 image:

void reducebytwo(uint8_t *dst, uint8_t *src)
//src is 640x480, dst is 320x240

What would be the best way to do that using ARM SIMD NEON? Any sample code somewhere?

As a starting point, I simply would like to do the equivalent of:

for (int h = 0; h < 240; h++)
    for (int w = 0; w < 320; w++)
        dst[h * 320 + w] = (src[640 * h * 2 + w * 2] + src[640 * h * 2 + w * 2 + 1] + src[640 * h * 2 + 640 + w * 2] + src[640 * h * 2 + 640 + w * 2 + 1]) / 4; 

image image-processing arm simd neon 
  | 
  this question edited Jul 24 '13 at 1:27 asked Jul 23 '13 at 16:34 gregoiregentil 632 11 31      Bestneeds to be defined. Fastest, highest quality, minimum size, etc? For highest quality, there are different tradeoffs in image reduction. Preserving low frequency content is important is some cases and high frequency in others. What is 8-bit? A gray scale, colour mapped, or something else? – artless noise Jul 23 '13 at 17:20      It's a grey scale input. Best = fastest. –  gregoiregentil Jul 23 '13 at 18:08



 | 

Answers
3

解决方法

Here is the asm version on reduce_line that @Nils Pipenbrinck suggested

static void reduce2_neon_line(uint8_t* __restrict src1, uint8_t* __restrict src2, uint8_t* __restrict dest, int width) {
    for(int i=0; i<width; i+=16) {
        asm (
             "pld [%[line1], #0xc00]     \n"
             "pld [%[line2], #0xc00]     \n"
             "vldm %[line1]!, {d0,d1}  \n"
             "vldm %[line2]!, {d2,d3}  \n"
             "vpaddl.u8 q0, q0         \n"
             "vpaddl.u8 q1, q1         \n"
             "vadd.u16  q0, q1         \n"
             "vshrn.u16 d0, q0, #2     \n"

             "vst1.8 {d0}, [%[dst]]! \n"

             :
             : [line1] "r"(src1), [line2] "r"(src2), [dst] "r"(dest)
             : "q0", "q1", "memory"
             );
    }
}

It is about 4 times faster then C version (tested on iPhone 5).


  | 
  this answer answered Oct 8 '13 at 12:50 Max 15.4k 3 34 52



 | 

This is a one to one translation of your code to arm NEON intrinsics:

#include <arm_neon.h>
#include <stdint.h>

static void resize_line (uint8_t * __restrict src1, uint8_t * __restrict src2, uint8_t * __restrict dest)
{
  int i;
  for (i=0; i<640; i+=16)
  {
    // load upper line and add neighbor pixels:
    uint16x8_t a = vpaddlq_u8 (vld1q_u8 (src1));

    // load lower line and add neighbor pixels:
    uint16x8_t b = vpaddlq_u8 (vld1q_u8 (src2));

    // sum of upper and lower line: 
    uint16x8_t c = vaddq_u16 (a,b);

    // divide by 4, convert to char and store:
    vst1_u8 (dest, vshrn_n_u16 (c, 2));

    // move pointers to next chunk of data
    src1+=16;
    src2+=16;
    dest+=8;
   }
}   

void resize_image (uint8_t * src, uint8_t * dest)
{
  int h;    
  for (h = 0; h < 240 - 1; h++)
  {
    resize_line (src+640*(h*2+0), 
                 src+640*(h*2+1), 
                 dest+320*h);
  }
}

It processes 32 source-pixels and generates 8 output pixels per iteration.

I did a quick look at the assembler output and it looks okay. You can get better performance if you write the resize_line function in assembler, unroll the loop and eliminate pipeline stalls. That would give you an estimated factor of three performance boost.

It should be a lot faster than your implementation without assembler changes though.

Note: I haven't tested the code...


  | 
  this answer edited Oct 8 '13 at 15:28 Max 15.4k 3 34 52 answered Jul 24 '13 at 5:56 Nils Pipenbrinck 58.9k 18 126 201      Great! Do you think that having the whole resize_image function in assembler would be much faster or do you think that with your suggestion, I have already 90% of the time saving? –  gregoiregentil Jul 24 '13 at 17:40      It would be faster.. no doubt about that. – Nils Pipenbrinck Jul 24 '13 at 19:43



 | 

If you're not too concerned with precision then this inner loop should give you twice the compute throughput compared to the more accurate algorithm:

for (i=0; i<640; i+= 32)
{
    uint8x16x2_t a, b;
    uint8x16_t c, d;

    /* load upper row, splitting even and odd pixels into a.val[0]
     * and a.val[1] respectively. */
    a = vld2q_u8(src1);

    /* as above, but for lower row */
    b = vld2q_u8(src2);

    /* compute average of even and odd pixel pairs for upper row */
    c = vrhaddq_u8(a.val[0], a.val[1]);
    /* compute average of even and odd pixel pairs for lower row */
    d = vrhaddq_u8(b.val[0], b.val[1]);

    /* compute average of upper and lower rows, and store result */
    vst1q_u8(dest, vrhaddq_u8(c, d));

    src1+=32;
    src2+=32;
    dest+=16;
}

It works by using the vhadd operation, which has a result the same size as the input. This way you don't have to shift the final sum back down to 8-bit, and all of the arithmetic throughout is eight-bit, which means you can perform twice as many operations per instruction.

However it is less accurate, because the intermediate sum is quantised, and GCC 4.7 does a terrible job of generating code. GCC 4.8 does just fine.

The whole operation has a good chance of being I/O bound, though. The loop should be unrolled to maximise separation between loads and arithmetic, and __builtin_prefetch() (or PLD) should be used to hoist the incoming data into caches before it's needed.


  | 
  this answer answered Aug 7 '13 at 12:09 sh1 2,858 7 24

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值