解决Resize 8-bit image by 2 with ARM NEON

最新推荐文章于 2021-10-11 17:31:24 发布

approlaro

最新推荐文章于 2021-10-11 17:31:24 发布

阅读量765

点赞数 1

本文链接：https://blog.csdn.net/soralaro/article/details/98975662

版权

解决Resize 8-bit image by 2 with ARM NEON

分享于

2019阿里云全部产品优惠券(新购或升级都可以使用，强烈推荐)
领取地址：https://promotion.aliyun.com/ntms/yunparter/invite.html

I have an 8-bit 640x480 image that I would like to shrink to a 320x240 image:

void reducebytwo(uint8_t *dst, uint8_t *src)
//src is 640x480, dst is 320x240

What would be the best way to do that using ARM SIMD NEON? Any sample code somewhere?

As a starting point, I simply would like to do the equivalent of:

for (int h = 0; h < 240; h++)
    for (int w = 0; w < 320; w++)
        dst[h * 320 + w] = (src[640 * h * 2 + w * 2] + src[640 * h * 2 + w * 2 + 1] + src[640 * h * 2 + 640 + w * 2] + src[640 * h * 2 + 640 + w * 2 + 1]) / 4;

image image-processing arm simd neon
|
this question edited Jul 24 '13 at 1:27 asked Jul 23 '13 at 16:34 gregoiregentil 632 11 31 Bestneeds to be defined. Fastest, highest quality, minimum size, etc? For highest quality, there are different tradeoffs in image reduction. Preserving low frequency content is important is some cases and high frequency in others. What is 8-bit? A gray scale, colour mapped, or something else? – artless noise Jul 23 '13 at 17:20 It's a grey scale input. Best = fastest. – gregoiregentil Jul 23 '13 at 18:08

3 Answers
3

解决方法

Here is the asm version on reduce_line that @Nils Pipenbrinck suggested

static void reduce2_neon_line(uint8_t* __restrict src1, uint8_t* __restrict src2, uint8_t* __restrict dest, int width) {
    for(int i=0; i<width; i+=16) {
        asm (
             "pld [%[line1], #0xc00]     \n"
             "pld [%[line2], #0xc00]     \n"
             "vldm %[line1]!, {d0,d1}  \n"
             "vldm %[line2]!, {d2,d3}  \n"
             "vpaddl.u8 q0, q0         \n"
             "vpaddl.u8 q1, q1         \n"
             "vadd.u16  q0, q1         \n"
             "vshrn.u16 d0, q0, #2     \n"

             "vst1.8 {d0}, [%[dst]]! \n"

             :
             : [line1] "r"(src1), [line2] "r"(src2), [dst] "r"(dest)
             : "q0", "q1", "memory"
             );
    }
}

It is about 4 times faster then C version (tested on iPhone 5).

|
this answer answered Oct 8 '13 at 12:50 Max 15.4k 3 34 52

This is a one to one translation of your code to arm NEON intrinsics:

#include <arm_neon.h>
#include <stdint.h>

static void resize_line (uint8_t * __restrict src1, uint8_t * __restrict src2, uint8_t * __restrict dest)
{
  int i;
  for (i=0; i<640; i+=16)
  {
    // load upper line and add neighbor pixels:
    uint16x8_t a = vpaddlq_u8 (vld1q_u8 (src1));

    // load lower line and add neighbor pixels:
    uint16x8_t b = vpaddlq_u8 (vld1q_u8 (src2));

    // sum of upper and lower line: 
    uint16x8_t c = vaddq_u16 (a,b);

    // divide by 4, convert to char and store:
    vst1_u8 (dest, vshrn_n_u16 (c, 2));

    // move pointers to next chunk of data
    src1+=16;
    src2+=16;
    dest+=8;
   }
}   

void resize_image (uint8_t * src, uint8_t * dest)
{
  int h;    
  for (h = 0; h < 240 - 1; h++)
  {
    resize_line (src+640*(h*2+0), 
                 src+640*(h*2+1), 
                 dest+320*h);
  }
}

It processes 32 source-pixels and generates 8 output pixels per iteration.

I did a quick look at the assembler output and it looks okay. You can get better performance if you write the resize_line function in assembler, unroll the loop and eliminate pipeline stalls. That would give you an estimated factor of three performance boost.

It should be a lot faster than your implementation without assembler changes though.

Note: I haven't tested the code...

|
this answer edited Oct 8 '13 at 15:28 Max 15.4k 3 34 52 answered Jul 24 '13 at 5:56 Nils Pipenbrinck 58.9k 18 126 201 Great! Do you think that having the whole resize_image function in assembler would be much faster or do you think that with your suggestion, I have already 90% of the time saving? – gregoiregentil Jul 24 '13 at 17:40 It would be faster.. no doubt about that. – Nils Pipenbrinck Jul 24 '13 at 19:43

If you're not too concerned with precision then this inner loop should give you twice the compute throughput compared to the more accurate algorithm:

for (i=0; i<640; i+= 32)
{
    uint8x16x2_t a, b;
    uint8x16_t c, d;

    /* load upper row, splitting even and odd pixels into a.val[0]
     * and a.val[1] respectively. */
    a = vld2q_u8(src1);

    /* as above, but for lower row */
    b = vld2q_u8(src2);

    /* compute average of even and odd pixel pairs for upper row */
    c = vrhaddq_u8(a.val[0], a.val[1]);
    /* compute average of even and odd pixel pairs for lower row */
    d = vrhaddq_u8(b.val[0], b.val[1]);

    /* compute average of upper and lower rows, and store result */
    vst1q_u8(dest, vrhaddq_u8(c, d));

    src1+=32;
    src2+=32;
    dest+=16;
}

It works by using the vhadd operation, which has a result the same size as the input. This way you don't have to shift the final sum back down to 8-bit, and all of the arithmetic throughout is eight-bit, which means you can perform twice as many operations per instruction.

However it is less accurate, because the intermediate sum is quantised, and GCC 4.7 does a terrible job of generating code. GCC 4.8 does just fine.

The whole operation has a good chance of being I/O bound, though. The loop should be unrolled to maximise separation between loads and arithmetic, and __builtin_prefetch() (or PLD) should be used to hoist the incoming data into caches before it's needed.

|
this answer answered Aug 7 '13 at 12:09 sh1 2,858 7 24

approlaro

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
解决Resize 8-bit image by 2 with ARM NEON

解决Resize 8-bit image by 2 with ARM NEON分享于2019阿里云全部产品优惠券(新购或升级都可以使用，强烈推荐)领取地址：https://promotion.aliyun.com/ntms/yunparter/invite.htmlI have an 8-bit 640x480 image that I would like to shrink ...
复制链接

扫一扫