x265中weight_pp函数的向量实现

最新推荐文章于 2021-01-04 18:02:18 发布

XX_bai

最新推荐文章于 2021-01-04 18:02:18 发布

阅读量286

点赞数

文章标签： x265 weight_pp 向量 SIMD

本文链接：https://blog.csdn.net/XX_bai/article/details/89164494

版权

摘要

weight_pp函数c语言的内容是

static void weight_pp_c(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset)
{
    int x, y;
    const int correction = (IF_INTERNAL_PREC - X265_DEPTH); //6
    X265_CHECK(!(width & 15), "weightp alignment error\n");
    X265_CHECK(!((w0 << 6) > 32767), "w0 using more than 16 bits, asm output will mismatch\n");
    X265_CHECK(!(round > 32767), "round using more than 16 bits, asm output will mismatch\n");
    X265_CHECK((shift >= correction), "shift must be include factor correction, please update ASM ABI\n");
    X265_CHECK(!(round & ((1 << correction) - 1)), "round must be include factor correction, please update ASM ABI\n");

    for (y = 0; y <= height - 1; y++)
    {
        for (x = 0; x <= width - 1; )
        {
            // simulating pixel to short conversion
            int16_t val = src[x] << correction;
            dst[x] = x265_clip(((w0 * (val) + round) >> shift) + offset);  // shift actually equals shift + correction  
            // round表示与shift配合使用的进位数
            x++;
        }

        src += stride;
        dst += stride;
    }
}

该函数的作用是P帧加权参考帧获取，需加权、移位、偏移。内容较为简单。

正文

首先是将一些定点数据向量化，便于后续的向量计算。c语言中为便于汇编优化，其实给出了一些定点数据的限制条件。比如width为16的倍数， $w 0 < < 6$ 和round在16位的范围内。

首先构造w0和round的数据组合
$w 0 < < 6 ∣ r o u n d < < 16$
得到一个32位数，高16位为round，低16位为 $w 0 < < 6$ ，为什么这样设置后续会说明。

然后将该32位数存入向量寄存器的低32位中，通过混洗，使向量寄存器中的4个32位元素都为该常量。

对shift和offset进行同样的操作。

还有这里是以参数的形式传入width和height，具体的值是未知的。因此函数内还是需要两层循环，但内层width保证为16的倍数，因此可以将内层循环展开，同时处理16个横向连续数据。

因此外层循环height次，内层循环width/16次。

进入循环，首先取出16个数据，每8个数据（64位）存入向量寄存器低64位。然后位数扩展，8位至16位，占满128位寄存器。

这里设置了一个常量向量，16位元素，全1。使用ilvr和ilvl指令，源寄存器为常量向量和上面取出的一个寄存器。指令的内容是取两个源寄存器的右侧64位或者左侧64为数据，然后16位奇偶交错存入目的寄存器。得到这样的形式。

${1, col_{n+3}, 1, col_{n+2}, 1, col_{n+1}, 1, col_n\}$

将该结果和 $w 0 < < 6 ∣ r o u n d < < 16$ 相乘求和，它的向量形式就是

${round, (w0<<6), round, (w0<<6), round, (w0<<6), round, (w0<<6)\}$

对应位置相乘，相邻元素求和并扩展，得到32位，对于一个元素得到这样一个32位结果。

$r o u n d + c o l * (w 0 < < 6)$ 等效为 $r o u n d + (c o l < < 6) * w 0$

完全符合c语言的计算过程。然后是向量移位以及向量求和运算，会使用srl和addv指令。

然后是完成x265_clip的操作，需要将32位元素饱和截位至8位，需要两个截位过程。这里有一点需要注意。起始的32位数是有符号的，无法保证它一定是正数，所以第一次饱和截位是从32位有符号到16位无符号的操作。第二次是从16位无符号到8位无符号的饱和截位，确保所有情况下计算的结果都是正确的。

最后将截位得到的结果按存储顺序组合，刚好得到一个向量寄存器，存入目的地址完成运算。

XX_bai

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
x265中weight_pp函数的向量实现

摘要weight_pp函数c语言的内容是static void weight_pp_c(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset){ int x, y; const int correction =...
复制链接

扫一扫