问如何累加一个128位寄存器中的四个float数？

最新推荐文章于 2024-04-07 09:10:25 发布

lychee007

最新推荐文章于 2024-04-07 09:10:25 发布

阅读量3.3k

点赞数 1

文章标签： float arrays function 优化 basic 编译器

本文链接：https://blog.csdn.net/lychee007/article/details/6194384

版权

问题（zlw）：

一个_mm128型的寄存器包含 (xx3, xx2, xx1, xx0)这样四个float型浮点数.

想做 xx0 + xx1 + xx2 + xx3这样的计算

提问者自己觉得可以用:
xx = _mm_hadd_ps(xx, _mm_set_zero); // to get (0, 0, xx3+ xx2, xx1+xx0)

xx= _mm_add_ss(xx, _mm_shuffle_ps(xx,xx, _MM_SHUFFLE( 0, 0, 0, 1 )) );
_mm_store_ss(&temp,xx);

亦可以用:

xx= _mm_add_ps(xx, _mm_movehl_ps(xx, xx));

xx= _mm_add_ss(xx, _mm_shuffle_ps(xx, xx, 1));

_mm_store_ss( &temp, xx );

问有没有更好的方法？

回答1（jimdempseyatthecove）：

对同一寄存器使用两次hadd（源和目的寄存器为同一个）
// xx = { xx3, xx2, xx1, xx0 }
xx=_mm_hadd_ps(xx,xx);
// xx = { xx3+xx2, xx1+xx0, xx3+xx2, xx1+xx0}
xx=_mm_hadd_ps(xx,xx);
// xx = { xx2+xx3+xx1+xx0, xx3+xx2+xx1+xx0, xx3+xx2+xx1+xx0, xx3+xx2+xx1+xx0}

虽然最终每个32位里都是四个float型数的累加和，但是并不会带来额外开销
需要额外注意的是，如果在比较老的处理器上使用，需要在两次水平累加中间插入一个load或者store或者非sse操作（why？）

回答2（Brandon Hewitt）：

“最好”的方法依赖实现的平台。

float foo(float in[4]) { return __sec_reduce_add(in[:]); }

（这里用到了 Intel® Cilk™ Plus ）

使用Intel C++ Composer XE Update 2 版本的编译器，产生的默认 Intel® SSE2代码如下 (i.e. icc -S -c test.c), :

movups (%rdi), %xmm0

movaps %xmm0, %xmm1

movhlps %xmm0, %xmm1

addps %xmm1, %xmm0

movaps %xmm0, %xmm2

shufps $245, %xmm0, %xmm2

addss %xmm2, %xmm0

ret

而支持 Intel SSE4.2 (icc -xSSE4.2 -S -c test.c)平台，则会得到
movups (%rdi), %xmm0

haddps %xmm0, %xmm0

ret

这个查看Cilk汇编指令的方法很好。

Intel对自家产品的优化应该尽心尽力，一般手工优化能达到这样的水平应该说还算不错的。

P.S.//20110807

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/mac/optaps/common/optaps_par_cean_prog.htm

C/C++ Extensions for Array Notations Programming Model

Reductions

A reduction combines array section elements to generate a scalar result. Intel® Cilk™ Plus supports reductions on array sections. It defines a generic reduction function that applies a user-defined dyadic function. It also has nine built-in common reduction functions. The built-in functions are polymorphic functions that accept int, float, and other C basic data type arguments. The names and descriptions of reduction functions are summarized in the table below.

Reduction Function Prototypes
Function Prototypes	Descriptions
`__sec_reduce(fun, identity, a[:])`	Generic reduction function. Reduces `fun` across the array `a[:]` using `identity` as the initial value.
`__sec_reduce_add(a[:])`	Built-in reduction function. Adds values passed as arrays
`__sec_reduce_mul(a[:])`	Built-in reduction function. Multiplies values passed as arrays
`__sec_reduce_all_zero(a[:])`	Built-in reduction function. Tests that array elements are all zero
`__sec_reduce_all_nonzero(a[:])`	Built-in reduction function. Tests that array elements are all non-zero
`__sec_reduce_any_nonzero(a[:])`	Built-in reduction function. Tests for any array element that is non-zero
`__sec_reduce_min(a[:])`	Built-in reduction function. Determines the minimum value of array elements
`__sec_reduce_max(a[:])`	Built-in reduction function. Determines the maximum value of array elements
`__sec_reduce_min_ind(a[:])`	Built-in reduction function. Determines the index of minimum value of array elements
`__sec_reduce_max_ind(a[:])`	Built-in reduction function. Determines the index of maximum value of array elements

The reduction operation can reduce on multiple ranks. The number of ranks reduced depends on the execution context. For a given execution context of rank m and a reduction array section argument with rank n, where n>m, the last n-m ranks of the array section argument are reduced.

Example
sum = __sec_reduce_add(a[:][:]); // sum across the whole array a sum_of_column[:] = __sec_reduce_add(a[:][:]); // sum across the column of a

Example

sum = __sec_reduce_add(a[:][:]); // sum across the whole array a

sum_of_column[:] = __sec_reduce_add(a[:][:]); // sum across the column of a

lychee007

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫