Programming trivia: 4x4 integer matrix transpose in SSE2

From http://www.randombit.net/bitbashing/2009/10/08/integer_matrix_transpose_in_sse2.html

It is a good example of SSE for matrix transpose.

he Intel SSE2 intrinsics has a macro _MM_TRANSPOSE4_PSwhich performs a matrix transposition on a 4x4 array represented byelements in 4 SSE registers. However, it doesn’t work with integerregisters because Intel intrinsics make a distinction between integerand floating point SSE registers. Theoretically one could cast and usethe floating point operations, but it seems quite plausible that thiswill not round trip properly; for instance if one of your integervalues happens to have the same value as a 32-bit IEEE denormal.

However it is easy to do with the punpckldq, punpckhdq, punpcklqdq,and punpckhqdq instructions; code and diagrams ahoy.

If we name the 4 input registers I0, I1, I2, and I3, then label their cooresponding elements as0{0,1,2,3} and so on, then the transpose operation looks like this:

../../../_images/sse2_transpose.png

When we are done, O{0,1,2,3} contains the all of the first,second, third, or fourth elements (resp) of the input vectors.

In Intel’s intrinsics (also usable in at least GNU C++ and VisualC++), this can be expressed as:

__m128i T0 = _mm_unpacklo_epi32(I0, I1);
__m128i T1 = _mm_unpacklo_epi32(I2, I3);
__m128i T2 = _mm_unpackhi_epi32(I0, I1);
__m128i T3 = _mm_unpackhi_epi32(I2, I3);

/* Assigning transposed values back into I[0-3] */
I0 = _mm_unpacklo_epi64(T0, T1);
I1 = _mm_unpackhi_epi64(T0, T1);
I2 = _mm_unpacklo_epi64(T2, T3);
I3 = _mm_unpackhi_epi64(T2, T3);

The diagram was done with latex2png,a handly little tool for generating images with LaTeX inputs.


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值