How to user SSE2 instructions to improve the performance of memory copy?

最新推荐文章于 2022-01-01 19:31:55 发布

hello_wyq

最新推荐文章于 2022-01-01 19:31:55 发布

阅读量2.6k

点赞数

分类专栏： MMX/SSE指令文章标签： performance user alignment cache big data system

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/hello_wyq/article/details/4048009

版权

MMX/SSE指令专栏收录该内容

9 篇文章 0 订阅

订阅专栏

How to user SSE2 instructions to improve the performance of memory copy?

It is an ageless topic for programmers to discuss how to improve the performance to transfer data. For the common scenario, the memcpy provided by C/C++ lib is enough to handle the boring duplicating tasks. However, if the size of data is larger than the size of last data cache in your system, for example, the last data cache is L2 data cache in my system whose size equals 1Mbytes, what will happen when copying 10Mbytes data? Obviously, the speed will become slow because the cache is polluted after invoking memcpy.

As a result, we often ask whether there is a good solution that can decrease the polluting degree in the data cache, and improve its performance at the same time. Fortunately, it exists in the SSEx instructions’ set. This article only focus on the SSE2’s instructions, the advanced SSEx will be discussed in the following articles.

Before showing the source codes, we had better understand several basic instructions in the SSE, and they are:

a. PREFETCHNTA.

Non-temporal data—fetch data into location close to the processor, minimizing cache pollution

• Pentium III processor—1st-level cache

• Pentium 4 and Intel Xeon processor—2nd-level cache

b. MOVNTDQ.

The MOVNTDQ (store double quadword using non-temporal hint) instruction stores packed integer data from an XMM register to memory, using a non-temporal hint.

c. MOVDQA.

The MOVDQA (move aligned double quadword) instruction transfers a double quadword operand from memory to an XMM register or vice versa; or between XMM registers. The memory address must be aligned to a 16-byte boundary; otherwise, a general-protection exception (#GP) is generated.

d. MOVDQU.

The MOVDQU (move unaligned double quadword) instruction performs the same operations as the MOVDQA instruction, except that 16-byte alignment of a memory address is not required. Its efficiency is lower than MOVDQA.

e. SFENCE.

The SFENCE (Store Fence) instruction controls write ordering by creating a fence for memory store operations. This instruction guarantees that the result of every store instruction that precedes the store fence in program order is globally visible before any store instruction that follows the fence. The SFENCE instruction provides an efficient way of ensuring ordering between procedures that produce weakly-ordered data and procedures that consume that data.

The maximal advantage is that SSE instruction enhances the throughput to transfer data. One cycle of instruction can transfer 16 bytes’ data, but the original bandwidth of MOV only holds 4 bytes. Moreover, by cooperating with PREFETCHNTA, the hit rating of data cache takes few effects after SSE’s memcpy if comparing with the traditional way.

Ok, we have spent too much time in describing the un-core contents, and let’s enter the main branch.

[The block of source code]

void* sse2_fast_memcpy( void *pDst, const void *pSrc, size_t len )

{

void *pBegin = pDst;

int offset;

if ( (offset = ((unsigned long) pDst) & SSE_ALIGNMENT_MASK) > 0 )

offset = SSE_ALIGNMENT_VAL - offset;

if ( len < (size_t) offset + 16 )

{

return memcpy( pDst, pSrc, len );

}

if ( offset > 0 )

{

memcpy( pDst, pSrc, offset );

len -= offset;

pDst = ((char *) pDst) + offset;

pSrc = ((char *) pSrc) + offset;

}

if ( SSE_CAN_ALIGN( pDst, pSrc ) )

{

_asm

{

mov ecx, len

mov esi, pSrc

mov edi, pDst

cmp ecx, 128

jb LA2

prefetchnta [esi]

LA1:

prefetchnta XMMWORD PTR[esi + 16 * 4]

movdqa xmm0, XMMWORD PTR[esi]

movntdq XMMWORD PTR[edi], xmm0

movdqa xmm1, XMMWORD PTR[esi + 16 * 1]

movntdq XMMWORD PTR[edi + 16 * 1], xmm1

movdqa xmm2, XMMWORD PTR[esi + 16 * 2]

movntdq XMMWORD PTR[edi + 16 * 2], xmm2

movdqa xmm3, XMMWORD PTR[esi + 16 * 3]

movntdq XMMWORD PTR[edi + 16 * 3], xmm3

prefetchnta XMMWORD PTR[esi + 16 * 8]

movdqa xmm4, XMMWORD PTR[esi + 16 * 4]

movntdq XMMWORD PTR[edi + 16 * 4], xmm4

movdqa xmm5, XMMWORD PTR[esi + 16 * 5]

movntdq XMMWORD PTR[edi + 16 * 5], xmm5

movdqa xmm6, XMMWORD PTR[esi + 16 * 6]

movntdq XMMWORD PTR[edi + 16 * 6], xmm6

movdqa xmm7, XMMWORD PTR[esi + 16 * 7]

movntdq XMMWORD PTR[edi + 16 * 7], xmm7

add esi, 128

add edi, 128

sub ecx, 128

cmp ecx, 128

jae LA1

LA2:

cmp ecx, 64

jb LA3

prefetchnta XMMWORD PTR[esi]

sub ecx, 64

movdqa xmm0, XMMWORD PTR[esi]

movntdq XMMWORD PTR[edi], xmm0

movdqa xmm1, XMMWORD PTR[esi + 16 * 1]

movntdq XMMWORD PTR[edi + 16 * 1], xmm1

movdqa xmm2, XMMWORD PTR[esi + 16 * 2]

movntdq XMMWORD PTR[edi + 16 * 2], xmm2

movdqa xmm3, XMMWORD PTR[esi + 16 * 3]

movntdq XMMWORD PTR[edi + 16 * 3], xmm3

add esi, 64

add edi, 64

LA3:

prefetchnta XMMWORD PTR[esi]

cmp ecx, 32

jb LA4

sub ecx, 32

movdqa xmm4, XMMWORD PTR[esi]

movntdq XMMWORD PTR[edi], xmm4

movdqa xmm5, XMMWORD PTR[esi + 16 * 1]

movntdq XMMWORD PTR[edi + 16 * 1], xmm5

add esi, 32

add edi, 32

LA4:

cmp ecx, 16

jb LA5

sub ecx, 16

movdqa xmm6, XMMWORD PTR[esi]

movntdq XMMWORD PTR[edi], xmm6

//add esi, 16

//add edi, 16

LA5:

sfence

}

}

else // Unalignment

{

_asm

{

mov ecx, len

mov esi, pSrc

mov edi, pDst

cmp ecx, 128

jb LB2

prefetchnta [esi]

LB1:

prefetchnta XMMWORD PTR[esi + 16 * 4]

movdqu xmm0, XMMWORD PTR[esi]

movntdq XMMWORD PTR[edi], xmm0

movdqu xmm1, XMMWORD PTR[esi + 16 * 1]

movntdq XMMWORD PTR[edi + 16 * 1], xmm1

movdqu xmm2, XMMWORD PTR[esi + 16 * 2]

movntdq XMMWORD PTR[edi + 16 * 2], xmm2

movdqu xmm3, XMMWORD PTR[esi + 16 * 3]

movntdq XMMWORD PTR[edi + 16 * 3], xmm3

prefetchnta XMMWORD PTR[esi + 16 * 8]

movdqu xmm4, XMMWORD PTR[esi + 16 * 4]

movntdq XMMWORD PTR[edi + 16 * 4], xmm4

movdqu xmm5, XMMWORD PTR[esi + 16 * 5]

movntdq XMMWORD PTR[edi + 16 * 5], xmm5

movdqu xmm6, XMMWORD PTR[esi + 16 * 6]

movntdq XMMWORD PTR[edi + 16 * 6], xmm6

movdqu xmm7, XMMWORD PTR[esi + 16 * 7]

movntdq XMMWORD PTR[edi + 16 * 7], xmm7

add esi, 128

add edi, 128

sub ecx, 128

cmp ecx, 128

jae LB1

LB2:

cmp ecx, 64

jb LB3

prefetchnta XMMWORD PTR[esi]

sub ecx, 64

movdqu xmm0, XMMWORD PTR[esi]

movntdq XMMWORD PTR[edi], xmm0

movdqu xmm1, XMMWORD PTR[esi + 16 * 1]

movntdq XMMWORD PTR[edi + 16 * 1], xmm1

movdqu xmm2, XMMWORD PTR[esi + 16 * 2]

movntdq XMMWORD PTR[edi + 16 * 2], xmm2

movdqu xmm3, XMMWORD PTR[esi + 16 * 3]

movntdq XMMWORD PTR[edi + 16 * 3], xmm3

add esi, 64

add edi, 64

LB3:

prefetchnta XMMWORD PTR[esi]

cmp ecx, 32

jb LB4

sub ecx, 32

movdqu xmm4, XMMWORD PTR[esi]

movntdq XMMWORD PTR[edi], xmm4

movdqu xmm5, XMMWORD PTR[esi + 16 * 1]

movntdq XMMWORD PTR[edi + 16 * 1], xmm5

add esi, 32

add edi, 32

LB4:

cmp ecx, 16

jb LB5

sub ecx, 16

movdqu xmm6, XMMWORD PTR[esi]

movntdq XMMWORD PTR[edi], xmm6

//add esi, 16

//add edi, 16

LB5:

sfence

}

} // End if ( SSE_CAN_ALIGN( pDst, pSrc ) )

offset = len & 0x0F;

if ( offset > 0 )

{

memcpy( ((char *) pDst) + (len - offset), ((char *) pSrc) + (len - offset), offset );

}

return pBegin;

}

[Comments of source codes]

1) The memcpy of system is adopted when the length of data is small in the source codes. If you prefer all by yourself, small_fast_mempcy listed in the next session maybe is an ideal selection to handle the small block of data.

void* small_fast_memcpy( void *pDst, const void *pSrc, size_t len )

{

_asm

{

mov ecx, len

mov edi, pDst

mov esi, pSrc

rep movsb

}

return pDst;

}

2) To ease to catch the emphasis of source code, the simply logical judegement is implemented by C syntax. In fact, it is very easy to convert them into ASM, too. If readers do not prefer hybrid way, why no to do it now by yourselfJ!

3) MACROs are utilized in the source code to eliminate the modification if instructions are 64-byte alignment(The partial instructions of SSE4x need 64-bytes alignment). Please refer the below block.

#define SSE_ALIGNMENT_VAL (16)

#define SSE_ALIGNMENT_MASK (SSE_ALIGNMENT_VAL - 1)

#define SSE_CAN_ALIGN( addr1, addr2 ) /

((((unsigned long) (addr1)) & SSE_ALIGNMENT_MASK) == (((unsigned long) (addr2)) & SSE_ALIGNMENT_MASK))

4) The interlace of XMMx register is to decrease the reading and writing dependence, and improve the parallel ability.

5) The interlace of reading and writing operations is to add the delaying time to cover the current prefetching latency.

6) The prefetching size is 64bytes and it is my length of data cache line in my PC. It should be adjusted in the real environment, but it is appropriate size in normal case.

[Summarization]

1. It only can acquire the improvement of speed when the data’s size is larger than the size of last data cache. Otherwise, the performance becomes worse. By testing it for big data, there is +50% improvement.

2. Be careful to use prefetch instruction. Its abusing operation can heavily disturb the performance of system.

3. It is not the best solution by using SSEx instruction. If the higher SSEx’s instructions are used, +1% gain is available if compared with SSE2.

Anyway, I hope that readers can give me any feedback to improve/correct the usage of SSEx. I think that it will help me understand inside deeply.

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

hello_wyq CSDN认证博客专家 CSDN认证企业博客

码龄24年

162: 原创

13万+: 周排名

82万+: 总排名

83万+: 访问

: 等级

9948: 积分

130: 粉丝

49: 获赞

358: 评论

194: 收藏

私信

关注

热门文章

分类专栏

ADTV 27篇
Bugzilla 2篇
C/C++ 84篇
Clearcase 19篇
Color Space 7篇
CUDA/OpenCL/DirectX 1篇
DIY 1篇
Doxygen 6篇
FFmpeg 32篇
Java 20篇
Linux 54篇
MICOM 3篇
MMX/SSE指令 9篇
SCons
SVN 11篇
VBA/EXCEL 8篇
Windows 37篇
XUL/XPCOM 7篇
出国故事 4篇
杂项 11篇
科技 11篇
Python 2篇

最新评论

关于pthread_cond_signal与pthread_cond_broadcast的使用说明
鲸落南北_yls: 为了线程安全，牺牲了线程切换造成的性能消耗
关于pthread_cond_signal与pthread_cond_broadcast的使用说明
鲸落南北_yls: 在锁区域之外，调用signal,posix明确允许这么做，随之而来的确定因素：我们可以预见调度行为吗？
Python下unicode字符串的处理
Tisfy: 楼主绝对是具备广阔胸怀和完整知识体系的人
如何在linux/unix中设置线程的优先级
liuqun69: 博主末尾给出的这份样例代码太误导初学者啦我重写的一份样例代码，连接如下：https://www.jianshu.com/p/c0055e27a60e 希望大家批评指正 [code=cpp] void show_thread_info() { int err; int policy; struct sched_param schedule_parameters; printf("--- Show policy and priority of current thread:\n"); err = pthread_getschedparam(pthread_self(), &policy, &schedule_parameters ); assert( !err ); switch (policy) { case SCHED_FIFO: printf("--- policy = SCHED_FIFO\n"); break; case SCHED_RR: printf("--- policy = SCHED_RR\n"); break; case SCHED_OTHER: printf("--- policy = SCHED_OTHER\n"); break; default: printf("--- policy = UNKNOWN\n"); break; } printf("--- Current thread's sched_priority = %d\n", schedule_parameters.sched_priority); } [/code]
Python utf-8与byte的解码问题
战猿回复橙子皮eat辣椒: 这个是传智燕青老师博客，我也是刚发现的https://blog.csdn.net/weixin_44062339

最新文章

目录

评论 1

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。