How to user SSE2 instructions to improve the performance of memory copy?
It is an ageless topic for programmers to discuss how to improve the performance to transfer data. For the common scenario, the memcpy provided by C/C++ lib is enough to handle the boring duplicating tasks. However, if the size of data is larger than the size of last data cache in your system, for example, the last data cache is L2 data cache in my system whose size equals 1Mbytes, what will happen when copying 10Mbytes data? Obviously, the speed will become slow because the cache is polluted after invoking memcpy.
As a result, we often ask whether there is a good solution that can decrease the polluting degree in the data cache, and improve its performance at the same time. Fortunately, it exists in the SSEx instructions’ set. This article only focus on the SSE2’s instructions, the advanced SSEx will be discussed in the following articles.
Before showing the source codes, we had better understand several basic instructions in the SSE, and they are:
a. PREFETCHNTA.
Non-temporal data—fetch data into location close to the processor, minimizing cache pollution
• Pentium III processor—1st-level cache
• Pentium 4 and Intel Xeon processor—2nd-level cache
b. MOVNTDQ.
The MOVNTDQ (store double quadword using non-temporal hint) instruction stores packed integer data from an XMM register to memory, using a non-temporal hint.
c. MOVDQA.
The MOVDQA (move aligned double quadword) instruction transfers a double quadword operand from memory to an XMM register or vice versa; or between XMM registers. The memory address must be aligned to a 16-byte boundary; otherwise, a general-protection exception (#GP) is generated.
d. MOVDQU.
The MOVDQU (move unaligned double quadword) instruction performs the same operations as the MOVDQA instruction, except that 16-byte alignment of a memory address is not required. Its efficiency is lower than MOVDQA.
e. SFENCE.
The SFENCE (Store Fence) instruction controls write ordering by creating a fence for memory store operations. This instruction guarantees that the result of every store instruction that precedes the store fence in program order is globally visible before any store instruction that follows the fence. The SFENCE instruction provides an efficient way of ensuring ordering between procedures that produce weakly-ordered data and procedures that consume that data.
The maximal advantage is that SSE instruction enhances the throughput to transfer data. One cycle of instruction can transfer 16 bytes’ data, but the original bandwidth of MOV only holds 4 bytes. Moreover, by cooperating with PREFETCHNTA, the hit rating of data cache takes few effects after SSE’s memcpy if comparing with the traditional way.
Ok, we have spent too much time in describing the un-core contents, and let’s enter the main branch.
[The block of source code]
void* sse2_fast_memcpy( void *pDst, const void *pSrc, size_t len ) { void *pBegin = pDst; int offset;
if ( (offset = ((unsigned long) pDst) & SSE_ALIGNMENT_MASK) > 0 ) offset = SSE_ALIGNMENT_VAL - offset;
if ( len < (size_t) offset + 16 ) { return memcpy( pDst, pSrc, len ); }
if ( offset > 0 ) { memcpy( pDst, pSrc, offset ); len -= offset; pDst = ((char *) pDst) + offset; pSrc = ((char *) pSrc) + offset; }
if ( SSE_CAN_ALIGN( pDst, pSrc ) ) { _asm { mov ecx, len mov esi, pSrc mov edi, pDst
cmp ecx, 128 jb LA2 prefetchnta [esi] LA1: prefetchnta XMMWORD PTR[esi + 16 * 4] movdqa xmm0, XMMWORD PTR[esi] movntdq XMMWORD PTR[edi], xmm0 movdqa xmm1, XMMWORD PTR[esi + 16 * 1] movntdq XMMWORD PTR[edi + 16 * 1], xmm1 movdqa xmm2, XMMWORD PTR[esi + 16 * 2] movntdq XMMWORD PTR[edi + 16 * 2], xmm2 movdqa xmm3, XMMWORD PTR[esi + 16 * 3] movntdq XMMWORD PTR[edi + 16 * 3], xmm3
prefetchnta XMMWORD PTR[esi + 16 * 8] movdqa xmm4, XMMWORD PTR[esi + 16 * 4] movntdq XMMWORD PTR[edi + 16 * 4], xmm4 movdqa xmm5, XMMWORD PTR[esi + 16 * 5] movntdq XMMWORD PTR[edi + 16 * 5], xmm5 movdqa xmm6, XMMWORD PTR[esi + 16 * 6] movntdq XMMWORD PTR[edi + 16 * 6], xmm6 movdqa xmm7, XMMWORD PTR[esi + 16 * 7] movntdq XMMWORD PTR[edi + 16 * 7], xmm7
add esi, 128 add edi, 128 sub ecx, 128 cmp ecx, 128 jae LA1 LA2: cmp ecx, 64 jb LA3 prefetchnta XMMWORD PTR[esi] sub ecx, 64 movdqa xmm0, XMMWORD PTR[esi] movntdq XMMWORD PTR[edi], xmm0 movdqa xmm1, XMMWORD PTR[esi + 16 * 1] movntdq XMMWORD PTR[edi + 16 * 1], xmm1 movdqa xmm2, XMMWORD PTR[esi + 16 * 2] movntdq XMMWORD PTR[edi + 16 * 2], xmm2 movdqa xmm3, XMMWORD PTR[esi + 16 * 3] movntdq XMMWORD PTR[edi + 16 * 3], xmm3
add esi, 64 add edi, 64 LA3: prefetchnta XMMWORD PTR[esi] cmp ecx, 32 jb LA4 sub ecx, 32 movdqa xmm4, XMMWORD PTR[esi] movntdq XMMWORD PTR[edi], xmm4 movdqa xmm5, XMMWORD PTR[esi + 16 * 1] movntdq XMMWORD PTR[edi + 16 * 1], xmm5
add esi, 32 add edi, 32 LA4: cmp ecx, 16 jb LA5 sub ecx, 16 movdqa xmm6, XMMWORD PTR[esi] movntdq XMMWORD PTR[edi], xmm6
//add esi, 16 //add edi, 16 LA5: sfence } } else // Unalignment { _asm { mov ecx, len mov esi, pSrc mov edi, pDst
cmp ecx, 128 jb LB2 prefetchnta [esi] LB1: prefetchnta XMMWORD PTR[esi + 16 * 4] movdqu xmm0, XMMWORD PTR[esi] movntdq XMMWORD PTR[edi], xmm0 movdqu xmm1, XMMWORD PTR[esi + 16 * 1] movntdq XMMWORD PTR[edi + 16 * 1], xmm1 movdqu xmm2, XMMWORD PTR[esi + 16 * 2] movntdq XMMWORD PTR[edi + 16 * 2], xmm2 movdqu xmm3, XMMWORD PTR[esi + 16 * 3] movntdq XMMWORD PTR[edi + 16 * 3], xmm3
prefetchnta XMMWORD PTR[esi + 16 * 8] movdqu xmm4, XMMWORD PTR[esi + 16 * 4] movntdq XMMWORD PTR[edi + 16 * 4], xmm4 movdqu xmm5, XMMWORD PTR[esi + 16 * 5] movntdq XMMWORD PTR[edi + 16 * 5], xmm5 movdqu xmm6, XMMWORD PTR[esi + 16 * 6] movntdq XMMWORD PTR[edi + 16 * 6], xmm6 movdqu xmm7, XMMWORD PTR[esi + 16 * 7] movntdq XMMWORD PTR[edi + 16 * 7], xmm7
add esi, 128 add edi, 128 sub ecx, 128 cmp ecx, 128 jae LB1 LB2: cmp ecx, 64 jb LB3 prefetchnta XMMWORD PTR[esi] sub ecx, 64 movdqu xmm0, XMMWORD PTR[esi] movntdq XMMWORD PTR[edi], xmm0 movdqu xmm1, XMMWORD PTR[esi + 16 * 1] movntdq XMMWORD PTR[edi + 16 * 1], xmm1 movdqu xmm2, XMMWORD PTR[esi + 16 * 2] movntdq XMMWORD PTR[edi + 16 * 2], xmm2 movdqu xmm3, XMMWORD PTR[esi + 16 * 3] movntdq XMMWORD PTR[edi + 16 * 3], xmm3
add esi, 64 add edi, 64 LB3: prefetchnta XMMWORD PTR[esi] cmp ecx, 32 jb LB4 sub ecx, 32 movdqu xmm4, XMMWORD PTR[esi] movntdq XMMWORD PTR[edi], xmm4 movdqu xmm5, XMMWORD PTR[esi + 16 * 1] movntdq XMMWORD PTR[edi + 16 * 1], xmm5
add esi, 32 add edi, 32 LB4: cmp ecx, 16 jb LB5 sub ecx, 16 movdqu xmm6, XMMWORD PTR[esi] movntdq XMMWORD PTR[edi], xmm6
//add esi, 16 //add edi, 16 LB5: sfence } } // End if ( SSE_CAN_ALIGN( pDst, pSrc ) )
offset = len & 0x0F; if ( offset > 0 ) { memcpy( ((char *) pDst) + (len - offset), ((char *) pSrc) + (len - offset), offset ); }
return pBegin; } |
[Comments of source codes]
1) The memcpy of system is adopted when the length of data is small in the source codes. If you prefer all by yourself, small_fast_mempcy listed in the next session maybe is an ideal selection to handle the small block of data.
void* small_fast_memcpy( void *pDst, const void *pSrc, size_t len ) { _asm { mov ecx, len mov edi, pDst mov esi, pSrc rep movsb }
return pDst; } |
2) To ease to catch the emphasis of source code, the simply logical judegement is implemented by C syntax. In fact, it is very easy to convert them into ASM, too. If readers do not prefer hybrid way, why no to do it now by yourselfJ!
3) MACROs are utilized in the source code to eliminate the modification if instructions are 64-byte alignment(The partial instructions of SSE4x need 64-bytes alignment). Please refer the below block.
#define SSE_ALIGNMENT_VAL (16) #define SSE_ALIGNMENT_MASK (SSE_ALIGNMENT_VAL - 1) #define SSE_CAN_ALIGN( addr1, addr2 ) / ((((unsigned long) (addr1)) & SSE_ALIGNMENT_MASK) == (((unsigned long) (addr2)) & SSE_ALIGNMENT_MASK)) |
4) The interlace of XMMx register is to decrease the reading and writing dependence, and improve the parallel ability.
5) The interlace of reading and writing operations is to add the delaying time to cover the current prefetching latency.
6) The prefetching size is 64bytes and it is my length of data cache line in my PC. It should be adjusted in the real environment, but it is appropriate size in normal case.
[Summarization]
1. It only can acquire the improvement of speed when the data’s size is larger than the size of last data cache. Otherwise, the performance becomes worse. By testing it for big data, there is +50% improvement.
2. Be careful to use prefetch instruction. Its abusing operation can heavily disturb the performance of system.
3. It is not the best solution by using SSEx instruction. If the higher SSEx’s instructions are used, +1% gain is available if compared with SSE2.
Anyway, I hope that readers can give me any feedback to improve/correct the usage of SSEx. I think that it will help me understand inside deeply.