How to user SSE2 instructions to improve the performance of memory copy?

 

How to user SSE2 instructions to improve the performance of memory copy?

 

It is an ageless topic for programmers to discuss how to improve the performance to transfer data. For the common scenario, the memcpy provided by C/C++ lib is enough to handle the boring duplicating tasks. However, if the size of data is larger than the size of last data cache in your system, for example, the last data cache is L2 data cache in my system whose size equals 1Mbytes, what will happen when copying 10Mbytes data? Obviously, the speed will become slow because the cache is polluted after invoking memcpy.

As a result, we often ask whether there is a good solution that can decrease the polluting degree in the data cache, and improve its performance at the same time. Fortunately, it exists in the SSEx instructions’ set. This article only focus on the SSE2’s instructions, the advanced SSEx will be discussed in the following articles.

Before showing the source codes, we had better understand several basic instructions in the SSE, and they are:

a. PREFETCHNTA.

Non-temporal data—fetch data into location close to the processor, minimizing cache pollution

• Pentium III processor—1st-level cache

• Pentium 4 and Intel Xeon processor—2nd-level cache

b. MOVNTDQ.

The MOVNTDQ (store double quadword using non-temporal hint) instruction stores packed integer data from an XMM register to memory, using a non-temporal hint.

c. MOVDQA.

The MOVDQA (move aligned double quadword) instruction transfers a double quadword operand from memory to an XMM register or vice versa; or between XMM registers. The memory address must be aligned to a 16-byte boundary; otherwise, a general-protection exception (#GP) is generated.

d. MOVDQU.

The MOVDQU (move unaligned double quadword) instruction performs the same operations as the MOVDQA instruction, except that 16-byte alignment of a memory address is not required. Its efficiency is lower than MOVDQA.

e. SFENCE.

The SFENCE (Store Fence) instruction controls write ordering by creating a fence for memory store operations. This instruction guarantees that the result of every store instruction that precedes the store fence in program order is globally visible before any store instruction that follows the fence. The SFENCE instruction provides an efficient way of ensuring ordering between procedures that produce weakly-ordered data and procedures that consume that data.

 

The maximal advantage is that SSE instruction enhances the throughput to transfer data. One cycle of instruction can transfer 16 bytes’ data, but the original bandwidth of MOV only holds 4 bytes. Moreover, by cooperating with PREFETCHNTA, the hit rating of data cache takes few effects after SSE’s memcpy if comparing with the traditional way.

Ok, we have spent too much time in describing the un-core contents, and let’s enter the main branch.

 

[The block of source code]

 

void* sse2_fast_memcpy( void *pDst, const void *pSrc, size_t len )

{

    void *pBegin = pDst;

    int  offset;

 

    if ( (offset = ((unsigned long) pDst) & SSE_ALIGNMENT_MASK) > 0 )

        offset = SSE_ALIGNMENT_VAL - offset;

 

    if ( len < (size_t) offset + 16 )

    {

        return memcpy( pDst, pSrc, len );

    }

 

    if ( offset > 0 )

    {

        memcpy( pDst, pSrc, offset );

        len -= offset;

        pDst = ((char *) pDst) + offset;

        pSrc = ((char *) pSrc) + offset;

    }

 

    if ( SSE_CAN_ALIGN( pDst, pSrc ) )

    {

        _asm

        {

            mov ecx, len

            mov esi, pSrc

            mov edi, pDst

 

            cmp ecx, 128

            jb LA2

            prefetchnta [esi]

LA1:

            prefetchnta XMMWORD PTR[esi + 16 * 4]

            movdqa xmm0, XMMWORD PTR[esi]

            movntdq XMMWORD PTR[edi], xmm0

            movdqa xmm1, XMMWORD PTR[esi + 16 * 1]

            movntdq XMMWORD PTR[edi + 16 * 1], xmm1

            movdqa xmm2, XMMWORD PTR[esi + 16 * 2]

            movntdq XMMWORD PTR[edi + 16 * 2], xmm2

            movdqa xmm3, XMMWORD PTR[esi + 16 * 3]

            movntdq XMMWORD PTR[edi + 16 * 3], xmm3

 

            prefetchnta XMMWORD PTR[esi + 16 * 8]

            movdqa xmm4, XMMWORD PTR[esi + 16 * 4]

            movntdq XMMWORD PTR[edi + 16 * 4], xmm4

            movdqa xmm5, XMMWORD PTR[esi + 16 * 5]

            movntdq XMMWORD PTR[edi + 16 * 5], xmm5

            movdqa xmm6, XMMWORD PTR[esi + 16 * 6]

            movntdq XMMWORD PTR[edi + 16 * 6], xmm6

            movdqa xmm7, XMMWORD PTR[esi + 16 * 7]

            movntdq XMMWORD PTR[edi + 16 * 7], xmm7

 

            add esi, 128

            add edi, 128

            sub ecx, 128

            cmp ecx, 128

            jae LA1

LA2:

            cmp ecx, 64

            jb  LA3

            prefetchnta XMMWORD PTR[esi]

            sub ecx, 64

            movdqa xmm0, XMMWORD PTR[esi]

            movntdq XMMWORD PTR[edi], xmm0

            movdqa xmm1, XMMWORD PTR[esi + 16 * 1]

            movntdq XMMWORD PTR[edi + 16 * 1], xmm1

            movdqa xmm2, XMMWORD PTR[esi + 16 * 2]

            movntdq XMMWORD PTR[edi + 16 * 2], xmm2

            movdqa xmm3, XMMWORD PTR[esi + 16 * 3]

            movntdq XMMWORD PTR[edi + 16 * 3], xmm3

 

            add esi, 64

            add edi, 64

LA3:

            prefetchnta XMMWORD PTR[esi]

            cmp ecx, 32

            jb  LA4

            sub ecx, 32

            movdqa xmm4, XMMWORD PTR[esi]

            movntdq XMMWORD PTR[edi], xmm4

            movdqa xmm5, XMMWORD PTR[esi + 16 * 1]

            movntdq XMMWORD PTR[edi + 16 * 1], xmm5

 

            add esi, 32

            add edi, 32

LA4:

            cmp ecx, 16

            jb  LA5

            sub ecx, 16

            movdqa xmm6, XMMWORD PTR[esi]

            movntdq XMMWORD PTR[edi], xmm6

 

            //add esi, 16

            //add edi, 16

LA5:

            sfence

        }

    }

    else // Unalignment

    {

        _asm

        {

            mov ecx, len

            mov esi, pSrc

            mov edi, pDst

 

            cmp ecx, 128

            jb LB2

            prefetchnta [esi]

LB1:

            prefetchnta XMMWORD PTR[esi + 16 * 4]

            movdqu xmm0, XMMWORD PTR[esi]

            movntdq XMMWORD PTR[edi], xmm0

            movdqu xmm1, XMMWORD PTR[esi + 16 * 1]

            movntdq XMMWORD PTR[edi + 16 * 1], xmm1

            movdqu xmm2, XMMWORD PTR[esi + 16 * 2]

            movntdq XMMWORD PTR[edi + 16 * 2], xmm2

            movdqu xmm3, XMMWORD PTR[esi + 16 * 3]

            movntdq XMMWORD PTR[edi + 16 * 3], xmm3

 

            prefetchnta XMMWORD PTR[esi + 16 * 8]

            movdqu xmm4, XMMWORD PTR[esi + 16 * 4]

            movntdq XMMWORD PTR[edi + 16 * 4], xmm4

            movdqu xmm5, XMMWORD PTR[esi + 16 * 5]

            movntdq XMMWORD PTR[edi + 16 * 5], xmm5

            movdqu xmm6, XMMWORD PTR[esi + 16 * 6]

            movntdq XMMWORD PTR[edi + 16 * 6], xmm6

            movdqu xmm7, XMMWORD PTR[esi + 16 * 7]

            movntdq XMMWORD PTR[edi + 16 * 7], xmm7

 

            add esi, 128

            add edi, 128

            sub ecx, 128

            cmp ecx, 128

            jae LB1

LB2:

            cmp ecx, 64

            jb  LB3

            prefetchnta XMMWORD PTR[esi]

            sub ecx, 64

            movdqu xmm0, XMMWORD PTR[esi]

            movntdq XMMWORD PTR[edi], xmm0

            movdqu xmm1, XMMWORD PTR[esi + 16 * 1]

            movntdq XMMWORD PTR[edi + 16 * 1], xmm1

            movdqu xmm2, XMMWORD PTR[esi + 16 * 2]

            movntdq XMMWORD PTR[edi + 16 * 2], xmm2

            movdqu xmm3, XMMWORD PTR[esi + 16 * 3]

            movntdq XMMWORD PTR[edi + 16 * 3], xmm3

 

            add esi, 64

            add edi, 64

LB3:

            prefetchnta XMMWORD PTR[esi]

            cmp ecx, 32

            jb  LB4

            sub ecx, 32

            movdqu xmm4, XMMWORD PTR[esi]

            movntdq XMMWORD PTR[edi], xmm4

            movdqu xmm5, XMMWORD PTR[esi + 16 * 1]

            movntdq XMMWORD PTR[edi + 16 * 1], xmm5

 

            add esi, 32

            add edi, 32

LB4:

            cmp ecx, 16

            jb  LB5

            sub ecx, 16

            movdqu xmm6, XMMWORD PTR[esi]

            movntdq XMMWORD PTR[edi], xmm6

 

            //add esi, 16

            //add edi, 16

LB5:

            sfence

        }

    } // End if ( SSE_CAN_ALIGN( pDst, pSrc ) )

 

    offset = len & 0x0F;

    if ( offset > 0 )

    {

        memcpy( ((char *) pDst) + (len - offset), ((char *) pSrc) + (len - offset), offset );

    }

 

    return pBegin;

}

 

[Comments of source codes]

 

1) The memcpy of system is adopted when the length of data is small in the source codes. If you prefer all by yourself, small_fast_mempcy listed in the next session maybe is an ideal selection to handle the small block of data.

void* small_fast_memcpy( void *pDst, const void *pSrc, size_t len )

{

    _asm

    {

        mov ecx, len

        mov edi, pDst

        mov esi, pSrc

        rep movsb

    }

 

    return pDst;

}

 

2) To ease to catch the emphasis of source code, the simply logical judegement is implemented by C syntax. In fact, it is very easy to convert them into ASM, too. If readers do not prefer hybrid way, why no to do it now by yourselfJ!

 

3) MACROs are utilized in the source code to eliminate the modification if instructions are 64-byte alignment(The partial instructions of SSE4x need 64-bytes alignment). Please refer the below block.

#define SSE_ALIGNMENT_VAL               (16)

#define SSE_ALIGNMENT_MASK              (SSE_ALIGNMENT_VAL - 1)

#define SSE_CAN_ALIGN( addr1, addr2 )   /

    ((((unsigned long) (addr1)) & SSE_ALIGNMENT_MASK) == (((unsigned long) (addr2)) & SSE_ALIGNMENT_MASK))

 

4) The interlace of XMMx register is to decrease the reading and writing dependence, and improve the parallel ability.

 

5) The interlace of reading and writing operations is to add the delaying time to cover the current prefetching latency.

 

6) The prefetching size is 64bytes and it is my length of data cache line in my PC. It should be adjusted in the real environment, but it is appropriate size in normal case.

 

[Summarization]

1. It only can acquire the improvement of speed when the data’s size is larger than the size of last data cache. Otherwise, the performance becomes worse. By testing it for big data, there is +50% improvement.

2. Be careful to use prefetch instruction. Its abusing operation can heavily disturb the performance of system.

3. It is not the best solution by using SSEx instruction. If the higher SSEx’s instructions are used, +1% gain is available if compared with SSE2.

 

Anyway, I hope that readers can give me any feedback to improve/correct the usage of SSEx. I think that it will help me understand inside deeply.

 

要启用以下指令集:ssesse2、sse3、sse4.1、sse4.2、avx、avx2和f,您需要执行以下步骤: 1. 首先,您需要确定您的处理器是否支持这些指令集。您可以参考处理器的技术规格表或联系处理器制造商来获取详细信息。如果您的处理器不支持这些指令集,那么您将无法启用它们。 2. 在确认您的处理器支持这些指令集后,您需要进入计算机的BIOS(基本输入/输出系统)设置。BIOS是计算机的低级固件,负责初始化硬件和启动操作系统。 3. 启动计算机时,按下相应的键(通常是Del、F2或F10)来进入BIOS设置。具体的按键可能因计算机品牌和型号而有所不同,请参考您计算机的指南。 4. 一旦进入BIOS设置界面,您需要在菜单中寻找“高级设置”、“处理器设置”或类似的选项。具体的命名可能因BIOS制造商和版本而有所不同。 5. 在“高级设置”或“处理器设置”菜单中,您应该能够找到一个“指令集”或类似的选项。在这个选项下,您应该能够找到ssesse2、sse3、sse4.1、sse4.2、avx、avx2和f等指令集。 6. 启用所需的指令集选项。您可以使用键盘上的方向键和回车键来选择和启用/禁用这些选项,具体方法请参考BIOS界面上的提示。 7. 完成设置后,保存并退出BIOS设置。通常,您可以按下F10键保存并退出,然后选择“是”以保存所做的更改。 8. 计算机会重新启动,所做的更改将生效。您现在应该能够使用ssesse2、sse3、sse4.1、sse4.2、avx、avx2和f这些指令集来执行支持的应用程序和任务了。 请注意,在进行BIOS设置时要小心,因为错误的更改可能会导致计算机出现问题。如果您不确定如何进行设置,建议咨询计算机制造商的技术支持或寻求专业人士的帮助。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值