Introduction to SSE Programming /基于SSE指令集的程序设计简介

最新推荐文章于 2022-06-21 11:24:36 发布

zhuliting

最新推荐文章于 2022-06-21 11:24:36 发布

阅读量2.2k

点赞数

分类专栏： SSE C/C++ Linux 文章标签： assembly float components arrays compiler function

本文链接：https://blog.csdn.net/zhuliting/article/details/6007286

版权

Linux 同时被 3 个专栏收录

60 篇文章 1 订阅

订阅专栏

C/C++

47 篇文章 0 订阅

订阅专栏

SSE

4 篇文章 0 订阅

订阅专栏

中文翻译见：http://dev.gameres.com/Program/Other/sseintro.htm

SSE是英特尔提出的即MMX之后新一代（当然是几年前了）CPU指令集，最早应用在PIII系列CPU上。已经得到了Intel PIII、P4、Celeon、Xeon、AMD Athlon、duron等系列CPU的支持。而更新的SSE2指令集仅得到了P4系列CPU的支持。

SSE为什么会比传统的浮点运算更快呢？因为它使用了128位的存储单元，这对于32位的浮点数来讲，是可以存下4个的，也就是说，SSE中的所有计算都是一次性针对4个浮点数来完成的，这种批处理当然就会带来效率的提升。SSE的全称：Stream SIMD Extentions（流SIMD扩展）。SIMD就是single instruction multiple data，连起来就是“数据流单指令多数据扩展”，从名字我们就可以更好的理解SSE是如何工作的了。

虽然他执行一次相当于四次，会比传统的浮点运算执行4次的速度要快，但是他执行一次的速度却并没有想象中的那么快，所以要体现SSE的速度，必须有Stream做前提，就是大量的流数据，这样才能发挥SIMD的强大作用。SSE支持的数据类型是4个32位（共计128位）浮点数集合，就是C、C++语言中的float[4]，并且必须是以16位字节边界对齐的。

Introduction

The Intel Streaming SIMD Extensions technology enhance the performance of floating-point operations. Visual Studio .NET 2003 supports a set of SSE Intrinsics which allow the use of SSE instructions directly from C++ code, without writing the Assembly instructions. MSDN SSE topics [2] may be confusing for the programmers who are not familiar with the SSE Assembly progamming. However, reading the Intel Software manuals [1] together with MSDN gives the opportunity to understand the basics of SSE programming.

SIMD is a single-instruction, multiple-data (SIMD) execution model. Consider the following programming task: computing of the square root of each element in a long floating-point array. The algorithm for this task may be written by such way:

for each  f in array
    f = sqrt(f)

Let's be more specific:

for each  f in array
{
    load f to the floating-point register
    calculate the square root
    write the result from the register to memory
}

Processor with the Intel SSE support have eight 128-bit registers, each of which may contain 4 single-precision floating-point numbers. SSE is a set of instructions which allow to load the floating-point numbers to 128-bit registers, perform the arithmetic and logical operations with them and write the result back to memory. Using SSE technology, algorithms may be written as:

for each  4 members in array
{
    load 4 members to the SSE register
    calculate 4 square roots in one operation
    write the result from the register to memory
}

The C++ programmer writing a program using SSE Intrinsics doesn't care about registers. He has a 128-byte __m128 type and a set of functions to perform the arithmetic and logical operations. It's up to the C++ compiler to decide which SSE register to use and to make code optimizations. SSE technology may be used when some operation is done with each element of a long floating-point arrays.

SSE Programming Details

Include Files

All SSE instructions and __m128 data type are defined in xmmintrin.h file:

#include <xmmintrin.h>

Since SSE instructions are compiler intrinsics and not functions, there are no lib-files.

Data Alignment

Each float array processed by SSE instructions should have 16 byte alignment. A static array is declared using the __declspec(align(16))keyword:

__declspec(align(16)) float m_fArray[ARRAY_SIZE];

Dynamic array should be allocated using new _aligned_malloc function:

m_fArray = (float*) _aligned_malloc(ARRAY_SIZE * sizeof(float), 16);

Array allocated by the _aligned_malloc function is released using the _aligned_free function:

_aligned_free(m_fArray);

__m128 Data Type

Variables of this type are used as SSE instructions operands. They should not be accessed directly. Variables of type _m128 are automatically aligned on 16-byte boundaries.

Detection of SSE Support

SSE instructions may be used if they are supported by the processor. The Visual C++ CPUID sample [4] shows how to detect support of the SSE, MMX and other processor features. It is done using the cpuid Assembly command. See details in this sample and in the Intel Software manuals [1].

SSETest Demo Project

SSETest project is a dialog-based application which makes the following calculation with three float arrays:

fResult[i] = sqrt( fSource1[i]*fSource1[i] + fSource2[i]*fSource2[i] ) + 0.5

i = 0, 1, 2 ... ARRAY_SIZE-1

ARRAY_SIZE is defined as 30000. Source arrays are filled using sin and cos functions. The Waterfall chart control written by Kris Jearakul [3] is used to show the source arrays and the result of calculations. Calculation time (ms) is shown in the dialog. Calculation may be done using one of three possible ways:

C++ code;
C++ code with SSE Intrinsics;
Inline Assembly with SSE instructions.

C++ function:

void CSSETestDlg::ComputeArrayCPlusPlus(
          float* pArray1,                   // [in] first source array

          float* pArray2,                   // [in] second source array

          float* pResult,                   // [out] result array

          int nSize)                        // [in] size of all arrays

{

    int i;

    float* pSource1 = pArray1;
    float* pSource2 = pArray2;
    float* pDest = pResult;

    for ( i = 0; i < nSize; i++ )
    {
        *pDest = (float)sqrt((*pSource1) * (*pSource1) + (*pSource2)
                 * (*pSource2)) + 0.5f;

        pSource1++;
        pSource2++;
        pDest++;
    }
}

Now let's rewrite this function using the SSE Instrinsics. To find the required SSE Instrinsics I use the following way:

Find Assembly SSE instruction in Intel Software manuals [1]. First I look for this instruction in Volume 1, Chapter 9, and after this find the detailed Description in Volume 2. This description contains also appropriate C++ Intrinsic name.
Search for SSE Intrinsic name in the MSDN Library.

Some SSE Intrinsics are composite and cannot be found by this way. They should be found directly in the MSDN Library (descriptions are very short but readable). The results of such search may be shown in the following table:

Required Function	Assembly Instruction	SSE Intrinsic
Assign float value to 4 components of 128-bit value	movss + shufps	_mm_set_ps1 (composite)
Multiply 4 float components of 2 128-bit values	mulps	_mm_mul_ps
Add 4 float components of 2 128-bit values	addps	_mm_add_ps
Compute the square root of 4 float components in 128-bit values	sqrtps	_mm_sqrt_ps

C++ function with SSE Intrinsics:

void CSSETestDlg::ComputeArrayCPlusPlusSSE(
          float* pArray1,                   // [in] first source array

          float* pArray2,                   // [in] second source array

          float* pResult,                   // [out] result array

          int nSize)                        // [in] size of all arrays

{
    int nLoop = nSize/ 4;

    __m128 m1, m2, m3, m4;

    __m128* pSrc1 = (__m128*) pArray1;
    __m128* pSrc2 = (__m128*) pArray2;
    __m128* pDest = (__m128*) pResult;


    __m128 m0_5 = _mm_set_ps1(0.5f);        // m0_5[0, 1, 2, 3] = 0.5


    for ( int i = 0; i < nLoop; i++ )
    {
        m1 = _mm_mul_ps(*pSrc1, *pSrc1);        // m1 = *pSrc1 * *pSrc1

        m2 = _mm_mul_ps(*pSrc2, *pSrc2);        // m2 = *pSrc2 * *pSrc2

        m3 = _mm_add_ps(m1, m2);                // m3 = m1 + m2

        m4 = _mm_sqrt_ps(m3);                   // m4 = sqrt(m3)

        *pDest = _mm_add_ps(m4, m0_5);          // *pDest = m4 + 0.5

        
        pSrc1++;
        pSrc2++;
        pDest++;
    }
}

This doesn't show the function using inline Assembly. Anyone who is interested may read it in the demo project. Calculation times on my computer:

C++ code - 26 ms
C++ with SSE Intrinsics - 9 ms
Inline Assembly with SSE instructions - 9 ms

Execution time should be estimated in the Release configuration, with compiler optimizations.

SSESample Demo Project

SSESample project is a dialog-based application which makes the following calculation with float array:

fResult[i] = sqrt(fSource[i]*2.8)

i = 0, 1, 2 ... ARRAY_SIZE-1

The program also calculates the minimum and maximum values in the result array. ARRAY_SIZE is defined as 100000. Result array is shown in the listbox. Calculation time (ms) for each way is shown in the dialog:

C++ code - 6 ms on my computer;
C++ code with SSE Intrinsics - 3 ms;
Inline Assembly with SSE instructions - 2 ms.

Assembly code performs better because of intensive using of the SSX registers. However, usually C++ code with SSE Intrinsics performs like Assembly code or better, because it is difficult to write an Assembly code which runs faster than optimized code generated by C++ compiler.

C++ function:

// Input: m_fInitialArray

// Output: m_fResultArray, m_fMin, m_fMax

void CSSESampleDlg::OnBnClickedButtonCplusplus()
{
    m_fMin = FLT_MAX;
    m_fMax = FLT_MIN;

    int i;

    for ( i = 0; i < ARRAY_SIZE; i++ )
    {
        m_fResultArray[i] = sqrt(m_fInitialArray[i]  * 2.8f);

        if ( m_fResultArray[i] < m_fMin )
            m_fMin = m_fResultArray[i];

        if ( m_fResultArray[i] > m_fMax )
            m_fMax = m_fResultArray[i];
    }
}

C++ function with SSE Intrinsics:

// Input: m_fInitialArray

// Output: m_fResultArray, m_fMin, m_fMax

void CSSESampleDlg::OnBnClickedButtonSseC()
{
    __m128 coeff = _mm_set_ps1(2.8f);      // coeff[0, 1, 2, 3] = 2.8

    __m128 tmp;

    __m128 min128 = _mm_set_ps1(FLT_MAX);  // min128[0, 1, 2, 3] = FLT_MAX

    __m128 max128 = _mm_set_ps1(FLT_MIN);  // max128[0, 1, 2, 3] = FLT_MIN


    __m128* pSource = (__m128*) m_fInitialArray;
    __m128* pDest = (__m128*) m_fResultArray;

    for ( int i = 0; i < ARRAY_SIZE/4; i++ )
    {
        tmp = _mm_mul_ps(*pSource, coeff);      // tmp = *pSource * coeff

        *pDest = _mm_sqrt_ps(tmp);              // *pDest = sqrt(tmp)


        min128 =  _mm_min_ps(*pDest, min128);
        max128 =  _mm_max_ps(*pDest, max128);

        pSource++;
        pDest++;
    }

    // extract minimum and maximum values from min128 and max128

    union u
    {
        __m128 m;
        float f[4];
    } x;

    x.m = min128;
    m_fMin = min(x.f[0], min(x.f[1], min(x.f[2], x.f[3])));

    x.m = max128;
    m_fMax = max(x.f[0], max(x.f[1], max(x.f[2], x.f[3])));
}

Sources

Intel Software manuals.
- Volume 1: Basic Architecture, CHAPTER 9, PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
- Volume 2: Instruction Set Reference http://developer.intel.com/design/archives/processors/mmx/index.htm
MSDN, Streaming SIMD Extensions (SSE). http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclang/html/vcrefstreamingsimdextensions.asp
Waterfall chart control written by Kris Jearakul. http://www.codeguru.com/controls/Waterfall.shtml
Microsoft Visual C++ CPUID sample. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vcsample/html/vcsamcpuiddeterminecpucapabilities.asp
Matt Pietrek. Under The Hood. February 1998 issue of Microsoft Systems Journal. http://www.microsoft.com/msj/0298/hood0298.aspx

上文来自：http://www.codeproject.com/KB/recipes/sseintro.aspx

1、要定义一个__m128变量，并为它赋四个float整数，可以这样写：

__m128 S1 = { 1.0f, 2.0f, 3,0f, 4,0f };

要改变其中第2个（基数为0）元素时可以这样写：

S1.m128_f32[2] = 6.0f;

令外我们还会用到几个赋值的指令，它可以让我们更方便的使用这个数据结构：

S1 = _mm_set_ps1( 2.0f );

它会让S1.m128_f32中的四个元素全部赋予2.0f，这样会比你一个一个赋值要快的多。

S1 = _mm_setzero_ps();

这会让S1中的所有4个浮点数都置零。

2、一般来讲，所有SSE指令函数都有3个部分组成，中间用下划线隔开：

_mm_set_ps1

mm表示多媒体扩展指令集

set表示此函数的含义缩写

ps1表示该函数对结果变量的影响，由两个字母组成，第一个字母表示对结果变量的影响方式，p表示把结果做为指向一组数据的指针，每一个元素都将参与运算，S表示只将结果变量中的第一个元素参与运算；第二个字母表示参与运算的数据类型。s表示32位浮点数，d表示64位浮点数，i32表示32位定点数，i64表示64位定点数，由于SSE只支持32位浮点数的运算，所以你可能会在这些指令封装函数中找不到包含非s修饰符的，但你可以在MMX和SSE2的指令集中去认识它们。

zhuliting

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Introduction to SSE Programming /基于SSE指令集的程序设计简介

IntroductionThe Intel Streaming SIMD Extensions technology enhance the performance of floating-point operations. Visual Studio .NET 2003 supports a set of SSE Intrinsics which allow the use of SSE instructions directly from C++ code, without writing the
复制链接

扫一扫