Introduction to MMX Programming

最新推荐文章于 2024-10-18 22:48:33 发布

Duwchy

最新推荐文章于 2024-10-18 22:48:33 发布

阅读量2.1k

点赞数

分类专栏：学习文章标签： assembly byte c++ each image microsoft

学习专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Introduction

The Intel MMX™ technology allows enhanced performance in many applications such as image processing, 2D and 3D graphics and others. The typical situation where Intel MMX™ tecnology may be applied is the execution of repetitive operations on large arrays of data elements like byte, word or double-word.

Visual Studio .NET 2003 supports a set of MMX Intrinsics which allow the use of the MMX instructions directly from C++ code, without writing the Assembly instructions. Reading the MSDN MMX topics [2] together with Intel Software manuals [1] gives the opportunity to understand the basics of MMX programming.

MMX technology implememts the SIMD (single-instruction, multiple-data) execution model. Consider the following programming task: adding some value to each element in a BYTE array. The algorithm for this task may be written by such way:

for each  b in array
    b = b + n

With more details:

for each  b in array
{
    load b to the register
    add n to the register
    read the result from the register to memory
}

Processors with the Intel MMX support have eight 64-bit registers, each of which may contain 8 bytes, or 4 words, or 2 double-words. MMX is a set of instructions which allow to load a numeric data (bytes, words, double-words) into the MMX registers, make arithmetic and logical operations with them and read the results back to memory. Using the MMX technology, algorithm may be written by such way:

for each  8 members in array
{
    load 8 members to the MMX register
    add n to each byte in one operation
    write the result from the register back into memory
}

A C++ programmer writing a program using MMX Intrinsics doesn't work with the MMX registers directly. He has a 64-byte __m64 type and set of functions to perform an arithmetic and logical operations. The C++ compiler takes care of registers and code optimizations.

Visual C++ MMXSwarm sample [4] shows the use of the MMX technology in image processing. It contains a set of wrapper classes simplifying work with MMX Intrinsics, and shows how to make image processing operations on various types of images (monochrome, RGB 24 bits, RGB 32 bits etc.). This article is a simple introduction to C++ MMX programming. Everyone who is interesting in this technology is strongly encouraged to read the MMXSwarm sample.

MMX Programming Details

Include Files

All MMX instructions are defined in emmintrin.h file:

#include <emmintrin.h>

Since MMX instructions are compiler intrinsics and not functions, there are no lib-files.

__m64 Data Type

Variables of this type are used as MMX instruction operands. They should not be accessed directly. Variables of type _m64 are automatically aligned on 8-byte boundaries.

Detection of MMX Support

MMX instructions may be used if they are supported by the processor. The Visual C++ CPUID sample [3] shows how to detect support of the SSE, MMX and other processor features. It is done using the cpuid Assembly command. See details in this sample and in the Intel Software manuals [1].

Saturation Arithmetic and Wraparound Mode

The MMX technology supports a new arithmetic capability known as saturating arithmetic. In saturation mode, results of an operation that overflow or underflow are clipped (saturated) to a data-range limit for the data type ( [1]). Saturation mode is used in image processing. The following simple example allows to understand the difference between saturation and wraparound mode. Consider adding 1 to a BYTE variable which has value 255. In wraparound mode result will be 0 (carry bit is ignored). In saturation mode result will be 255. The same effect is in the low range, for example, 1 - 2 = 0 (for BYTE type, in saturation mode). Each MMX arithmetic instruction has two sub-types: saturated and wraparound. The demo project from this article uses only saturated instructions.

MMX8 Demo Project

MMX8 is SDI application which makes simple processing with a monochrome 8 bits per pixel image. Source image or result of it's processing is shown in the window. New ATL class CImage is used to extract an image from resources and to show it in the window. Two operations are done with the image: inversion and changing of brightness. Each operation may be done by one of the following ways:

C++ code;
C++ code with MMX Intrinsics;
Inline Assembly with MMX instructions.

Calculation time is shown in the status bar.

C++ image inversion function:

void CImg8Operations::InvertImageCPlusPlus(
    BYTE* pSource, 
    BYTE* pDest, 
    int nNumberOfPixels)
{
    for ( int i = 0; i < nNumberOfPixels; i++ )
    {
        *pDest++ = 255 - *pSource++;
    }
}

The best way to find the required MMX instruction is reading the Intel Software manuals [1]. The name of the required Assembly MMX instruction may be found in the short MMX technology overview (Volume 1, Chapter 8). Detailed instruction definition is in the volume 2. This definition contains also the name of appropriate C++ compiler intrinsic. Some C++ MMX intrinsic are composite (translated to more than one Assembly instructions). They should be found directly in the MSDN documentation [2].

The summary of all MMX instructions used in the MMX8 sample is shown in the following table:

Required Function	Assembly Instruction	MMX Intrinsic
Empty MMX state (prevents collisions with floating-point operations)	emms	_mm_empty
Unsigned subtraction with saturation of each byte in two 64-bits operands	psubusb	_mm_subs_pu8
Unsigned addition with saturation of each byte in two 64-bits operands	paddusb	_mm_adds_pu8

Image inversion function in C++ with MMX Intrinsics:

Collapse

void CImg8Operations::InvertImageC_MMX(
    BYTE* pSource, 
    BYTE* pDest, 
    int nNumberOfPixels)
{
    __int64 i = 0;
    i = ~i;                                 // 0xffffffffffffffff    

    // 8 pixels are processed in one loop
    int nLoop = nNumberOfPixels/8;

    __m64* pIn = (__m64*) pSource;          // input pointer
    __m64* pOut = (__m64*) pDest;           // output pointer

    __m64 tmp;                              // work variable

    _mm_empty();                            // emms

    __m64 n1 = Get_m64(i);

    for ( int i = 0; i < nLoop; i++ )
    {
        tmp = _mm_subs_pu8 (n1 , *pIn);     // Unsigned subtraction with 
                                            // saturation.
                                            // tmp = n1 - *pIn  for each byte

        *pOut = tmp;

        pIn++;                              // next 8 pixels
        pOut++;
    }

    _mm_empty();                            // emms
}

__m64 CImg8Operations::Get_m64(__int64 n)
{
    union __m64__m64
    {
        __m64 m;
        __int64 i;
    } mi;

    mi.i = n;
    return mi.m;
}

Since the functions are executed in a very short time, I call them a number of times to see the significant difference. Calculation times on my computer:

C++ code - 43 ms
C++ with MMX Intrinsics - 26 ms
Inline Assembly with MMX instructions - 26 ms

Execution time should be estimated in the Release configuration, with compiler optimizations.

Changing of brighntess is done by the most simple way - just adding or substracting some value to/from each pixel in the image. Conversion functions are slightly more complicated because we need two different branches for a positive and negative changes.

C++ function for changing an image brightness:

Collapse

void CImg8Operations::ChangeBrightnessCPlusPlus(
    BYTE* pSource, 
    BYTE* pDest, 
    int nNumberOfPixels, 
    int nChange)
{
    if ( nChange > 255 )
        nChange = 255;
    else if ( nChange < -255 )
        nChange = -255;

    BYTE b = (BYTE) abs(nChange);

    int i, n;

    if ( nChange > 0 )
    {
        for ( i = 0; i < nNumberOfPixels; i++ )
        {
            n = (int)(*pSource++ + b);

            if ( n > 255 )
                n = 255;

            *pDest++ = (BYTE) n;
        }
    }
    else
    {
        for ( i = 0; i < nNumberOfPixels; i++ )
        {
            n = (int)(*pSource++ - b);

            if ( n < 0 )
                n = 0;
            *pDest++ = (BYTE) n;
        }
    }
}

Changing an image brightness using C++ with MMX Intrinsics:

Collapse

void CImg8Operations::ChangeBrightnessC_MMX(
    BYTE* pSource, 
    BYTE* pDest, 
    int nNumberOfPixels, 
    int nChange)
{
    if ( nChange > 255 )
        nChange = 255;
    else if ( nChange < -255 )
        nChange = -255;

    BYTE b = (BYTE) abs(nChange);

    // make 64 bits value with b in each byte
    __int64 c = b;

    for ( int i = 1; i <= 7; i++ )
    {
        c = c << 8;
        c |= b;
    }

    // 8 pixels are processed in one loop
    int nNumberOfLoops = nNumberOfPixels / 8;

    __m64* pIn = (__m64*) pSource;          // input pointer
    __m64* pOut = (__m64*) pDest;           // output pointer

    __m64 tmp;                              // work variable


    _mm_empty();                            // emms

    __m64 nChange64 = Get_m64(c);

    if ( nChange > 0 )
    {
        for ( i = 0; i < nNumberOfLoops; i++ )
        {
            tmp = _mm_adds_pu8(*pIn, nChange64); // Unsigned addition 
                                                 // with saturation.
                                                 // tmp = *pIn + nChange64
                                                 // for each byte

            *pOut = tmp;

            pIn++;                               // next 8 pixels
            pOut++;
        }
    }
    else
    {
        for ( i = 0; i < nNumberOfLoops; i++ )
        {
            tmp = _mm_subs_pu8(*pIn, nChange64); // Unsigned subtraction 
                                                 // with saturation.
                                                 // tmp = *pIn - nChange64
                                                 // for each byte

            *pOut = tmp;

            pIn++;                                      // next 8 pixels
            pOut++;
        }
    }

    _mm_empty();                            // emms
}

Notice that the sign of the nChange parameter is checked once outside of loop and not thousands of times inside of loop. Calculation times on my computer:

C++ code - 49 ms
C++ with MMX Intrinsics - 26 ms
Inline Assembly with MMX instructions - 26 ms

MMX32 Demo Project

MMX32 project makes an operations with 32 bits per pixel RGB image. Operations are inversion and changing of image color balance (multiplication of each color to some value).

MMX multiplication is done by more complicated way that addition or subtraction, because result of multiplication is not of the same size as operands. For example, if multiplication operands have a BYTE type, result should have a WORD type. This requires additional conversions, and difference between C++ and MMX execution times is minimal (5-10%).

Changing an image color balance using C++ with MMX Intrinsics:

Collapse

void CImg32Operations::ColorsC_MMX(
    BYTE* pSource, 
    BYTE* pDest, 
    int nNumberOfPixels, 
    float fRedCoefficient, 
    float fGreenCoefficient, 
    float fBlueCoefficient)
{
    int nRed = (int)(fRedCoefficient * 256.0f);
    int nGreen = (int)(fGreenCoefficient * 256.0f);
    int nBlue = (int)(fBlueCoefficient * 256.0f);

    // make multiplication coefficient
    __int64 c = 0;
    c = nRed;
    c = c << 16;
    c |= nGreen;
    c = c << 16;
    c |= nBlue;

    __m64 nNull = _m_from_int(0);           // null
    __m64 tmp = _m_from_int(0);             // work variable

    _mm_empty();                            // emms

    __m64 nCoeff = Get_m64(c);

    DWORD* pIn = (DWORD*) pSource;          // input pointer
    DWORD* pOut = (DWORD*) pDest;           // output pointer

    for ( int i = 0; i < nNumberOfPixels; i++ )
    {
        tmp = _m_from_int(*pIn);                // tmp = *pIn (write to low
                                                // 32 bits)

        tmp = _mm_unpacklo_pi8(tmp, nNull );    // convert low 4 bytes of
                                                // tmp to 4 words
                                                // high byte for each word
                                                // is taken from nNull

        tmp =  _mm_mullo_pi16 (tmp , nCoeff);   // multiply each word in
                                                // tmp to word in nCoeff
                                                // get low word of each
                                                // result

        tmp = _mm_srli_pi16 (tmp , 8);          // shift each word in tmp
                                                // right to 8 bits (/256)

        tmp = _mm_packs_pu16 (tmp, nNull);      // Pack with unsigned
                                                // saturation.
                                                // Convert 4 words from tmp
                                                // to 4 bytes and write them
                                                // to low 32 bits of tmp.
                                                // Convert 4 words from nNull
                                                // to 4 bytes and write them
                                                // to high 32 bits of tmp.

        *pOut = _m_to_int(tmp);                 // *pOut = tmp (low 32 bits)
        
        pIn++;
        pOut++;

    }

    _mm_empty();                          // emms
}

See additional details in the demo project source code.

SSE2 Technology

SSE2 technology contains a set of integer MMX-like Intrinsics operating with SSE 128-bytes registers. Changing of an inmage color balance using the SSE2 technology, for example, can be executed significantly faster than using pure C++ code. SSE2 technology also extends the SSE technology adding an operations with double-precision floating-point data type. The MMXSwarm C++ sample works both with MMX and integer SSE2 instructions.