C++ AMP: Introduction and Best Practices

最新推荐文章于 2019-11-29 16:00:51 发布

Augusdi

最新推荐文章于 2019-11-29 16:00:51 发布

阅读量2.4k

点赞数

分类专栏： C++ AMP

C++ AMP 专栏收录该内容

75 篇文章 2 订阅

订阅专栏

Introduction

46 simple examples showing different C++ AMP applications and best practices, from device aquisition, to array and array_view, to exception handling, to correct performance measurement. All examples are sufficiently commented but I will explain some of the concepts in this article as well.

Building the Sample

There are no special requirements. This is a Console application.

Description

The first step in working with AMP is selecting the device on which you want to run your code. Ideally you want a device that supports DirectX 11 and is not dedicated to rendering to the display. If you noticed in the block below, Has display is true. Ideally you want it to be false - it will be if you have a GPU which is not dedicated to rendering to the display. Here's an output of available accelerator properties on my machine. Please ensure you have updated GPU driver - I had a problem with creating NVidia accelerator until I installed the latest driver.

i C++

/* 
Accelerators and their properties. 
   -- Description:               NVIDIA Quadro 5000M 
      Device path:               PCI\VEN_10DE&DEV_06DA&SUBSYS_1520103C&REV_A3\4&ADCCE93&0&0018 
      Version:                   11.0 
      Dedicated memory:          2047424 KB 
      Supports double precision: true 
      Limited double precision:  true 
      Has display:               true 
      Is emulated:               false 
      Is debug:                  true 
*/

Here's the code producing above output. It enumerates all available on my machine devices and prints properties of each one.

C++

void AmpExamples::AcceleratorProperties() 
{ 
    cout << "\nAccelerators and their properties.\n"; 
 
    vector<accelerator> list = accelerator::get_all(); 
 
    for_each(list.begin(), list.end(), [](const accelerator a) { 
        wcout << "   -- Description:               " << a.description << endl; 
        wcout << "      Device path:               " << a.device_path << endl; 
        wcout << "      Version:                   " << (a.version >> 16) << '.' << (a.version & 0xFFFF) << endl; 
        wcout << "      Dedicated memory:          " << a.dedicated_memory << " KB" << endl; 
        wcout << "      Supports double precision: " << ((a.supports_double_precision) ? "true" : "false") << endl;                // Note that full double precision is required by the concurrency::precise_math functions in <amp_math.h> 
        wcout << "      Limited double precision:  " << ((a.supports_limited_double_precision) ? "true" : "false") << endl; 
        wcout << "      Has display:               " << ((a.has_display) ? "true" : "false") << endl; 
        wcout << "      Is emulated:               " << ((a.is_emulated) ? "true" : "false") << endl; 
        wcout << "      Is debug:                  " << ((a.is_debug) ? "true" : "false") << endl; 
        wcout << endl; 
    }); 
 
#ifndef _DEBUG 
    // Requires ref accelerator for debugging on GPU!!! 
    bool r = PickAccelerator(); 
#endif 
 
    // Now that we have a GPU accelerator, we can create views to other accelerators 
    accelerator_view warp = accelerator(accelerator::direct3d_warp).default_view; 
    wcout << L"\n   Aquired another accelerator: " << warp.accelerator.description << endl; 
 
    // While default view is on the gpu 
    accelerator_view gpu = accelerator().default_view; 
    wcout << L"   Default view:                " << gpu.accelerator.description << endl; 
} 
 
bool AmpExamples::PickAccelerator() 
{ 
    bool success = false; 
 
    vector<accelerator> list = accelerator::get_all(); 
 
    auto result = find_if(list.begin(), list.end(), [](const accelerator& a) { 
        return    !a.is_emulated 
                && a.supports_double_precision  
                //&& !a.has_display 
                ; 
    }); 
 
    if (result != list.end()) 
    { 
        accelerator gpu = *result; 
        success = accelerator::set_default(gpu.device_path); 
 
        if (success) 
        { 
            wcout << "\n   Accelerator for the process: " << gpu.description << endl; 
            return true; 
        } 
    } 
 
    accelerator warp(L"direct3d\\warp"); 
    success = accelerator::set_default(warp.device_path); 
 
    wcout << "\n   Accelerator for the process: " << warp.description << endl;     
 
    return success; 
}

If you are targeting a specific hardware, accelerator can be created directly by passing to accelerator constructor a system-wide unique path to a device if you know it (i.e. the “Device Instance Path” property for the device in Device Manager), e.g. accelerator a(L"PCI\VEN_10DE&DEV_06DA&SUBSYS_1520103C&REV_A3\4&ADCCE93&0&0018").

It is important to note that in order to debug GPU code (to set breakpoints in AMP) you must run on ref device.

I C++

/* 
   -- Description:               Software Adapter 
      Device path:               direct3d\ref 
      Version:                   11.1 
      Dedicated memory:          0 KB 
      Supports double precision: true 
      Limited double precision:  true 
      Has display:               true 
      Is emulated:               true 
      Is debug:                  true 
*/

Hence I wrapped device selection in

C++

#ifndef _DEBUG 
    // Requires ref accelerator for debugging on GPU!!! 
    bool r = PickAccelerator(); 
#endif

You can also print messages from AMP code to output window - I have an example in the solution.

accelerator_view is your device view on which your AMP code is executed. It is optional but you should specify it in your code explicitely.

C++

void AmpExamples::AcceleratorViewProperties() 
{ 
    cout << "\nAccelerator Views and their properties.\n"; 
 
    vector<accelerator> list = accelerator::get_all(); 
 
    for_each(list.begin(), list.end(), [](accelerator a) { 
        accelerator_view av = a.create_view(); 
 
        wcout << "   -- Description:               " << av.accelerator.description << endl; 
        wcout << "      Version:                   " << (av.version >> 16) << '.' << (av.version & 0xFFFF) << endl; 
        wcout << "      Is debug:                  " << ((av.is_debug) ? "true" : "false") << endl; 
        wcout << "      Queing mode:               " << ((av.queuing_mode == queuing_mode::queuing_mode_automatic) ? "automatic" : "immediate") << endl; 
        wcout << endl; 
    }); 
}

Following several examples demonstrate basic concepts of AMP. Please note that although I time execution of some of the code, the intention is just to get understanding of what AMP is doing under the hood and where it spends time. You will notice that the first time AMP is accessed there is a long pause. It is because runtime is initializing AMP framework. In addition, each kernel (restrict amp) must be compiled causing initial performance penalty. The last two examples in the solution show how to measure kernel code correctly. Before measuring performance the code executes a small warm-up routine to force JIT of the kernel.

But let's get back to basics. Most of the time you will be working with array_view, which is a pointer to the underlying data. array data type is used mostly when you want to measure performance, interop with DirectX, or store some data. AMP will automatically handle transferring data between the host and the device when you use array_view.

You can call functions from within the kernel. In that case function must be amp restricted.

C++

void AmpExamples::AddElementsInternal( 
    index<1>                        idx,  
    array_view<int, 1>                sum,  
    const array_view<const int, 1>    a,  
    const array_view<const int, 1>    b 
    ) restrict(amp) 
{ 
    sum[idx] = a[idx] + b[idx]; 
}

Several examples starting with ArrayViewOps show what should not be done in amp. Compiler will not allow you to remove const, for example, but in some cases you can run into trouble with pointers.

C++

void AmpExamples::PointerRestrictions() 
{ 
    int p[] = { 1, 2, 3, 4, 5 }; 
    const int size = ARRAYSIZE(p); 
 
    array_view<int, 1> a(size, p); 
 
    parallel_for_each(a.extent, [=](index<1> idx) restrict(amp) 
        { 
            struct A 
            { 
                bool flag; 
                int  data; 
            }; 
 
            A a; 
 
            bool* p1 = &(a.flag); 
            //bool* p2 = p1++;                // error C3599: '++' : cannot perform pointer arithmetic on pointer to bool in amp restricted code 
 
            //bool b = *(p2); 
 
            // Compiler Error: 
            // base class, data member or array element must be at least 4-byte aligned for amp-restricted function  
            // 
            /* 
            struct B 
            { 
                bool flag; 
                bool data; 
            }; 
 
            B b; 
            */ 
 
            // Solution to B is to allign the struct 
            struct C 
            { 
                bool flag; 
                __declspec(align(4)) bool data;            // Note that the alignment is only applied to the field 
            }; 
 
            C c; 
 
            // To align a structure 
            typedef __declspec(align(4))  
            struct D  
            { 
                bool flag; 
            }  
            ALIGNED_BOOL; 
 
            ALIGNED_BOOL d[10];                            // Now we can create an array of aligned fields 
        } 
    ); 
}

Please note that your kernel code must execute fast. On most hardware the limit is 2 seconds. On my machine it is 7 seconds. You can see the value using the following powershell script:

C++

/* 
PS HKLM:\> dir 
 
    Hive: HKEY_LOCAL_MACHINE 
 
Name                           Property                                                                                                                                                  
----                           --------                                                                                                                                                  
BCD00000000                                                                                                                                                                              
HARDWARE                                                                                                                                                                                 
SAM                                                                                                                                                                                      
dir : Requested registry access is not allowed. 
At line:1 char:1 
+ dir 
+ ~~~ 
    + CategoryInfo          : PermissionDenied: (HKEY_LOCAL_MACHINE\SECURITY:String) [Get-ChildItem], SecurityException 
    + FullyQualifiedErrorId : System.Security.SecurityException,Microsoft.PowerShell.Commands.GetChildItemCommand 
  
SOFTWARE                                                                                                                                                                                 
SYSTEM                                                                                                                                                                                   
 
PS HKLM:\> cd System\CurrentControlSet\Control\GraphicsDrivers 
 
PS HKLM:\System\CurrentControlSet\Control\GraphicsDrivers> dir 
 
    Hive: HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers 
 
Name                           Property                                                                                                                                                  
----                           --------                                                                                                                                                  
AdditionalModeLists                                                                                                                                                                      
Configuration                                                                                                                                                                            
Connectivity                                                                                                                                                                             
DCI                            Timeout : 7                                                                                                                                               
UseNewKey                                          
*/

This behavior is controlled by Windows. If you have a GPU which is not dedicated to rendering, you can disable Timout Detection and Recovery by creating device from DirectX 11 ID3D11Device. You must specify D3D11_CREATE_DEVICE_DISABLE_GPU_TIMEOUT flag which you will pass to teh CreateDevice function in order to do that.

C++

void AmpExamples::DisableTDR() 
{ 
    cout << "\nDisable TDR.\n"; 
 
    unsigned int flags = D3D11_CREATE_DEVICE_DISABLE_GPU_TIMEOUT;            // DISABLE TDR!!! 
     
#if _DEBUG 
    flags |= D3D11_CREATE_DEVICE_DEBUG; 
#endif 
 
    ID3D11Device*            device  = nullptr; 
    ID3D11DeviceContext*    context = nullptr; 
 
    D3D_DRIVER_TYPE driverTypes[] =  
    { 
        D3D_DRIVER_TYPE_HARDWARE, 
        D3D_DRIVER_TYPE_WARP, 
        D3D_DRIVER_TYPE_REFERENCE 
    }; 
 
    D3D_FEATURE_LEVEL featureLevels[] = 
    { 
        D3D_FEATURE_LEVEL_11_0, 
        D3D_FEATURE_LEVEL_10_1, 
        D3D_FEATURE_LEVEL_10_0 
    }; 
 
    D3D_FEATURE_LEVEL        feature; 
 
    // http://msdn.microsoft.com/en-us/library/windows/desktop/ff476877(v=vs.85).aspx 
    //IDXGIAdapter* adapter = nullptr; 
 
    HRESULT hr = S_OK; 
 
    for (UINT i = 0; i < ARRAYSIZE(driverTypes); ++i) 
    { 
        D3D_DRIVER_TYPE driverType = driverTypes[i]; 
 
        hr = D3D11CreateDevice( 
            nullptr,                        // dxgi adapter 
            driverType,                        // driver type 
            nullptr,                        // software rasterizer 
            flags,                            // flags 
            featureLevels,                    // feture levels 
            ARRAYSIZE(featureLevels),        // feature levels 
            D3D11_SDK_VERSION,                // sdk version 
            &device, 
            &feature, 
            &context 
            ); 
 
        if (SUCCEEDED(hr)) 
        { 
            break; 
        } 
    } 
 
    if (FAILED(hr) ||  
        ((feature != D3D_FEATURE_LEVEL_11_1) && (feature != D3D_FEATURE_LEVEL_11_0)) 
        ) 
    { 
        cerr << "   Failed to create Direct3D 11 device." << endl; 
        return; 
    } 
 
    // This accelerator_view will not time-out 
    accelerator_view av = create_accelerator_view(device); 
 
    wcout << "   -- Description:               " << av.accelerator.description << endl; 
    wcout << "      Version:                   " << (av.version >> 16) << '.' << (av.version & 0xFFFF) << endl; 
    wcout << "      Is debug:                  " << ((av.is_debug) ? "true" : "false") << endl; 
    wcout << "      Queing mode:               " << ((av.queuing_mode == queuing_mode::queuing_mode_automatic) ? "automatic" : "immediate") << endl; 
    wcout << endl; 
}

I have already mentioned that you can output information from the amp kernel to the debug output window. You do this using direct3d_printf function. Example follows:

C++

// void direct3d_abort() restrict(amp) 
// This function aborts the execution of a kernel. When the abort is detected by the runtime,  
// it raises a runtime_exception on the host with the error message, “Reference Rasterizer: Shader abort instruction hit”. 
// 
// D3D11 MESSAGE: Reference Rasterizer:    view[0,0] = 2 [ SHADER MESSAGE #2097410: SHADER_MESSAGE] 
// D3D11 MESSAGE: Reference Rasterizer:    view[0,1] = 3 [ SHADER MESSAGE #2097410: SHADER_MESSAGE] 
// D3D11 MESSAGE: Reference Rasterizer:    view[1,0] = 4 [ SHADER MESSAGE #2097410: SHADER_MESSAGE] 
// D3D11 MESSAGE: Reference Rasterizer:    view[1,1] = 5 [ SHADER MESSAGE #2097410: SHADER_MESSAGE] 
// 
// 
// void direct3d_printf(const char *_Format_string, …) restrict(amp) 
// (Parameters)_Format_string: The format string; ...: An optional list of parameters of variable count. 
// This function accepts a format string and an optional list of parameters of variable count.  
// It prints formatted output from a kernel to the Visual Studio output window. 
// 
// D3D11 ERROR: Reference Rasterizer:    errorf: av[idx] = 2 [ SHADER ERROR #2097411: SHADER_ERROR] 
// D3D11 ERROR: Reference Rasterizer:    errorf: av[idx] = 3 [ SHADER ERROR #2097411: SHADER_ERROR] 
// D3D11 ERROR: Reference Rasterizer:    errorf: av[idx] = 4 [ SHADER ERROR #2097411: SHADER_ERROR] 
// D3D11 ERROR: Reference Rasterizer:    errorf: av[idx] = 5 [ SHADER ERROR #2097411: SHADER_ERROR] 
// 
// 
// void direct3d_errorf(char *_Format_string, …) restrict(amp) 
// This function has identical characteristics and usage to the direct3d_printf function,  
// in that a message is printed to the output window. Additionally the C++ AMP runtime  
// will raise a runtime_exception on the host with the same error message passed to the direct3d_errof call. 
// 
// D3D11 ERROR: Reference Rasterizer: Shader abort instruction hit at IP 462 [ EXECUTION ERROR #2097409: SHADER_ABORT] 
// D3D11 ERROR: Reference Rasterizer: Shader abort instruction hit at IP 462 [ EXECUTION ERROR #2097409: SHADER_ABORT] 
// D3D11 ERROR: Reference Rasterizer: Shader abort instruction hit at IP 462 [ EXECUTION ERROR #2097409: SHADER_ABORT] 
// D3D11 ERROR: Reference Rasterizer: Shader abort instruction hit at IP 462 [ EXECUTION ERROR #2097409: SHADER_ABORT] 
// // 
void AmpExamples::DebugHelpers() 
{ 
    cout << "\nDebugging Support in AMP.\n"; 
 
    const int width = 2; 
    const int height = 2; 
    const int size = width * height; 
 
    vector<int> data(size); 
 
    int i = 0; 
    generate(data.begin(), data.end(), [&i]{ return ++i; }); 
 
    // In DEBUG mode with GPU ony selected, av will be ref! 
    // In other build configurations, these helpers will be replaced with NOOP. 
    //accelerator_view av = accelerator().create_view(); 
    accelerator_view av = accelerator(accelerator::direct3d_ref).default_view; 
 
    wcout << L"\n   device: " << av.get_accelerator().description << endl; 
 
    concurrency::extent<2> ext(width, height); 
    array_view<int, 2> view(ext, data); 
 
    // printf 
    parallel_for_each(av, ext, [=](index<2> idx) restrict(amp) { 
        view[idx]++; 
        direct3d_printf("   view[%d,%d] = %d\n", idx[0], idx[1], view[idx]);                // Limit is 7 parameters, will throw exception in RELEASE 
    }); 
 
    view.synchronize(); 
 
    // errorf 
    try 
    { 
        parallel_for_each(av, ext, [=](index<2> idx) restrict(amp) { 
            direct3d_errorf("   errorf: av[idx] = %d\n", view[idx]); 
            view[idx] *= 10; 
        }); 
 
        view.synchronize(); 
    } 
    catch(runtime_exception& e) 
    { 
        cout << "\n   errorf caused runtime exception: " << e.what() << endl; 
    } 
 
    // abort 
    try 
    { 
        parallel_for_each(av, ext, [=](index<2> idx) restrict(amp) { 
            view[idx] *= 10; 
            direct3d_abort();                            // This will terminate the program when debugging in GPU only mode 
        }); 
 
        view.synchronize(); 
    } 
    catch(runtime_exception& e) 
    { 
        cout << "\n   aborted: " << e.what() << endl; 
    } 
}

Please note that printf must be commented or removed once you have done debugging - when I ran my solution in RELEASE mode it caused the process to crash.

We've made it to measuring amp performance. You should be using array or use device.wait with array_view to get proper results. Still, I got some funny numbers when I ran my code. One thing important to understand is that parallel_for_each code is asynchronous which looks as if it were synchronous to the host. That approximately means that once you invoke it the execution will be scheduled on the device and control will be returned to the host but you will be guaranteed that you can access data only after the kernel execution completes. I skipped a topic on synchronisation between the device and the host but you will find well commented examples in the solution.

C++

/* 
Measuring Performance 2. (In release mode) 
 
   device: NVIDIA Quadro 5000M 
 
   0 executed in  66991us ( 66ms) : copy-in    0us, kernel  66033us, copy 2  958us 
   1 executed in  19012us ( 19ms) : copy-in  994us, kernel  17023us, copy 2  994us 
   2 executed in  30010us ( 30ms) : copy-in    0us, kernel  29039us, copy 2  970us 
   3 executed in  48035us ( 48ms) : copy-in 1003us, kernel  45059us, copy 2 1972us 
   4 executed in  68055us ( 68ms) : copy-in  996us, kernel  66019us, copy 2 1039us 
   5 executed in  96026us ( 96ms) : copy-in  999us, kernel  93051us, copy 2 1975us 
   6 executed in 140032us (140ms) : copy-in 2008us, kernel 136021us, copy 2 2002us 
   7 executed in 192019us (192ms) : copy-in 1997us, kernel 187037us, copy 2 2984us 
   8 executed in 230028us (230ms) : copy-in 2996us, kernel 224043us, copy 2 2989us 
   9 executed in 281015us (281ms) : copy-in 3000us, kernel 275015us, copy 2 2998us 
 
Measuring Performance 2. (One more run) 
 
   device: NVIDIA Quadro 5000M 
 
   0 executed in  10997us ( 10ms) : copy-in  975us, kernel   9048us, copy 2  973us 
   1 executed in  18967us ( 18ms) : copy-in 1003us, kernel  16987us, copy 2  976us 
   2 executed in  31041us ( 31ms) : copy-in  999us, kernel  29009us, copy 2 1033us 
   3 executed in  48008us ( 48ms) : copy-in    0us, kernel  45996us, copy 2 2011us 
   4 executed in  68004us ( 68ms) : copy-in    0us, kernel  67023us, copy 2  980us 
   5 executed in  96010us ( 96ms) : copy-in 1000us, kernel  92998us, copy 2 2011us 
   6 executed in 140985us (140ms) : copy-in 2002us, kernel 137009us, copy 2 1973us 
   7 executed in 194028us (194ms) : copy-in 2003us, kernel 190041us, copy 2 1984us 
   8 executed in 228027us (228ms) : copy-in 3025us, kernel 223019us, copy 2 1983us 
   9 executed in 280027us (280ms) : copy-in 2996us, kernel 274053us, copy 2 2977us 
 
Measuring Performance 2. (Ran outside of VS) 
 
   device: NVIDIA Quadro 5000M 
 
   0 executed in  15565us ( 15ms) : copy-in 0us, kernel  15565us, copy 2     0us 
   1 executed in  15615us ( 15ms) : copy-in 0us, kernel  15615us, copy 2     0us 
   2 executed in  31177us ( 31ms) : copy-in 0us, kernel  31177us, copy 2     0us 
   3 executed in  46790us ( 46ms) : copy-in 0us, kernel  46790us, copy 2     0us 
   4 executed in  77949us ( 77ms) : copy-in 0us, kernel  62401us, copy 2 15547us 
   5 executed in 109174us (109ms) : copy-in 0us, kernel  93627us, copy 2 15546us 
   6 executed in 140438us (140ms) : copy-in 0us, kernel 124847us, copy 2 15591us 
   7 executed in 187180us (187ms) : copy-in 0us, kernel 187180us, copy 2     0us 
   8 executed in 234037us (234ms) : copy-in 0us, kernel 218435us, copy 2 15602us 
   9 executed in 280803us (280ms) : copy-in 0us, kernel 265210us, copy 2 15593us 
*/ 
void AmpExamples::MeasurePerformance2() 
{ 
    cout << "\nMeasuring Performance 2.\n\n"; 
 
    time_point<system_clock> start = system_clock::now(); 
    time_point<system_clock> stop = system_clock::now(); 
    time_point<system_clock> tmStart = system_clock::now(); 
 
    accelerator_view device = accelerator().default_view; 
    wcout << L"   device: " << device.accelerator.description << endl << endl; 
 
    WarmUp(device); 
 
    // 10 samples increasing amount of data by i in each loop 
    for (int i = 0; i < 10; ++i) 
    { 
        // NUmber of rows and columns for both matrices 
        const int r1 = 300 + 100 * i; 
        const int c1 = 500 + 100 * i; 
        const int r2 = c1; 
        const int c2 = 400 + 100 * i; 
 
        assert(c1 == r2);                    // columns in m1 == rows in m2 
 
        vector<float> va(r1 * c1); 
        vector<float> vb(r2 * c2); 
        vector<float> vc(r1 * c2);            // resultant matrix 
 
        RandomFill(va); 
        RandomFill(vb); 
 
        concurrency::extent<2> ea(r1, c1); 
        concurrency::extent<2> eb(r2, c2); 
        concurrency::extent<2> ec(r1, c2); 
 
        // Using arrays only to measure performance. If using arrqay_view, we have  
        // to manually force synchronization because parallel_for_each is async. 
        // When parallel_for_each returns, then computation is only scheduled on the device.  
        // To force execution you need to call wait() on accelerator_view.  
        array<float, 2> a(ea); 
        array<float, 2> b(eb); 
        array<float, 2> c(ec); 
 
        // Copy underlying data to the device 
        tmStart = system_clock::now(); 
        start = system_clock::now(); 
 
        copy(va.begin(), a); 
        copy(vb.begin(), b); 
 
        stop = system_clock::now(); 
        long long tmCopy1 = duration_cast<microseconds>(stop - start).count(); 
 
        // Run kernel 
        start = system_clock::now(); 
 
        MultiplyMatrices(device, c, a, b); 
 
        device.wait();                    // Ensure that kernel completed execution!!! 
 
        stop = system_clock::now(); 
        long long tmKernel = duration_cast<microseconds>(stop - start).count(); 
 
        // Copy data back to the host 
        start = system_clock::now(); 
 
        copy(c, vc.begin()); 
 
        stop = system_clock::now(); 
        long long tmCopy2 = duration_cast<microseconds>(stop - start).count(); 
 
        long long ms = duration_cast<milliseconds>(stop - tmStart).count(); 
        long long us = duration_cast<microseconds>(stop - tmStart).count(); 
        cout << "   " << i << " executed in " << us << "us (" << ms << "ms)"  
             << " : copy-in " << tmCopy1  
             << "us, kernel " << tmKernel 
             << "us, copy 2 " << tmCopy2  
             << "us" <<  endl; 
    } 
}