C++AMP基础

最新推荐文章于 2024-08-31 00:15:00 发布

珞喻小森林

最新推荐文章于 2024-08-31 00:15:00 发布

阅读量1.5k

点赞数

分类专栏： C 文章标签： C AMP CPP AMP

C 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

C++AMP（AMP：Accelerated Massive Parallelism)是一个并行库和语言层面的小扩展，能够帮助在C++应用程序中实现异构计算。Visual Studio 2012及以上版本提供了新的工具和功能支持，可以用来调试和剖析C++AMP应用程序的性能，包括GPU调试和GPU并行可视化。对于适合数据并行计算的应用而言，可以实现显著的加速。

微软官网的C++AMP结束，里面有入门的小例子C++ AMP Overview。

https://msdn.microsoft.com/en-us/library/hh265137.aspx

AMP简单语法结束及如何用VS调式GPU代码的介绍PPT：

http://www.gregcons.com/KateBlog/content/binary/GregoryCppAMP.pdf

谷歌：c++ amp accelerated massive parallelismwith microsoft visual c++ pdf可以下载电子书

国内有中文翻译版的书：

https://download.csdn.net/download/qq_18521747/8906191

书名引用：KateGregory, AdeMiller. C++ AMP:用Visual C++加速大规模并行计算[M]. 人民邮电出版社, 2014.

就是把通常在CPU上的独立计算的循环放到GPU去加速，GPU为每一个计算单元都分配一个线程，比如c[1000]=a[1000]+b[1000]，GPU分配1000个计算线程同时进行计算，AMP支持一维、二维、三维的矩阵运算，基本的数学函数库运算，甚至可以FFT。（AMPFFT库我暂时没调通，就没用上）

For example, you might want to add {1, 2,3, 4, 5} and {6, 7, 8, 9, 10} to obtain {7, 9, 11, 13, 15}.

不使用C++AMP，通常的写法就是用循环遍历数组得到每一个值：

#include <iostream>  
  
void StandardMethod() {  
  
    int aCPP[] = {1, 2, 3, 4, 5};  
    int bCPP[] = {6, 7, 8, 9, 10};  
    int sumCPP[5];  
  
    for (int idx = 0; idx < 5; idx++)  
    {  
        sumCPP[idx] = aCPP[idx] + bCPP[idx];  
    }  
  
    for (int idx = 0; idx < 5; idx++)  
    {  
        std::cout << sumCPP[idx] << "\n";  
    }  
}

Using C++ AMP, you might write thefollowing code instead：

#include <amp.h>  
#include <iostream>  
using namespace concurrency;  
  
const int size = 5;  
  
void CppAmpMethod() {  
    int aCPP[] = {1, 2, 3, 4, 5};  
    int bCPP[] = {6, 7, 8, 9, 10};  
    int sumCPP[size];  
  
    // Create C++ AMP objects.  
    array_view<const int, 1> a(size, aCPP);  
    array_view<const int, 1> b(size, bCPP);  
    array_view<int, 1> sum(size, sumCPP);  
    sum.discard_data();  
  
    parallel_for_each(   
        // Define the compute domain, which is the set of threads that are created.  
        sum.extent,   
        // Define the code to run on each thread on the accelerator.  
 [=](index<1> idx) restrict(amp)  
    {  
        sum[idx] = a[idx] + b[idx];  
    }  
    );  
  
    // Print the results. The expected output is "7, 9, 11, 13, 15".  
    for (int i = 0; i < size; i++) {  
        std::cout << sum[i] << "\n";  
    }  
}

下面笔记摘录自图书：KateGregory, AdeMiller. C++ AMP:用Visual C++加速大规模并行计算[M]. 人民邮电出版社, 2014.

第三章C++AMP基础

array<T,N>
accelerator 与 accelerator_view
index<N>
extent<N>
array_view<T,N>
parallel_for_each
restrict(amp)

这些都类模板很标识符

在CPU和GPU直接赋值数据

数学库函数

3.1

array<T,N>

array模板位于concurrency命名空间，有两个参数T和N，T是Type：即该集合元素的类型；N是正整数，即维度，或秩，一般为1，2，3维。

array是GPU上的一组相同类型元素的信息，矩阵。array在加速器（GPU）的一个视图上acclerator_view。每个加速器（GPU）至少有一个这样的视图，每个加速器有自己默认的视图。

array<int,1> a(5);//声明了一个一维的int数组，该数组由5个元素组成

在构造数组的同时也会分配响应的存储空间。

array<float,2> b(4,2);

array<int,3> c(4,3,2);

上面声明的这三个数组array里面没有任何值；构造函数只创建了空数组，我们可以把元素写入数组；或者在创建数组的时候就把元素复制进去：

std::vector<int> v(5);

array<int,1> a(5,v.begin(),v.end());

array的内存布局是限定的，所以的元素都会按顺序存储在连续的内存块上（GPU上的显存）

从array取回数据（即从GPU显存到CPU内存的数据拷贝）：

copy(a,v);//将显存的数组array的数据拷贝到CPU的内存中的向量v的内存空间中

数组会与某个加速器的某个视图发生绑定关系。如果系统中只有一种加速器。

如果系统装有多个加速器，就可以把代码指定到特定的加速器上运行，

可以用accelerator_view av ;指定在哪个加速器上构造数组：

array<float,1> m(n,v.begin(),av);

3.2 accelerator与acclerator_view

accelerator对象位于concurrency命名空间，不仅可以表示GPU，还可以表示虚拟加速器。

accelerator的内存可以装载一个或多个数组，可以在这些数组上执行运算，一优化数据并行计算操作。

函数accelerator::get_all()会返回运行时加速器向量。这样我们就可以根据目标计算机的不同配置，选择不同的代码执行路径。

例如，我们可以检测加速器属性，看它们到底是仿真器还是GPU。我们可以查询加速器的功能，是否支持双精度计算等。

默认构造器是运算时选择的最佳加速器。

加速器通常都是物理设备，这类设备可能有好几个逻辑视图。这些视图之间是隔离的。加速器是一种隔离资源和执行上下文环境的计算单元。我们可以让多个线程共享一个视图，也可以在同一个加速器上使用多个单独的视图，消除变量共享的问题。

没一类加速器都有一个默认视图。

accelerator device(accelerator::default_accelerator);

accelerator_view av=device.default_view;

array<float,1> C(n,av);

以上三行代码与下面一行代码等效：

array<float,1> C(n);

下面的程序用于输出本地计算的GPU加速器相关配置信息：

//===============================================================================
//
// Microsoft Press
// C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++
//
//===============================================================================
// Copyright (c) 2012-2013 Ade Miller & Kate Gregory.  All rights reserved.
// This code released under the terms of the 
// Microsoft Public License (Ms-PL), http://ampbook.codeplex.com/license.
//
// THIS CODE AND INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY OF
// ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO
// THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A
// PARTICULAR PURPOSE.
//===============================================================================

#include <tchar.h>
#include <SDKDDKVer.h>
#include <iostream>
#include <iomanip>
#include <vector>
#include <amp.h>

using namespace concurrency;

// Note: This code is somewhat different from the code described in the book. It produces a more detailed
// output and accepts a /a switch that will show the REF and CPU accelerators. If you want the original 
// output, as show on page 22 then the /o switch will produce that.

int _tmain(int argc, _TCHAR* argv[])
{
    bool show_all = false;
    bool old_format = false;
    if (argc > 1) 
    {
        if (std::wstring(argv[1]).compare(L"/a") == 0)
        {
            show_all = true;
        }
        if (std::wstring(argv[1]).compare(L"/o") == 0)
        {
            show_all = false;
            old_format = true;
        }
    }

    std::vector<accelerator> accls = accelerator::get_all();
    if (!show_all)
    {
        accls.erase(std::remove_if(accls.begin(), accls.end(), [](accelerator& a) 
        { 
            return (a.device_path == accelerator::cpu_accelerator) || (a.device_path == accelerator::direct3d_ref); 
        }), accls.end());
    }

    if (accls.empty())
    {
        std::wcout << "No accelerators found that are compatible with C++ AMP" << std::endl << std::endl;
        return 0;
    }
    std::cout << "Show " << (show_all ? "all " : "") << "AMP Devices (";
#if defined(_DEBUG)
    std::cout << "DEBUG";
#else
    std::cout << "RELEASE";
#endif
    std::cout <<  " build)" << std::endl;
    std::wcout << "Found " << accls.size() 
        << " accelerator device(s) that are compatible with C++ AMP:" << std::endl;
    int n = 0;
    if (old_format)
    {
        std::for_each(accls.cbegin(), accls.cend(), [=, &n](const accelerator& a)
        {
            std::wcout << "  " << ++n << ": " << a.description 
                << ", has_display=" << (a.has_display ? "true" : "false") 
                << ", is_emulated=" << (a.is_emulated ? "true" : "false")
                << std::endl;
        });
        std::wcout << std::endl;
        return 1;
    }

    std::for_each(accls.cbegin(), accls.cend(), [=, &n](const accelerator& a)
    {
        std::wcout << "  " << ++n << ": " << a.description << " "  
            << std::endl << "       device_path                       = " << a.device_path
            << std::endl << "       dedicated_memory                  = " << std::setprecision(4) << float(a.dedicated_memory) / (1024.0f * 1024.0f) << " Mb"
            << std::endl << "       has_display                       = " << (a.has_display ? "true" : "false") 
            << std::endl << "       is_debug                          = " << (a.is_debug ? "true" : "false") 
            << std::endl << "       is_emulated                       = " << (a.is_emulated ? "true" : "false") 
            << std::endl << "       supports_double_precision         = " << (a.supports_double_precision ? "true" : "false") 
            << std::endl << "       supports_limited_double_precision = " << (a.supports_limited_double_precision ? "true" : "false") 
            << std::endl;
    });
    std::wcout << std::endl;
	system("pause");
	return 1;
}

/*
Show AMP Devices (DEBUG build)
Found 3 accelerator device(s) that are compatible with C++ AMP:
1: Intel(R) HD Graphics 4600
device_path                       = PCI\VEN_8086&DEV_0416&SUBSYS_380117AA&REV_06\3&11583659&0&10
dedicated_memory                  = 0.1099 Mb
has_display                       = true
is_debug                          = true
is_emulated                       = false
supports_double_precision         = true
supports_limited_double_precision = true
2: AMD Radeon HD 8570M
device_path                       = PCI\VEN_1002&DEV_6663&SUBSYS_380117AA&REV_00\4&57D6125&0&0008
dedicated_memory                  = 1.988 Mb
has_display                       = false
is_debug                          = true
is_emulated                       = false
supports_double_precision         = true
supports_limited_double_precision = true
3: Microsoft Basic Render Driver
device_path                       = direct3d\warp
dedicated_memory                  = 0 Mb
has_display                       = false
is_debug                          = true
is_emulated                       = true
supports_double_precision         = true
supports_limited_double_precision = true
*/