thrust(并行STL算法库)使用指南

最新推荐文章于 2024-05-20 23:52:50 发布

妈妈做的脆皮鱼

最新推荐文章于 2024-05-20 23:52:50 发布

阅读量407

点赞数

文章标签： c++ 算法人工智能

原文链接：https://docs.nvidia.com/cuda/thrust/#vectors

版权

THRUST算法库使用指南

1. 数据结构的相关属性和基本操作
- Vector的使用案例（host--CPU, device--GPU ）
2. 各种算法使用的通用模式
3. 更多的官方案例代码地址

简介：
借用官方的介绍： Thrust 是一个强大的并行算法和数据结构库。Thrust 为 GPU 编程提供了灵活的高级接口，大大提高了开发人员的工作效率。使用 Thrust情况下，C++ 开发人员只需编写几行代码即可执行 GPU 加速的排序、扫描、转换和缩减操作，其速度比最新的多核 CPU 快几个数量级。例如，thrust：：sort 算法的分拣性能比 STL 和 TBB 快 5 到 100 倍。
官方文档地址，关于该库的安装，在安装CUDA Toolkit时已经被默认安装。

1. 数据结构的相关属性和基本操作

Vector的使用案例（host–CPU, device–GPU ）

thrust::host_vector<>和thrust::device_vector<>相关的操作：
1. 初始化操作：
	thrust::device_vector<int> D(10, 1);    // 初始device_vector长度为10，值全为1。
	thrust::fill(D.begin(), D.begin() + 7, 9);   // 将device_vector的前7个元素全部设置为9.
	thrust::host_vector<int> H(D.begin(), D.begin() + 5);  // 使用D的前5个元素初始化一个host_vector变量
	thrust::sequence(H.begin(), H.end());        // 设置H的元素值为从0到n递增：0，1，2，3，4，...........
	thrust::copy(H.begin(), H.end(), D.begin());   // 将H中的元素全部复制到D中，也可以实现将host_vector<>中的数据复制到device_vector<>中。
	
2. 元素访问与修改操作：
	允许我们像STL中的vector一样遍历和修改元素。
3. size()和resize(n)操作:
	允许像STL中的vector一样，获取当前容器的大小以及更改容器的大小。
4. 内存自动释放操作

5. 指针操作（与原始指针之间的数据交换----thrust中指针与原始指针之间的相互转换）：
	5.1 原始指针被用作thrust算法的输入时，应当使用thrust::device_ptr对原始指针进行包装，再传入到thrust算法中。（如何将我们在GPU上的数据传递给thrust进行处理）
		size_t N = 10;
		// step1: 在device memory上申请原始指针
		int * raw_ptr; cudaMalloc((void **) &raw_ptr, N * sizeof(int));
		// step2: 使用device_ptr包装设备指针
		thrust::device_ptr<int> dev_ptr(raw_ptr);
		// step3: 在thrust algorithms中使用device_ptr
		thrust::fill(dev_ptr, dev_ptr + N, (int) 0);
	
	5.2 从device_ptr中获取原始指针，使用thrust::raw_pointer_cast：
		size_t N = 10;
		// 创建device_ptr
		thrust::device_ptr<int> dev_ptr = thrust::device_malloc<int>(N);
		// 从device_ptr中提取原始指针
		int * raw_ptr = thrust::raw_pointer_cast(dev_ptr);		

6. STL中容器迭代器与thrust中迭代器的兼容。
	std::list<int> stl_list;
    stl_list.push_back(10); stl_list.push_back(20); 
    stl_list.push_back(30); stl_list.push_back(40);
    
	6.1 使用STL中的迭代器，初始化一个thrust::device_vector<>变量
		 thrust::device_vector<int> D(stl_list.begin(), stl_list.end());
		 
	6.2 将一个thrust::device_vector<>复制到STL的一个容器中
		std::vector<int> stl_vector(D.size());
  		thrust::copy(D.begin(), D.end(), stl_vector.begin());

2. 各种算法使用的通用模式

链接: 算法接口地址

2.1 Transformations

2.1.1 transformations的基本使用方法

定义： 转换是将算法应用于输入的所有数据，并将结果数据存储到指定的目标范围内。基本分为三大模块：Filling, Modifying, Replacing 。

// transformations的使用方法
thrust::device_vector<int> X(10);
thrust::device_vector<int> Y(10);
thrust::device_vector<int> Z(10);

// initialize X to 0,1,2,3, ....
thrust::sequence(X.begin(), X.end());

// compute Y = -X ，取反运算
thrust::transform(X.begin(), X.end(), Y.begin(), thrust::negate<int>());

// fill Z with twos， 使用2填充Z
thrust::fill(Z.begin(), Z.end(), 2);

// 执行加法计算 Y = X + Y
 thrust::transform(X.begin(), X.end(), Y.begin(), Y.begin(),thrust::plus<float>());
 
// compute Y = X mod 2， 取余运算
thrust::transform(X.begin(), X.end(), Z.begin(), Y.begin(), thrust::modulus<int>());

// replace all the ones in Y with tens， 使用10替换所有为1的元素
thrust::replace(Y.begin(), Y.end(), 1, 10);

// print Y  ， 执行窗口输出
thrust::copy(Y.begin(), Y.end(), std::ostream_iterator<int>(std::cout, "\n"));

2.1.2 自定义transformations的方法（限制于输入与输出参数数量小于等于2）

*在某些情况下，thrust中提供的单个transformations方法无法实现我们的转换规则，可能需要多个transformations函数子结合使用才可以，但是这种方法执行的效率较低。为了能够更方便、高效的达到转换目的，*这里引入自定义转换函数。 但是transformations中自定义函数子不是任意的，定义的函数子必须保证，输入参数的数量只能是1个或者是2个。

// 自定义transformations中的转换规则
struct saxpy_functor
{
    const float a;

    saxpy_functor(float _a) : a(_a) {}

    __host__ __device__
        float operator()(const float& x, const float& y) const {
            return a * x + y;
        }
};

// 使用自定义的transformations结合的方式实现Y = A * X + Y
void saxpy_fast(float A, thrust::device_vector<float>& X, thrust::device_vector<float>& Y)
{
    // Y = A * X + Y
    thrust::transform(X.begin(), X.end(), Y.begin(), Y.begin(), saxpy_functor(A));
}

// 多个transformations结合的方式实现Y = A * X + Y
void saxpy_slow(float A, thrust::device_vector<float>& X, thrust::device_vector<float>& Y)
{
    thrust::device_vector<float> temp(X.size());
    // temp <- A
    thrust::fill(temp.begin(), temp.end(), A);
    // temp <- A * X
    thrust::transform(X.begin(), X.end(), temp.begin(), temp.begin(), thrust::multiplies<float>());
    // Y <- A * X + Y
    thrust::transform(temp.begin(), temp.end(), Y.begin(), Y.begin(), thrust::plus<float>());
}

2.1.3 自定义transformations的方法（使用thrust::zip_iterator和thrust::for_each实现任意输入输出参数数量）

案例介绍： 该案例展示了自定义任意输入数量的transformations的函数子的方法。函数的形式为：output[i] = F(first[i], second[i], third[i], … ).
使用zip_iterator将四个vector "zipped"成一个元组序列。函数F接收任意输入数量的一个元组序列。并使用如下的方法去访问每个元组序列中的vector元素，其中t就是元组序列：

					get<0>(t) returns a reference to A[i],
					get<1>(t) returns a reference to B[i],
					get<2>(t) returns a reference to C[i],
					get<3>(t) returns a reference to D[i].
		本案例实现的转换为：D[i] = A[i] + B[i] * C[i];   三个输入和一个输出

#include <thrust/for_each.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/zip_iterator.h>
#include <iostream>

#include <thrust/detail/config.h>

#if THRUST_CPP_DIALECT >= 2011 && !defined(THRUST_LEGACY_GCC)
#include <thrust/zip_function.h>
#endif // >= C++11

struct arbitrary_functor1
{
    template <typename Tuple>
    __host__ __device__
    void operator()(Tuple t)
    {
        // D[i] = A[i] + B[i] * C[i];
        thrust::get<3>(t) = thrust::get<0>(t) + thrust::get<1>(t) * thrust::get<2>(t);
    }
};

int main(void)
{
    // allocate storage
    thrust::device_vector<float> A(5);
    thrust::device_vector<float> B(5);
    thrust::device_vector<float> C(5);
    thrust::device_vector<float> D1(5);

    // initialize input vectors
    A[0] = 3;  B[0] = 6;  C[0] = 2; 
    A[1] = 4;  B[1] = 7;  C[1] = 5; 
    A[2] = 0;  B[2] = 2;  C[2] = 7; 
    A[3] = 8;  B[3] = 1;  C[3] = 4; 
    A[4] = 2;  B[4] = 8;  C[4] = 3; 

    // apply the transformation
    thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(A.begin(), B.begin(), C.begin(), D1.begin())),
                     thrust::make_zip_iterator(thrust::make_tuple(A.end(),   B.end(),   C.end(),   D1.end())),
                     arbitrary_functor1());

    // print the output
    std::cout << "Tuple functor" << std::endl;
    for(int i = 0; i < 5; i++)
        std::cout << A[i] << " + " << B[i] << " * " << C[i] << " = " << D1[i] << std::endl;
#endif // >= C++11
}

2.2 Reductions操作

定义： 规约算法-----规约可以认为就是计算 x=x0x1x2x3…* xn，其中*可以是加法，乘法等运算。以加法为例，就相当于给定一个数组，对数组求和。

1. 规约求和运算，一下三行等效：第三个参数为初始值，第四个参数为规约运算操作。
int sum = thrust::reduce(D.begin(), D.end(), (int) 0, thrust::plus<int>());
int sum = thrust::reduce(D.begin(), D.end(), (int) 0);
int sum = thrust::reduce(D.begin(), D.end())

2. 计算一个vector中给定值的数量: thrust::count();
3. thrust::count_if;
4. thrust::min_element / thrust::max_element / thrust::is_sorted/ thrust::inner_product

2.3.1 基于核函数融合的方式去实现reductions操作（实现自定义的规约方法）

#include <thrust/transform_reduce.h>
#include <thrust/functional.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <cmath>

// square<T> computes the square of a number f(x) -> x*x
template <typename T>
struct square
{
    __host__ __device__
        T operator()(const T& x) const {
            return x * x;
        }
};
int main(void)
{
    // initialize host array
    float x[4] = {1.0, 2.0, 3.0, 4.0};
    // transfer to device
    thrust::device_vector<float> d_x(x, x + 4);
    // setup arguments
    square<float>        unary_op;
    thrust::plus<float> binary_op;
    float init = 0;
    // compute norm
    float norm = std::sqrt( thrust::transform_reduce(d_x.begin(), d_x.end(), unary_op, init, binary_op) );
    std::cout << norm << std::endl;
    return 0;
}

2.4 Prefix-Sums扫描操作（元素保存的位置不变，但是元素值会发生变化）

1. 使用默认的扫描求和： 每一个元素等于其前面元素的累加和
#include <thrust/scan.h>
int data[6] = {1, 0, 2, 2, 1, 3};
thrust::inclusive_scan(data, data + 6, data); // in-place scan
// data is now {1, 1, 3, 5, 6, 9}

2. 带偏移量的扫描求和
int data[6] = {1, 0, 2, 2, 1, 3};
thrust::exclusive_scan(data, data + 6, data); // in-place scan
// data is now {0, 1, 1, 3, 5, 6}  0=0， 1=1+0， 1=0+1， 3=2+1， 5=2+3， 6=1+5.

2.5 reordering

copy_if : 通过判断条件决定是否进行复制
partition : 重新分区数据，根据bool值，true值在前，false置后。
remove and remove_if : 删除元素
unique: 移除连续的重复数据。

2.6 排序算法

2.6.1 thrust::sort和thrust::stable_sort（后者的区别在于，对于连续相等的元素，在排序后，其位置保持不变）

#include <thrust/sort.h>
const int N = 6;
int A[N] = {1, 4, 2, 8, 5, 7};
thrust::sort(A, A + N);
// A is now {1, 2, 4, 5, 7, 8}

2.6.2 thrust::sort_by_key和thrust::stable_sort_by_key

#include <thrust/sort.h>
const int N = 6;
int    keys[N] = {  1,   4,   2,   8,   5,   7};
char values[N] = {'a', 'b', 'c', 'd', 'e', 'f'};

thrust::sort_by_key(keys, keys + N, values);

// keys is now   {  1,   2,   4,   5,   7,   8}
// values is now {'a', 'c', 'b', 'e', 'f', 'd'}

2.6.3 用户自定义排序规则

#include <thrust/sort.h>
#include <thrust/functional.h>
const int N = 6;
int A[N] = {1, 4, 2, 8, 5, 7};
thrust::stable_sort(A, A + N, thrust::greater<int>());
// A is now {8, 7, 5, 4, 2, 1}

2.7 thrust迭代器家族（Fancy iterators）

2.7.1 常量迭代器constant_iterator

#include <thrust/iterator/constant_iterator.h>
...
// create iterators
thrust::constant_iterator<int> first(10);
thrust::constant_iterator<int> last = first + 3;

first[0]   // returns 10
first[1]   // returns 10
first[100] // returns 10

// sum of [first, last)
thrust::reduce(first, last);   // returns 30 (i.e. 3 * 10)

2.7.2 递增迭代器counting_iterator

#include <thrust/iterator/counting_iterator.h>
...
// create iterators
thrust::counting_iterator<int> first(10);
thrust::counting_iterator<int> last = first + 3;

first[0]   // returns 10
first[1]   // returns 11
first[100] // returns 110

// sum of [first, last)
thrust::reduce(first, last);   // returns 33 (i.e. 10 + 11 + 12)

2.7.3 转换操作transform_iterator（允许面向迭代器的transorformations操作）

#include <thrust/iterator/transform_iterator.h>
// initialize vector
thrust::device_vector<int> vec(3);
vec[0] = 10; vec[1] = 20; vec[2] = 30;

// create iterator (type omitted)
first = thrust::make_transform_iterator(vec.begin(), negate<int>());
last  = thrust::make_transform_iterator(vec.end(),   negate<int>());
first[0]   // returns -10
first[1]   // returns -20
first[2]   // returns -30
// sum of [first, last)
thrust::reduce(first, last);   // returns -60 (i.e. -10 + -20 + -30)

// 上面代码的另一种简化方式
// sum of [first, last)
thrust::reduce(thrust::make_transform_iterator(vec.begin(), negate<int>()),
               thrust::make_transform_iterator(vec.end(),   negate<int>()));

2.7.4 支持混合操作的迭代器permutation_iterator

#include <thrust/iterator/permutation_iterator.h>

...

// gather locations
thrust::device_vector<int> map(4);
map[0] = 3;
map[1] = 1;
map[2] = 0;
map[3] = 5;

// array to gather from
thrust::device_vector<int> source(6);
source[0] = 10;
source[1] = 20;
source[2] = 30;
source[3] = 40;
source[4] = 50;
source[5] = 60;

// fuse gather with reduction:
//   sum = source[map[0]] + source[map[1]] + ...
int sum = thrust::reduce(thrust::make_permutation_iterator(source.begin(), map.begin()),
                         thrust::make_permutation_iterator(source.begin(), map.end()));

2.7.5 多序列压缩为一个元组序列（在前面的transformations中的自定义多个输入和输出中有用到）zip_iterator

#include <thrust/iterator/zip_iterator.h>
...
// initialize vectors
thrust::device_vector<int>  A(3);
thrust::device_vector<char> B(3);
A[0] = 10;  A[1] = 20;  A[2] = 30;
B[0] = 'x'; B[1] = 'y'; B[2] = 'z';

// create iterator (type omitted)
first = thrust::make_zip_iterator(thrust::make_tuple(A.begin(), B.begin()));
last  = thrust::make_zip_iterator(thrust::make_tuple(A.end(),   B.end()));

first[0]   // returns tuple(10, 'x')
first[1]   // returns tuple(20, 'y')
first[2]   // returns tuple(30, 'z')

// maximum of [first, last)
thrust::maximum< tuple<int,char> > binary_op;
thrust::tuple<int,char> init = first[0];
thrust::reduce(first, last, init, binary_op); // returns tuple(30, 'z')