【无标题】

不喝汽水的小甜甜

于 2022-03-18 15:01:39 发布

阅读量416

点赞数

文章标签： c语言

本文链接：https://blog.csdn.net/thj_2017720/article/details/123575144

版权

CUDA中thrust库的使用

GaryW666

已于 2022-02-20 22:43:33 修改

1372

分类专栏： cuda学习心得文章标签： cuda

于 2021-01-05 21:02:20 首次发布

本文链接： https://blog.csdn.net/qq_43707919/article/details/112251190

收起

cuda学习心得专栏收录该内容

2 篇文章 0 订阅

订阅专栏

CUDA中thrust库的使用

接触过CUDA的各位应该都了解过归约算法，包括归约算法求和、求最大最小值、求方差标准差等等。为了保证算法的时间复杂度，我们常常会花费大量的时间去优化归约算法的实现，包括线程分散度的问题、thread分歧以及bank冲突的问题等等。当数据维度较小时还能够冷静的分析每一个可能还存在优化空间的点，但当数据维度较大时，常常感觉优化的程度还是不够。不要慌，这时就是体现CUDA强大的时刻，CUDA的thrust库可以完美的解决这些问题。
本文主要记录一下最近使用到的thrust库中的函数，包括reduce、sort、unique等等。

1、vector

在记录函数之前，首先记录一下thrust提供的数据类型vector，thrust中定义了host端和device端的两种vector，分别定义在host_vector.h和device_vector.h中，在声明变量时也很简单：

thrust::host_vector<type> hvec;
thrust::device_vector<type> dvec;
dvec=hvec; //device vector和 host vector可以直接用等号进行传递，对应于cudaMemcpy的功能

      
      
      
      1
2
3

thrust中还定义了device_ptr指针类型，当传入函数的指针是指向device端的内存时，需要用device_ptr进行封装：

float array[6] = { 3, 1, 2, 3, 5, 4 };
float *dev_array = 0;
cudaMalloc(&dev_array, 4 * 6);
cudaMemcpy(dev_array, array, 4 * 6, cudaMemcpyHostToDevice);
thrust::device_ptr<float> dev_ptr(dev_array);
thrust::reduce(dev_ptr, dev_ptr + 6);//由于dev_array指向device端，不能直接作为参数，需要对其封装

thrust::host_vector<type> hvec;
thrust::device_vector<type> dvec;
dvec=hvec;
thrust::reduce(dvec.begin(), dvec.end());//此时的参数是迭代器，不用也不能用device_ptr对其封装

//上述的两种函数的调用方法也存在host端的版本，传入的指针或者迭代器都是host端数据
thrust::reduce(array, array + 6);
thrust::reduce(hvec.begin(), hvec.end());

//从device_ptr中提取“原始”指针需要使用raw_pointer_cast函数
float dev_array=thrust::raw_pointer_cast(dev_ptr);

上述程序中说明了thrust中的数据类型以及函数调用时四种调用方式，本文后续函数使用的例子中，一个函数只选取一种调用方式，其余的方式可以类比调用。

2、reduce

thrust::reduce函数主要用于归约操作，在reduce.h中被定义，其返回值可以为一个具体数值，调用的方式也很简单：

thrust::device_ptr<float> dev_xptr(dev_xvalue);
double sum = thrust::reduce(dev_xptr, dev_xptr + N);

 
 
 
 1
2

也可以通过thrust::transform_reduce函数，定义自己想要的归约方式，下面的例子定义了一个求方差的归约：

//随意定义想要的归约方式
struct variance: std::unary_function<float, float>
{
	variance(float m): mean(m){ }
	const float mean;
	__host__ __device__ float operator()(float data) const
	{
		return ::pow(data - mean, 2.0f);
	}
};
//需要提前通过reduce函数求和，从而获得均值mean
float variance = thrust::transform_reduce(dev_ptr,dev_ptr + N,variance(mean),0.0f,thrust::plus<float>()) / N;

 
 
 
 1
2
3
4
5
6
7
8
9
10
11
12

3、sort

sort函数用于对数据的数据进行排序，在sort.h中定义，使用方式也很简单：

int A[6] = {1, 4, 2, 8, 5, 7};
thrust::sort(A, A + 6);
//result:{1, 2, 4, 5, 7, 8} 默认排序方式为由小到大

//sort_by_key函数可以根据键值来进行排序
int keys[N] = { 1, 4, 2, 8, 5, 7};
char values[N] = { ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’};
thrust::sort_by_key(keys, keys + N, values);
//result: key{1, 2, 4, 5, 7, 8}
// values{‘a’, ‘c’, ‘b’, ‘e’, ‘f’, ‘d’}

当然sort函数也可以自己定义排序方式，下面的例子是对三维坐标点的坐标值按照由小到大排序：

struct compRule
{
	__host__ __device__
		bool operator()(const float3 &p1,const float3 &p2)
	{
		if (p1.x != p2.x)
			return p1.x <= p2.x;
		else if (p1.y != p2.y)
			return p1.y <= p2.y;
		else if (p1.z != p2.z)
			return p1.z <= p2.z;
<span class="token punctuation">}</span>

};
thrust::host_vector<float3> p(10);
thrust::sort(p.begin(), p.end(), compRule());
//输入 //输出
//p[0] = { 0, 0, 0 }; p[0] = { 0, 0, 0 };
//p[1] = { 0, 1, 0 }; p[1] = { 0, 0, 1 };
//p[2] = { 1, 0, 0 }; p[2] = { 0, 1, 0 };
//p[3] = { 1, 0, 1 }; p[3] = { 0, 1, 1 };
//p[4] = { 1, 1, 1 }; p[4] = { 0, 1, 1 };
//p[5] = { 0, 1, 1 }; p[5] = { 1, 0, 0 };
//p[6] = { 0, 0, 1 }; p[6] = { 1, 0, 1 };
//p[7] = { 1, 1, 0 }; p[7] = { 1, 0, 1 };
//p[8] = { 0, 1, 1 }; p[8] = { 1, 1, 0 };
//p[9] = { 1, 0, 1 }; p[9] = { 1, 1, 1 };

最后还有stable_sort函数，与sort不同的是，其在排序时不改变相同数据的相对位置。

3、max_element(min_element)

max_element(min_element)函数，故名思义，求最大(小)值，在extrema.h被定义，调用方法如下：

thrust::device_vector<type>::iterator iter = thrust::max_element(dvec.begin()，dvec.end());
//其返回值是一个迭代器
int position = iter - dvec.begin();//获取最大(小)值所在位置
type max_val = *iter; //获取最大(小)值结果

 
 
 
 1
2
3
4

4、unique

unique函数，用来将一组数据中满足条件的数据筛选出来，在unique.h中被定义，也可以自定义筛选条件。下面例子的功能是删除重复数据：

struct is_sam
{
	__host__ __device__
		bool operator()(const float3 &p1, const float3 &p2)
	{
		return (p1.x==p2.x) && (p1.y==p2.y) && (p1.z==p2.z);
	}
};

thrust::unique(p.begin(), p.end(),is_sam()),p.end();
//unique函数的功能只是将满足条件的数据筛选出来，无法直接删除，需要结合vector的erase函数进行删除
p.erase(thrust::unique(p.begin(), p.end(),is_sam()),p.end());

//输入 //unique后结果 //erase后结果
//p[0] = { 0, 0, 0 } p[0] = { 0, 0, 0 } p[0] = { 0, 0, 0 }
//p[1] = { 0, 0, 1 } p[1] = { 0, 0, 1 } p[1] = { 0, 0, 1 }
//p[2] = { 0, 1, 0 } p[2] = { 0, 1, 0 } p[2] = { 0, 1, 0 }
//p[3] = { 0, 1, 1 } p[3] = { 0, 1, 1 } p[3] = { 0, 1, 1 }
//p[4] = { 0, 1, 1 } p[4] = { 1, 0, 0 } p[4] = { 1, 0, 0 }
//p[5] = { 1, 0, 0 } p[5] = { 1, 0, 1 } p[5] = { 1, 0, 1 }
//p[6] = { 1, 0, 1 } p[6] = { 1, 1, 0 } p[6] = { 1, 1, 0 }
//p[7] = { 1, 0, 1 } p[7] = { 1, 1, 1 } p[7] = { 1, 1, 1 }
//p[8] = { 1, 1, 0 } p[8] = { 0, 1, 1 }
//p[9] = { 1, 1, 1 } p[9] = { 1, 0, 1 }

目前只接触了以上函数的使用，后续使用过的其他函数会持续的进行添加!

</article>

GaryW666

关注关注

2
点赞
踩
1
评论
4
收藏
打赏
扫一扫，分享内容

点击复制链接

专栏目录

还能输入1000个字符

插入表情

代码片

HTML/XML
objective-c
Ruby
PHP
C
C++
JavaScript
Python
Java
CSS
SQL
其它

智慧msnd 2021.06.17

举报
回复

varianceshifteop 这是个什么鬼？

</div>

Thrust是并行算法和数据结构的基于GPU CUDA的C++库。Thrust主要通过管理系统底层的功能比如memory access（内存获取）和memory allocation（内存分配）来实现加速，使得

一、注意事项
thrust一般还是只能单独(从host中)调用，并不适合和.cu混合使用。
thrust中的算法主要是建立在vector和map<key,value>这两种数据结构之上。
二、算法(翻译自thrust文档)
1.Searching
binary search
lower_bound
upper_bound
2.Copyi

优点有如下：

速度快，对于 1024 * 1024 的图像，大约 0.01 ms，远远小于 CPU 版本时间。
翻转时，不需要额外的拷贝时间，使用内存少。

int main(){

float* deviceArray;
float max, …

读入文本
分词
建立字典，将每个词映射到一个唯一的索引（index）
将文本从词的序列转换为索引的序列，方便输入模型

读入文本
数据集：英文小说——H. G. Well的Time Machine
import collections
import re

def read_time_mach…

https://docs.nvidia.com/cuda/thrust/index.html
https://github.com/NVIDIA/thrust

thrus的特点

thrust一般还是只能单独(从host中)调用，并不适合和.cu混合使用。
thrust中的算法主要是建立在vector和map<key,value>这两种数据结构之上。比较适合工程使用，并不能实现复杂的算法；
如果你需要处理big size的vector和map，并且操作都比较简单，可以考虑 t

Thrust 是一个开源的 C++ 库，用于开发高性能并行应用程序，以 C++ 标准模板库为蓝本实现。

官方文档见这里：CUDA Thrust

/* … */
float *fMatrix_Device; // 指向设备显存
int iMatrixSize…

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/copy.h