Pinned Memory 多设备异步拷贝

最新推荐文章于 2022-10-03 15:34:24 发布

USST_Coder

最新推荐文章于 2022-10-03 15:34:24 发布

阅读量464

点赞数

分类专栏： CUDA 文章标签： cuda 内存优化 gpgpu

本文链接：https://blog.csdn.net/yan31415/article/details/109998927

版权

CUDA 专栏收录该内容

14 篇文章 7 订阅

订阅专栏

调用cudaMallocHost可以申请主机端Pinned内存，对比Pageable内存，Pinned内存具有更快的拷贝速度。并且支持同一时刻多设备异步内存拷贝，实现Overlap。

cudaMallocHost使用

cudaMallocHost //申请Pinned内存
cudaFreeHost //释放Pinned内存

Pinned内存与Pageable内存拷贝性能对比

分别申请400M Pinned内存与Pageable内存，向设备端进行内存拷贝，通过nsys获取执行时间。
代码实现：

void TestHostPinnedMem()
{
	//compare pinned and pagable memcpy
	const int size = 1024 * 1024 * 100;
	int* h1 = (int*)malloc(size * sizeof(int));
	int* d1;
	cudaMalloc(&d1, size * sizeof(int));
	cudaMemcpy(d1, h1, size * sizeof(int), cudaMemcpyHostToDevice);
	int* h2;
	cudaMallocHost(&h2, size * sizeof(int));
	cudaMemcpy(d1, h2, size * sizeof(int), cudaMemcpyHostToDevice);
	cudaFree(d1);
	free(h1);
	cudaFreeHost(h2);
}

拷贝耗时，其中Pageable内存耗时情况如下：

Begins: 0.291734s
Ends: 0.37133s (+79.596 ms)
HtoD memcpy 419,430,400 bytes
Source memory kind: Pageable
Destination memory kind: Device
Throughput: 5.26951 GiB/s
Correlation ID: 207
Stream: Default stream (7)

Pinned内存耗时情况如下：

Begins: 0.512981s
Ends: 0.547286s (+34.305 ms)
HtoD memcpy 419,430,400 bytes
Source memory kind: Pinned
Destination memory kind: Device
Throughput: 12.2266 GiB/s
Correlation ID: 209
Stream: Default stream (7)

对比两者执行情况，Pinned在吞吐量和执行速度快了2倍多。

异步内存拷贝

申请Pinned内存，并异步拷贝4M数据至不同的设备，代码实现如下：

void TestAsyncPinnedMem()
{
	const int size = 1024 * 1024 ;
	int* d0,*d1,*h1;
	cudaMallocHost(/*(void**)*/&h1, sizeof(int) * size);
	cudaSetDevice(0);
	cudaStream_t s0;
	cudaStreamCreate(&s0);
	cudaMalloc(&d0, size * sizeof(int));
	cudaSetDevice(1);
	cudaStream_t s1;
	cudaStreamCreate(&s1);
	cudaMalloc(&d1, size * sizeof(int));
	cudaMemcpyAsync(d0, h1, sizeof(int) * size, cudaMemcpyHostToDevice, s0);
	cudaMemcpyAsync(d1, h1, sizeof(int) * size, cudaMemcpyHostToDevice, s1);
	cudaStreamSynchronize(s0);
	cudaStreamSynchronize(s1);
	cudaFree(d0);
	cudaFree(d1);
	cudaFreeHost(h1);
}

nsys抓取两个异步函数的时间线，可以看到两者在时间轴上是存在重叠区域的，说明是实现了overlap
在这里插入图片描述
两次拷贝执行时间分别如下：

Begins: 0.981878s
Ends: 0.982224s (+346.270 μs)
HtoD memcpy 4,194,304 bytes
Source memory kind: Pinned
Destination memory kind: Device
Throughput: 12.1128 GiB/s
Correlation ID: 213
Stream: Stream 15

Begins: 0.981892s
Ends: 0.982237s (+344.959 μs)
HtoD memcpy 4,194,304 bytes
Source memory kind: Pinned
Destination memory kind: Device
Throughput: 12.1588 GiB/s
Correlation ID: 214
Stream: Stream 25

申请Pageable内存，并异步拷贝4M数据至不同的设备，代码实现如下：

在这里插入代码片void TestAsyncPagableMem()
{
	const int size = 1024 * 1024;
	int* d0, * d1, * h1;
	h1 = (int*)malloc( sizeof(int) * size);
	cudaSetDevice(0);
	cudaStream_t s0;
	cudaStreamCreate(&s0);
	cudaMalloc(&d0, size * sizeof(int));
	cudaSetDevice(1);
	cudaStream_t s1;
	cudaStreamCreate(&s1);
	cudaMalloc(&d1, size * sizeof(int));
	cudaMemcpyAsync(d0, h1, sizeof(int) * size, cudaMemcpyHostToDevice, s0);
	cudaMemcpyAsync(d1, h1, sizeof(int) * size, cudaMemcpyHostToDevice, s1);
	cudaStreamSynchronize(s0);
	cudaStreamSynchronize(s1);
	cudaFree(d0);
	cudaFree(d1);
	free(h1);
}

nsys抓取两个异步函数的时间线，可以看到两者在时间轴上没有重叠区域，是串行执行
在这里插入图片描述
两次拷贝执行时间分别如下，对比Pinned内存拷贝，执行时间增加了2倍多，说明Pinned内存在多设备异步拷贝时，拷贝overlap并没有导致各自拷贝时间的增加：

Begins: 0.503774s
Ends: 0.504617s (+843.035 μs)
HtoD memcpy 4,194,304 bytes
Source memory kind: Pageable
Destination memory kind: Device
Throughput: 4.97524 GiB/s
Correlation ID: 212
Stream: Stream 15

Begins: 0.504744s
Ends: 0.505368s (+623.935 μs)
HtoD memcpy 4,194,304 bytes
Source memory kind: Pageable
Destination memory kind: Device
Throughput: 6.72234 GiB/s
Correlation ID: 213
Stream: Stream 25

USST_Coder

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Pinned Memory 多设备异步拷贝

void TestHostPinnedMem(){ //compare pinned and pagable memcpy const int size = 1024 * 1024 * 100; int* h1 = (int*)malloc(size * sizeof(int)); int* d1; cudaMalloc(&d1, size * sizeof(int)); cudaMemcpy(d1, h1, size * sizeof(int), cudaMemcpyHostToDe
复制链接

扫一扫