Pinned Memory 多设备异步拷贝

调用cudaMallocHost可以申请主机端Pinned内存,对比Pageable内存,Pinned内存具有更快的拷贝速度。并且支持同一时刻多设备异步内存拷贝,实现Overlap。

cudaMallocHost使用

cudaMallocHost //申请Pinned内存
cudaFreeHost //释放Pinned内存

Pinned内存与Pageable内存拷贝性能对比

分别申请400M Pinned内存与Pageable内存,向设备端进行内存拷贝,通过nsys获取执行时间。
代码实现:

void TestHostPinnedMem()
{
	//compare pinned and pagable memcpy
	const int size = 1024 * 1024 * 100;
	int* h1 = (int*)malloc(size * sizeof(int));
	int* d1;
	cudaMalloc(&d1, size * sizeof(int));
	cudaMemcpy(d1, h1, size * sizeof(int), cudaMemcpyHostToDevice);
	int* h2;
	cudaMallocHost(&h2, size * sizeof(int));
	cudaMemcpy(d1, h2, size * sizeof(int), cudaMemcpyHostToDevice);
	cudaFree(d1);
	free(h1);
	cudaFreeHost(h2);
}

拷贝耗时,其中Pageable内存耗时情况如下:

Begins: 0.291734s
Ends: 0.37133s (+79.596 ms)
HtoD memcpy 419,430,400 bytes
Source memory kind: Pageable
Destination memory kind: Device
Throughput: 5.26951 GiB/s
Correlation ID: 207
Stream: Default stream (7)

Pinned内存耗时情况如下:

Begins: 0.512981s
Ends: 0.547286s (+34.305 ms)
HtoD memcpy 419,430,400 bytes
Source memory kind: Pinned
Destination memory kind: Device
Throughput: 12.2266 GiB/s
Correlation ID: 209
Stream: Default stream (7)

对比两者执行情况,Pinned在吞吐量和执行速度快了2倍多。

异步内存拷贝

申请Pinned内存,并异步拷贝4M数据至不同的设备,代码实现如下:

void TestAsyncPinnedMem()
{
	const int size = 1024 * 1024 ;
	int* d0,*d1,*h1;
	cudaMallocHost(/*(void**)*/&h1, sizeof(int) * size);
	cudaSetDevice(0);
	cudaStream_t s0;
	cudaStreamCreate(&s0);
	cudaMalloc(&d0, size * sizeof(int));
	cudaSetDevice(1);
	cudaStream_t s1;
	cudaStreamCreate(&s1);
	cudaMalloc(&d1, size * sizeof(int));
	cudaMemcpyAsync(d0, h1, sizeof(int) * size, cudaMemcpyHostToDevice, s0);
	cudaMemcpyAsync(d1, h1, sizeof(int) * size, cudaMemcpyHostToDevice, s1);
	cudaStreamSynchronize(s0);
	cudaStreamSynchronize(s1);
	cudaFree(d0);
	cudaFree(d1);
	cudaFreeHost(h1);
}

nsys抓取两个异步函数的时间线,可以看到两者在时间轴上是存在重叠区域的,说明是实现了overlap
在这里插入图片描述
两次拷贝执行时间分别如下:

Begins: 0.981878s
Ends: 0.982224s (+346.270 μs)
HtoD memcpy 4,194,304 bytes
Source memory kind: Pinned
Destination memory kind: Device
Throughput: 12.1128 GiB/s
Correlation ID: 213
Stream: Stream 15
Begins: 0.981892s
Ends: 0.982237s (+344.959 μs)
HtoD memcpy 4,194,304 bytes
Source memory kind: Pinned
Destination memory kind: Device
Throughput: 12.1588 GiB/s
Correlation ID: 214
Stream: Stream 25

申请Pageable内存,并异步拷贝4M数据至不同的设备,代码实现如下:

在这里插入代码片void TestAsyncPagableMem()
{
	const int size = 1024 * 1024;
	int* d0, * d1, * h1;
	h1 = (int*)malloc( sizeof(int) * size);
	cudaSetDevice(0);
	cudaStream_t s0;
	cudaStreamCreate(&s0);
	cudaMalloc(&d0, size * sizeof(int));
	cudaSetDevice(1);
	cudaStream_t s1;
	cudaStreamCreate(&s1);
	cudaMalloc(&d1, size * sizeof(int));
	cudaMemcpyAsync(d0, h1, sizeof(int) * size, cudaMemcpyHostToDevice, s0);
	cudaMemcpyAsync(d1, h1, sizeof(int) * size, cudaMemcpyHostToDevice, s1);
	cudaStreamSynchronize(s0);
	cudaStreamSynchronize(s1);
	cudaFree(d0);
	cudaFree(d1);
	free(h1);
}

nsys抓取两个异步函数的时间线,可以看到两者在时间轴上没有重叠区域,是串行执行
在这里插入图片描述
两次拷贝执行时间分别如下,对比Pinned内存拷贝,执行时间增加了2倍多,说明Pinned内存在多设备异步拷贝时,拷贝overlap并没有导致各自拷贝时间的增加:

Begins: 0.503774s
Ends: 0.504617s (+843.035 μs)
HtoD memcpy 4,194,304 bytes
Source memory kind: Pageable
Destination memory kind: Device
Throughput: 4.97524 GiB/s
Correlation ID: 212
Stream: Stream 15
Begins: 0.504744s
Ends: 0.505368s (+623.935 μs)
HtoD memcpy 4,194,304 bytes
Source memory kind: Pageable
Destination memory kind: Device
Throughput: 6.72234 GiB/s
Correlation ID: 213
Stream: Stream 25
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值