调用cudaMallocHost可以申请主机端Pinned内存,对比Pageable内存,Pinned内存具有更快的拷贝速度。并且支持同一时刻多设备异步内存拷贝,实现Overlap。
cudaMallocHost使用
cudaMallocHost //申请Pinned内存
cudaFreeHost //释放Pinned内存
Pinned内存与Pageable内存拷贝性能对比
分别申请400M Pinned内存与Pageable内存,向设备端进行内存拷贝,通过nsys获取执行时间。
代码实现:
void TestHostPinnedMem()
{
//compare pinned and pagable memcpy
const int size = 1024 * 1024 * 100;
int* h1 = (int*)malloc(size * sizeof(int));
int* d1;
cudaMalloc(&d1, size * sizeof(int));
cudaMemcpy(d1, h1, size * sizeof(int), cudaMemcpyHostToDevice);
int* h2;
cudaMallocHost(&h2, size * sizeof(int));
cudaMemcpy(d1, h2, size * sizeof(int), cudaMemcpyHostToDevice);
cudaFree(d1);
free(h1);
cudaFreeHost(h2);
}
拷贝耗时,其中Pageable内存耗时情况如下:
Begins: 0.291734s
Ends: 0.37133s (+79.596 ms)
HtoD memcpy 419,430,400 bytes
Source memory kind: Pageable
Destination memory kind: Device
Throughput: 5.26951 GiB/s
Correlation ID: 207
Stream: Default stream (7)
Pinned内存耗时情况如下:
Begins: 0.512981s
Ends: 0.547286s (+34.305 ms)
HtoD memcpy 419,430,400 bytes
Source memory kind: Pinned
Destination memory kind: Device
Throughput: 12.2266 GiB/s
Correlation ID: 209
Stream: Default stream (7)
对比两者执行情况,Pinned在吞吐量和执行速度快了2倍多。
异步内存拷贝
申请Pinned内存,并异步拷贝4M数据至不同的设备,代码实现如下:
void TestAsyncPinnedMem()
{
const int size = 1024 * 1024 ;
int* d0,*d1,*h1;
cudaMallocHost(/*(void**)*/&h1, sizeof(int) * size);
cudaSetDevice(0);
cudaStream_t s0;
cudaStreamCreate(&s0);
cudaMalloc(&d0, size * sizeof(int));
cudaSetDevice(1);
cudaStream_t s1;
cudaStreamCreate(&s1);
cudaMalloc(&d1, size * sizeof(int));
cudaMemcpyAsync(d0, h1, sizeof(int) * size, cudaMemcpyHostToDevice, s0);
cudaMemcpyAsync(d1, h1, sizeof(int) * size, cudaMemcpyHostToDevice, s1);
cudaStreamSynchronize(s0);
cudaStreamSynchronize(s1);
cudaFree(d0);
cudaFree(d1);
cudaFreeHost(h1);
}
nsys抓取两个异步函数的时间线,可以看到两者在时间轴上是存在重叠区域的,说明是实现了overlap
两次拷贝执行时间分别如下:
Begins: 0.981878s
Ends: 0.982224s (+346.270 μs)
HtoD memcpy 4,194,304 bytes
Source memory kind: Pinned
Destination memory kind: Device
Throughput: 12.1128 GiB/s
Correlation ID: 213
Stream: Stream 15
Begins: 0.981892s
Ends: 0.982237s (+344.959 μs)
HtoD memcpy 4,194,304 bytes
Source memory kind: Pinned
Destination memory kind: Device
Throughput: 12.1588 GiB/s
Correlation ID: 214
Stream: Stream 25
申请Pageable内存,并异步拷贝4M数据至不同的设备,代码实现如下:
在这里插入代码片void TestAsyncPagableMem()
{
const int size = 1024 * 1024;
int* d0, * d1, * h1;
h1 = (int*)malloc( sizeof(int) * size);
cudaSetDevice(0);
cudaStream_t s0;
cudaStreamCreate(&s0);
cudaMalloc(&d0, size * sizeof(int));
cudaSetDevice(1);
cudaStream_t s1;
cudaStreamCreate(&s1);
cudaMalloc(&d1, size * sizeof(int));
cudaMemcpyAsync(d0, h1, sizeof(int) * size, cudaMemcpyHostToDevice, s0);
cudaMemcpyAsync(d1, h1, sizeof(int) * size, cudaMemcpyHostToDevice, s1);
cudaStreamSynchronize(s0);
cudaStreamSynchronize(s1);
cudaFree(d0);
cudaFree(d1);
free(h1);
}
nsys抓取两个异步函数的时间线,可以看到两者在时间轴上没有重叠区域,是串行执行
两次拷贝执行时间分别如下,对比Pinned内存拷贝,执行时间增加了2倍多,说明Pinned内存在多设备异步拷贝时,拷贝overlap并没有导致各自拷贝时间的增加:
Begins: 0.503774s
Ends: 0.504617s (+843.035 μs)
HtoD memcpy 4,194,304 bytes
Source memory kind: Pageable
Destination memory kind: Device
Throughput: 4.97524 GiB/s
Correlation ID: 212
Stream: Stream 15
Begins: 0.504744s
Ends: 0.505368s (+623.935 μs)
HtoD memcpy 4,194,304 bytes
Source memory kind: Pageable
Destination memory kind: Device
Throughput: 6.72234 GiB/s
Correlation ID: 213
Stream: Stream 25