目标:
编写第一段CUDA C代码
了解为主机(Host)编写的Code与为设备(Device)编写的代码之间的区别
如何从Host上运行Device Code
了解如何在支持CUDA的Device上使用设备内存
了解如何查询系统中支持CUDA的设备信息
目录
方法一:在Device Code中使用Device pointer
1.CUDA C 与 标准C 相关概念:
主机(Host):CPU以及系统的内存。
设备(Device):GPU及其内存。
核函数(Kernel):在GPU设备上执行的函数。
主机代码(Host Code):在Linux(GNU gcc)、Windows(Microsoft Visual C)编译器来编译。
设备代码(Device Code):在Nvidia nvcc 编译器来编译
2.如何区分Host Code 和 Device Code
//__global__ 告诉编辑器,这是Device Code,函数应在Device上调用,圆括号里是传递给Device Code参数
__global__ void kernelFun(void){
}
//<<<>>>尖括号里面的参数不是传递Device Code的参数,而是运行时如何启动Device Code,后续详细阐述
int main(){
kernelFun<<<1,1>>>();
printf("Hello World:\n");
return 0;
}
(1)CUDA C通过某种语法将函数标记位“Device Code”,表示将Host Code发送到一个编译器,而将Device Code发送到一个编译器。
(2)CUDA C的优势提供了与C语言级别上的集成,使得Device Code调用看上去非常向Host Code调用。
后面再详细论述这个函数调用背后发送的动作。
总之,CUDA C编译器在运行时,负责实现从Host Code中调用Device Code.
3.如何给Device Code传递参数
概念:
1.可以像调用C函数那样将参数传递给Kernel。
运行时系统负责处理将参数从Host传递给Device的过程中的所有复杂操作。
2.当Device执行任何有用的操作时,都需要分配内存。
4.如何在Device上分配内存和释放内存
cudaMalloc()
相关文档:"CUDA_Runtime_API.pdf"
__host____device__cudaError_t cudaMalloc (void**devPtr, size_t size)
Allocate memory on the device.
Parameters
devPtr
- Pointer to allocated device memory
size
- Requested allocation size in bytes
Returns
cudaSuccess, cudaErrorMemoryAllocation
Description
Allocates size bytes of linear memory on the device and returns in *devPtr a pointer
to the allocated memory. The allocated memory is suitably aligned for any kind of
variable. The memory is not cleared. cudaMalloc() returns cudaErrorMemoryAllocation
in case of failure.
The device version of cudaFree cannot be used with a *devPtr allocated using the host
API, and vice versa.
See also:
cudaMallocPitch, cudaFree, cudaMallocArray, cudaFreeArray, cudaMalloc3D,
cudaMalloc3DArray, cudaMallocHost ( C API), cudaFreeHost, cudaHostAlloc
作用:告诉CUDA运行时在设备上分配内存。
参数:*devPtr 指针,指向保存新分配内存地址的变量;size为分配内存的大小。
返回类型:void
注意:
(1)可以将cudaMalloc()分配的指针传递给在Device上执行的kernel
(2)在Device Code中使用cudaMallc()分配的指针进行内存读/写操作
(3)将cudaMalloc()分配的指针传递给Host上执行的函数
(4)不能在Host Code对cudaMalloc()返回的指针进行解引用(Dereference),以来读取或者写入内存。
cudaFree()
__host____device__cudaError_t cudaFree (void *devPtr)
Frees memory on the device.
Parameters
devPtr
- Device pointer to memory to free
Returns
cudaSuccess, cudaErrorInvalidDevicePointer, cudaErrorInitializationError
Description
Frees the memory space pointed to by devPtr, which must have been returned by a
previous call to cudaMalloc() or cudaMallocPitch(). Otherwise, or if cudaFree(devPtr)
has already been called before, an error is returned. If devPtr is 0, no operation is
performed. cudaFree() returns cudaErrorInvalidDevicePointer in case of failure.
The device version of cudaFree cannot be used with a *devPtr allocated using the host
API, and vice versa.
Note that this function may also return error codes from previous, asynchronous
launches.
See also:
cudaMalloc, cudaMallocPitch, cudaMallocArray, cudaFreeArray, cudaMallocHost ( C
API), cudaFreeHost, cudaMalloc3D, cudaMalloc3DArray, cudaHostAlloc
5.如何访问Device内存
方法一:在Device Code中使用Device pointer
与标准C中指针的释放方式完全一样。
总的来说:主机指针只能访问Host Code中的内存,设备指针只能访问Device Code中的内存
方法二:调用cudaMemcpy()
cudaMemcpy():
通过cudaMemcpuKind参数指定Device内存究竟是源指针还是内存指针。
__host__cudaError_t cudaMemcpy (void *dst, const void
*src, size_t count, cudaMemcpyKind kind)
Copies data between host and device.
Parameters
dst
- Destination memory address
src
- Source memory address
count
- Size in bytes to copy
kind
- Type of transfer
Returns
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer,
cudaErrorInvalidMemcpyDirection
Description
Copies count bytes from the memory area pointed to by src to the memory area
pointed to by dst, where kind specifies the direction of the copy, and must be one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost,
cudaMemcpyDeviceToDevice, or cudaMemcpyDefault. Passing cudaMemcpyDefault
is recommended, in which case the type of transfer is inferred from the pointer values.
However, cudaMemcpyDefault is only allowed on systems that support unified virtual
addressing. Calling cudaMemcpy() with dst and src pointers that do not match the
direction of the copy results in an undefined behavior.
‣ Note that this function may also return error codes from previous, asynchronous
launches.
‣ This function exhibits synchronous behavior for most use cases.
See also:
cudaMemcpy2D, cudaMemcpyToArray, cudaMemcpy2DToArray,
cudaMemcpyFromArray, cudaMemcpy2DFromArray, cudaMemcpyArrayToArray,
cudaMemcpy2DArrayToArray, cudaMemcpyToSymbol, cudaMemcpyFromSymbol,
cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemcpyToArrayAsync,
cudaMemcpy2DToArrayAsync, cudaMemcpyFromArrayAsync,
cudaMemcpy2DFromArrayAsync, cudaMemcpyToSymbolAsync,
cudaMemcpyFromSymbolAsync
6.如何查询Device各种参数功能
cudaGetDevice
查询系统中有多少个设备是支持CUDA架构的
__host____device__cudaError_t cudaGetDevice (int
*device)
Returns which device is currently being used.
Parameters
device
- Returns the device on which the active host thread executes the device code.
Returns
cudaSuccess
Description
Returns in *device the current device for the calling host thread.
Note that this function may also return error codes from previous, asynchronous
launches.
See also:
cudaGetDeviceCount, cudaSetDevice, cudaGetDeviceProperties, cudaChooseDevice
Data types :cudaDeviceProp
Data types used by CUDA Runtime,包含了设备的相关属性,详细部分参照“CUDA_Runtime_API.pdf”
5.3. cudaDeviceProp Struct Reference
CUDA device properties
int cudaDeviceProp::asyncEngineCount
Number of asynchronous engines
int cudaDeviceProp::canMapHostMemory
Device can map host memory with cudaHostAlloc/cudaHostGetDevicePointer
int cudaDeviceProp::clockRate
Clock frequency in kilohertz
int cudaDeviceProp::computeMode
Compute mode (See cudaComputeMode)
int cudaDeviceProp::concurrentKernels
Device can possibly execute multiple kernels concurrently
int cudaDeviceProp::concurrentManagedAccess
Device can coherently access managed memory concurrently with the CPU
int cudaDeviceProp::deviceOverlap
Device can concurrently copy memory and execute a kernel. Deprecated. Use instead
asyncEngineCount.
cudaGetDeviceProperties
设备属性的使用,后面慢慢用到什么了解什么就阔以啦,没必要一次都搞明白
__host__cudaError_t cudaGetDeviceProperties
(cudaDeviceProp *prop, int device)
Returns information about the compute-device.
Parameters
prop
- Properties for the specified device
device
- Device number to get properties for
Returns
cudaSuccess, cudaErrorInvalidDevice
Description
Returns in *prop the properties of device dev. The cudaDeviceProp structure is defined
as:
struct cudaDeviceProp {
char name[256];
size_t totalGlobalMem;
size_t sharedMemPerBlock;
int regsPerBlock;
int warpSize;
size_t memPitch;
int maxThreadsPerBlock;
int maxThreadsDim[3];
int maxGridSize[3];
int clockRate;
size_t totalConstMem;
int major;
int minor;
size_t textureAlignment;
size_t texturePitchAlignment;
int deviceOverlap;
int multiProcessorCount;
int kernelExecTimeoutEnabled;
int integrated;
int canMapHostMemory;
int computeMode;
int maxTexture1D;
int maxTexture1DMipmap;
int maxTexture1DLinear;
int maxTexture2D[2];
int maxTexture2DMipmap[2];
int maxTexture2DLinear[3];
int maxTexture2DGather[2];
int maxTexture3D[3];
int maxTexture3DAlt[3];
int maxTextureCubemap;
int maxTexture1DLayered[2];
int maxTexture2DLayered[3];
int maxTextureCubemapLayered[2];
int maxSurface1D;
int maxSurface2D[2];
int maxSurface3D[3];
int maxSurface1DLayered[2];
int maxSurface2DLayered[3];
int maxSurfaceCubemap;
int maxSurfaceCubemapLayered[2];
size_t surfaceAlignment;
int concurrentKernels;
int ECCEnabled;
int pciBusID;
int pciDeviceID;
int pciDomainID;
int tccDriver;
int asyncEngineCount;
int unifiedAddressing;
int memoryClockRate;
int memoryBusWidth;
int l2CacheSize;
int maxThreadsPerMultiProcessor;
int streamPrioritiesSupported;
int globalL1CacheSupported;
int localL1CacheSupported;
size_t sharedMemPerMultiprocessor;
int regsPerMultiprocessor;
int managedMemSupported;
int isMultiGpuBoard;
int multiGpuBoardGroupID;
int singleToDoublePrecisionPerfRatio;
int pageableMemoryAccess;
int concurrentManagedAccess;
}
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
int main(void) {
cudaDeviceProp prop; //将设备拥有的属性填充到cudaDeviceProp结构中
int count;
cudaGetDeviceCount(&count); //支持CUDA架构的Device数量
for (int i = 0; i< count; i++) {
cudaGetDeviceProperties(&prop, i); //获取设备的属性
printf(" --- General Information for device %d ---\n", i);
printf("Name: %s\n", prop.name);
printf("Compute capability: %d.%d\n", prop.major, prop.minor);
printf("Clock rate: %d\n", prop.clockRate);
printf("Device copy overlap: ");
if (prop.deviceOverlap)
printf("Enabled\n");
else
printf("Disabled\n");
printf("Kernel execution timeout : ");
if (prop.kernelExecTimeoutEnabled)
printf("Enabled\n");
else
printf("Disabled\n");
printf(" --- Memory Information for device %d ---\n", i);
printf("Total global mem: %ld\n", prop.totalGlobalMem);
printf("Total constant Mem: %ld\n", prop.totalConstMem);
printf("Max mem pitch: %ld\n", prop.memPitch);
printf("Texture Alignment: %ld\n", prop.textureAlignment);
printf(" --- MP Information for device %d ---\n", i);
printf("Multiprocessor count: %d\n",
prop.multiProcessorCount);
printf("Shared mem per mp: %ld\n", prop.sharedMemPerBlock);
printf("Registers per mp: %d\n", prop.regsPerBlock);
printf("Threads in warp: %d\n", prop.warpSize);
printf("Max threads per block: %d\n",
prop.maxThreadsPerBlock);
printf("Max thread dimensions: (%d, %d, %d)\n",
prop.maxThreadsDim[0], prop.maxThreadsDim[1],
prop.maxThreadsDim[2]);
printf("Max grid dimensions: (%d, %d, %d)\n",
prop.maxGridSize[0], prop.maxGridSize[1],
prop.maxGridSize[2]);
printf("\n");
}
}
运行结果:
7.设备属性的使用
通过上述我们了解到,可以通过cudaGetDeviceProperties()来查询各种属性。
问题:在多GPU平台环境下,我们对每个Device设备进行迭代,操作有些繁琐,因此CUDA运行时,提供了一种自动方式来执行这个迭代操作。
NVIDIA的SLI(Scalable link Interface)可伸缩链路接口技术使得多个独立的GPU可以并排排列。
如果应用程序依赖于GPU的某些特定属性,或者需要在系统中最快的GPU上运行,那么你就应该熟悉这个API,因为CUDA运行时本身并不能保证为应用程序选择最优或者最合适的GPU。
cudaChooseDevice
__host__cudaError_t cudaChooseDevice (int *device,
const cudaDeviceProp *prop)
Select compute-device which best matches criteria.
Parameters
device
- Device with best match
prop
- Desired device properties
Returns
cudaSuccess, cudaErrorInvalidValue
Description
Returns in *device the device which has properties that best match *prop.
Note that this function may also return error codes from previous, asynchronous
launches.
See also:
cudaGetDeviceCount, cudaGetDevice, cudaSetDevice, cudaGetDeviceProperties
cudaSetDevice
__host__cudaError_t cudaSetDevice (int device)
Set device to be used for GPU executions.
Parameters
device
- Device on which the active host thread should execute the device code.
Returns
cudaSuccess, cudaErrorInvalidDevice, cudaErrorDeviceAlreadyInUse
Description
Sets device as the current device for the calling host thread. Valid device id's are 0 to(cudaGetDeviceCount() - 1).
Any device memory subsequently allocated from this host thread using cudaMalloc(),
cudaMallocPitch() or cudaMallocArray() will be physically resident on device.
Any host memory allocated from this host thread using cudaMallocHost() or
cudaHostAlloc() or cudaHostRegister() will have its lifetime associated with
device. Any streams or events created from this host thread will be associated with
device. Any kernels launched from this host thread using the <<<>>> operator or
cudaLaunchKernel() will be executed on device.
This call may be made from any host thread, to any device, and at any time. This
function will do no synchronization with the previous or new device, and should be
considered a very low overhead call.
Note that this function may also return error codes from previous, asynchronous
launches.
See also:
cudaGetDeviceCount, cudaGetDevice, cudaGetDeviceProperties, cudaChooseDevice
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
int main(void) {
cudaDeviceProp prop;
int dev;
cudaGetDevice(&dev);
printf("ID of current CUDA device: %d\n", dev);
memset(&prop, 0, sizeof(cudaDeviceProp)); //将设备属性填充到一个cudaDeviceProp结构中
prop.major = 1;
prop.minor = 3;
cudaChooseDevice(&dev, &prop); //CUDA运行时查询是否存在某个设备满足这些条件,并返回设备ID
printf("ID of CUDA device closest to revision 1.3: %d\n", dev);
cudaSetDevice(dev); //所有的设备操作都在这个设备上运行
}
8.小结:
- 通过增加修饰符使我们可以指定哪些代码在设备上运行,哪些代码在主机上运行。
- 添加关键字__global__告诉编译器把该函数放在GPU上运行。
- GPU上专门的内存对应的API,包括cudaMalloc()、cudaMemcpy()、cudaFree(),分别实现了分配设备内存,在设备和主机之间复制数据,以及释放设备内存等功能。