PPMP_char2

文章讲述了作者自学CUDA编程的经历,包括void**类型的内存管理、CUDA编译过程中的主机与设备代码区分、函数声明的关键字,以及CUDA编程中的重要概念如threadblock、grid、数据映射和API调用。作者计划系统学习并通过解决课后习题来深化理解。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

PMPP char2 – Heterogeneous data parallel computing

​ 这几年,断断续续地自学了一些cuda编程,是从一些博客、视频里学的,属于是零敲碎打,不够系统、全面。偶然间,发现了《Programming Massively Parallel Processors A Hands-on Approach, 4th Edition》,这本书很棒,便萌生了仔细系统地学习CUDA的想法。以前有一些基础,这次主要在于深入理解+查缺补漏,知识点方面不做过多记录,重点是课后习题及相关程序。争取做到周更,每周看完一章。我的答案和理解,可能有错误和纰漏的地方,欢迎评论指正。

知识点

1 (void**)&d_a

目的:在host存一个指针(d_a),指向device上的内存,从而实现memcpy、kernel调用等。

实现这一目的,就要在d_a声明后,修改其内容。修改d_a的话,就需要d_a的地址。

cudaMalloc() 函数的原型,数据类型也为void**。

cudaError_t cudaMalloc(void** devPtr, size_t size);

在这里插入图片描述

int *d_a, *d_b, *d_c; 						//d_a指向cpu端地址的一个指针,没有初始化,地址为null
printf(" %p  %p\n", d_a, &d_a);

cudaMalloc((void **)&d_a, N * sizeof(int));//d_a指向GPU端地址的一个指针,某一具体地址。
printf(" %p  %p\n", d_a, &d_a);
....

结果:

 (nil)  0x7ffc611847e8				// nil表示null,即没有初始化的指针。
 0x7ffb53200000  0x7ffc611847e8		// &d_a没有变化,还是cpu上的地址;而d_a本身改变了,变为gpu上的地址。

2 编译过程

在这里插入图片描述

​ CUDA C是C的拓展,拓展的部分传统编译器没法处理,要使用能处理这些拓展的编译器,譬如NVCC。编译器处理 CUDA C 程序,使用 CUDA 关键字分隔主机代码和设备代码。

​ 设备端运行的代码,标有 CUDA 关键字→→被 NVCC 编译成称为 PTX 文件的虚拟二进制文件→→NVCC 的运行时组件进一步编译(在host上完成)为真实object file(cubin文件)→→ 这些 object file 会被链接到最终的可执行程序中→→运行时被加载到 GPU 上执行。

​ PTX 文件,是高级语言与汇编语言的中间状态,是一种更加底层的描述,并不是二进制文件。这样做的好处是可以针对不同的 GPU 架构生成优化的二进制代码,提高程序的可移植性和性能。

3 函数声明关键字

​ global用来修饰kernel函数,也就是host调用,device执行;device用来修饰辅助函数供其他global、device调用不能独立运行),也就是device调用,device执行。

​ host就是普通的cpu函数。

qualifier keywordcallable fromexecuted onexecuted by
_host_ (默认的)HostHostcaller host thread
_global_Host (or Device)Devicenew grid of device threads
_device_DeviceDevicecaller device thread

课后习题

  1. If we want to use each thread in a grid to calculate one output element of a
    vector addition, what would be the expression for mapping the thread/block
    indices to the data index (i)?
    (A) i=threadIdx.x + threadIdx.y;
    (B) i=blockIdx.x + threadIdx.x;
    © i=blockIdx.x*blockDim.x + threadIdx.x;
    (D) i=blockIdx.x*threadIdx.x;

答案:c

  1. Assume that we want to use each thread to calculate two adjacent elements of
    a vector addition. What would be the expression for mapping the thread/block
    indices to the data index (i) of the first element to be processed by a thread?
    (A) i=blockIdx.xblockDim.x + threadIdx.x +2;
    (B) i=blockIdx.xthreadIdx.x2;
    © i=(blockIdx.xblockDim.x + threadIdx.x)2;
    (D) i=blockIdx.xblockDim.x2 + threadIdx.x;

答案:c

  1. We want to use each thread to calculate two elements of a vector addition.
    Each thread block processes 2*blockDim.x consecutive elements that form
    two sections. All threads in each block will process a section first, each
    processing one element. They will then all move to the next section, each

    processing one element. Assume that variable i should be the index for the
    first element to be processed by a thread. What would be the expression for
    mapping the thread/block indices to data index of the first element?

    (A) i=blockIdx.x*blockDim.x + threadIdx.x +2;
    (B) i=blockIdx.x*threadIdx.x*2;
    © i=(blockIdx.x*blockDim.x + threadIdx.x)*2;
    (D) i=blockIdx.x*blockDim.x*2 + threadIdx.x;

答案:d

  1. For a vector addition, assume that the vector length is 8000, each thread
    calculates one output element, and the thread block size is 1024 threads. The
    programmer configures the kernel call to have a minimum number of thread
    blocks to cover all output elements. How many threads will be in the grid?
    (A) 8000
    (B) 8196
    © 8192
    (D) 8200

答案:c

  1. If we want to allocate an array of v integer elements in the CUDA device
    global memory, what would be an appropriate expression for the second
    argument of the cudaMalloc call?
    (A) n
    (B) v
    © n * sizeof(int)
    (D) v * sizeof(int)

答案:d

  1. If we want to allocate an array of n floating-point elements and have a
    floating-point pointer variable A_d to point to the allocated memory, what
    would be an appropriate expression for the first argument of the cudaMalloc
    () call?
    (A) n
    (B) (void *) A_d
    © *A_d
    (D) (void **) &A_d

答案:d

  1. If we want to copy 3000 bytes of data from host array A_h (A_h is a pointer
    to element 0 of the source array) to device array A_d (A_d is a pointer to
    element 0 of the destination array), what would be an appropriate API call
    for this data copy in CUDA?
    (A) cudaMemcpy(3000, A_h, A_d, cudaMemcpyHostToDevice);
    (B) cudaMemcpy(A_h, A_d, 3000, cudaMemcpyDeviceTHost);
    © cudaMemcpy(A_d, A_h, 3000, cudaMemcpyHostToDevice);
    (D) cudaMemcpy(3000, A_d, A_h, cudaMemcpyHostToDevice);

答案:c

  1. How would one declare a variable err that can appropriately receive the
    returned value of a CUDA API call?
    (A) int err;
    (B) cudaError err;
    © cudaError_t err;
    (D) cudaSuccess_t err;

答案:c

  1. Consider the following CUDA kernel and the corresponding host function
    that calls it:
    01 global void foo_kernel(float* a, float* b, unsigned int
    N){
    02 unsigned int i=blockIdx.x*blockDim.x + threadIdx.x;
    03 if(i , N) {
    04 b[i]=2.7f*a[i] - 4.3f;
    05 }
    06 }
    07 void foo(float* a_d, float* b_d) {
    08 unsigned int N=200000;
    09 foo_kernel<<< (N + 128-1)/128, 128>>> (a_d,b_d, N);
    10 }
    a. What is the number of threads per block?
    b. What is the number of threads in the grid?
    c. What is the number of blocks in the grid?
    d. What is the number of threads that execute the code on line 02?
    e. What is the number of threads that execute the code on line 04?

答案:a、128;b、20064;c、1536;d、20064;e、20000

  1. A new summer intern was frustrated with CUDA. He has been complaining
    that CUDA is very tedious. He had to declare many functions that he plans
    to execute on both the host and the device twice, once as a host function and
    once as a device function. What is your response?

答案:

​ 采用qualifier keyword去修饰函数名,是为了提示编译器对cpu、gpu代码分离,并各自做相应的处理和优化。cpu端代码就是普通的c编译器处理;gpu端代码属于c拓展,需要特殊编译器。

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值