PPMP_char2

最新推荐文章于 2025-05-01 18:31:59 发布

凑合看吧

最新推荐文章于 2025-05-01 18:31:59 发布

阅读量2k

点赞数 22

文章标签：高性能计算 cuda

本文链接：https://blog.csdn.net/u013441272/article/details/138212536

版权

文章讲述了作者自学CUDA编程的经历，包括void**类型的内存管理、CUDA编译过程中的主机与设备代码区分、函数声明的关键字，以及CUDA编程中的重要概念如threadblock、grid、数据映射和API调用。作者计划系统学习并通过解决课后习题来深化理解。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

PMPP char2 – Heterogeneous data parallel computing

这几年，断断续续地自学了一些cuda编程，是从一些博客、视频里学的，属于是零敲碎打，不够系统、全面。偶然间，发现了《Programming Massively Parallel Processors A Hands-on Approach, 4th Edition》，这本书很棒，便萌生了仔细系统地学习CUDA的想法。以前有一些基础，这次主要在于深入理解+查缺补漏，知识点方面不做过多记录，重点是课后习题及相关程序。争取做到周更，每周看完一章。我的答案和理解，可能有错误和纰漏的地方，欢迎评论指正。

知识点

1 （void**）&d_a

目的：在host存一个指针（d_a），指向device上的内存，从而实现memcpy、kernel调用等。

实现这一目的，就要在d_a声明后，修改其内容。修改d_a的话，就需要d_a的地址。

cudaMalloc() 函数的原型，数据类型也为void**。

cudaError_t cudaMalloc(void** devPtr, size_t size);

在这里插入图片描述

int *d_a, *d_b, *d_c; 						//d_a指向cpu端地址的一个指针，没有初始化，地址为null
printf(" %p  %p\n", d_a, &d_a);

cudaMalloc((void **)&d_a, N * sizeof(int));//d_a指向GPU端地址的一个指针，某一具体地址。
printf(" %p  %p\n", d_a, &d_a);
....

结果：

 (nil)  0x7ffc611847e8				// nil表示null，即没有初始化的指针。
 0x7ffb53200000  0x7ffc611847e8		// &d_a没有变化，还是cpu上的地址；而d_a本身改变了，变为gpu上的地址。

2 编译过程

在这里插入图片描述

CUDA C是C的拓展，拓展的部分传统编译器没法处理，要使用能处理这些拓展的编译器，譬如NVCC。编译器处理 CUDA C 程序，使用 CUDA 关键字分隔主机代码和设备代码。

设备端运行的代码，标有 CUDA 关键字→→被 NVCC 编译成称为 PTX 文件的虚拟二进制文件→→NVCC 的运行时组件进一步编译（在host上完成）为真实object file（cubin文件）→→ 这些 object file 会被链接到最终的可执行程序中→→运行时被加载到 GPU 上执行。

PTX 文件，是高级语言与汇编语言的中间状态，是一种更加底层的描述，并不是二进制文件。这样做的好处是可以针对不同的 GPU 架构生成优化的二进制代码，提高程序的可移植性和性能。

3 函数声明关键字

global用来修饰kernel函数，也就是host调用，device执行；device用来修饰辅助函数（供其他global、device调用，不能独立运行），也就是device调用，device执行。

host就是普通的cpu函数。

qualifier keyword	callable from	executed on	executed by
_host_ (默认的)	Host	Host	caller host thread
_global_	Host (or Device)	Device	new grid of device threads
_device_	Device	Device	caller device thread

课后习题

If we want to use each thread in a grid to calculate one output element of a
vector addition, what would be the expression for mapping the thread/block
indices to the data index (i)?
(A) i=threadIdx.x + threadIdx.y;
(B) i=blockIdx.x + threadIdx.x;
© i=blockIdx.x*blockDim.x + threadIdx.x;
(D) i=blockIdx.x*threadIdx.x;

答案：c

Assume that we want to use each thread to calculate two adjacent elements of
a vector addition. What would be the expression for mapping the thread/block
indices to the data index (i) of the first element to be processed by a thread?
(A) i=blockIdx.xblockDim.x + threadIdx.x +2;
(B) i=blockIdx.xthreadIdx.x2;
© i=(blockIdx.xblockDim.x + threadIdx.x)2;
(D) i=blockIdx.xblockDim.x2 + threadIdx.x;

答案：c

We want to use each thread to calculate two elements of a vector addition.
Each thread block processes 2*blockDim.x consecutive elements that form
two sections. All threads in each block will process a section first, each
processing one element. They will then all move to the next section, each

processing one element. Assume that variable i should be the index for the
first element to be processed by a thread. What would be the expression for
mapping the thread/block indices to data index of the first element?

(A) i=blockIdx.x*blockDim.x + threadIdx.x +2;
(B) i=blockIdx.x*threadIdx.x*2;
© i=(blockIdx.x*blockDim.x + threadIdx.x)*2;
(D) i=blockIdx.x*blockDim.x*2 + threadIdx.x;

答案：d

For a vector addition, assume that the vector length is 8000, each thread
calculates one output element, and the thread block size is 1024 threads. The
programmer configures the kernel call to have a minimum number of thread
blocks to cover all output elements. How many threads will be in the grid?
(A) 8000
(B) 8196
© 8192
(D) 8200

答案：c

If we want to allocate an array of v integer elements in the CUDA device
global memory, what would be an appropriate expression for the second
argument of the cudaMalloc call?
(A) n
(B) v
© n * sizeof(int)
(D) v * sizeof(int)

答案：d

If we want to allocate an array of n floating-point elements and have a
floating-point pointer variable A_d to point to the allocated memory, what
would be an appropriate expression for the first argument of the cudaMalloc
() call?
(A) n
(B) (void *) A_d
© *A_d
(D) (void **) &A_d

答案：d

If we want to copy 3000 bytes of data from host array A_h (A_h is a pointer
to element 0 of the source array) to device array A_d (A_d is a pointer to
element 0 of the destination array), what would be an appropriate API call
for this data copy in CUDA?
(A) cudaMemcpy(3000, A_h, A_d, cudaMemcpyHostToDevice);
(B) cudaMemcpy(A_h, A_d, 3000, cudaMemcpyDeviceTHost);
© cudaMemcpy(A_d, A_h, 3000, cudaMemcpyHostToDevice);
(D) cudaMemcpy(3000, A_d, A_h, cudaMemcpyHostToDevice);

答案：c

How would one declare a variable err that can appropriately receive the
returned value of a CUDA API call?
(A) int err;
(B) cudaError err;
© cudaError_t err;
(D) cudaSuccess_t err;

答案：c

Consider the following CUDA kernel and the corresponding host function
that calls it:
01 global void foo_kernel(float* a, float* b, unsigned int
N){
02 unsigned int i=blockIdx.x*blockDim.x + threadIdx.x;
03 if(i , N) {
04 b[i]=2.7f*a[i] - 4.3f;
05 }
06 }
07 void foo(float* a_d, float* b_d) {
08 unsigned int N=200000;
09 foo_kernel<<< (N + 128-1)/128, 128>>> (a_d,b_d, N);
10 }
a. What is the number of threads per block?
b. What is the number of threads in the grid?
c. What is the number of blocks in the grid?
d. What is the number of threads that execute the code on line 02?
e. What is the number of threads that execute the code on line 04?

答案：a、128；b、20064；c、1536；d、20064；e、20000

A new summer intern was frustrated with CUDA. He has been complaining
that CUDA is very tedious. He had to declare many functions that he plans
to execute on both the host and the device twice, once as a host function and
once as a device function. What is your response?

答案：

采用qualifier keyword去修饰函数名，是为了提示编译器对cpu、gpu代码分离，并各自做相应的处理和优化。cpu端代码就是普通的c编译器处理；gpu端代码属于c拓展，需要特殊编译器。