
PMPP char2 – Heterogeneous data parallel computing

​ 这几年,断断续续地自学了一些cuda编程,是从一些博客、视频里学的,属于是零敲碎打,不够系统、全面。偶然间,发现了《Programming Massively Parallel Processors A Hands-on Approach, 4th Edition》,这本书很棒,便萌生了仔细系统地学习CUDA的想法。以前有一些基础,这次主要在于深入理解+查缺补漏,知识点方面不做过多记录,重点是课后习题及相关程序。争取做到周更,每周看完一章。我的答案和理解,可能有错误和纰漏的地方,欢迎评论指正。


1 (void**)&d_a



cudaMalloc() 函数的原型,数据类型也为void**。

cudaError_t cudaMalloc(void** devPtr, size_t size);


int *d_a, *d_b, *d_c; 						//d_a指向cpu端地址的一个指针,没有初始化,地址为null
printf(" %p  %p\n", d_a, &d_a);

cudaMalloc((void **)&d_a, N * sizeof(int));//d_a指向GPU端地址的一个指针,某一具体地址。
printf(" %p  %p\n", d_a, &d_a);


 (nil)  0x7ffc611847e8				// nil表示null,即没有初始化的指针。
 0x7ffb53200000  0x7ffc611847e8		// &d_a没有变化,还是cpu上的地址;而d_a本身改变了,变为gpu上的地址。

2 编译过程


​ CUDA C是C的拓展,拓展的部分传统编译器没法处理,要使用能处理这些拓展的编译器,譬如NVCC。编译器处理 CUDA C 程序,使用 CUDA 关键字分隔主机代码和设备代码。

​ 设备端运行的代码,标有 CUDA 关键字→→被 NVCC 编译成称为 PTX 文件的虚拟二进制文件→→NVCC 的运行时组件进一步编译(在host上完成)为真实object file(cubin文件)→→ 这些 object file 会被链接到最终的可执行程序中→→运行时被加载到 GPU 上执行。

​ PTX 文件,是高级语言与汇编语言的中间状态,是一种更加底层的描述,并不是二进制文件。这样做的好处是可以针对不同的 GPU 架构生成优化的二进制代码,提高程序的可移植性和性能。

3 函数声明关键字

​ global用来修饰kernel函数,也就是host调用,device执行;device用来修饰辅助函数供其他global、device调用不能独立运行),也就是device调用,device执行。

​ host就是普通的cpu函数。

qualifier keywordcallable fromexecuted onexecuted by
_host_ (默认的)HostHostcaller host thread
_global_Host (or Device)Devicenew grid of device threads
_device_DeviceDevicecaller device thread


  1. If we want to use each thread in a grid to calculate one output element of a
    vector addition, what would be the expression for mapping the thread/block
    indices to the data index (i)?
    (A) i=threadIdx.x + threadIdx.y;
    (B) i=blockIdx.x + threadIdx.x;
    © i=blockIdx.x*blockDim.x + threadIdx.x;
    (D) i=blockIdx.x*threadIdx.x;


  1. Assume that we want to use each thread to calculate two adjacent elements of
    a vector addition. What would be the expression for mapping the thread/block
    indices to the data index (i) of the first element to be processed by a thread?
    (A) i=blockIdx.xblockDim.x + threadIdx.x +2;
    (B) i=blockIdx.xthreadIdx.x2;
    © i=(blockIdx.xblockDim.x + threadIdx.x)2;
    (D) i=blockIdx.xblockDim.x2 + threadIdx.x;


  1. We want to use each thread to calculate two elements of a vector addition.
    Each thread block processes 2*blockDim.x consecutive elements that form
    two sections. All threads in each block will process a section first, each
    processing one element. They will then all move to the next section, each

    processing one element. Assume that variable i should be the index for the
    first element to be processed by a thread. What would be the expression for
    mapping the thread/block indices to data index of the first element?

    (A) i=blockIdx.x*blockDim.x + threadIdx.x +2;
    (B) i=blockIdx.x*threadIdx.x*2;
    © i=(blockIdx.x*blockDim.x + threadIdx.x)*2;
    (D) i=blockIdx.x*blockDim.x*2 + threadIdx.x;


  1. For a vector addition, assume that the vector length is 8000, each thread
    calculates one output element, and the thread block size is 1024 threads. The
    programmer configures the kernel call to have a minimum number of thread
    blocks to cover all output elements. How many threads will be in the grid?
    (A) 8000
    (B) 8196
    © 8192
    (D) 8200


  1. If we want to allocate an array of v integer elements in the CUDA device
    global memory, what would be an appropriate expression for the second
    argument of the cudaMalloc call?
    (A) n
    (B) v
    © n * sizeof(int)
    (D) v * sizeof(int)


  1. If we want to allocate an array of n floating-point elements and have a
    floating-point pointer variable A_d to point to the allocated memory, what
    would be an appropriate expression for the first argument of the cudaMalloc
    () call?
    (A) n
    (B) (void *) A_d
    © *A_d
    (D) (void **) &A_d


  1. If we want to copy 3000 bytes of data from host array A_h (A_h is a pointer
    to element 0 of the source array) to device array A_d (A_d is a pointer to
    element 0 of the destination array), what would be an appropriate API call
    for this data copy in CUDA?
    (A) cudaMemcpy(3000, A_h, A_d, cudaMemcpyHostToDevice);
    (B) cudaMemcpy(A_h, A_d, 3000, cudaMemcpyDeviceTHost);
    © cudaMemcpy(A_d, A_h, 3000, cudaMemcpyHostToDevice);
    (D) cudaMemcpy(3000, A_d, A_h, cudaMemcpyHostToDevice);


  1. How would one declare a variable err that can appropriately receive the
    returned value of a CUDA API call?
    (A) int err;
    (B) cudaError err;
    © cudaError_t err;
    (D) cudaSuccess_t err;


  1. Consider the following CUDA kernel and the corresponding host function
    that calls it:
    01 global void foo_kernel(float* a, float* b, unsigned int
    02 unsigned int i=blockIdx.x*blockDim.x + threadIdx.x;
    03 if(i , N) {
    04 b[i]=2.7f*a[i] - 4.3f;
    05 }
    06 }
    07 void foo(float* a_d, float* b_d) {
    08 unsigned int N=200000;
    09 foo_kernel<<< (N + 128-1)/128, 128>>> (a_d,b_d, N);
    10 }
    a. What is the number of threads per block?
    b. What is the number of threads in the grid?
    c. What is the number of blocks in the grid?
    d. What is the number of threads that execute the code on line 02?
    e. What is the number of threads that execute the code on line 04?


  1. A new summer intern was frustrated with CUDA. He has been complaining
    that CUDA is very tedious. He had to declare many functions that he plans
    to execute on both the host and the device twice, once as a host function and
    once as a device function. What is your response?


​ 采用qualifier keyword去修饰函数名,是为了提示编译器对cpu、gpu代码分离,并各自做相应的处理和优化。cpu端代码就是普通的c编译器处理;gpu端代码属于c拓展,需要特殊编译器。

SQLAlchemy 是一个 SQL 工具包和对象关系映射(ORM)库,用于 Python 编程语言。它提供了一个高级的 SQL 工具和对象关系映射工具,允许开发者以 Python 类和对象的形式操作数据库,而无需编写大量的 SQL 语句。SQLAlchemy 建立在 DBAPI 之上,支持多种数据库后端,如 SQLite, MySQL, PostgreSQL 等。 SQLAlchemy 的核心功能: 对象关系映射(ORM): SQLAlchemy 允许开发者使用 Python 类来表示数据库表,使用类的实例表示表中的行。 开发者可以定义类之间的关系(如一对多、多对多),SQLAlchemy 会自动处理这些关系在数据库中的映射。 通过 ORM,开发者可以像操作 Python 对象一样操作数据库,这大大简化了数据库操作的复杂性。 表达式语言: SQLAlchemy 提供了一个丰富的 SQL 表达式语言,允许开发者以 Python 表达式的方式编写复杂的 SQL 查询。 表达式语言提供了对 SQL 语句的灵活控制,同时保持了代码的可读性和可维护性。 数据库引擎和连接池: SQLAlchemy 支持多种数据库后端,并且为每种后端提供了对应的数据库引擎。 它还提供了连接池管理功能,以优化数据库连接的创建、使用和释放。 会话管理: SQLAlchemy 使用会话(Session)来管理对象的持久化状态。 会话提供了一个工作单元(unit of work)和身份映射(identity map)的概念,使得对象的状态管理和查询更加高效。 事件系统: SQLAlchemy 提供了一个事件系统,允许开发者在 ORM 的各个生命周期阶段插入自定义的钩子函数。 这使得开发者可以在对象加载、修改、删除等操作时执行额外的逻辑。
