charpter_01_Introduction to CUDA programming

最新推荐文章于 2023-07-11 02:54:50 发布

The__Aviator

最新推荐文章于 2023-07-11 02:54:50 发布

阅读量339

点赞数

分类专栏：高性能并行计算零基础进阶笔记文章标签： cuda gpu hpc 并行计算

本文链接：https://blog.csdn.net/pilotmickey/article/details/113746796

版权

高性能并行计算零基础进阶笔记专栏收录该内容

2 篇文章 1 订阅

订阅专栏

Charpter 01 Introduction to CUDA Programming

Charpter 01 Introduction to CUDA Programming

0. Contents in Brief:

History of HPC
Getting started with PyCUDA

1.The history of highperformance computing

1.1 Basic

Floating-Point Operations (FLOPs) per second is the fundamental unit for measuring the theoretical peak of any compute processor.
https://en.wikipedia.org/wiki/FLOPS

1.2 History Overview

1.2.1 Three epochs of last century

MegaFLOP ~ PetaFLOP
百万~千兆
1000000~1000000000000000
$10^5$ ~ $10^{15}$

epoch1 epoch2 epoch3
CRAY-1 CRAY-2 Cray T3D
160 MFOP 2 GFLOPs 1 TFLOP
a single vector CPU a 4 Core Vector CPU compute nodes communicate by network

epoch1	epoch2	epoch3
CRAY-1	CRAY-2	Cray T3D
160 MFOP	2 GFLOPs	1 TFLOP
a single vector CPU	a 4 Core Vector CPU	compute nodes communicate by network

1.2.2 Almost 20 YEARS later … to 2020, there were no fundamental innovations.

Technological innovations were primarily focused on three architectural innovations:

1. Moving from an 8-bit to a 16-bit to a 32-bit and now a 64-bit instruction set；
1. Increasing ILP；
Instruction-Level Parallelism (ILP) is a concept wherein code-independent instructions can execute at the same time. For the instructions to execute in parallel, they need to be independent of each other.

ILP：独立于代码的指令级并行

https://en.wikipedia.org/wiki/Instruction-level_parallelism
https://zhuanlan.zhihu.com/p/139241174
1. Increasing the number of cores.

This was supported by increasing the clock rate, which currently stands at 4 GHz

Clock rate: 时钟频率，CPU的主频

The fundamental laws that drove the semiconductor industry make this possible.

Moore’s Law https://en.wikipedia.org/wiki/Moore%27s_law
Dennard scaling https://en.wikipedia.org/wiki/Dennard_scaling

Why GPU?

https://zhuanlan.zhihu.com/p/107394889

CPU 和 GPU 的区别是什么？

https://www.zhihu.com/question/19903344/answer/96081382

1.2.3 2004 A researcher named Mark Harris made use of GPU for non-graphics tasks for the first time.

General Purpose Computation using GPU (GPGPU) was coined.
https://scholar.google.com/citations?user=pxbPE_cAAAAJ&hl=en

1.2.4 2007 CUDA: Compute Unified Device Architecture was first released.

1.3 Heterogeneous computing 异构计算

Heterogeneous：多种多样的；混杂的

GPUs are used to accelerate the parts of the code that are parallel in nature.
A highly efficient CPU coupled with a high throughput GPU results in improved performance for the application.
Diagram: An application running on multiple processor types:

This concept can be very well defined with the help of Amdahl’s law.

Amdahl’s law :
https://en.wikipedia.org/wiki/Amdahl%27s_law

A formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved.

The key point:

CPU is good for a certain fraction of code that is
latency bound
GPU is good at running the Single Instruction
Multiple Data (SIMD) part of the code in parallel.

SIMD:
https://en.wikipedia.org/wiki/SIMD
A class of parallel computers in Flynn’s taxonomy.
It describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.

It won’t necessarily result in good speedup for the overall application if only one of them. So essentially offloading certain types of operations from the processor onto a GPU is called Heterogeneous Computing.

Diagram: latency bound and throughput bound:

在这里插入图片描述

1.4 Programming paradigm

1.4.1 Flynn’s taxonomy

https://en.wikipedia.org/wiki/Flynn%27s_taxonomy
Flynn’s taxonomy is a classification of computer architectures, proposed by Michael J. Flynn in 1966.

The four classifications include:

Single instruction stream, single data stream (SISD)
Single instruction stream, multiple data streams (SIMD)
Multiple instruction streams, single data stream (MISD)
Multiple instruction streams, multiple data streams (MIMD)

Further divisions: 略

SIMDs is used to describe GPU architecture.

SIMD is used to describe an architecture where the same instruction is applied in parallel to multiple data points.

In contrast, in Single Instruction Multiple Threads (SIMTs), rather than a
single thread issuing the instructions, multiple threads issue the same instruction to different data.

SIMTs
[https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads](https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads%5C)
An execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with multithreading.

在这里插入图片描述

The GPU architecture is more suitable in terms of the SIMT category compared to SIMD.

1.4.2 An example

Adding two arrays and storing data in a third array ： $C = A + B$

Diagram: Vector addition

在这里插入图片描述

It is obvious that each task is independent of each other, but the same
operation is being applied by all of the threads.

1.5 Low latency versus higher throughput 低延迟和高吞吐量

1.5.1 CPU architecture

CPU	GPU
optimized for low latency access	optimized for data parallel throughput computation
larger amount of cache	smaller amount of cache
many types of cache	not many

L3 ~ L1 the lower the amount of cache is present and less latency.
Since CPUs run at a very high clock speed, it becomes necessary to hide the latency of fetching the data by frequently storing used data in caches and predicting the next instruction to execute.

Diagram: How the CPU and GPU architecture dedicate the chip die area for different memory and compute units:

在这里插入图片描述

While GPU uses a lot of transistors for computing ALUs, CPU uses
it to reduce latency.
The CPU architecture is a latency reducing
architecture.

ALUs: 算术逻辑单元
https://en.wikipedia.org/wiki/Arithmetic_logic_unit
https://baike.baidu.com/item/算术逻辑单元/8954657?fr=aladdin

Architecture of the central processing unit (CPU)
https://computersciencewiki.org/index.php/Architecture_of_the_central_processing_unit_(CPU)

1.5.2 GPU architecture

The GPU architecture, on the other hand, is called a latency reducing or high throughput architecture.

The GPU architecture hides latency with computations from other threads.

why we can’t create these threads in the CPU and do the same thing to hide latency?

The context switching time between threads in CPU, compared to GPU, is much higher due to less registers in CPU than in GPU.

1.6 Programming approaches to GPU

Diagram:The various ways we can perform GPU programming:

2. Getting started with PyCUDA

CUDA
host	device
CPU	GPU
Be responsible for calling the device function	/
host code (the serial code)	device code

2.1 Querying your GPU with PyCUDA

import pycuda.driver as drv

drv.init()  # 初始化PyCUDA
# 查GPU参数
device = drv.Device(0)
print(f'Device {0}: {device.name()}')
compute_capability = float('%d.%d' % device.compute_capability())
print(f'Compute Capability:{compute_capability}')
print(f'Total Memory: {device.total_memory()} megabytes')
device_attributes = device.get_attributes()
print(device_attributes)

2.2 Using PyCUDA’s `gpuarray` class

2.2.1 Transferring data to and from the GPU with gpuarray

GPU has its own memory: device memory or global device memory(to differentiate this from the additional cache memory, shared memory, and register memory that is also on the GPU.)
We use malloc and free functions in C or new and delete operators in C++ to allocat heap memory dynamically.
We treat global device memory the same way in CUDA C/C++ to transfer data back and forth between the CPU to the GPU with commands such as cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost
All while keeping track of multiple pointers in both the CPU and GPU space and performing proper memory allocations (cudaMalloc) and deallocations (cudaFree).

Fortunately, PyCUDA covers all of the overhead of memory allocation, deallocation, and data transfers with the `gpuarray` class.

gpuarray objects even perform automatic cleanup based on the lifetime.

"""Using PyCUDA's gpuarray class"""
import numpy as np
import pycuda.autoinit
from pycuda import gpuarray
host_data = np.array([1, 2, 3, 4, 5], dtype=np.float32)
device_data = gpuarray.to_gpu(host_data)
print(device_data)
device_data_2 = device_data*2
print(device_data_2)
host_data_2 = device_data_2.get()
print(host_data_2)

check this url if you have problem with your pycuda nvcc enviroment
https://stackoverflow.com/questions/40856535/pycuda-nvcc-fatal-cannot-find-compiler-cl-exe-in-path

2.2.2 Basic pointwise arithmetic operations with gpuarray

A pointwise operation is intrinsically parallelizable.

array(A) $+, -, *, /, * *$ array(B)

These pointwise operations performed on the GPU are in parallel since the computation of one element is not dependent on the computation of any other element.

2.2.3 A speed test

"""A speed test"""
import numpy as np
import pycuda.autoinit
from pycuda import gpuarray
from time import time

host_data = np.float32(np.random.random(5000000))
t1 = time()
host_data_2x = host_data * np.float32(2)
t2 = time()
print('total time to compute on CPU: %f' % (t2 - t1))
device_data = gpuarray.to_gpu(host_data)
t1 = time()
device_data_2x = device_data * np.float32(2)
t2 = time()
from_device = device_data_2x.get()
print('total time to compute on GPU: %f' % (t2 - t1))
print('Is the host computation the same as the GPU computation?:{}'.format(np.allclose(from_device, host_data_2x)))

以上是 GPU型号：GTX 1050 Ti 的结果，第一次run是3s多，
这和原书 GTX 1050 的0.008s的结果还差很多，书中解释说pycuda会重新编译.cu文件，但是重新运行会不用再编译新的.cu文件，速度就会上来，或者用编译好的GPU code或者用Scikit-CUDA库就不会有这个问题。但是测试重新run很多次还是没有上来太多，稳定在2.5s左右.（2021.01.26）

如果示例代码是第一次被运行，它将会慢一些，而且在程序正常输出之前还会打印下面内容。https://zhuanlan.zhihu.com/p/125598914

UserWarning: The CUDA compiler succeeded, but said the following:
kernel.cu

在这里插入图片描述

用IPython就快了，5000000时GPU时间稳定在0.002s左右;5000000之前GPU和CPU时间都差不多。（2021.01.27）

Tips： Python脚本性能分析:

若py文件名为’time_calc0.py’，在IPython中输入：

with open('time_calc0.py','r') as f:
    time_calc_code = f.read()
%prun -s cumulative exec(time_calc_code)

在这里插入图片描述

ncalls：表示函数调用的次数；
tottime：表示指定函数的总的运行时间，除掉函数中调用子函数的运行时间；
percall：（第一个percall）等于 tottime/ncalls；
cumtime：表示该函数及其所有子函数的调用运行的时间，即函数开始调用到返回的时间；
percall：（第二个percall）即函数运行一次的平均时间，等于 cumtime/ncalls；
filename:lineno(function)：每个函数调用的具体信息；

The Python Profilers:
https://docs.python.org/3/library/profile.html#cProfile.run
How to use profilers?
https://www.machinelearningplus.com/python/cprofile-how-to-profile-your-python-code/

2.3 Using PyCUDA’s `ElementWiseKernel` for performing pointwise computations

2.3.1 Basic

Kernel: A function that is launched directly onto the GPU by CUDA.

We will use several functions from PyCUDA that generate templates and design patterns for different types of kernels, easing our transition into GPU programming.

An example:

import numpy as np
import pycuda.autoinit
from pycuda import gpuarray
from time import time
from pycuda.elementwise import ElementwiseKernel

host_data = np.float32(np.random.random(50000000))

gpu_2x_ker = ElementwiseKernel(
    "float *in, float *out",
    "out[i] = 2*in[i];",
    "gpu_2x_ker")


def speedcomparison():
    t1 = time()
    host_data_2x = host_data * np.float32(2)
    t2 = time()
    print('total time to compute on CPU: %f' % (t2 - t1))
    device_data = gpuarray.to_gpu(host_data)
    # allocate memory for output
    device_data_2x = gpuarray.empty_like(device_data)
    t1 = time()
    gpu_2x_ker(device_data, device_data_2x)
    t2 = time()
    from_device = device_data_2x.get()
    print('total time to compute on GPU: %f' % (t2 - t1))
    print(
        'Is the host computation the same as the GPU computation? : {}'.format(np.allclose(from_device, host_data_2x)))


if __name__ == '__main__':
    speedcomparison()

运行结果：

在这里插入图片描述

GPU时间和原书差距很大，原书结果是0.0s，呃。

性能分析：

在这里插入图片描述

可以发现{method 'acquire' of '_thread.lock' objects}耗时最多，1.964s，几乎是全部的时间，无论用ipython还是用pycharm结果一样。而前一个例子用ipython时这一项耗时为0，用pycharm时这一项耗时也有1.9s多。
？？？？这是什么意思2021.01.27

2.3.2 Mandelbrot

from time import time
import matplotlib
##This will prevent the figure from popping up matplotlib.use('Agg')
from matplotlib import pyplot as plt
import numpy as np
import pycuda.autoinit
from pycuda import gpuarray
from pycuda.elementwise import ElementwiseKernel

mandel_ker = ElementwiseKernel(
"pycuda::complex<float> *lattice, float *mandelbrot_graph, int max_iters, float upper_bound",
"""
mandelbrot_graph[i] = 1;

pycuda::complex<float> c = lattice[i]; 
pycuda::complex<float> z(0,0);

for (int j = 0; j < max_iters; j++)
    {
    
    z = z*z + c;
    
    if(abs(z) > upper_bound)
        {
          mandelbrot_graph[i] = 0;
          break;
        }

    }
        
""",
"mandel_ker")

def gpu_mandelbrot(width, height, real_low, real_high, imag_low, imag_high, max_iters, upper_bound):

    # we set up our complex lattice as such
    real_vals = np.matrix(np.linspace(real_low, real_high, width), dtype=np.complex64)
    imag_vals = np.matrix(np.linspace( imag_high, imag_low, height), dtype=np.complex64) * 1j
    mandelbrot_lattice = np.array(real_vals + imag_vals.transpose(), dtype=np.complex64)    
    
    # copy complex lattice to the GPU
    mandelbrot_lattice_gpu = gpuarray.to_gpu(mandelbrot_lattice)

    # allocate an empty array on the GPU
    mandelbrot_graph_gpu = gpuarray.empty(shape=mandelbrot_lattice.shape, dtype=np.float32)

    mandel_ker( mandelbrot_lattice_gpu, mandelbrot_graph_gpu, np.int32(max_iters), np.float32(upper_bound))
              
    mandelbrot_graph = mandelbrot_graph_gpu.get()
    
    return mandelbrot_graph


    if __name__ == '__main__':

        t1 = time()
        mandel = gpu_mandelbrot(512,512,-2,2,-2,2,256, 2)
        t2 = time()

        mandel_time = t2 - t1

        t1 = time()
        fig = plt.figure(1)
        plt.imshow(mandel, extent=(-2, 2, -2, 2))
        plt.savefig('mandelbrot.png', dpi=fig.dpi)
        t2 = time()

        dump_time = t2 - t1

        print ('It took {} seconds to calculate the Mandelbrot graph.'.format(mandel_time))
        print ('It took {} seconds to dump the image.'.format(dump_time))```

运行结果：

在这里插入图片描述

性能分析：

在这里插入图片描述

仍然是这一项耗时最长，去掉这个时间比原书0.89s快。（测试GPU：GTX 1050 Ti，原书GPU：GTX 1050）

3. Kernels, Threads, Blocks, and Grids

3.1 Kernels

3.1.1 Contents

前面写了PyCUDA内置的特定设计模式的kernel，这章写完全自定义的kernel，更精细地控制GPU，但是编程也更复杂；
理解threads, blocks, and grids和它们在kernel里面的角色；

3.1.2 The PyCUDA `SourceModule` function

import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda import gpuarray
from pycuda.compiler import SourceModule

ker = SourceModule("""
__global__ void scalar_multiply_kernel(float *outvec, float scalar, float *vec)
{
int i = threadIdx.x;
outvec[i] = scalar*vec[i];
}
""")

# __global__ : Distinguish the function as a kernel to the compiler.
# void: We'll always just declare this as a void function, because we'll always get our output values by passing a pointer to some empty chunk of memory that we pass in as a parameter.

void:
https://docs.microsoft.com/en-us/cpp/cpp/void-cpp?view=msvc-160
https://stackoverflow.com/questions/1043034/what-does-void-mean-in-c-c-and-c

Basically it means “nothing” or “no type”
There are 3 basic ways that void is used:

Function argument: int myFunc(void) – the function takes nothing.
Function return value: void myFunc(int) – the function returns nothing
Generic data pointer: void* data – ‘data’ is a pointer to data of unknown type, and cannot be dereferenced

Note: the void in a function argument is optional in C++, so int myFunc() is exactly the same as int myFunc(void), and it is left out completely in C#. It is always required for a return value.

指针
A pointer is an object in many programming languages that stores a memory address.
https://en.wikipedia.org/wiki/Pointer_(computer_programming)

C语言数组当参数传递
https://blog.csdn.net/Laoynice/article/details/79196993/

# parameters：
# outvec: (floating-point array pointer) 输出的scale后的向量;
# scalar: (float) 要乘以的数，不是一个指针;
# vec: (floating-point array pointer) 输入向量.

Tips: Singleton input parameters to a kernel function can be passed in directly from the host without using pointers or allocated device memory.

回看Charpter 01 2.3 相对应的用ElementwiseKernel的例子:

gpu_2x_ker = ElementwiseKernel(
    "float *in, float *out",
    "out[i] = 2*in[i];",
    "gpu_2x_ker")

它通过i自动地在多个GPU threads 上并行。
在以上自定义kernel里面：
The identification of each individual thread is given by the threadIdx value, which we retrieve as follows: int i = threadIdx.x;.
Here we’ll have to specifically set the number of threads to 512 with the block and grid parameters.

完整代码：

import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda import gpuarray
from pycuda.compiler import SourceModule

ker = SourceModule("""
__global__ void scalar_multiply_kernel(float *outvec, float scalar, float *vec)
{
     int i = threadIdx.x;
     outvec[i] = scalar*vec[i];
}
""")

scalar_multiply_gpu = ker.get_function("scalar_multiply_kernel")

testvec = np.random.randn(512).astype(np.float32)
testvec_gpu = gpuarray.to_gpu(testvec)
outvec_gpu = gpuarray.empty_like(testvec_gpu)

scalar_multiply_gpu(outvec_gpu, np.float32(2), testvec_gpu, block=(512, 1, 1), grid=(1, 1, 1))

print("Does our kernel work correctly? : {}".format(np.allclose(outvec_gpu.get(), 2 * testvec)))

3.2 Threads, blocks, and grids

3.2.1 GPU architecture

3.2.1.1 The two sides of CUDA:

在这里插入图片描述

Software	—>	Hardware
Kernel	—>	GPU device
Device memory	—>	GPU memory
Block	Executes on/as	Multi-processor
Block Specific memory	—>	Multi-Processor memory
Thread	—>	SIMD cores/CUDA cores

CUDA Threads:
- CUDA threads execute on a CUDA core.
- CUDA threads are different from CPU threads.
- CUDA threads are extremely lightweight and provide fast context switching.
  
  The reason for fast context switching is due to the availability of a large register size in a GPU and hardwarebased scheduler.
  
  Register:
  A processor register is a quickly accessible location available to a computer’s processor.
  https://en.wikipedia.org/wiki/Processor_register
  https://baike.baidu.com/item/寄存器/187682?fr=aladdin
CUDA blocks:
- CUDA threads are grouped together into a logical entity called a CUDA block.
- CUDA blocks execute on a single Streaming Multiprocessor (SM).
- One block runs on a single SM.
- The user needs to divide the parallel computation into blocks and threads.
GRID/kernel:
- CUDA blocks are grouped together into a logical entity called a CUDA GRID.
- A CUDA GRID is then executed on the device.

3.2.1.2 Thread hierarchy

All threads in a grid execute the same kernel code;
They require a set of unique coordinates (thread ID) to identify the appropriate portion of the data they need to process.
The maximum number of threads per block is determined by the compute capability.
- If we consider the compute capability 3.5(NVIDIA 2012a).
- The total size of a block is limited to 1024 threads. valid example to identify a thread: (1024, 1, 1) or (32, 32, 1)…

An example of how a thread is identified:
在这里插入图片描述

The blocks in the grid are organized in a 2-dimensional array.
Inside each block, the threads are arranged in a 3-dimensional array.
In this example, the block is organized into 5x3x2 arrays of threads, giving a total of 30 threads per block.
For the particular case of thread (4, 2, 0), its coordinates are given by threadIdx.x = 4, threadIdx.y = 2, and threadIdx.z = 0.
This example has six blocks within the grid, and 30 threads per block giving a grand total of 180 threads in the grid.
The number of threads per block, the quantity of blocks within a grid and their respective array are defined by parameters given at the kernel launch, by dimBlock and dimGrid.
- 上图所示例子对应的代码为：
- ```
dim3 dimBlock(5, 3, 2);
dim3 dimGrid(3, 2);
Kernel1<<<dimGrid, dimBlock>>>(...);
```
- grid 因为只有2维，省略了第三个参数。

3.2.2 Vector addition using PyCUDA

3.2.2.1 The sequential C code running on CPU

#include<stdio.h>
#include<stdlib.h>
#define N 512

void host_add(int *a, int *b, int *c) {
    for(int idx=0;idx<N;idx++)
      c[idx] = a[idx] + b[idx];
}
//basically just fills the array with index.
void fill_array(int *data) {
    for(int idx=0;idx<N;idx++)
      data[idx] = idx;
}

void print_output(int *a, int *b, int*c) {
    for(int idx=0;idx<N;idx++)
        printf("\n %d + %d = %d", a[idx] , b[idx], c[idx]);
}

int main(void) {
    int *a, *b, *c;
    int size = N * sizeof(int);
    // Alloc space for host copies of a, b, c and setup input values
    a = (int *)malloc(size); fill_array(a);
    b = (int *)malloc(size); fill_array(b);
    c = (int *)malloc(size);
    host_add(a,b,c);
    print_output(a,b,c);
    free(a); free(b); free(c);
    return 0;
}

Steps:

Allocate memory on the CPU, that is, malloc new.
Populate/initialize the CPU data.
Call the CPU function that has the crunching of data. The actual algorithm is vector addition in this case.
Consume the crunched data, which is printed in this case.

3.2.2.2 Convert the sequential C code to CUDA C code

The main function will look like this:

int main(void) {
    int *a, *b, *c;
    int *d_a, *d_b, *d_c; // device copies of a, b, c
    int size = N * sizeof(int);
    // Alloc space for host copies of a, b, c and setup input values
    a = (int *)malloc(size); fill_array(a);
    b = (int *)malloc(size); fill_array(b);
    c = (int *)malloc(size);
    // Alloc space for device copies of vector (a, b, c)
    cudaMalloc((void *)&d_a, N * sizeof(int));
    cudaMalloc((void *)&d_b, N *sizeof(int));
    cudaMalloc((void *)&d_c, N * sizeof(int));
    // Copy from host to device
    cudaMemcpy(d_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b, N* sizeof(int), cudaMemcpyHostToDevice);
    device_add<<<1,1>>>(d_a,d_b,d_c);
    // Copy result back to host
    cudaMemcpy(c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost);
    print_output(a,b,c);
    free(a); free(b); free(c);
    //free gpu memory
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    return 0;
}

Steps:

Allocate memory on the GPU, that is, cudaMalloc.

Unlike the malloc command,cudaMalloc does not return a pointer to allocated memory; instead, it takes a pointer reference as a parameter and updates the same with the allocated memory.

指针和引用的区别:
https://blog.csdn.net/l477918269/article/details/90233908?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.baidujs&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.baidujs
Populate/initialize the CPU data.
Transfer the data from the host to the device with cudaMemcpy.

Like other memcopy commands, this API requires the destination pointer, source pointer, and size.
Call the GPU function with <<<,>>> brackets.
Synchronize the evice and host with cudaDeviceSynchronize.

In order for the host to make sure that kernel execution has finished, the host calls the cudaDeviceSynchronize function.
Transfer data from the device to the host with cudaMemcpy.
Consume the crunched data, which is printed in this case.

除了分配内存、初始化数据、调用方法和使用处理后的数据，多了3，5，6几个步骤。另外，分配内存变成分配给GPU，调用方法变成调用GPU方法。

3.2.2.3 Using PyCUDA to rewrite the CUDA C code

完整代码：

ref: https://medium.com/@CIulius/five-different-ways-to-sum-vectors-in-pycuda-3f2d9409b139

import numpy as np

# --- PyCUDA initialization
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule


###################
# iDivUp FUNCTION #
###################
def iDivUp(a, b):
    return a // b + 1


########
# MAIN #
########

start = cuda.Event()
end = cuda.Event()

N = 100000

BLOCKSIZE = 256

# --- Create random vectorson the CPU
h_a = np.random.randn(1, N)
h_b = np.random.randn(1, N)

# --- Set CPU arrays as single precision
h_a = h_a.astype(np.float32)
h_b = h_b.astype(np.float32)

# --- Allocate GPU device memory
d_a = cuda.mem_alloc(h_a.nbytes)
d_b = cuda.mem_alloc(h_b.nbytes)
d_c = cuda.mem_alloc(h_a.nbytes)

# --- Memcopy from host to device
cuda.memcpy_htod(d_a, h_a)
cuda.memcpy_htod(d_b, h_b)

mod = SourceModule("""
  #include <stdio.h>
  __global__ void deviceAdd(float * __restrict__ d_c, const float * __restrict__ d_a, const float * __restrict__ d_b, const int N)
  {
    const int tid = threadIdx.x + blockIdx.x * blockDim.x;
    if (tid >= N) return;
    d_c[tid] = d_a[tid] + d_b[tid];
  } 
  """)

# --- Define a reference to the __global__ function and call it
deviceAdd = mod.get_function("deviceAdd")
blockDim = (BLOCKSIZE, 1, 1)
gridDim = (iDivUp(N, BLOCKSIZE), 1, 1)
start.record()
deviceAdd(d_c, d_a, d_b, np.int32(N), block=blockDim, grid=gridDim)
end.record()
end.synchronize()
secs = start.time_till(end) * 1e-3
print("Processing time = %fs" % (secs))

# --- Copy results from device to host
h_c = np.empty_like(h_a)
cuda.memcpy_dtoh(h_c, d_c)

if np.array_equal(h_c, h_a + h_b):
    print("Test passed!")
else:
    print("Error!")

# --- Flush context printf buffer
cuda.Context.synchronize()

测试结果：
性能分析：
- 若记录传输数据耗时：
- 运行结果：
- 若记录第二次传输数据耗时：
- 运行结果：

所以Charpter 01 section 2.3中所出现的问题是host和device间的数据传输过于耗时，且主要耗时在第二次传输数据时：两个章节例子在数据传输上的耗时均在在2.0s左右。
如何降低数据传输耗时？为什么Charpter 01 2.2.3 A speed test 用IPython就没有这个问题？2021.02.01

3.2.3 How kernel code is written and manage the thread and block sizes?

3.2.3.1 Creating multiple blocks

完整代码：

import numpy as np

# --- PyCUDA initialization
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule



########
# MAIN #
########

start = cuda.Event()
end = cuda.Event()
# start.record()
N = 100000


# --- Create random vectorson the CPU
h_a = np.random.randn(1, N)
h_b = np.random.randn(1, N)

# --- Set CPU arrays as single precision
h_a = h_a.astype(np.float32)
h_b = h_b.astype(np.float32)

# --- Allocate GPU device memory
d_a = cuda.mem_alloc(h_a.nbytes)
d_b = cuda.mem_alloc(h_b.nbytes)
d_c = cuda.mem_alloc(h_a.nbytes)


# --- Memcopy from host to device
# start.record()
cuda.memcpy_htod(d_a, h_a)
cuda.memcpy_htod(d_b, h_b)
# end.record()
mod = SourceModule("""
  #include <stdio.h>
  __global__ void deviceAdd(float *d_c, const float *d_a, const float *d_b)
  {
    const int index = blockIdx.x;
    d_c[index] = d_a[index] + d_b[index];
  } 
  """)

# --- Define a reference to the __global__ function and call it
deviceAdd = mod.get_function("deviceAdd")
blockDim = (1, 1, 1)
gridDim = (N, 1, 1)
start.record()
deviceAdd(d_c, d_a, d_b, block=blockDim, grid=gridDim)
end.record()
end.synchronize()
secs = start.time_till(end) * 1e-3
print("Processing time = %fs" % (secs))

# --- Copy results from device to host
h_c = np.empty_like(h_a)
cuda.memcpy_dtoh(h_c, d_c)

if np.array_equal(h_c, h_a + h_b):
    print("Test passed!")
else:
    print("Error!")

# --- Flush context printf buffer
cuda.Context.synchronize()

测试结果：
性能分析：
主要修改代码在SourceModule和blockDim、 gridDim。
By using blockIdx.x to index the array, each block handles a different element of the array. On the device, each block can execute in parallel.
The preceding screenshot represents the vector addition GPU code in
which every block shows indexing for multiple single-thread blocks.

3.2.3.2 Creating multiple threads

完整代码：

import numpy as np

# --- PyCUDA initialization
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule

########
# MAIN #
########

start = cuda.Event()
end = cuda.Event()
# start.record()
N = 1024

# --- Create random vectorson the CPU
h_a = np.random.randn(1, N)
h_b = np.random.randn(1, N)

# --- Set CPU arrays as single precision
h_a = h_a.astype(np.float32)
h_b = h_b.astype(np.float32)

# --- Allocate GPU device memory
d_a = cuda.mem_alloc(h_a.nbytes)
d_b = cuda.mem_alloc(h_b.nbytes)
d_c = cuda.mem_alloc(h_a.nbytes)

# --- Memcopy from host to device
# start.record()
cuda.memcpy_htod(d_a, h_a)
cuda.memcpy_htod(d_b, h_b)
# end.record()
mod = SourceModule("""
  #include <stdio.h>
  __global__ void deviceAdd(float *d_c, const float *d_a, const float *d_b)
  {
    d_c[threadIdx.x] = d_a[threadIdx.x] + d_b[threadIdx.x];
  } 
  """)

# --- Define a reference to the __global__ function and call it
deviceAdd = mod.get_function("deviceAdd")
blockDim = (N, 1, 1)
gridDim = (1, 1, 1)
start.record()
deviceAdd(d_c, d_a, d_b, block=blockDim, grid=gridDim)
end.record()
end.synchronize()
secs = start.time_till(end) * 1e-3
print("Processing time = %fs" % (secs))

# --- Copy results from device to host
h_c = np.empty_like(h_a)
cuda.memcpy_dtoh(h_c, d_c)

if np.array_equal(h_c, h_a + h_b):
    print("Test passed!")
else:
    print("Error!")

# --- Flush context printf buffer
cuda.Context.synchronize()

对于GTX 1050 Ti？，N的最大取值为1024。

The NVIDIA Pascal card allows a maximum of 1,024 threads per thread block in the x and y dimensions, while in the z dimension, you can only launch 64 threads. Similarly, the maximum blocks in a grid are restricted to 65,535 in the y and z dimensions in the Pascal architecture and 2^31 -1 in the x dimension.

运行结果：
性能分析：
Instead of blockIdx.x, we make use of threadIdx.x, as shown in the following screenshot:
This will execute the device_add function N times in parallel instead of once. Each parallel invocation of the device_add function is referred to as a thread.

3.2.3.3 Combining blocks and threads

Two scenarios that depict different combinations:
Let’s consider that the total number of vector elements is 32.
- Scenario 1:
  Each block contains 8 threads and a total of four blocks.
- Scenario 2:
  Each block contains 4 threads and a total of eight blocks.
完整代码：

import numpy as np

# --- PyCUDA initialization
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule

########
# MAIN #
########


start = cuda.Event()
end = cuda.Event()
# start.record()
N = 3200
BLOKESIZE = 4

# --- Create random vectorson the CPU
h_a = np.random.randn(1, N)
h_b = np.random.randn(1, N)

# --- Set CPU arrays as single precision
h_a = h_a.astype(np.float32)
h_b = h_b.astype(np.float32)

# --- Allocate GPU device memory
d_a = cuda.mem_alloc(h_a.nbytes)
d_b = cuda.mem_alloc(h_b.nbytes)
d_c = cuda.mem_alloc(h_a.nbytes)

# --- Memcopy from host to device
# start.record()
cuda.memcpy_htod(d_a, h_a)
cuda.memcpy_htod(d_b, h_b)
# end.record()
mod = SourceModule("""
  #include <stdio.h>
  __global__ void deviceAdd(float *d_c, const float *d_a, const float *d_b)
  {  
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    d_c[index] = d_a[index] + d_b[index];
  } 
  """)

# --- Define a reference to the __global__ function and call it
deviceAdd = mod.get_function("deviceAdd")
blockDim = (int(N / BLOKESIZE), 1, 1)
gridDim = (BLOKESIZE, 1, 1)
start.record()
deviceAdd(d_c, d_a, d_b, block=blockDim, grid=gridDim)
end.record()
end.synchronize()
secs = start.time_till(end) * 1e-3
print("Processing time = %fs" % (secs))

# --- Copy results from device to host
h_c = np.empty_like(h_a)
cuda.memcpy_dtoh(h_c, d_c)

if np.array_equal(h_c, h_a + h_b):
    print("Test passed!")
else:
    print("Error!")

# --- Flush context printf buffer
cuda.Context.synchronize()

代码将32个元素的向量改成了3200个元素的向量。
为什么index = threadIdx.x + blockIdx.x * blockDim.x;就可以索引到对应的threads：
在这里插入图片描述

运行结果：
- 当BLOCSIZE = 4时：
- 当BLOCSIZE = 8时：
- 当BLOCSIZE = 800时：
- 当BLOCSIZE = 3200时：
当BLOCKSIZE很大时(threads很少时)，算得就更慢.
性能分析：

3.2.3.4 Why bother with threads and blocks?

Unlike parallel blocks, threads have mechanisms to communicate and synchronize efficiently.
Threads belonging to different blocks cannot communicate/synchronize with each other during the execution of the kernel.
The result of this is that, if new hardware is released with more SMs and if the code has enough parallelism, the code can be scaled linearly.
The threads communicate with each other using a special memory known as shared memory.
元素的向量改成了3200个元素的向量。
为什么index = threadIdx.x + blockIdx.x * blockDim.x;就可以索引到对应的threads：
运行结果：
- 当BLOCSIZE = 4时：
- 当BLOCSIZE = 8时：
- 当BLOCSIZE = 800时：
- 当BLOCSIZE = 3200时：
当BLOCKSIZE很大时(threads很少时)，算得就更慢.
性能分析：

3.2.3.4 Why bother with threads and blocks?

Unlike parallel blocks, threads have mechanisms to communicate and synchronize efficiently.
Threads belonging to different blocks cannot communicate/synchronize with each other during the execution of the kernel.
The result of this is that, if new hardware is released with more SMs and if the code has enough parallelism, the code can be scaled linearly.
The threads communicate with each other using a special memory known as shared memory.