Charpter 01 Introduction to CUDA Programming
- Charpter 01 Introduction to CUDA Programming
- 0. Contents in Brief:
- 1.The history of highperformance computing
- 1.1 Basic
- 1.2 History Overview
- 1.2.1 Three epochs of last century
- 1.2.2 Almost 20 YEARS later … to 2020, there were no fundamental innovations.
- Why GPU?
- CPU 和 GPU 的区别是什么?
- 1.2.3 2004 A researcher named Mark Harris made use of GPU for non-graphics tasks for the first time.
- 1.2.4 2007 CUDA: Compute Unified Device Architecture was first released.
- 1.3 Heterogeneous computing 异构计算
- 1.4 Programming paradigm
- 1.5 Low latency versus higher throughput 低延迟和高吞吐量
- 1.6 Programming approaches to GPU
- 2. Getting started with PyCUDA
- 3. Kernels, Threads, Blocks, and Grids
0. Contents in Brief:
- History of HPC
- Getting started with PyCUDA
1.The history of highperformance computing
1.1 Basic
Floating-Point Operations (FLOPs) per second is the fundamental unit for measuring the theoretical peak of any compute processor.
https://en.wikipedia.org/wiki/FLOPS
1.2 History Overview
1.2.1 Three epochs of last century
-
MegaFLOP ~ PetaFLOP
-
百万~千兆
-
1000000~1000000000000000
-
1 0 5 10^5 105~ 1 0 15 10^{15} 1015
epoch1 epoch2 epoch3 CRAY-1 CRAY-2 Cray T3D 160 MFOP 2 GFLOPs 1 TFLOP a single vector CPU a 4 Core Vector CPU compute nodes communicate by network
1.2.2 Almost 20 YEARS later … to 2020, there were no fundamental innovations.
Technological innovations were primarily focused on three architectural innovations:
-
- Moving from an 8-bit to a 16-bit to a 32-bit and now a 64-bit instruction set;
-
- Increasing ILP;
Instruction-Level Parallelism (ILP) is a concept wherein code-independent instructions can execute at the same time. For the instructions to execute in parallel, they need to be independent of each other.
ILP:独立于代码的指令级并行
https://en.wikipedia.org/wiki/Instruction-level_parallelism
https://zhuanlan.zhihu.com/p/139241174 -
- Increasing the number of cores.
This was supported by increasing the clock rate, which currently stands at 4 GHz
Clock rate: 时钟频率,CPU的主频
The fundamental laws that drove the semiconductor industry make this possible.
- Moore’s Law https://en.wikipedia.org/wiki/Moore%27s_law
- Dennard scaling https://en.wikipedia.org/wiki/Dennard_scaling
Why GPU?
https://zhuanlan.zhihu.com/p/107394889
CPU 和 GPU 的区别是什么?
https://www.zhihu.com/question/19903344/answer/96081382
1.2.3 2004 A researcher named Mark Harris made use of GPU for non-graphics tasks for the first time.
General Purpose Computation using GPU (GPGPU) was coined.
https://scholar.google.com/citations?user=pxbPE_cAAAAJ&hl=en
1.2.4 2007 CUDA: Compute Unified Device Architecture was first released.
1.3 Heterogeneous computing 异构计算
Heterogeneous:多种多样的;混杂的
-
GPUs are used to accelerate the parts of the code that are parallel in nature.
-
A highly efficient CPU coupled with a high throughput GPU results in improved performance for the application.
-
Diagram: An application running on multiple processor types:
This concept can be very well defined with the help of Amdahl’s law.
Amdahl’s law :
https://en.wikipedia.org/wiki/Amdahl%27s_lawA formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved.
The key point:
-
CPU is good for a certain fraction of code that is
latency bound -
GPU is good at running the Single Instruction
Multiple Data (SIMD) part of the code in parallel.SIMD:
https://en.wikipedia.org/wiki/SIMD
A class of parallel computers in Flynn’s taxonomy.
It describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.
It won’t necessarily result in good speedup for the overall application if only one of them. So essentially offloading certain types of operations from the processor onto a GPU is called Heterogeneous Computing.
Diagram: latency bound and throughput bound:
1.4 Programming paradigm
1.4.1 Flynn’s taxonomy
https://en.wikipedia.org/wiki/Flynn%27s_taxonomy
Flynn’s taxonomy is a classification of computer architectures, proposed by Michael J. Flynn in 1966.
The four classifications include:
- Single instruction stream, single data stream (SISD)
- Single instruction stream, multiple data streams (SIMD)
- Multiple instruction streams, single data stream (MISD)
- Multiple instruction streams, multiple data streams (MIMD)
Further divisions: 略
SIMDs is used to describe GPU architecture.
SIMD is used to describe an architecture where the same instruction is applied in parallel to multiple data points.
In contrast, in Single Instruction Multiple Threads (SIMTs), rather than a
single thread issuing the instructions, multiple threads issue the same instruction to different data.
SIMTs
[https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads](https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads%5C)
An execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with multithreading.
The GPU architecture is more suitable in terms of the SIMT category compared to SIMD.
1.4.2 An example
Adding two arrays and storing data in a third array : C = A + B C = A + B C=A+B
Diagram: Vector addition
It is obvious that each task is independent of each other, but the same
operation is being applied by all of the threads.
1.5 Low latency versus higher throughput 低延迟和高吞吐量
1.5.1 CPU architecture
CPU | GPU |
---|---|
optimized for low latency access | optimized for data parallel throughput computation |
larger amount of cache | smaller amount of cache |
many types of cache | not many |
L3 ~ L1 the lower the amount of cache is present and less latency.
Since CPUs run at a very high clock speed, it becomes necessary to hide the latency of fetching the data by frequently storing used data in caches and predicting the next instruction to execute.
Diagram: How the CPU and GPU architecture dedicate the chip die area for different memory and compute units:
While GPU uses a lot of transistors for computing ALUs, CPU uses
it to reduce latency.
The CPU architecture is a latency reducing
architecture.
ALUs: 算术逻辑单元
https://en.wikipedia.org/wiki/Arithmetic_logic_unit
https://baike.baidu.com/item/算术逻辑单元/8954657?fr=aladdinArchitecture of the central processing unit (CPU)
https://computersciencewiki.org/index.php/Architecture_of_the_central_processing_unit_(CPU)
1.5.2 GPU architecture
The GPU architecture, on the other hand, is called a latency reducing or high throughput architecture.
The GPU architecture hides latency with computations from other threads.
why we can’t create these threads in the CPU and do the same thing to hide latency?
The context switching time between threads in CPU, compared to GPU, is much higher due to less registers in CPU than in GPU.
1.6 Programming approaches to GPU
Diagram:The various ways we can perform GPU programming:
2. Getting started with PyCUDA
CUDA | |
---|---|
host | device |
CPU | GPU |
Be responsible for calling the device function | / |
host code (the serial code) | device code |
2.1 Querying your GPU with PyCUDA
import pycuda.driver as drv
drv.init() # 初始化PyCUDA
# 查GPU参数
device = drv.Device(0)
print(f'Device {0}: {device.name()}')
compute_capability = float('%d.%d' % device.compute_capability())
print(f'Compute Capability:{compute_capability}')
print(f'Total Memory: {device.total_memory()} megabytes')
device_attributes = device.get_attributes()
print(device_attributes)
2.2 Using PyCUDA’s gpuarray
class
2.2.1 Transferring data to and from the GPU with gpuarray
- GPU has its own memory: device memory or global device memory(to differentiate this from the additional cache memory, shared memory, and register memory that is also on the GPU.)
- We use
malloc
andfree
functions in C ornew
anddelete
operators in C++ to allocat heap memory dynamically. - We treat global device memory the same way in CUDA C/C++ to transfer data back and forth between the CPU to the GPU with commands such as
cudaMemcpyHostToDevice
andcudaMemcpyDeviceToHost
- All while keeping track of multiple pointers in both the CPU and GPU space and performing proper memory allocations (
cudaMalloc
) and deallocations (cudaFree
).
Fortunately, PyCUDA covers all of the overhead of memory allocation, deallocation, and data transfers with the gpuarray
class.
gpuarray
objects even perform automatic cleanup based on the lifetime.
"""Using PyCUDA's gpuarray class"""
import numpy as np
import pycuda.autoinit
from pycuda import gpuarray
host_data = np.array([1, 2, 3, 4, 5], dtype=np.float32)
device_data = gpuarray.to_gpu(host_data)
print(device_data)
device_data_2 = device_data*2
print(device_data_2)
host_data_2 = device_data_2.get()
print(host_data_2)
check this url if you have problem with your pycuda nvcc enviroment
https://stackoverflow.com/questions/40856535/pycuda-nvcc-fatal-cannot-find-compiler-cl-exe-in-path
2.2.2 Basic pointwise arithmetic operations with gpuarray
A pointwise operation is intrinsically parallelizable.
array(A) + , − , ∗ , / , ∗ ∗ +, -, *, /, ** +,−,∗,/,∗∗ array(B)
These pointwise operations performed on the GPU are in parallel since the computation of one element is not dependent on the computation of any other element.
2.2.3 A speed test
"""A speed test"""
import numpy as np
import pycuda.autoinit
from pycuda import gpuarray
from time import time
host_data = np.float32(np.random.random(5000000))
t1 = time()
host_data_2x = host_data * np.float32(2)
t2 = time()
print('total time to compute on CPU: %f' % (t2 - t1))
device_data = gpuarray.to_gpu(host_data)
t1 = time()
device_data_2x = device_data * np.float32(2)
t2 = time()
from_device = device_data_2x.get()
print('total time to compute on GPU: %f' % (t2 - t1))
print('Is the host computation the same as the GPU computation?:{}'.format(np.allclose(from_device, host_data_2x)))
以上是 GPU型号:GTX 1050 Ti 的结果,第一次run是3s多,
这和原书 GTX 1050 的0.008s的结果还差很多,书中解释说pycuda会重新编译.cu文件,但是重新运行会不用再编译新的.cu文件,速度就会上来,或者用编译好的GPU code或者用Scikit-CUDA库就不会有这个问题。但是测试重新run很多次还是没有上来太多,稳定在2.5s左右.(2021.01.26)
如果示例代码是第一次被运行,它将会慢一些,而且在程序正常输出之前还会打印下面内容。https://zhuanlan.zhihu.com/p/125598914
UserWarning: The CUDA compiler succeeded, but said the following:
kernel.cu
用IPython就快了,5000000时GPU时间稳定在0.002s左右;5000000之前GPU和CPU时间都差不多。(2021.01.27)
Tips: Python脚本性能分析:
若py文件名为’time_calc0.py’,在IPython中输入:
with open('time_calc0.py','r') as f:
time_calc_code = f.read()
%prun -s cumulative exec(time_calc_code)
ncalls:表示函数调用的次数;
tottime:表示指定函数的总的运行时间,除掉函数中调用子函数的运行时间;
percall:(第一个percall)等于 tottime/ncalls;
cumtime:表示该函数及其所有子函数的调用运行的时间,即函数开始调用到返回的时间;
percall:(第二个percall)即函数运行一次的平均时间,等于 cumtime/ncalls;
filename:lineno(function):每个函数调用的具体信息;
The Python Profilers:
https://docs.python.org/3/library/profile.html#cProfile.run
How to use profilers?
https://www.machinelearningplus.com/python/cprofile-how-to-profile-your-python-code/
2.3 Using PyCUDA’s ElementWiseKernel
for performing pointwise computations
2.3.1 Basic
- Kernel: A function that is launched directly onto the GPU by CUDA.
We will use several functions from PyCUDA that generate templates and design patterns for different types of kernels, easing our transition into GPU programming.
An example:
import numpy as np
import pycuda.autoinit
from pycuda import gpuarray
from time import time
from pycuda.elementwise import ElementwiseKernel
host_data = np.float32(np.random.random(50000000))
gpu_2x_ker = ElementwiseKernel(
"float *in, float *out",
"out[i] = 2*in[i];",
"gpu_2x_ker")
def speedcomparison():
t1 = time()
host_data_2x = host_data * np.float32(2)
t2 = time()
print('total time to compute on CPU: %f' % (t2 - t1))
device_data = gpuarray.to_gpu(host_data)
# allocate memory for output
device_data_2x = gpuarray.empty_like(device_data)
t1 = time()
gpu_2x_ker(device_data, device_data_2x)
t2 = time()
from_device = device_data_2x.get()
print('total time to compute on GPU: %f' % (t2 - t1))
print(
'Is the host computation the same as the GPU computation? : {}'.format(np.allclose(from_device, host_data_2x)))
if __name__ == '__main__':
speedcomparison()
运行结果:
GPU时间和原书差距很大,原书结果是0.0s,呃。
性能分析:
可以发现
{method 'acquire' of '_thread.lock' objects}
耗时最多,1.964s,几乎是全部的时间,无论用ipython还是用pycharm结果一样。而前一个例子用ipython时这一项耗时为0,用pycharm时这一项耗时也有1.9s多。
????这是什么意思2021.01.27
2.3.2 Mandelbrot
from time import time
import matplotlib
##This will prevent the figure from popping up matplotlib.use('Agg')
from matplotlib import pyplot as plt
import numpy as np
import pycuda.autoinit
from pycuda import gpuarray
from pycuda.elementwise import ElementwiseKernel
mandel_ker = ElementwiseKernel(
"pycuda::complex<float> *lattice, float *mandelbrot_graph, int max_iters, float upper_bound",
"""
mandelbrot_graph[i] = 1;
pycuda::complex<float> c = lattice[i];
pycuda::complex<float> z(0,0);
for (int j = 0; j < max_iters; j++)
{
z = z*z + c;
if(abs(z) > upper_bound)
{
mandelbrot_graph[i] = 0;
break;
}
}
""",
"mandel_ker")
def gpu_mandelbrot(width, height, real_low, real_high, imag_low, imag_high, max_iters, upper_bound):
# we set up our complex lattice as such
real_vals = np.matrix(np.linspace(real_low, real_high, width), dtype=np.complex64)
imag_vals = np.matrix(np.linspace( imag_high, imag_low, height), dtype=np.complex64) * 1j
mandelbrot_lattice = np.array(real_vals + imag_vals.transpose(), dtype=np.complex64)
# copy complex lattice to the GPU
mandelbrot_lattice_gpu = gpuarray.to_gpu(mandelbrot_lattice)
# allocate an empty array on the GPU
mandelbrot_graph_gpu = gpuarray.empty(shape=mandelbrot_lattice.shape, dtype=np.float32)
mandel_ker( mandelbrot_lattice_gpu, mandelbrot_graph_gpu, np.int32(max_iters), np.float32(upper_bound))
mandelbrot_graph = mandelbrot_graph_gpu.get()
return mandelbrot_graph
if __name__ == '__main__':
t1 = time()
mandel = gpu_mandelbrot(512,512,-2,2,-2,2,256, 2)
t2 = time()
mandel_time = t2 - t1
t1 = time()
fig = plt.figure(1)
plt.imshow(mandel, extent=(-2, 2, -2, 2))
plt.savefig('mandelbrot.png', dpi=fig.dpi)
t2 = time()
dump_time = t2 - t1
print ('It took {} seconds to calculate the Mandelbrot graph.'.format(mandel_time))
print ('It took {} seconds to dump the image.'.format(dump_time))```
运行结果:
性能分析:
仍然是这一项耗时最长,去掉这个时间比原书0.89s快。(测试GPU:GTX 1050 Ti,原书GPU:GTX 1050)
3. Kernels, Threads, Blocks, and Grids
3.1 Kernels
3.1.1 Contents
- 前面写了PyCUDA内置的特定设计模式的kernel,这章写完全自定义的kernel,更精细地控制GPU,但是编程也更复杂;
- 理解threads, blocks, and grids和它们在kernel里面的角色;
3.1.2 The PyCUDA SourceModule
function
import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda import gpuarray
from pycuda.compiler import SourceModule
ker = SourceModule("""
__global__ void scalar_multiply_kernel(float *outvec, float scalar, float *vec)
{
int i = threadIdx.x;
outvec[i] = scalar*vec[i];
}
""")
# __global__ : Distinguish the function as a kernel to the compiler.
# void: We'll always just declare this as a void function, because we'll always get our output values by passing a pointer to some empty chunk of memory that we pass in as a parameter.
void:
https://docs.microsoft.com/en-us/cpp/cpp/void-cpp?view=msvc-160
https://stackoverflow.com/questions/1043034/what-does-void-mean-in-c-c-and-cBasically it means “nothing” or “no type”
There are 3 basic ways that void is used:
- Function argument:
int myFunc(void)
– the function takes nothing.- Function return value:
void myFunc(int)
– the function returns nothing- Generic data pointer:
void* data
– ‘data’ is a pointer to data of unknown type, and cannot be dereferencedNote: the void in a function argument is optional in C++, so int myFunc() is exactly the same as int myFunc(void), and it is left out completely in C#. It is always required for a return value.
指针
A pointer is an object in many programming languages that stores a memory address.
https://en.wikipedia.org/wiki/Pointer_(computer_programming)C语言数组当参数传递
https://blog.csdn.net/Laoynice/article/details/79196993/
# parameters:
# outvec: (floating-point array pointer) 输出的scale后的向量;
# scalar: (float) 要乘以的数,不是一个指针;
# vec: (floating-point array pointer) 输入向量.
Tips: Singleton input parameters to a kernel function can be passed in directly from the host without using pointers or allocated device memory.
回看Charpter 01 2.3 相对应的用ElementwiseKernel
的例子:
gpu_2x_ker = ElementwiseKernel(
"float *in, float *out",
"out[i] = 2*in[i];",
"gpu_2x_ker")
它通过i
自动地在多个GPU threads 上并行。
在以上自定义kernel里面:
The identification of each individual thread is given by the threadIdx
value, which we retrieve as follows: int i = threadIdx.x;
.
Here we’ll have to specifically set the number of threads to 512 with the block
and grid
parameters.
完整代码:
import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda import gpuarray
from pycuda.compiler import SourceModule
ker = SourceModule("""
__global__ void scalar_multiply_kernel(float *outvec, float scalar, float *vec)
{
int i = threadIdx.x;
outvec[i] = scalar*vec[i];
}
""")
scalar_multiply_gpu = ker.get_function("scalar_multiply_kernel")
testvec = np.random.randn(512).astype(np.float32)
testvec_gpu = gpuarray.to_gpu(testvec)
outvec_gpu = gpuarray.empty_like(testvec_gpu)
scalar_multiply_gpu(outvec_gpu, np.float32(2), testvec_gpu, block=(512, 1, 1), grid=(1, 1, 1))
print("Does our kernel work correctly? : {}".format(np.allclose(outvec_gpu.get(), 2 * testvec)))
3.2 Threads, blocks, and grids
3.2.1 GPU architecture
3.2.1.1 The two sides of CUDA:
Software | —> | Hardware |
---|---|---|
Kernel | —> | GPU device |
Device memory | —> | GPU memory |
Block | Executes on/as | Multi-processor |
Block Specific memory | —> | Multi-Processor memory |
Thread | —> | SIMD cores/CUDA cores |
-
CUDA Threads:
- CUDA threads execute on a CUDA core.
- CUDA threads are different from CPU threads.
- CUDA threads are extremely lightweight and provide fast context switching.
The reason for fast context switching is due to the availability of a large register size in a GPU and hardwarebased scheduler.
Register:
A processor register is a quickly accessible location available to a computer’s processor.
https://en.wikipedia.org/wiki/Processor_register
https://baike.baidu.com/item/寄存器/187682?fr=aladdin
-
CUDA blocks:
- CUDA threads are grouped together into a logical entity called a CUDA block.
- CUDA blocks execute on a single Streaming Multiprocessor (SM).
- One block runs on a single SM.
- The user needs to divide the parallel computation into blocks and threads.
-
GRID/kernel:
- CUDA blocks are grouped together into a logical entity called a CUDA GRID.
- A CUDA GRID is then executed on the device.
3.2.1.2 Thread hierarchy
- All threads in a grid execute the same kernel code;
- They require a set of unique coordinates (thread ID) to identify the appropriate portion of the data they need to process.
- The maximum number of threads per block is determined by the compute capability.
- If we consider the compute capability 3.5(NVIDIA 2012a).
- The total size of a block is limited to 1024 threads. valid example to identify a thread: (1024, 1, 1) or (32, 32, 1)…
An example of how a thread is identified:
- The blocks in the grid are organized in a 2-dimensional array.
- Inside each block, the threads are arranged in a 3-dimensional array.
- In this example, the block is organized into 5x3x2 arrays of threads, giving a total of 30 threads per block.
- For the particular case of thread (4, 2, 0), its coordinates are given by
threadIdx.x = 4
,threadIdx.y = 2
, andthreadIdx.z = 0
. - This example has six blocks within the grid, and 30 threads per block giving a grand total of 180 threads in the grid.
- The number of threads per block, the quantity of blocks within a grid and their respective array are defined by parameters given at the kernel launch, by
dimBlock
anddimGrid
.- 上图所示例子对应的代码为:
-
dim3 dimBlock(5, 3, 2); dim3 dimGrid(3, 2); Kernel1<<<dimGrid, dimBlock>>>(...);
- grid 因为只有2维,省略了第三个参数。
3.2.2 Vector addition using PyCUDA
3.2.2.1 The sequential C code running on CPU
#include<stdio.h>
#include<stdlib.h>
#define N 512
void host_add(int *a, int *b, int *c) {
for(int idx=0;idx<N;idx++)
c[idx] = a[idx] + b[idx];
}
//basically just fills the array with index.
void fill_array(int *data) {
for(int idx=0;idx<N;idx++)
data[idx] = idx;
}
void print_output(int *a, int *b, int*c) {
for(int idx=0;idx<N;idx++)
printf("\n %d + %d = %d", a[idx] , b[idx], c[idx]);
}
int main(void) {
int *a, *b, *c;
int size = N * sizeof(int);
// Alloc space for host copies of a, b, c and setup input values
a = (int *)malloc(size); fill_array(a);
b = (int *)malloc(size); fill_array(b);
c = (int *)malloc(size);
host_add(a,b,c);
print_output(a,b,c);
free(a); free(b); free(c);
return 0;
}
- Steps:
- Allocate memory on the CPU, that is,
malloc new
. - Populate/initialize the CPU data.
- Call the CPU function that has the crunching of data. The actual algorithm is vector addition in this case.
- Consume the crunched data, which is printed in this case.
3.2.2.2 Convert the sequential C code to CUDA C code
- The
main
function will look like this:
int main(void) {
int *a, *b, *c;
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);
// Alloc space for host copies of a, b, c and setup input values
a = (int *)malloc(size); fill_array(a);
b = (int *)malloc(size); fill_array(b);
c = (int *)malloc(size);
// Alloc space for device copies of vector (a, b, c)
cudaMalloc((void *)&d_a, N * sizeof(int));
cudaMalloc((void *)&d_b, N *sizeof(int));
cudaMalloc((void *)&d_c, N * sizeof(int));
// Copy from host to device
cudaMemcpy(d_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, N* sizeof(int), cudaMemcpyHostToDevice);
device_add<<<1,1>>>(d_a,d_b,d_c);
// Copy result back to host
cudaMemcpy(c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost);
print_output(a,b,c);
free(a); free(b); free(c);
//free gpu memory
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
- Steps:
-
Allocate memory on the GPU, that is,
cudaMalloc
.Unlike the
malloc
command,cudaMalloc
does not return a pointer to allocated memory; instead, it takes a pointer reference as a parameter and updates the same with the allocated memory.指针和引用的区别:
https://blog.csdn.net/l477918269/article/details/90233908?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.baidujs&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.baidujs -
Populate/initialize the CPU data.
-
Transfer the data from the host to the device with
cudaMemcpy
.Like other
memcopy
commands, this API requires the destination pointer, source pointer, and size. -
Call the GPU function with
<<<,>>>
brackets. -
Synchronize the evice and host with
cudaDeviceSynchronize
.In order for the host to make sure that kernel execution has finished, the host calls the
cudaDeviceSynchronize
function. -
Transfer data from the device to the host with
cudaMemcpy
. -
Consume the crunched data, which is printed in this case.
除了分配内存、初始化数据、调用方法和使用处理后的数据,多了3,5,6几个步骤。另外,分配内存变成分配给GPU,调用方法变成调用GPU方法。
3.2.2.3 Using PyCUDA to rewrite the CUDA C code
- 完整代码:
ref: https://medium.com/@CIulius/five-different-ways-to-sum-vectors-in-pycuda-3f2d9409b139
import numpy as np
# --- PyCUDA initialization
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
###################
# iDivUp FUNCTION #
###################
def iDivUp(a, b):
return a // b + 1
########
# MAIN #
########
start = cuda.Event()
end = cuda.Event()
N = 100000
BLOCKSIZE = 256
# --- Create random vectorson the CPU
h_a = np.random.randn(1, N)
h_b = np.random.randn(1, N)
# --- Set CPU arrays as single precision
h_a = h_a.astype(np.float32)
h_b = h_b.astype(np.float32)
# --- Allocate GPU device memory
d_a = cuda.mem_alloc(h_a.nbytes)
d_b = cuda.mem_alloc(h_b.nbytes)
d_c = cuda.mem_alloc(h_a.nbytes)
# --- Memcopy from host to device
cuda.memcpy_htod(d_a, h_a)
cuda.memcpy_htod(d_b, h_b)
mod = SourceModule("""
#include <stdio.h>
__global__ void deviceAdd(float * __restrict__ d_c, const float * __restrict__ d_a, const float * __restrict__ d_b, const int N)
{
const int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid >= N) return;
d_c[tid] = d_a[tid] + d_b[tid];
}
""")
# --- Define a reference to the __global__ function and call it
deviceAdd = mod.get_function("deviceAdd")
blockDim = (BLOCKSIZE, 1, 1)
gridDim = (iDivUp(N, BLOCKSIZE), 1, 1)
start.record()
deviceAdd(d_c, d_a, d_b, np.int32(N), block=blockDim, grid=gridDim)
end.record()
end.synchronize()
secs = start.time_till(end) * 1e-3
print("Processing time = %fs" % (secs))
# --- Copy results from device to host
h_c = np.empty_like(h_a)
cuda.memcpy_dtoh(h_c, d_c)
if np.array_equal(h_c, h_a + h_b):
print("Test passed!")
else:
print("Error!")
# --- Flush context printf buffer
cuda.Context.synchronize()
-
测试结果:
-
性能分析:
-
若记录传输数据耗时:
-
运行结果:
-
若记录第二次传输数据耗时:
-
运行结果:
-
所以Charpter 01 section 2.3中所出现的问题是host和device间的数据传输过于耗时,且主要耗时在第二次传输数据时:两个章节例子在数据传输上的耗时均在在2.0s左右。
如何降低数据传输耗时?为什么Charpter 01 2.2.3 A speed test 用IPython就没有这个问题?2021.02.01
3.2.3 How kernel code is written and manage the thread and block sizes?
3.2.3.1 Creating multiple blocks
- 完整代码:
import numpy as np
# --- PyCUDA initialization
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
########
# MAIN #
########
start = cuda.Event()
end = cuda.Event()
# start.record()
N = 100000
# --- Create random vectorson the CPU
h_a = np.random.randn(1, N)
h_b = np.random.randn(1, N)
# --- Set CPU arrays as single precision
h_a = h_a.astype(np.float32)
h_b = h_b.astype(np.float32)
# --- Allocate GPU device memory
d_a = cuda.mem_alloc(h_a.nbytes)
d_b = cuda.mem_alloc(h_b.nbytes)
d_c = cuda.mem_alloc(h_a.nbytes)
# --- Memcopy from host to device
# start.record()
cuda.memcpy_htod(d_a, h_a)
cuda.memcpy_htod(d_b, h_b)
# end.record()
mod = SourceModule("""
#include <stdio.h>
__global__ void deviceAdd(float *d_c, const float *d_a, const float *d_b)
{
const int index = blockIdx.x;
d_c[index] = d_a[index] + d_b[index];
}
""")
# --- Define a reference to the __global__ function and call it
deviceAdd = mod.get_function("deviceAdd")
blockDim = (1, 1, 1)
gridDim = (N, 1, 1)
start.record()
deviceAdd(d_c, d_a, d_b, block=blockDim, grid=gridDim)
end.record()
end.synchronize()
secs = start.time_till(end) * 1e-3
print("Processing time = %fs" % (secs))
# --- Copy results from device to host
h_c = np.empty_like(h_a)
cuda.memcpy_dtoh(h_c, d_c)
if np.array_equal(h_c, h_a + h_b):
print("Test passed!")
else:
print("Error!")
# --- Flush context printf buffer
cuda.Context.synchronize()
-
测试结果:
-
性能分析:
-
主要修改代码在
SourceModule
和blockDim
、gridDim
。
By usingblockIdx.x
to index the array, each block handles a different element of the array. On the device, each block can execute in parallel.
-
The preceding screenshot represents the vector addition GPU code in
which every block shows indexing for multiple single-thread blocks.
3.2.3.2 Creating multiple threads
- 完整代码:
import numpy as np
# --- PyCUDA initialization
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
########
# MAIN #
########
start = cuda.Event()
end = cuda.Event()
# start.record()
N = 1024
# --- Create random vectorson the CPU
h_a = np.random.randn(1, N)
h_b = np.random.randn(1, N)
# --- Set CPU arrays as single precision
h_a = h_a.astype(np.float32)
h_b = h_b.astype(np.float32)
# --- Allocate GPU device memory
d_a = cuda.mem_alloc(h_a.nbytes)
d_b = cuda.mem_alloc(h_b.nbytes)
d_c = cuda.mem_alloc(h_a.nbytes)
# --- Memcopy from host to device
# start.record()
cuda.memcpy_htod(d_a, h_a)
cuda.memcpy_htod(d_b, h_b)
# end.record()
mod = SourceModule("""
#include <stdio.h>
__global__ void deviceAdd(float *d_c, const float *d_a, const float *d_b)
{
d_c[threadIdx.x] = d_a[threadIdx.x] + d_b[threadIdx.x];
}
""")
# --- Define a reference to the __global__ function and call it
deviceAdd = mod.get_function("deviceAdd")
blockDim = (N, 1, 1)
gridDim = (1, 1, 1)
start.record()
deviceAdd(d_c, d_a, d_b, block=blockDim, grid=gridDim)
end.record()
end.synchronize()
secs = start.time_till(end) * 1e-3
print("Processing time = %fs" % (secs))
# --- Copy results from device to host
h_c = np.empty_like(h_a)
cuda.memcpy_dtoh(h_c, d_c)
if np.array_equal(h_c, h_a + h_b):
print("Test passed!")
else:
print("Error!")
# --- Flush context printf buffer
cuda.Context.synchronize()
对于GTX 1050 Ti?,N
的最大取值为1024。
The NVIDIA Pascal card allows a maximum of 1,024 threads per thread block in the x and y dimensions, while in the z dimension, you can only launch 64 threads. Similarly, the maximum blocks in a grid are restricted to 65,535 in the y and z dimensions in the Pascal architecture and 2^31 -1 in the x dimension.
-
运行结果:
-
性能分析:
-
Instead of blockIdx.x, we make use of threadIdx.x, as shown in the following screenshot:
-
This will execute the
device_add
functionN
times in parallel instead of once. Each parallel invocation of thedevice_add
function is referred to as a thread.
3.2.3.3 Combining blocks and threads
-
Two scenarios that depict different combinations:
Let’s consider that the total number of vector elements is 32.-
Scenario 1:
Each block contains 8 threads and a total of four blocks.
-
Scenario 2:
Each block contains 4 threads and a total of eight blocks.
-
-
完整代码:
import numpy as np
# --- PyCUDA initialization
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
########
# MAIN #
########
start = cuda.Event()
end = cuda.Event()
# start.record()
N = 3200
BLOKESIZE = 4
# --- Create random vectorson the CPU
h_a = np.random.randn(1, N)
h_b = np.random.randn(1, N)
# --- Set CPU arrays as single precision
h_a = h_a.astype(np.float32)
h_b = h_b.astype(np.float32)
# --- Allocate GPU device memory
d_a = cuda.mem_alloc(h_a.nbytes)
d_b = cuda.mem_alloc(h_b.nbytes)
d_c = cuda.mem_alloc(h_a.nbytes)
# --- Memcopy from host to device
# start.record()
cuda.memcpy_htod(d_a, h_a)
cuda.memcpy_htod(d_b, h_b)
# end.record()
mod = SourceModule("""
#include <stdio.h>
__global__ void deviceAdd(float *d_c, const float *d_a, const float *d_b)
{
int index = threadIdx.x + blockIdx.x * blockDim.x;
d_c[index] = d_a[index] + d_b[index];
}
""")
# --- Define a reference to the __global__ function and call it
deviceAdd = mod.get_function("deviceAdd")
blockDim = (int(N / BLOKESIZE), 1, 1)
gridDim = (BLOKESIZE, 1, 1)
start.record()
deviceAdd(d_c, d_a, d_b, block=blockDim, grid=gridDim)
end.record()
end.synchronize()
secs = start.time_till(end) * 1e-3
print("Processing time = %fs" % (secs))
# --- Copy results from device to host
h_c = np.empty_like(h_a)
cuda.memcpy_dtoh(h_c, d_c)
if np.array_equal(h_c, h_a + h_b):
print("Test passed!")
else:
print("Error!")
# --- Flush context printf buffer
cuda.Context.synchronize()
代码将32个元素的向量改成了3200个元素的向量。
为什么index = threadIdx.x + blockIdx.x * blockDim.x;
就可以索引到对应的threads:
-
运行结果:
-
当
BLOCSIZE = 4
时:
-
当
BLOCSIZE = 8
时:
-
当
BLOCSIZE = 800
时:
-
当
BLOCSIZE = 3200
时:
当
BLOCKSIZE
很大时(threads很少时),算得就更慢. -
-
性能分析:
3.2.3.4 Why bother with threads and blocks?
-
Unlike parallel blocks, threads have mechanisms to communicate and synchronize efficiently.
-
Threads belonging to different blocks cannot communicate/synchronize with each other during the execution of the kernel.
-
The result of this is that, if new hardware is released with more SMs and if the code has enough parallelism, the code can be scaled linearly.
-
The threads communicate with each other using a special memory known as shared memory.
元素的向量改成了3200个元素的向量。
为什么index = threadIdx.x + blockIdx.x * blockDim.x;
就可以索引到对应的threads:
-
运行结果:
-
当
BLOCSIZE = 4
时:
-
当
BLOCSIZE = 8
时:
-
当
BLOCSIZE = 800
时:
-
当
BLOCSIZE = 3200
时:
当
BLOCKSIZE
很大时(threads很少时),算得就更慢. -
-
性能分析:
3.2.3.4 Why bother with threads and blocks?
-
Unlike parallel blocks, threads have mechanisms to communicate and synchronize efficiently.
-
Threads belonging to different blocks cannot communicate/synchronize with each other during the execution of the kernel.
-
The result of this is that, if new hardware is released with more SMs and if the code has enough parallelism, the code can be scaled linearly.
-
The threads communicate with each other using a special memory known as shared memory.