First Chapter: introducing CUDA and getting started with CUDA
Parallel processing
the main reason behind clock speed being constant is high power dissipation with high clock rate. small trasistors packed in a small area and working at high speed will dissipate large power, and hence it is very difficult to keep the processor cool.
As clock speed is getting satuated in terms of development, we need a new computing paradigm to increase the performance of the processors.
GPU has many small and simple processors that can get work done in parallel.
Introducing GPU architecture and CUDA
GPUs are used in many applications other than rendering graphic, these kinds of GPUs are called General-Purpose GPUS(GPGPUs).
A GPU has simple control hardware and more hardware for data computaion. this structure makes it more power-efficient. the disadvantage is that it has a more restrictive programming model.
CUDA provide an easy and efficient way of interacting with the GPUs.
The performance of any hardware architecture is measured in terms of latency and throughput.
1. Latency is the time taken to complete a given task;
2. Throughput is the amount of the task completed in a given time;
CPUs are designed to execute all instructions in the minimum time;
GPUs are designed to execute more instructions in a given time.
we don’t mind a delay in the processing of a single pixel. what we want is that more pixels are processed in the same time;
CUDA architecture
CUDA architecture includes the unified shedder pipeline(统一着色器管道)which allows all arthmetic logical units(ALUs) present on a GPU chip to be marshaled by a single CUDA program. the instruction set is also tailored to general purpose computation and not specific to pixel computations. it also allows arbitrary read and write access to memory.
All GPUs have many parallel processing units called cores.
on the hardware side, these cores are divided into steaming processors and streaming multiprocessors.
on the software side, a CUDA program is executed as a series of multiple threads running in parallel. Each thread is executed on a different core. The GPU can be viewed as a combination of many threads. Each block is bound to a different streaming multiprocessor on the GPU.
The threads from same block can communicate with one another. The GPU has hierarchical (分层的) memory structure that inside one block and multiple blocks.
We will call a CPU and its memory the host and a GPU and its memory a device;
the host code is compiled on CPU by a normal C or C++ compiler;
the device code is compiled on the GPU by a GPU compiler.
Before lauching threads, the host copies data from the host memory to the device memory. The threrad works on data from device memory and stores the result on the device memory. Finally, this data is copied back to the host memory for further processing.
The steps to develop a CUDA C program are as follows:
1. Allocate memory for data in the host and device memory;
2. Copy data from the host memory to the device;
3. Lauch a kernel by specifying the degree of parallism;(指明kernel函数调用的并行thread数目);
4. After all the threads are finished, copy the data back from the device memory to the host memory;
5. Free up all memory used on the host and the device.
CUDA applications
computer vision applicaition
With the CUDA acceleration of those algorithm, applications such as iamge segmentatiton, object detection, and classification can achieve a real-time frame rate performance of more than 30 frames per second. the medical imaging field is seeing widespread use of GPUs and CUDA in reconstruction and the processing of MRI images and Computed Tomography images.
DeviceQuery program
The result of Device Query program as follows:
A basic program in CUDA C
'__ global __ ’ is a qualifier added by CUDA C to standard C. it tells the compiler that the function definition that follows this qualilfier should be compiled to run on a device, rather than a host.
‘kernel<<<1, 1>>>’ is a kernel call. the values inside the angular brackets indicate arguments we want to pass from the host to the device. Basically, it indicates the number of blocks and the number of threads that will run in parallel on the device.
‘kernel<<<1, 1>>>’ means that kernel function will run on one block and one thread on device;