CUDA Programming: An Introduction to GPU Architecture

最新推荐文章于 2024-11-02 16:28:26 发布

张博208

最新推荐文章于 2024-11-02 16:28:26 发布

阅读量45

点赞数

分类专栏： GPU & cuda 文章标签： python

原文链接：https://medium.com/@muhammedashraf2661/cuda-programming-an-introduction-to-gpu-architecture-dfd8dfffa13f

版权

GPU & cuda 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

A deep dive into the backbone of the AI revolution: GPUs.

Photo by Nana Dua on Unsplash

Once upon a time, there was a machine learning engineer called Alice. One day, Alice’s boss assigned a task to her: to write a code that performs vector addition between two vectors. Because she is a proficient engineer, she knew that solving the task in her mind first was essential before commencing coding. Alice grabbed her paper and pen and began to write pseudocode outlining the steps the program should take to add two vectors and then implemented the code as the following:

void addVectors(int* vec_out, int *vec_a, int *vec_b, int size) {
    
    for (int i = 0; i < size; i++) {
        vec_out[i] = vec_a[i] + vec_b[i];
    }
}

The above function will work correctly. However, if we consider that the length of the vectors is 100,000 elements and each addition operation takes 2 milliseconds, then the entire function will take approximately 3 minutes. This means that Alice might be fired by the end of the day.

Optimizing Performance with Parallelization

Performing the same operation on multiple data points is the secret key to parallel programming (SIMD).

SIMD is a computer processing technique where one operation is applied to many data points at once.

That’s precisely what we have here. We have a single operation “addition” that we want to perform across multiple data points. Since each operation is independent of the others, there’s no need to execute them one by one.

So, what if we divide this vector into smaller vectors? For instance, we could create 10 smaller vectors, each containing 10,000 elements. We could then distribute these sub-vectors to 10 different computers, allowing them to perform the operations simultaneously. Afterward, we could concatenate the results together. This approach would potentially reduce the operation time 10x, resulting in a total time equivalent to performing the operation on a 10,000-element vector.

Now, consider an even more distributed scenario: What if we had 100,000 computers? We could divide the 100,000-element vector into 100,000 sub-vectors, each containing just one element. By simultaneously performing the operations on all these sub-vectors as shown in Figure 1. The total time spent would be equivalent to the time required for a single element operation, which we previously assumed to be 2 milliseconds.

Figure 1. Vector Addition Parallelization

The CPU: Components and Functions

The question at hand is: Do we truly require such a large number of computers? More precisely, what components are essential for our needs, and which ones are unnecessary? We can do without accessories like keyboards and monitors; what we truly require are the components responsible for carrying out calculations — the CPUs.

Multiprocessor Systems

Someone might suggest placing multiple CPUs within a single computer and utilizing them all to parallelize our task, a concept already implemented in servers and known as multiprocessor systems, as depicted in Figure 2.

Figure 2. Multiprocessors System

While this approach can be effective, it tends to be costly.

Uni-Core Processors

Let’s take a closer look inside the CPU to identify the specific components essential for maximizing our parallelization process. By understanding the structure of the CPU’s architecture, we can pinpoint the key elements necessary to optimize parallel processing efficiently.

Figure 3. Simplified CPU Architecture.

There are two main components in every CPU that we are interested in today:

ALU (Arithmetic Logic Unit): Performs arithmetic (addition, multiplication, etc.) and logical (AND, OR, etc.) operations on data, based on instructions from the control unit, enabling mathematical computations and logical decisions.
Control Unit: Directs the flow of data and instructions within the CPU, fetching instructions from memory, decoding them, and coordinating ALU operations, ensuring the proper execution of instructions in sequence.

Multi-Core Processors

It’s important to know that not everything inside a CPU works as a brain that perform calculations. The main part that does the mathmatical operations is called the ‘Processing Unit’ or CPU Core. So, when we talk about CPUs here, what really matters are their cores.

Now, you might wonder: Can we have one CPU with many cores? Yes, we can, and we already do. It’s called a multicore processor, like the one shown in Figure 4.

Figure 4. Multi-Core CPU Structure.

Multicore CPUs feature multiple cores, ranging from 4 to 32, with each core attached with its own L1 cache memory. Additionally, there is a shared memory called L2 cache. The L1 cache is smaller but faster than the L2 cache, prompting cores to check L1 before L2. If the required data isn’t found in either, some CPU generations will then access the L3 cache, while others may proceed directly to the main memory.

Multicore computers offer a cost-effective solution for parallel computing compared to systems with multiple physical CPUs. They leverage shared resources such as memory and I/O interfaces, reducing the overall hardware costs while still providing significant computational power.

The GPU Architecture

So, what Alice requires is a component with a huge number of cores, even if these cores are less functional than those found in the CPU. This component is commonly known as the GPU, or Graphics Processing Unit. GPUs are specialized processors designed to handle parallel tasks efficiently, making them an ideal choice for accelerating computations.

Let’s take a closer look inside the Nvidia Maxwell GM107 GPU and discuss what we find in Figure 5.

Figure 5. NVIDIA Maxwell GM107 Architecture

If this is your first encounter with Nvidia GPU architecture, you’ll notice a large rectangle called GPC, containing smaller rectangles labeled SMM, which contains a huge number of green squares. However, you might not know what each of these elements refers to. Let’s simplify the structure of the architecture with following diagram:

Figure 6. Diagram illustrates the structure of The GPU architecture.

As shown above in Figure 6. Inside the GPU, there are several GPCs (Graphics Processing Clusters), which are like big boxes that hold everything together. Each GPC has a raster engine and several TPCs (Texture Processing Clusters). Inside each TPC, there are multiple SMs (Streaming Multiprocessors), which are the heart of the GPU. The SMs do all the actual computing work and contain CUDA cores, Tensor cores, and other important parts as we will see later.

Table 1 bellow shows that the number of GPCs, TPCs, and SMs varies from one architecture to another. For example, in the NVIDIA Maxwell architecture GM200, there are 6 GPCs, 4 TPCs per GPC, and 1 SM per TPC, resulting in 4 SMs per GPC, and 24 SMs in total for a full GPU. This calculation is as follows: 6 (GPCs) x 4 (TPCs/GPC) x 1 (SMs/TPC) = 24 SMs.

Table 1. A table shows number of SMs/TPCs in different Nvidia GPUs.

Let’s take single SM (referred to as SMM in Maxwell architecture and SMX in Kepler architecture) as shown in Figure 7 and discuss what we find.

Figure 7. Maxwell Single SM Architecture.

Streaming Multiprocessors (SMs) serve as the fundamental building blocks of a GPU. While there are differences among SMs across various architectures, these variances are generally not extensive, and they share similarities with each other. We are going to discuss the most important and common components now.

CUDA Cores

In Arabic, there’s a common saying: ‘يتضح المعنى بالتضاد’, which suggests that meaning of something becomes clear through contrast. Thus, we’ll compare the CPU core with the GPU core to precisely understand the function of these GPU cores.

The number of cores in a CPU can range from 4 to potentially 8, and in certain devices, it can extend to 32 cores. on the other hand, GPUs includes thousands of cores. For instance, A100 GPU which is widely utilized in the AI industry today, boasts an impressive 6912 cores. This shows a big gap between the number of cores in CPUs and GPUs.

CPU cores can fetch instructions, decode them, execute them, retrieve data from a register file, perform calculations, and store the results back in registers and repeating this process. In contrast, GPU cores are specialized in executing arithmetic operations in parallel, making them efficient for tasks that involve heavy computation, such as rendering graphics or processing large datasets in parallel.

Figure 8: CUDA Core Components

From this perspective, it’s more accurate to describe the GPU core as a floating-point unit (FPU). FPU is a specialized circuit designed to perform operations on floating-point numbers. In old GPUs, every CUDA core contained two units as shown in Figure 8 above: one for floating-point operations and another for integer operations.

In modern GPUs, instead of being referred to as a singular “core”, there are distinct components: FP32 units for handling single-precision floating-point numbers, FP64 units for double-precision, and INT units for integer operations. Additionally, there are Tensor Cores, which we will cover later.

Special Function Units

SFUs are specialized units within an SM that efficiently handle specific mathematical operations, such as trigonometric functions (sin, cos, etc.), exponentiation, and reciprocal square roots.

Load/Store Units (LD/ST)

In NVIDIA’s Maxwell GPU architecture are responsible for memory operations, including loading and storing data. The Maxwell architecture introduced several improvements to the efficiency and performance of these units.

Modern GPU Architecture (Volta)

The GPU that Alice is going to use in 2024 will not, in most cases, be from the Maxwell Generation. We now have the V100, A100, and H100 GPUs, which are more powerful than Maxwell. Additionally, their architectures are not far from what we have explained above. So, let’s take a look at the V100 GPU that Alice uses in our case and discuss what we find.

Figure 9. Volta 100 GPU Architecture.

At first look (Figure 9) you notice that there are 6 GPCs, each containing 7 TPCs, which in turn contain 2 SMs inside, shared L2 cache and 8 Memory Controllers. Most of these components are familiar from the previous Maxwell architecture.

For each SM (Figure 10), Volta has:

64 FP32 CUDA Cores/SM and 5,376 FP32 CUDA Cores per full GPU.
64 INT CUDA Cores/SM, 32 FP64 CUDA Cores/SM.
128 KB of combined shared memory and L1 data cache
8 Tensor Cores/SM and 640 Tensor Cores per full GPU.

Figure 10. Volta 100 SM Architecture.

As mentioned earlier, starting from the Volta generation, the naming convention for cores has changed. In Maxwell and older generations, the CUDA core handles both floating-point (FP) and integer (INT) operations. However, in the Volta generation and beyond, these operations are separated into INT, FP32, and FP64 units.

Tensor Cores

Matrix multiplication is a fundamental operation extensively employed in deep learning. It involves multiplying corresponding elements of rows and columns within matrices and subsequently adding these products together.

Figure 11. Matrix Multiplication.

Parallelizing matrix multiplication involves distributing each individual operation to one of the CUDA cores, which can utilize a significant number of cores in just one Matrix-Matrix multiplication. For instance, in the case of a 4x4 matrix multiplication, there are 64 multiplications and 48 additions. Each small product generated needs to be stored temporarily before it can be accumulated with the other products, resulting in a considerable amount of reading and writing into the cache memory. This process is computationally intensive due to the high demand for memory access and manipulation of data.

In 2017, Nvidia added their Tensor Cores, a groundbreaking addition to their Streaming Multiprocessors (SMs) as part of the Volta GPU architecture. We know the second part “Core” but what is the Tensor?

Figure 12. Scalar vs Vector vs Matrix vs Tensor

A tensor is a mathematical object representing a multi-dimensional array of data, commonly used in various fields such as mathematics, physics, and computer science.

Most data elements in deep learning can be effectively represented using tensors. For example, images are commonly represented as tensors with dimensions (3, 224, 224), and most of the operations in neural networks are performed using matrices. The game changer lies in adding specialized cores for matrix operations, which is precisely what Nvidia has done with their GPUs, known as Tensor Cores.

We can define a Tensor Core as a specialized functional unit designed to execute matrix operations, such as matrix-matrix multiplication and matrix addition. These operations are essential for deep learning tasks.

Matrix Multiply Accumulate

Each Tensor Core within the Streaming Multiprocessor (SM) processes matrices A, B, and C to perform the Matrix Multiply Accumulate (MMA) operation, resulting in D = AB + C. Tensor Cores are specifically designed to handle matrices of size 4x4. However, if the matrix size exceeds 4x4, it will be partitioned into sub-matrices with dimensions of 4x4.

Figure 13. Matrix Multiply Accumulate (MMA) Operation.

Matrix Multiply Accumulate refers to a computational operation that multiplies two matrices together and then adds the resulting matrix to a third matrix.

Mixed Precision

Tensor Cores support mixed-precision matrix multiplication, allowing computations to be performed using a combination of lower and higher precision formats.

Mixed precision is a computational method that uses both 16-bit and 32-bit floating-point types in a model during training to make it run faster and use less memory.

Figure 14. Mixed Precision Multiply and Accumulate in Tensor Core.

That means, instead of passing values in single-precision (FP32) format, you can pass them in half-precision (FP16), which achieves more computational and memory efficiency. Then, you can perform the matrix multiplication in single-precision (FP32) to maintain accuracy, as illustrated above.

The utilization of two precisions is a brilliant concept introduced by Nvidia, enabling you to take the benefits of both simultaneously, memory efficient and speed through utilizing the FP16 and maintaining the accuracy through utilizing FP32.

Fused Multiply-Add

Tensor cores perform the multiplication between A and B and the addition with C as shown in Figure 13. Tensor cores perform the multiplication and the addition together in one operation called Fused Multiplication Addition (FMA).

FMA operation is a combination of two fundamental arithmetic operations — multiplication and addition. In a single FMA instruction, a Tensor Core multiplies two numbers and adds the result to an accumulator. This is expressed as AB + C, where each A element and B element are multiplied, and the product is added to C element in a single operation.

Figure 15. Diagram illustrates FMA Unit.

Each Tensor Core comprises 64 FMA units arranged in a specific layout, typically in a table format with 16 rows and 4 columns. The rows represent the number of elements to be calculated in the resulting matrix D, while the columns correspond to the four Fused Multiply-Add operation required to compute each element D(i,j) efficiently as shown above.

Now, here’s where it gets impressive: A powerful GPU like the GV100 can perform 512 Fused Multiply-Add (FMA) operations in just one clock cycle, equating to 1024 individual floating-point operations.

Volta GPUs Performance

The Volta architecture represented a significant advancement in computing power at its time of release. It was designed with a comprehensive suite of optimizations, including fused operations, mixed precision, and Tensor Cores, among other enhancements.

Figure 13: Volta with Tensor Cores vs. Pascal.

When comparing the Volta-based V100 accelerator, equipped with Tensor Cores, against the previous architecture (Pascal). The Volta-based V100 accelerator with Tensor Cores can perform such calculations at 12x faster rate than Pascal-based Tesla P100.

Back To Alice

But how can we utilize the power of the GPU in our tasks? What programming approach can effectively utilize this hardware? How can we rewrite Alice’s function to be executed in parallel? We’ll dive into all of that in the next article, see you!