CUDA系列-driver-architecture-0

为什么要费这么大劲把这些整理出来呢?

CUDA Driver SW Architecture

In a nutshell, the CUDA Driver manages GPU resources and schedules computing tasks on the GPU.

The driver provides user APIs to allocate memory on the GPU, to copy data into, out of, and between the allocated memory regions, and to launch GPU kernels that operate on the data.

The driver also provides interoperability between CUDA and other user APIs. This interoperability allows users to efficiently share GPU memory between CUDA and other APIs. It also allows efficient synchronization between GPU operations happening in different APIs without having the CPU to wait for GPU operations to complete.

The driver manages all GPU resources that are required for the GPU to be able to run GPU kernels. This includes resources explicitly allocated by user API calls and the auxiliary resources needed internally. It also coordinates between the GPU and the CPU so that they work together efficiently.

Glossary
Unit: A unit is the smallest piece of software that has high coherence within the functionality it implements, has low coupling with other entities that it talks to, and can be subjected to stand-alone testing.

Component: A collection of logically connected units to achieve desired functionality. CUDA Driver is a component.

Driver: This document uses the word “driver” by itself to mean the “CUDA Driver”. Other drivers are referred to by their full names (e.g. OpenGL driver) or by qualifying them to distinguish from the CUDA driver. E.g. “other drivers”, “kernel-mode driver”.

API: Application Programming Interface.

OS: Operating System (e.g. Linux, Windows, QNX)

CPU: The processor that is running the OS and other applications in the system.

GPU: Graphics Processing Unit.

Kernel, GPU Kernel, or Compute Kernel: A piece of executable code that is meant to be executed on the GPU. The word “kernel” used alone has this meaning. This can be confused with the OS kernel so to avoid the confusion we refer to the OS kernel with a qualifier, e.g. “OS kernel”, or “Linux kernel”, or such.


Architectural Characteristics

Deferred Completion

The CUDA driver follows a deferred task completion model. It is frequently referred to as asynchronous or async because the completion of a task may not happen before the return of the function call that triggered the task.

In other words, user API functions and driver-internal functions may return before the intended operation of such functions is complete and before the effects of such operations are visible to the caller of such functions.

This asynchronous model of execution is powerful because it decouples CPU execution from GPU execution till the time such coupling is necessary (e.g. for reading results back from the GPU to CPU the GPU needs to have finished its work). This allows CPU and GPU to make independent forward progress without having to execute in lockstep with each other.

This allows the CPU to submit work to GPU and return immediately without waiting for the GPU operation to finish. Typically, time taken for submission is very small compared to the time taken to finish computations on the GPU. Due to returning immediately the CPU becomes free to do other tasks, such as preparing for the next submit.

Function calls that return only after their intended operation is complete are called synchronous or sync functions. The action of such functions to cause CPU to wait for completion of GPU operations is called CPU synchronization, which is frequently abbreviated simply as synchronization.

Error Reporting

Due to the deferred completion model, asynchronous operations triggered by a function call can still be running even after the call returns. Errors generated during the asynchronous operations may not be reported when the call returns. More specifically, the return value from the function will signal success when the function is able to successfully enqueue the asynchronous operation on the GPU. Any error encountered by the asynchronous operation will be reported at the next available opportunity.

The driver tries to detect and return these errors as early as possible. Most of the APIs and driver functions can return errors produced by earlier asynchronous operations even though the called API or function itself did not generate any error.

In the worst case, the next synchronous function call that has to wait for the GPU work to complete will return the pending error.

Batched Execution

synchronous function calls described in Deferred Completion also allow driver to buffer requested operations and submit work to the GPU in batches.

On some platforms this improves performance characteristics. It can also be used to satisfy resource constraints.

TODO: List all platforms and behavior for each? Also explain how to ensure that work has been submitted to the GPU.

Producer and Consumer

GPU works in a consumer mode where it reads commands from the CPU and performs the tasks requested. The driver acts as a command producer for the GPU.

In this producer-consumer model the driver and the GPU communicate by means of channels, which are in-order command queues. In-order

Static Architecture

Layered View

在这里插入图片描述

Exported Interfaces

Applications and CUDA libraries use the CUDA Runtime and CUDA Driver APIs.

The CUDA Tools interface is used by tools like debuggers and profilers. Functions from this interface are not exported in the CUDA Runtime or CUDA Driver for linking. Thus it is hidden from user applications.

The CUDA Driver API layer is implemented by auto-generating boilerplate function code from header files. Each driver API function acts as a wrapper for calling into CUAPI functions and optionally calls tools callback functions for API tracing. Auto-generating code for this layer avoids tedious handwriting of repetitive code, thereby preventing errors.

Required Interfaces

CUDA driver allows applications to provide GPU binaries in the PTX intermediate language. Such binaries need to be compiled into GPU executable code before they can be executed on the GPU. This is known as run-time compilation or just-in-time (JIT) compilation for which the driver relies on a device compiler.

On some platforms the device compiler is built as a separate shared library and the driver links to it. On other platforms the device compiler is compiled as part of the driver itself.

The CUI Layer

Core Concepts

The CUI layer is the heart of CUDA driver. To describe its structure we need to first understand some core concepts.

CUDA Context

A CUDA context, usually referred to simply as a context, is a software container that encapsulates a set of resources managed together on a single GPU. A single GPU can have multiple such sets of resources, hence multiple contexts, associated with it.

Resource objects and handles in the CUDA driver are usually meaningful and usable only within the context in which they were created. However, in some cases, resources such as memory objects can be shared between contexts by explicitly mapping them to contexts other than the one they were created in.

Hardware Abstraction Layer (HAL)

HAL abstracts GPU functionality and allows rest of the code to ignore differences in HW functionality across GPU architectures.
HAL allows a single CUDA driver binary to work with any installed GPU that is currently supported by the driver.

Device Model Abstraction Layer (DMAL)

The CUDA driver is dependent on the OS kernel and OS device drivers to get access to the GPU. DMAL abstracts differences between OS device models to allow rest of the code to be device model independent.

This allows a single driver code base to work across different operating systems by including the right DMAL code for the target OS.

Functional Organization

在这里插入图片描述

The core concepts are not necessarily present as units in CUI layer. Several units together may use or contribute towards the core concepts.

For instance, the context is an aggregate structure containing parts of data for several units. Each such unit populates and uses its part of the context.

Similarly, DMAL and HAL are comprised of several groups of functionality. Units that implement a particular functionality also implements the corresponding part of DMAL or HAL or both.

For instance, the CUSW_UNIT_LAUNCH implements a kernel launch. Therefore it also contains implementation of launch-related functions in the HAL

Functional Groups

The functional groups, though implementing different parts of driver functionality, are closely related and dependent on each other for correct operation. They are described in sections below.

Context Management

The context acts as the central coordinator for operations on the GPU. It does most of its work by calling into other functional groups.

Memory Management

Memory management includes allocation, mapping, and freeing up of memory for use by GPU, CPU or both.

Memory Management depends on Execution. For some operations like freeing memory, the memory must not be under active use by the GPU. In such cases the free operation needs to wait till execution of tasks that are working on this memory object are complete.

GPU Code Management

GPU executable code is stored in the GPU memory.

GPU code may need to be compiled before GPU can execute it. Such compilation is time consuming. Instead of re-compiling such code every time the application runs, GPU binaries resulting from compilation are stored in a disk-based cache.

Execution

Execution involves scheduling tasks on the CPU and GPU, synchronizing between the tasks and managing GPU resources for the tasks.

Interoperability

CUDA driver can share memory and synchronization objects with APIs other than CUDA. This is done by mapping such objects from other APIs into CUDA using the OS kernel mode driver interface.

Driver Configuration

Driver configuration can be divided into two separate phases:

Driver build time

The driver is compiled into platform specific shared library file by the driver build process.

Driver run time

This is when the application is running on the target platform using the driver.

Build time configuration for the driver is managed using the NVIDIA-wide NVCONFIG infrastructure. It is described in detail at [TODO: Link].

Run time configuration of the driver can be done using API calls, environment variables, and application profiles (via configuration files or registry settings).

在这里插入图片描述

其它

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值