OpenCL™规范 3.2执行模式

꧁白杨树下꧂

已于 2023-12-14 10:52:16 修改

阅读量141

点赞数

分类专栏： openCL 文章标签： opencl

于 2023-12-14 10:46:34 首次发布

openCL 专栏收录该内容

162 篇文章 12 订阅

订阅专栏

3.2. Execution Model

3.2执行模式

The OpenCL execution model is defined in terms of two distinct units of execution: kernels that execute on one or more OpenCL devices and a host program that executes on the host. With regard to OpenCL, the kernels are where the "work" associated with a computation occurs. This work occurs through work-items that execute in groups (work-groups).

OpenCL执行模型是根据两个不同的执行单元定义的：在一个或多个OpenCL设备上执行的内核和在主机上执行的主机程序。关于OpenCL，内核是与计算相关联的“工作”发生的地方。此工作通过在组（工作组）中执行的工作项进行。

A kernel executes within a well-defined context managed by the host. The context defines the environment within which kernels execute. It includes the following resources:

内核在由主机管理的定义良好的上下文中执行。上下文定义了内核执行的环境。它包括以下资源：

Devices: One or more devices exposed by the OpenCL platform.
设备：OpenCL平台公开的一个或多个设备。
Kernel Objects: The OpenCL functions with their associated argument values that run on OpenCL devices.
内核对象：OpenCL函数及其在OpenCL设备上运行的相关参数值。
Program Objects: The program source and executable that implement the kernels.
程序对象：实现内核的程序源和可执行文件。
Memory Objects: Variables visible to the host and the OpenCL devices. Instances of kernels operate on these objects as they execute.
内存对象：主机和OpenCL设备可见的变量。内核的实例在执行这些对象时对其进行操作。

The host program uses the OpenCL API to create and manage the context. Functions from the OpenCL API enable the host to interact with a device through a command-queue. Each command-queue is associated with a single device. The commands placed into the command-queue fall into one of three types:

主机程序使用OpenCL API来创建和管理上下文。来自OpenCL API的函数使主机能够通过命令队列与设备交互。每个命令队列都与一个设备相关联。放入命令队列的命令分为三种类型之一：

Kernel-enqueue commands: Enqueue a kernel for execution on a device.
内核入队命令：将内核入队，以便在设备上执行。
Memory commands: Transfer data between the host and device memory, between memory objects, or map and unmap memory objects from the host address space.
内存命令：在主机和设备内存之间、内存对象之间传输数据，或者从主机地址空间映射和取消映射内存对象。
Synchronization commands: Explicit synchronization points that define order constraints between commands.
同步命令：定义命令之间的顺序约束的显式同步点。

In addition to commands submitted from the host command-queue, a kernel running on a device can enqueue commands to a device-side command queue. This results in child kernels enqueued by a kernel executing on a device (the parent kernel). Regardless of whether the command-queue resides on the host or a device, each command passes through six states.

除了从主机命令队列提交的命令外，在设备上运行的内核还可以将命令排入设备端命令队列。这导致子内核由在设备上执行的内核（父内核）排队。无论命令队列位于主机还是设备上，每个命令都会经过六个状态。

1.Queued: The command is enqueued to a command-queue. A command may reside in the queue until it is flushed either explicitly (a call to clFlush) or implicitly by some other command.

1.已入队列：将命令排入命令队列。命令可以驻留在队列中，直到它被显式地（对clFlush的调用）或由其他命令隐式地刷新为止。

2.Submitted: The command is flushed from the command-queue and submitted for execution on the device. Once flushed from the command-queue, a command will execute after any prerequisites for execution are met.

2.已提交：命令从命令队列中刷新，并提交以在设备上执行。一旦从命令队列中清除，命令将在满足执行的任何先决条件后执行。

3.Ready: All prerequisites constraining execution of a command have been met. The command, or for a kernel-enqueue command the collection of work groups associated with a command, is placed in a device work-pool from which it is scheduled for execution.

3.准备就绪：约束命令执行的所有先决条件都已满足。该命令，或者对于内核入队命令，与命令相关联的工作组的集合，被放置在设备工作池中，从该设备工作池计划执行该命令。

4.Running: Execution of the command starts. For the case of a kernel-enqueue command, one or more work-groups associated with the command start to execute.

4.运行中：开始执行命令。对于内核入队命令的情况，与该命令相关联的一个或多个工作组开始执行。

5.Ended: Execution of a command ends. When a Kernel-enqueue command ends, all of the work-groups associated with that command have finished their execution. Immediate side effects, i.e. those associated with the kernel but not necessarily with its child kernels, are visible to other units of execution. These side effects include updates to values in global memory.

5.运行结束：命令的执行结束。当内核入队命令结束时，与该命令相关联的所有工作组都已完成执行。即时副作用，即与内核相关但不一定与其子内核相关的副作用，对其他执行单元可见。这些副作用包括更新全局内存中的值。

6.Complete: The command and its child commands have finished execution and the status of the event object, if any, associated with the command is set to CL_COMPLETE.

6.完成：命令及其子命令已完成执行，与该命令相关联的事件对象的状态（如果有）设置为CL_Complete。

The execution states and the transitions between them are summarized below. These states and the concept of a device work-pool are conceptual elements of the execution model. An implementation of OpenCL has considerable freedom in how these are exposed to a program. Five of the transitions, however, are directly observable through a profiling interface. These profiled states are shown below.

执行状态及其之间的转换概述如下。这些状态和设备工作池的概念是执行模型的概念元素。OpenCL的实现在如何向程序公开这些方面具有相当大的自由度。然而，其中五个转换可以通过分析界面直接观察到。这些配置状态如下所示。

Figure 2. The states and transitions between states defined in the OpenCL execution model. A subset of these transitions is exposed through the profiling interface.

图2:OpenCL执行模型中定义的状态和状态之间的转换。这些转换的一个子集通过评测接口公开。

Commands communicate their status through Event objects. Successful completion is indicated by setting the event status associated with a command to CL_COMPLETE. Unsuccessful completion results in abnormal termination of the command which is indicated by setting the event status to a negative value. In this case, the command-queue associated with the abnormally terminated command and all other command-queues in the same context may no longer be available and their behavior is implementation defined.

命令通过事件对象传达其状态。通过将与命令相关联的事件状态设置为CL_COMPLETE来指示成功完成。未成功完成会导致命令异常终止，这是通过将事件状态设置为负值来指示的。在这种情况下，与异常终止的命令相关联的命令队列和同一上下文中的所有其他命令队列可能不再可用，并且它们的行为是实现定义的。

A command submitted to a device will not launch until prerequisites that constrain the order of commands have been resolved. These prerequisites have three sources:

在解决了限制命令顺序的先决条件之前，提交到设备的命令不会启动。这些先决条件有三个来源：

They may arise from commands submitted to a command-queue that constrain the order in which commands are launched. For example, commands that follow a command queue barrier will not launch until all commands prior to the barrier are complete.
它们可能源于提交到命令队列的命令，该命令队列限制了命令的启动顺序。例如，在命令队列栅栏之前的所有命令完成之前，遵循该栅栏的命令不会启动。
The second source of prerequisites is dependencies between commands expressed through events. A command may include an optional list of events. The command will wait and not launch until all the events in the list are in the state CL COMPLETE. By this mechanism, event objects define order constraints between commands and coordinate execution between the host and one or more devices.
先决条件的第二个来源是通过事件表达的命令之间的依赖关系。命令可以包括可选的事件列表。该命令将等待，直到列表中的所有事件都处于CL COMPLETE状态才启动。通过这种机制，事件对象定义命令之间的顺序约束，并协调主机和一个或多个设备之间的执行。
The third source of prerequisites can be the presence of non-trivial C initializers or C++ constructors for program scope global variables. In this case, OpenCL C/C++ compiler shall generate program initialization kernels that perform C initialization or C++ construction. These kernels must be executed by OpenCL runtime on a device before any kernel from the same program can be executed on the same device. The ND-range for any program initialization kernel is (1,1,1). When multiple programs are linked together, the order of execution of program initialization kernels that belong to different programs is undefined.
第三个先决条件来源可以是程序范围全局变量的非平凡C初始化器或C++构造函数的存在。在这种情况下，OpenCL C/C++编译器应生成执行C初始化或C++构造的程序初始化内核。这些内核必须由OpenCL运行时在设备上执行，然后才能在同一设备上执行来自同一程序的任何内核。任何程序初始化内核的ND范围都是（1,1,1）。当多个程序链接在一起时，属于不同程序的程序初始化内核的执行顺序是未定义的。

Program clean up may result in the execution of one or more program clean up kernels by the OpenCL runtime. This is due to the presence of non-trivial C++ destructors for program scope variables. The ND-range for executing any program clean up kernel is (1,1,1). The order of execution of clean up kernels from different programs (that are linked together) is undefined.

程序清理可能导致OpenCL运行时执行一个或多个程序清理内核。这是由于程序范围变量存在非平凡的C++析构函数。执行任何程序清理内核的ND范围是（1,1,1）。来自不同程序（链接在一起）的清理内核的执行顺序是未定义的。

Program initialization and clean-up kernels are missing before version 2.2.

2.2版本之前缺少程序初始化和清理内核。

Note that C initializers, C++ constructors, or C++ destructors for program scope variables cannot use pointers to coarse grain and fine grain SVM allocations.

请注意，程序范围变量的C初始化程序、C++构造函数或C++析构函数不能使用指向粗粒度和细粒度SVM分配的指针。

A command may be submitted to a device and yet have no visible side effects outside of waiting on and satisfying event dependences. Examples include markers, kernels executed over ranges of no work-items or copy operations with zero sizes. Such commands may pass directly from the ready state to the ended state.

命令可以提交到设备，但除了等待和满足事件依赖性之外，没有明显的副作用。示例包括标记、在没有工作项的范围内执行的内核或零大小的复制操作。这样的命令可以直接从就绪状态传递到结束状态。

Command execution can be blocking or non-blocking. Consider a sequence of OpenCL commands. For blocking commands, the OpenCL API functions that enqueue commands don’t return until the command has completed. Alternatively, OpenCL functions that enqueue non-blocking commands return immediately and require that a programmer defines dependencies between enqueued commands to ensure that enqueued commands are not launched before needed resources are available. In both cases, the actual execution of the command may occur asynchronously with execution of the host program.

命令执行可以是阻塞的，也可以是非阻塞的。考虑一系列OpenCL命令。对于阻塞命令，将命令排入队列的OpenCL API函数在命令完成之前不会返回。或者，将非阻塞命令排队的OpenCL函数会立即返回，并要求程序员定义排队命令之间的依赖关系，以确保在所需资源可用之前不会启动排队命令。在这两种情况下，命令的实际执行可能与主机程序的执行异步发生。

Commands within a single command-queue execute relative to each other in one of two modes:

单个命令队列中的命令以两种模式之一相对于彼此执行：

In-order Execution: Commands and any side effects associated with commands appear to the OpenCL application as if they execute in the same order they are enqueued to a command-queue.
顺序执行：命令和与命令相关的任何副作用在OpenCL应用程序中显示，就好像它们按与命令队列相同的顺序执行一样。
Out-of-order Execution: Commands execute in any order constrained only by explicit synchronization points (e.g. through command queue barriers) or explicit dependencies on events.
无序执行：命令以任何顺序执行，仅受显式同步点（例如通过命令队列栅栏）或对事件的显式依赖性的约束。

Multiple command-queues can be present within a single context. Multiple command-queues execute commands independently. Event objects visible to the host program can be used to define synchronization points between commands in multiple command queues. If such synchronization points are established between commands in multiple command-queues, an implementation must assure that the command-queues progress concurrently and correctly account for the dependencies established by the synchronization points. For a detailed explanation of synchronization points, see the execution model Synchronization section.

在单个上下文中可以存在多个命令队列。多个命令队列独立执行命令。主机程序可见的事件对象可用于定义多个命令队列中命令之间的同步点。如果在多个命令队列中的命令之间建立了这样的同步点，则实现必须确保命令队列同时进行，并正确地说明由同步点建立的依赖关系。有关同步点的详细说明，请参阅执行模型同步部分。

The core of the OpenCL execution model is defined by how the kernels execute. When a kernel-enqueue command submits a kernel for execution, an index space is defined. The kernel, the argument values associated with the arguments to the kernel, and the parameters that define the index space define a kernel-instance. When a kernel-instance executes on a device, the kernel function executes for each point in the defined index space. Each of these executing kernel functions is called a work-item. The work-items associated with a given kernel-instance are managed by the device in groups called work-groups. These work-groups define a coarse grained decomposition of the Index space. Work-groups are further divided into sub-groups, which provide an additional level of control over execution.

OpenCL执行模型的核心由内核的执行方式定义。当内核入队命令提交内核以供执行时，会定义一个索引空间。内核、与内核参数相关联的参数值以及定义索引空间的参数定义了内核实例。当内核实例在设备上执行时，内核函数会为定义的索引空间中的每个点执行。这些正在执行的内核函数中的每一个都被称为一个工作项。与给定内核实例相关联的工作项由设备以称为工作组的组进行管理。这些工作组定义了索引空间的粗粒度分解。工作组被进一步划分为子组，这些子组提供了对执行的额外控制级别。

Sub-groups are missing before version 2.1.

2.1版本之前缺少子组。

Work-items have a global ID based on their coordinates within the Index space. They can also be defined in terms of their work-group and the local ID within a work-group. The details of this mapping are described in the following section.

工作项具有基于其在索引空间中的坐标的全局ID。它们也可以根据其工作组和工作组中的本地ID进行定义。以下部分将介绍此映射的详细信息。