OpenCL™规范 3.2执行模式

3.2. Execution Model


The OpenCL execution model is defined in terms of two distinct units of execution: kernels that execute on one or more OpenCL devices and a host program that executes on the host. With regard to OpenCL, the kernels are where the "work" associated with a computation occurs. This work occurs through work-items that execute in groups (work-groups).


A kernel executes within a well-defined context managed by the host. The context defines the environment within which kernels execute. It includes the following resources:


  • Devices: One or more devices exposed by the OpenCL platform.

  • 设备:OpenCL平台公开的一个或多个设备。

  • Kernel Objects: The OpenCL functions with their associated argument values that run on OpenCL devices.

  • 内核对象:OpenCL函数及其在OpenCL设备上运行的相关参数值。

  • Program Objects: The program source and executable that implement the kernels.

  • 程序对象:实现内核的程序源和可执行文件。

  • Memory Objects: Variables visible to the host and the OpenCL devices. Instances of kernels operate on these objects as they execute.

  • 内存对象:主机和OpenCL设备可见的变量。内核的实例在执行这些对象时对其进行操作。

The host program uses the OpenCL API to create and manage the context. Functions from the OpenCL API enable the host to interact with a device through a command-queue. Each command-queue is associated with a single device. The commands placed into the command-queue fall into one of three types:

主机程序使用OpenCL API来创建和管理上下文。来自OpenCL API的函数使主机能够通过命令队列与设备交互。每个命令队列都与一个设备相关联。放入命令队列的命令分为三种类型之一:

  • Kernel-enqueue commands: Enqueue a kernel for execution on a device.

  • 内核入队命令:将内核入队,以便在设备上执行。

  • Memory commands: Transfer data between the host and device memory, between memory objects, or map and unmap memory objects from the host address space.

  • 内存命令:在主机和设备内存之间、内存对象之间传输数据,或者从主机地址空间映射和取消映射内存对象。

  • Synchronization commands: Explicit synchronization points that define order constraints between commands.

  • 同步命令:定义命令之间的顺序约束的显式同步点。

In addition to commands submitted from the host command-queue, a kernel running on a device can enqueue commands to a device-side command queue. This results in child kernels enqueued by a kernel executing on a device (the parent kernel). Regardless of whether the command-queue resides on the host or a device, each command passes through six states.


1.Queued: The command is enqueued to a command-queue. A command may reside in the queue until it is flushed either explicitly (a call to clFlush) or implicitly by some other command.


2.Submitted: The command is flushed from the command-queue and submitted for execution on the device. Once flushed from the command-queue, a command will execute after any prerequisites for execution are met.


3.Ready: All prerequisites constraining execution of a command have been met. The command, or for a kernel-enqueue command the collection of work groups associated with a command, is placed in a device work-pool from which it is scheduled for execution.


4.Running: Execution of the command starts. For the case of a kernel-enqueue command, one or more work-groups associated with the command start to execute.


5.Ended: Execution of a command ends. When a Kernel-enqueue command ends, all of the work-groups associated with that command have finished their execution. Immediate side effects, i.e. those associated with the kernel but not necessarily with its child kernels, are visible to other units of execution. These side effects include updates to values in global memory.


6.Complete: The command and its child commands have finished execution and the status of the event object, if any, associated with the command is set to CL_COMPLETE.


The execution states and the transitions between them are summarized below. These states and the concept of a device work-pool are conceptual elements of the execution model. An implementation of OpenCL has considerable freedom in how these are exposed to a program. Five of the transitions, however, are directly observable through a profiling interface. These profiled states are shown below.


Figure 2. The states and transitions between states defined in the OpenCL execution model. A subset of these transitions is exposed through the profiling interface.


Commands communicate their status through Event objects. Successful completion is indicated by setting the event status associated with a command to CL_COMPLETE. Unsuccessful completion results in abnormal termination of the command which is indicated by setting the event status to a negative value. In this case, the command-queue associated with the abnormally terminated command and all other command-queues in the same context may no longer be available and their behavior is implementation defined.


A command submitted to a device will not launch until prerequisites that constrain the order of commands have been resolved. These prerequisites have three sources:


  • They may arise from commands submitted to a command-queue that constrain the order in which commands are launched. For example, commands that follow a command queue barrier will not launch until all commands prior to the barrier are complete.

  • 它们可能源于提交到命令队列的命令,该命令队列限制了命令的启动顺序。例如,在命令队列栅栏之前的所有命令完成之前,遵循该栅栏的命令不会启动。

  • The second source of prerequisites is dependencies between commands expressed through events. A command may include an optional list of events. The command will wait and not launch until all the events in the list are in the state CL COMPLETE. By this mechanism, event objects define order constraints between commands and coordinate execution between the host and one or more devices.

  • 先决条件的第二个来源是通过事件表达的命令之间的依赖关系。命令可以包括可选的事件列表。该命令将等待,直到列表中的所有事件都处于CL COMPLETE状态才启动。通过这种机制,事件对象定义命令之间的顺序约束,并协调主机和一个或多个设备之间的执行。

  • The third source of prerequisites can be the presence of non-trivial C initializers or C++ constructors for program scope global variables. In this case, OpenCL C/C++ compiler shall generate program initialization kernels that perform C initialization or C++ construction. These kernels must be executed by OpenCL runtime on a device before any kernel from the same program can be executed on the same device. The ND-range for any program initialization kernel is (1,1,1). When multiple programs are linked together, the order of execution of program initialization kernels that belong to different programs is undefined.

  • 第三个先决条件来源可以是程序范围全局变量的非平凡C初始化器或C++构造函数的存在。在这种情况下,OpenCL C/C++编译器应生成执行C初始化或C++构造的程序初始化内核。这些内核必须由OpenCL运行时在设备上执行,然后才能在同一设备上执行来自同一程序的任何内核。任何程序初始化内核的ND范围都是(1,1,1)。当多个程序链接在一起时,属于不同程序的程序初始化内核的执行顺序是未定义的。

Program clean up may result in the execution of one or more program clean up kernels by the OpenCL runtime. This is due to the presence of non-trivial C++ destructors for program scope variables. The ND-range for executing any program clean up kernel is (1,1,1). The order of execution of clean up kernels from different programs (that are linked together) is undefined.


Program initialization and clean-up kernels are missing before version 2.2.


Note that C initializers, C++ constructors, or C++ destructors for program scope variables cannot use pointers to coarse grain and fine grain SVM allocations.


A command may be submitted to a device and yet have no visible side effects outside of waiting on and satisfying event dependences. Examples include markers, kernels executed over ranges of no work-items or copy operations with zero sizes. Such commands may pass directly from the ready state to the ended state.


Command execution can be blocking or non-blocking. Consider a sequence of OpenCL commands. For blocking commands, the OpenCL API functions that enqueue commands don’t return until the command has completed. Alternatively, OpenCL functions that enqueue non-blocking commands return immediately and require that a programmer defines dependencies between enqueued commands to ensure that enqueued commands are not launched before needed resources are available. In both cases, the actual execution of the command may occur asynchronously with execution of the host program.

命令执行可以是阻塞的,也可以是非阻塞的。考虑一系列OpenCL命令。对于阻塞命令,将命令排入队列的OpenCL API函数在命令完成之前不会返回。或者,将非阻塞命令排队的OpenCL函数会立即返回,并要求程序员定义排队命令之间的依赖关系,以确保在所需资源可用之前不会启动排队命令。在这两种情况下,命令的实际执行可能与主机程序的执行异步发生。

Commands within a single command-queue execute relative to each other in one of two modes:


  • In-order Execution: Commands and any side effects associated with commands appear to the OpenCL application as if they execute in the same order they are enqueued to a command-queue.

  • 顺序执行:命令和与命令相关的任何副作用在OpenCL应用程序中显示,就好像它们按与命令队列相同的顺序执行一样。

  • Out-of-order Execution: Commands execute in any order constrained only by explicit synchronization points (e.g. through command queue barriers) or explicit dependencies on events.

  • 无序执行:命令以任何顺序执行,仅受显式同步点(例如通过命令队列栅栏)或对事件的显式依赖性的约束。

Multiple command-queues can be present within a single context. Multiple command-queues execute commands independently. Event objects visible to the host program can be used to define synchronization points between commands in multiple command queues. If such synchronization points are established between commands in multiple command-queues, an implementation must assure that the command-queues progress concurrently and correctly account for the dependencies established by the synchronization points. For a detailed explanation of synchronization points, see the execution model Synchronization section.


The core of the OpenCL execution model is defined by how the kernels execute. When a kernel-enqueue command submits a kernel for execution, an index space is defined. The kernel, the argument values associated with the arguments to the kernel, and the parameters that define the index space define a kernel-instance. When a kernel-instance executes on a device, the kernel function executes for each point in the defined index space. Each of these executing kernel functions is called a work-item. The work-items associated with a given kernel-instance are managed by the device in groups called work-groups. These work-groups define a coarse grained decomposition of the Index space. Work-groups are further divided into sub-groups, which provide an additional level of control over execution.


Sub-groups are missing before version 2.1.


Work-items have a global ID based on their coordinates within the Index space. They can also be defined in terms of their work-group and the local ID within a work-group. The details of this mapping are described in the following section.






