OpenCL™规范 主机端和设备端命令 Host-side and Device-side Commands 主机端和设备端命令

This section describes how the OpenCL API functions associated with command-queues contribute to happens-before relations. There are two types of command-queues and associated API functions in OpenCL 2.x; host command-queues and device command-queues. The interaction of these command-queues with the memory model are for the most part equivalent. In a few cases, the rules only applies to the host command-queue. We will indicate these special cases by specifically denoting the host command-queue in the memory ordering rule. SVM memory consistency in such instances is implied only with respect to synchronizing host commands.

本节描述了与命令队列关联的OpenCL API函数如何对先发生后发生的关系做出贡献。OpenCL2.x中有两种类型的命令队列和相关的API函数;主机命令队列和设备命令队列。这些命令队列与内存模型的交互在很大程度上是等效的。在少数情况下,这些规则仅适用于主机命令队列。我们将通过在内存排序规则中专门表示主机命令队列来指示这些特殊情况。这种情况下的SVM内存一致性仅在同步主机命令方面是隐含的。

Memory ordering rules in this section apply to all memory objects (buffers, images and pipes) as well as to SVM allocations where no earlier, and more fine-grained, rules apply.


In the remainder of this section, we assume that each command C enqueued onto a command-queue has an associated event object E that signals its execution status, regardless of whether E was returned to the unit of execution that enqueued C. We also distinguish between the API function call that enqueues a command C and creates an event E, the execution of C, and the completion of C(which marks the event E as complete).


The ordering and synchronization rules for API commands are defined as following:


1.If an API function call X enqueues a command C, then X global-synchronizes-with C. For example, a host API function to enqueue a kernel global-synchronizes-with the start of that kernel-instances execution, so that memory updates sequenced-before the enqueue kernel function call will global-happen-before any kernel reads or writes to those same memory locations. For a device-side enqueue, global memory updates sequenced before X happens-before C reads or writes to those memory locations only in the case of fine-grained SVM.


2.If E is an event upon which a command C waits, then E global-synchronizes-with C. In particular, if C waits on an event E that is tracking the execution status of the command C1, then memory operations done by C1 will global-happen-before memory operations done by C. As an example, assume we have an OpenCL program using coarse-grain SVM sharing that enqueues a kernel to a host command-queue to manipulate the contents of a region of a buffer that the host thread then accesses after the kernel completes. To do this, the host thread can call clEnqueueMapBuffer to enqueue a blocking-mode map command to map that buffer region, specifying that the map command must wait on an event signaling the kernels completion. When clEnqueueMapBuffer returns, any memory operations performed by the kernel to that buffer region will global- happen-before subsequent memory operations made by the host thread.


3.If a command C has an event E that signals its completion, then C global- synchronizes-with E.


4.For a command C enqueued to a host-side command-queue, if C has an event E that signals its completion, then E global-synchronizes-with an API call X that waits on E. For example, if a host thread or kernel-instance calls the wait-for-events function on E (e.g. the clWaitForEvents function called from a host thread), then E global-synchronizes-with that wait-for-events function call.


5.If commands C and C1 are enqueued in that sequence onto an in-order command-queue, then the event (including the event implied between C and C1 due to the in-order queue) signaling C's completion global-synchronizes-with C1. Note that in OpenCL 2.x, only a host command-queue can be configured as an in-order queue.


6.If an API call enqueues a marker command C with an empty list of events upon which C should wait, then the events of all commands enqueued prior to C in the command-queue global-synchronize-with C.


7.If a host API call enqueues a command-queue barrier command C with an empty list of events on which C should wait, then the events of all commands enqueued prior to C in the command-queue global-synchronize-with C. In addition, the event signaling the completion of C global-synchronizes-with all commands enqueued after C in the command-queue.


8.If a host thread executes a clFinish call X, then the events of all commands enqueued prior to X in the command-queue global-synchronizes-with X.


9.The start of a kernel-instance K global-synchronizes-with all operations in the work-items of K. Note that this includes the execution of any atomic operations by the work-items in a program using fine-grain SVM.


10.All operations of all work-items of a kernel-instance K global-synchronizes-with the event signaling the completion of K. Note that this also includes the execution of any atomic operations by the work-items in a program using fine-grain SVM.


11.If a callback procedure P is registered on an event E, then E global-synchronizes-with all operations of P. Note that callback procedures are only defined for commands within host command-queues.


12.If C is a command that waits for an event E's completion, and API function call X sets the status of a user event E's status to CL_COMPLETE (for example, from a host thread using a clSetUserEventStatus function), then X global-synchronizes-with C.


13.If a device enqueues a command C with the CLK_ENQUEUE_FLAGS_WAIT_KERNEL flag, then the end state of the parent kernel instance global-synchronizes with C.


14.If a work-group enqueues a command C with the CLK_ENQUEUE_FLAGS_WAIT_WORK_GROUP flag, then the end state of the work-group global-synchronizes with C.


When using an out-of-order command-queue, a wait on an event or a marker or command-queue barrier command can be used to ensure the correct ordering of dependent commands. In those cases, the wait for the event or the marker or barrier command will provide the necessary global-synchronizes-with relation.


In this situation:


  • access to shared locations or disjoint locations in a single cl_mem object when using atomic operations from different kernel instances enqueued from the host such that one or more of the atomic operations is a write is implementation-defined and correct behavior is not guaranteed except at synchronization points.

  • 当使用来自从主机排队的不同内核实例的原子操作时,访问单个cl_mem对象中的共享位置或不相交位置,使得原子操作中的一个或多个是写入实现定义的,并且除了在同步点之外,不能保证正确的行为。

  • access to shared locations or disjoint locations in a single cl_mem object when using atomic operations from different kernel instances consisting of a parent kernel and any number of child kernels enqueued by that kernel is guaranteed under the memory ordering rules described earlier in this section.

  • 当使用来自由父内核和该内核排队的任何数量的子内核组成的不同内核实例的原子操作时,根据本节前面描述的内存排序规则,可以保证对单个cl_mem对象中的共享位置或不相交位置的访问。

  • access to shared locations or disjoint locations in a single program scope global variable, coarse-grained SVM allocation or fine-grained SVM allocation when using atomic operations from different kernel instances enqueued from the host to a single device is guaranteed under the memory ordering rules described earlier in this section.

  • 当使用从主机排队到单个设备的不同内核实例的原子操作时,对单个程序范围全局变量中的共享位置或不相交位置的访问、粗粒度SVM分配或细粒度SVM分配在本节前面描述的存储器排序规则下得到保证。

If fine-grain SVM is used but without support for the OpenCL 2.x atomic operations, then the host and devices can concurrently read the same memory locations and can concurrently update non-overlapping memory regions, but attempts to update the same memory locations are undefined. Memory consistency is guaranteed at the OpenCL synchronization points without the need for calls to clEnqueueMapBuffer and clEnqueueUnmapMemObject. For fine-grained SVM buffers it is guaranteed that at synchronization points only values written by the kernel will be updated. No writes to fine-grained SVM buffers can be introduced that were not in the original program.

​如果使用细粒度SVM,但不支持OpenCL 2.x原子操作,则主机和设备可以同时读取相同的存储器位置,并可以同时更新不重叠的存储器区域,但更新相同存储器位置的尝试是未定义的。在OpenCL同步点保证内存一致性,而无需调用clEnqueueMapBuffer和clEnqueueUnmapMemObject。对于细粒度SVM缓冲区,可以保证在同步点只有内核写入的值才会更新。不能引入原始程序中没有的对细粒度SVM缓冲区的写入。

In the remainder of this section, we discuss a few points regarding the ordering rules for commands with a host command-queue.


In an OpenCL 1.x implementation a synchronization point is a kernel-instance or host program location where the contents of memory visible to different work-items or command-queue commands are the same. It also says that waiting on an event and a command-queue barrier are synchronization points between commands in command-queues. Four of the rules listed above (2, 4, 7, and 8) cover these OpenCL synchronization points.


A map operation (clEnqueueMapBuffer or clEnqueueMapImage) performed on a non-SVM buffer or a coarse-grained SVM buffer is allowed to overwrite the entire target region with the latest runtime view of the data as seen by the command with which the map operation synchronizes, whether the values were written by the executing kernels or not. Any values that were changed within this region by another kernel or host thread while the kernel synchronizing with the map operation was executing may be overwritten by the map operation.


Access to non-SVM cl_mem buffers and coarse-grained SVM allocations is ordered at synchronization points between host commands. In the presence of an out-of-order command-queue or a set of command-queues mapped to the same device, multiple kernel instances may execute concurrently on the same device.

对非SVM cl_mem缓冲区和粗粒度SVM分配的访问是在主机命令之间的同步点排序的。在存在无序命令队列或映射到同一设备的一组命令队列的情况下,多个内核实例可以在同一设备上同时执行。

