PTX ISA 之同步指令 bar & membar

最新推荐文章于 2024-10-15 16:04:52 发布

__DARK__

最新推荐文章于 2024-10-15 16:04:52 发布

阅读量2.8k

点赞数 1

分类专栏： GPU 体系架构 CUDA learning

本文链接：https://blog.csdn.net/dark5669/article/details/60791828

版权

GPU 体系架构同时被 2 个专栏收录

24 篇文章

订阅专栏

CUDA learning

23 篇文章

订阅专栏

PTX ISA—同步指令

bar

Barrier synchronization.

Syntax

bar.sync     a{, b};//  __synthread();
bar.arrive   a, b;
// a 表示 barrier 的编号 0到15
// b 表示 线程的数量，如果b没确定，默认是cta 所有的线程


bar.red.popc.u32   d, a{, b}, {!}c;

bar.red.op.pred    p, a{, b}, {!}c;
//Once the barrier count is reached, the final value is written to the destination register in all threads waiting at the barrier.
.op = { .and, .or };

Description

Performs barrier synchronization and communication within a CTA. Each CTA instance has sixteen barriers numbered 0..15 .

每个 cta 有16个 barrier ，从 0 到15.

the bar.sync and bar.red instructions cause the executing thread to wait until all

or a specified number of threads in the CTA arrive at the barrier before resuming

execution.

上边两条指令，使正在执行的线程等待直到 CTA 所有线程到达barrier ，才能继续执行。

bar.red performs a predicate reduction across the threads participating in

the barrier.

bar.red 通过线程s 分享 barrier 实现了一个谓语的规约么》？？

bar.arrive does not cause any waiting by the executing threads; it simply

marks a thread’s arrival at the barrier.

bar.arrive 不会造成对正在执行的线程任何等待；它只是在想成到达的时候简单的标记一下。

example

// Use bar.sync to arrive at a pre-computed barrier number and
// wait for all threads in CTA to also arrive:
st.shared [r0],r1; // write my result to shared memory
bar.sync 1;// arrive, wait for others to arrive
ld.shared r2,[r3]; // use shared results from other threads

// Use bar.sync to arrive at a pre-computed barrier number and
// wait for fixed number of cooperating threads to arrive:
#define CNT1 (8*12) // Number of cooperating threads
st.shared [r0],r1;// write my result to shared memory
bar.sync 1, CNT1;// arrive, wait for others to arrive
ld.shared r2,[r3];// use shared results from other threads

// Use bar.red.and to compare results across the entire CTA:
setp.eq.u32 p,r1,r2;// p is True if r1==r2
bar.red.and.pred r3,1,p; // r3=AND(p) forall threads in CTA

// Use bar.red.popc to compute the size of a group of threads
// that have a specific condition True:
setp.eq.u32 p,r1,r2;// p is True if r1==r2
bar.red.popc.u32 r3,1,p; // r3=SUM(p) forall threads in CTA

/* Producer/consumer model. The producer deposits a value in
* shared memory, signals that it is complete but does not wait
* using bar.arrive, and begins fetching more data from memory.
* Once the data returns from memory, the producer must wait
* until the consumer signals that it has read the value from
* the shared memory location. In the meantime, a consumer
* thread waits until the data is stored by the producer, reads
* it, and then signals that it is done (without waiting).
*/
// Producer code places produced value in shared memory.
st.shared
[r0],r1;
bar.arrive 0,64;
ld.global
r1,[r2];
bar.sync
1,64;
...
// Consumer code, reads value from shared memory
bar.sync
0,64;
ld.shared r1,[r0];
bar.arri

membar

Memory barrier.

Syntax

membar.level;

.level = { .cta, ,gl, ,sys };

Description

Waits for all prior memory accesses requested by this thread to be performed at the CTA,

global, or system memory level.

等待所有之前的访存请求，来自 CTA ，global，系统存储级的

Thread execution resumes after a membar when the thread’s prior memory writes are visible to other threads at the specified level , and memory reads by this thread can no longer be affected by other thread writes.

线程重新执行在一个 membar 之后，当这个线程之前的写操作对其他线程可见的时候（在特定的范围），并且这个线程的读操作，不再受其他线程写的影响。

A memory read (e.g., by ld or atom ) has been performed when the value read has been

transmitted from memory and cannot be modified by another thread at the indicated

level.

当这个要读的值不会被其他线程修改时，这个读请求才会执行。

A memory write (e.g., by st , red or atom ) has been performed when the value

written has become visible to other clients at the specified level, that is, when the

previous value can no longer be read.

当这个要被写的值对于其他 clien 可见，之前的值不会再被读了，的时候，这个写操作才会执行。

membar.cta

Waits until all prior memory writes are visible to other threads in the same CTA.

等待所有之前的写操作对，，所有在CTA里其他线程可见。