barrier(wmb,mb,rmb)和cache coherence

 

http://www.linuxforum.net/forum/gshowflat.php?Cat=&Board=linuxK&Number=428239&page=5&view=collapsed&sb=5&o=all&fpart=

注: 这里的barrier 指的是wmb, rmb, mb.

一 直找不到合适的资料说明barrier和 Cache coherence 之间的关系. 在<<ldd>> <<ULK>>等书中说明了barrier的基本用法. Ldd 着重于在和外设打交道的时候barrier所起的作用. 还有一个例子是使用barrier实现无锁的链表操作.

迷惑于这种使用barrier实现的无锁操作. 另外的例子就是big read lock(brlock.c brlock.h). 找不到一些理由或者条件, 指出必须使用barrier的情景. 还有就是softirq.c 中的void init_bh(int nr, void (*routine)(void)), 也使用了barrier.

应该是这样: barrier强制cpu实现strong order而cache conherence 关注memory在多个cpu的cache中的copy的一致性问题. 问题是,

cache conhernece需要barrier的参与吗?
引起cpu reorder内存读写的技术有那些? Write buffer算一个, 但是数据在writer buffer的时候cache是否是一致的?(应该是一致的, 如果是这样,说明barrier和cache conherence根本就是两码事, cache conherence 对软件完全透明但是barrier需要软件的参与)

我想在下面的情景下需要考虑使用barrier: 两个cpu都会操作同一个数据(写的情形需要互斥), 或者读写两个数据, 但是这两个数据有某种关系, 比如根据一个数据的值决定另一数据的操作.
这个描述很不让人满意. 也纯粹是推理(当然也有道理, per cpu的data永远不用barrier, reoreder 的问题不影响单个cpu.(fix me)).

另外, 涉及到使用Barrier 的时候, 有的书籍说: 让改变立即对所有的cpu可见. 这种说法也许不妥, 我们应该怎样理解这个问题?

 


这里为什么有mb()

我的理解和你差不多:
1.cache coherence不需要barrier参与,完全由硬件协议保证cache coherence.
2. 引起cpu reorder的技术还有: load forwarding和load scheduling.
3. 只有涉及到至少两个内存地址和两个对内存访问的功能单元(CPU或外设)时才有memory ordering问题。
4. “让改变立即对所有的cpu可见”不妥, 应该是让两个内存操作的被其它CPU可见的顺序符合某种要求(release, acquire or fence)。

reoreder 的问题也会影响单cpu。
上学期学的一门《高级计算机系统结构》课程就讲过一个经典例子。
PowerPC机的store指令紧跟着一个load指令就会发生颠倒顺序的执行。对于下述程序,如果不用memory barrier,就会发生错误:

1 while ( TDRE == 0 );
2 TDR = char1;
3 asm("eieio"); // memory barrier
4 while ( TDRE == 0 );
5 TDR = char2;

程序说明:假设上述的TDRE是某外设的状态寄存器,为0表示外设忙,为1表示外设可以接收一个字符并处理,这时用户可向外设的TDR寄存器写入一个待处理的字符,写入之后TDRE变为0,外设处理字符需要一段时间,在这段时间过去后TDRE又从0变为1。

假 设将上述程序的第三行去掉,那么程序在执行第一行时会等到TDRE为1时继续向下执行,然而后面有一条store指令(TDR = char1;)之后紧跟一条load指令( while ( TDRE == 0 ); ),这时第4句中的load指令会先执行,然后再执行第二行的store指令,这样第四行load出来的寄存器值肯定为1,这样就会立刻执行第五行,结果 造成外设在忙的状态下接收到第二个字符,这样肯定会出错。
所以必须叫加入第三行,以确保store指令在load指令前执行。

reoreder所引起的这个外设(非smp cpu)的问题是很好理解的, 关键是对smp环境的程序设计产生的影响不那么直观.

如果我们的想法正确. 就应该看看init_bh, ksoftirqd的问题了. 但是好像不太直观啊.

我也急迫地想了解这方面的东西。
另外ldd的中断一章也看不懂。


This code, though simple, represents the typical job of an interrupt handler. It, in
turn, calls short_incr_bp, which is defined as follows:
static inline void short_incr_bp(volatile unsigned long *index,
int delta)
{
unsigned long new = *index + delta;
barrier (); /* Don’t optimize these two together */
这里为什么用barrier?


*index = (new >= (short_buffer + PAGE_SIZE)) ? short_buffer : new;
}
This function has been carefully written to wrap a pointer into the circular buffer
without ever exposing an incorrect value. By assigning only the final value and
placing a barrier to keep the compiler from optimizing things, it is possible to
manipulate the circular buffer pointers safely without locks.

高手指点。
在linux 中所有的原子操作都是带有mb的, 并且带有barrier(防止gcc对指令进行reorder ). 对应x86 , 就是像 test_and_setbit之类的操作都是以 volatile /lock /:memory的方式实现的.

原因也是明显的:

spin_lock(lock);
some read/write
………..

如果没有mb, 有可能 some read/write 会跑到 spin_lock 之前去执行, 这当然是不容许的.

这是一篇说明实现 lock free searching的文章, 对理解barrier很有裨益, 研究它远比翻译它有价值




Data Dependencies and wmb()

Version 1.0







Goal



Support lock-free algorithms without inefficient and ugly read-side code.

Obstacle Some CPUs do not support synchronous invalidation in hardware.





Example Code

Insertion into an unordered lock-free circular singly linked list,

while allowing concurrent searches.





Data Structures


The data structures used in all these examples are

a list element, a header element, and a lock.





struct el {

struct el *next;

long key;

long data;

};

struct el head;

spinlock_t mutex;





Search and Insert Using Global Locking



The familiar globally locked implementations of search() and insert() are as follows:



struct el *insert(long key, long data)

{

struct el *p;

p = kmalloc(sizeof(*p), GPF_ATOMIC);

spin_lock(&mutex);

p->next = head.next;

p->key = key;

p->data = data;

head.next = p;

spin_unlock(&mutex);

}



struct el *search(long key)

{

struct el *p;

p = head.next;

while (p != &head) {

if (p->key == key) {

return (p);

}

p = p->next;

}

return (NULL);

}



/* Example use. */



spin_lock(&mutex);

p = search(key);

if (p != NULL) {

/* do stuff with p */

}

spin_unlock(&mutex);



These implementations are quite straightforward, but are subject to locking bottlenecks.



Search and Insert Using wmb() and rmb()



The existing wmb() and rmb() primitives can be used to do lock-free insertion. The

searching task will either see the new element or not, depending on the exact timing,

just like the locked case. In any case, we are guaranteed not to see an invalid pointer,

regardless of timing, again, just like the locked case. The problem is that wmb() is

guaranteed to enforce ordering only on the writing CPU --

the reading CPU must use rmb() to keep the ordering.





struct el *insert(long key, long data)

{

struct el *p;

p = kmalloc(sizeof(*p), GPF_ATOMIC);

spin_lock(&mutex);

p->next = head.next;

p->key = key;

p->data = data;

wmb();

head.next = p;

spin_unlock(&mutex);

}



struct el *search(long key)

{

struct el *p;

p = head.next;

while (p != &head) {

rmb();

if (p->key == key) {

return (p);

}

p = p->next;

};

return (NULL);

}



(Note: see read-copy update for information on how to delete elements from this list

while still permitting lock-free searches.)





The rmb()s in search() cause unnecessary performance degradation on CPUs (such as the

i386, IA64, PPC, and SPARC) where data dependencies result in an implied rmb(). In

addition, code that traverses a chain of pointers would have to be broken up in order to

insert the needed rmb()s. For example:



d = p->next->data;



would have to be rewritten as:

q = p->next;

rmb();

d = q->data;



One could imagine much uglier code expansion where there are more dereferences in a

single expression. The inefficiencies and code bloat could be avoided if there were a

primitive like wmb() that allowed read-side data dependencies to act as implicit rmb()

invocations.





Why do You Need the rmb()?



Many CPUs have single instructions that cause other CPUs to see preceding stores before

subsequent stores, without the reading CPUs needing an explicit rmb() if a data dependency

forces the ordering.



However, some CPUs have no such instruction, the Alpha being a case in point. On these

CPUs, a wmb() only guarantees that the invalidate requests corresponding to the writes

will be emitted in order. The wmb() does not guarantee that the reading CPU will process

these invalidates in order.



For example, consider a CPU with a partitioned cache, as shown in the following diagram:







Here, even-numbered cachelines are maintained in cache bank 0, and odd-numbered cache

lines are maintained in cache bank 1. Suppose that head was maintained in cache bank 0,

and that a newly allocated element was maintained in cache bank 1. The insert() code's

wmb() would guarantee that the invalidates corresponding to the writes to the next, key,

and data fields would appear on the bus before the write to head->next.

But suppose that the reading CPU's cache bank 1 was extremely busy, with lots of pending

invalidates and outstanding accesses, and that the reading CPU's cache bank 0 was idle.

The invalidation corresponding to head->next could then be processed before that of the



three fields. If search() were to be executing just at that time, it would pick up the

new value of head->next, but, since the invalidates corresponding to the three fields

had not yet been processed, it could pick up the old (garbage!) value corresponding to

these fields, possibly resulting in an oops or worse.

Placing an rmb() between the access to head->next and the three fields fixes this

problem. The rmb() forces all outstanding invalidates to be processed before any

subsequent reads are allowed to proceed. Since the invalidate corresponding to the three

fields arrived before that of head->next, this will guarantee that if the new value of

head->next was read, then the new value of the three fields will also be read.

No oopses (or worse).



However, all the rmb()s add complexity, are easy to leave out, and hurt performance of

all architectures. And if you forget a needed rmb(), you end up with very intermittent

and difficult-to-diagnose memory-corruption errors. Just what we don't need in Linux!



So, there is strong motivation for a way of eliminating the need for these rmb()s.

Solutions for lockfree search and insertions



Search and Insert Using wmbdd()



It would much nicer (and faster, on many architectures) to have a primitive similar to

wmb(), but that allowed read-side data dependencies to substitute for an explicit rmb().



It is possible to do this (see patch). With such a primitive, the code looks as follows:



struct el *insert(long key, long data)

{

struct el *p;

p = kmalloc(sizeof(*p), GPF_ATOMIC);

spin_lock(&mutex);

p->next = head.next;

p->key = key;

p->data = data;

wmbdd();

head.next = p;

spin_unlock(&mutex);

}



struct el *search(long key)

{

struct el *p;

p = head.next;

while (p != &head) {

if (p->key == key) {

return (p);

}

p = p->next;

}

return (NULL);

}





This code is much nicer: no rmb()s are required, searches proceed

fully in parallel with no locks or writes, and no intermittent data corruption.



Search and Insert Using read_barrier_depends()



Introduce a new primitive read_barrier_depends() that is defined to be an rmb() on

Alpha, and a nop on other architectures. This removes the read-side performance

problem for non-Alpha architectures, but still leaves the read-side

read_barrier_depends(). It is almost possible for the compiler to do a good job of

generating these (assuming that a "lockfree" gcc struct-field attribute is created

and used), but, unfortunately, the compiler cannot reliably tell when the relevant lock

is held. (If the lock is held, the read_barrier_depends() calls should not be generated.)



After discussions in lkml about this, it was decided that putting an explicit

read_barrier_depends() is the appropriate thing to do in the linux kernel. Linus also

suggested that the barrier names be made more explict. With such a primitive,

the code looks as follows:



struct el *insert(long key, long data)

{

struct el *p;

p = kmalloc(sizeof(*p), GPF_ATOMIC);

spin_lock(&mutex);

p->next = head.next;

p->key = key;

p->data = data;

write_barrier();

head.next = p;

spin_unlock(&mutex);

}



struct el *search(long key)

{

struct el *p;

p = head.next;

while (p != &head) {

read_barrier_depends();

if (p->key == key) {

return (p);

}

p = p->next;

}

return (NULL);

}





A preliminary patch for this is barriers-2.5.7-1.patch. The future releases of this

patch can be found along with the RCU package here.





Other Approaches Considered





Just make wmb() work like wmbdd(), so that data dependencies act as implied rmb()s.

Although wmbdd()'s semantics are much more intuitive, there are a number of uses of

wmb() in Linux that do not require the stronger semantics of wmbdd(), and strengthening

the semantics would incur unnecessary overhead on many CPUs--or require many changes to

the code, and thus a much larger patch.



Just drop support for Alpha. After all, Compaq seems to be phasing it out, right? There

are nonetheless a number of Alphas out there running Linux, and Compaq (or perhaps HP)

will be manufacturing new Alphas for quite a few years to come. Microsoft would likely

focus quite a bit of negative publicity on Linux's dropping support for anything (never

mind that they currently support only two CPU architectures). And the code to make Alpha

work correctly is not all that complex, and it does not impact performance of other CPUs.



Besides, we cannot be 100% certain that there won't be some other CPU lacking a

synchronous invalidation instruction...
在CUDA编程中,m_barrier和named barrier是两种常用的线程同步机制,它们可以有效地协调线程之间的合作和同步,提高程序的并行性和效率。 1. m_barrier m_barrier是一种基于硬件实现的线程同步机制,它可以让所有线程在同一时刻停止执行,直到所有线程都到达barrier点。m_barrier的使用非常简单,只需要在需要同步的地方调用cudaDeviceSynchronize()函数即可: ``` cudaDeviceSynchronize(); ``` 这样,所有线程都会在该语句处停止执行,直到所有线程都执行完该语句后,才会继续执行下一条语句。 需要注意的是,m_barrier的性能可能受到线程数的影响,因为它需要等待所有线程都到达barrier点才能继续执行。 2. named barrier named barrier是一种基于软件实现的线程同步机制,它可以让不同线程块之间进行同步,提高程序的并行性和效率。named barrier需要先创建一个barrier对象,然后在需要同步的地方调用barrier.sync()函数进行同步。 例如,可以使用以下代码创建一个named barrier对象: ``` cuda::barrier<cuda::thread_scope_block> my_barrier(blockDim.x); ``` 其中,cuda::thread_scope_block表示线程块作用域,blockDim.x表示线程块的大小。 然后,可以在需要同步的地方调用my_barrier.sync()函数进行同步: ``` my_barrier.sync(); ``` 需要注意的是,named barrier的性能可能受到线程块数的影响,因为它需要等待所有线程块都到达barrier点才能继续执行。 总的来说,m_barrier和named barrier都是重要的线程同步机制,在CUDA编程中应用广泛。需要根据具体情况选择合适的同步机制,并进行优化以提高程序的效率。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值