barrier(wmb,mb,rmb)和cache coherence

最新推荐文章于 2023-04-03 10:58:55 发布

小春222

最新推荐文章于 2023-04-03 10:58:55 发布

阅读量5.4k

点赞数

分类专栏： linux内核驱动文章标签： cache dependencies struct primitive insert search

本文链接：https://blog.csdn.net/zqy2000zqy/article/details/1137837

版权

linux内核驱动专栏收录该内容

46 篇文章 1 订阅

订阅专栏

http://www.linuxforum.net/forum/gshowflat.php?Cat=&Board=linuxK&Number=428239&page=5&view=collapsed&sb=5&o=all&fpart=

注: 这里的barrier 指的是wmb, rmb, mb.

一直找不到合适的资料说明barrier和 Cache coherence 之间的关系. 在<<ldd>> <<ULK>>等书中说明了barrier的基本用法. Ldd 着重于在和外设打交道的时候barrier所起的作用. 还有一个例子是使用barrier实现无锁的链表操作.

迷惑于这种使用barrier实现的无锁操作. 另外的例子就是big read lock(brlock.c brlock.h). 找不到一些理由或者条件, 指出必须使用barrier的情景. 还有就是softirq.c 中的void init_bh(int nr, void (*routine)(void)), 也使用了barrier.

应该是这样: barrier强制cpu实现strong order而cache conherence 关注memory在多个cpu的cache中的copy的一致性问题. 问题是,

cache conhernece需要barrier的参与吗?
引起cpu reorder内存读写的技术有那些? Write buffer算一个, 但是数据在writer buffer的时候cache是否是一致的?(应该是一致的, 如果是这样,说明barrier和cache conherence根本就是两码事, cache conherence 对软件完全透明但是barrier需要软件的参与)

我想在下面的情景下需要考虑使用barrier: 两个cpu都会操作同一个数据(写的情形需要互斥), 或者读写两个数据, 但是这两个数据有某种关系, 比如根据一个数据的值决定另一数据的操作.
这个描述很不让人满意. 也纯粹是推理(当然也有道理, per cpu的data永远不用barrier, reoreder 的问题不影响单个cpu.(fix me)).

另外, 涉及到使用Barrier 的时候, 有的书籍说: 让改变立即对所有的cpu可见. 这种说法也许不妥, 我们应该怎样理解这个问题?

这里为什么有mb()

我的理解和你差不多:
1.cache coherence不需要barrier参与，完全由硬件协议保证cache coherence.
2. 引起cpu reorder的技术还有: load forwarding和load scheduling.
3. 只有涉及到至少两个内存地址和两个对内存访问的功能单元(CPU或外设)时才有memory ordering问题。
4. “让改变立即对所有的cpu可见”不妥, 应该是让两个内存操作的被其它CPU可见的顺序符合某种要求(release, acquire or fence)。

reoreder 的问题也会影响单cpu。
上学期学的一门《高级计算机系统结构》课程就讲过一个经典例子。
PowerPC机的store指令紧跟着一个load指令就会发生颠倒顺序的执行。对于下述程序，如果不用memory barrier，就会发生错误：

1 while ( TDRE == 0 );
2 TDR = char1;
3 asm("eieio"); // memory barrier
4 while ( TDRE == 0 );
5 TDR = char2;

程序说明：假设上述的TDRE是某外设的状态寄存器，为0表示外设忙，为1表示外设可以接收一个字符并处理，这时用户可向外设的TDR寄存器写入一个待处理的字符，写入之后TDRE变为0，外设处理字符需要一段时间，在这段时间过去后TDRE又从0变为1。

假设将上述程序的第三行去掉，那么程序在执行第一行时会等到TDRE为1时继续向下执行，然而后面有一条store指令（TDR = char1;）之后紧跟一条load指令( while ( TDRE == 0 ); )，这时第4句中的load指令会先执行，然后再执行第二行的store指令，这样第四行load出来的寄存器值肯定为1，这样就会立刻执行第五行，结果造成外设在忙的状态下接收到第二个字符，这样肯定会出错。
所以必须叫加入第三行，以确保store指令在load指令前执行。

reoreder所引起的这个外设(非smp cpu)的问题是很好理解的, 关键是对smp环境的程序设计产生的影响不那么直观.

如果我们的想法正确. 就应该看看init_bh, ksoftirqd的问题了. 但是好像不太直观啊.

我也急迫地想了解这方面的东西。
另外ldd的中断一章也看不懂。

This code, though simple, represents the typical job of an interrupt handler. It, in
turn, calls short_incr_bp, which is defined as follows:
static inline void short_incr_bp(volatile unsigned long *index,
int delta)
{
unsigned long new = *index + delta;
barrier (); /* Don’t optimize these two together */
这里为什么用barrier？

*index = (new >= (short_buffer + PAGE_SIZE)) ? short_buffer : new;
}
This function has been carefully written to wrap a pointer into the circular buffer
without ever exposing an incorrect value. By assigning only the final value and
placing a barrier to keep the compiler from optimizing things, it is possible to
manipulate the circular buffer pointers safely without locks.

高手指点。
在linux 中所有的原子操作都是带有mb的, 并且带有barrier(防止gcc对指令进行reorder ). 对应x86 , 就是像 test_and_setbit之类的操作都是以 volatile /lock /:memory的方式实现的.

原因也是明显的:

spin_lock(lock);
some read/write
………..

如果没有mb, 有可能 some read/write 会跑到 spin_lock 之前去执行, 这当然是不容许的.

这是一篇说明实现 lock free searching的文章, 对理解barrier很有裨益, 研究它远比翻译它有价值




Data Dependencies and wmb()

Version 1.0 







Goal



Support lock-free algorithms without inefficient and ugly read-side code. 

Obstacle Some CPUs do not support synchronous invalidation in hardware. 





Example Code 
Insertion into an unordered lock-free circular singly linked list, 

while allowing concurrent searches. 





Data Structures 


The data structures used in all these examples are 

a list element, a header element, and a lock. 





 struct el {

  struct el *next;

  long key;

  long data;

 };

 struct el head;

 spinlock_t mutex;





Search and Insert Using Global Locking



The familiar globally locked implementations of search() and insert() are as follows: 



 struct el *insert(long key, long data)

 {

  struct el *p;

  p = kmalloc(sizeof(*p), GPF_ATOMIC);

  spin_lock(&mutex);

  p->next = head.next;

  p->key = key;

  p->data = data;

  head.next = p;

  spin_unlock(&mutex);

 }



 struct el *search(long key)

 {

  struct el *p;

  p = head.next;

  while (p != &head) {

   if (p->key == key) {

    return (p);

   }

   p = p->next;

  }

  return (NULL);

 }



 /* Example use. */



 spin_lock(&mutex);

 p = search(key);

 if (p != NULL) {

  /* do stuff with p */

 }

 spin_unlock(&mutex);



These implementations are quite straightforward, but are subject to locking bottlenecks. 



Search and Insert Using wmb() and rmb()  


The existing wmb() and rmb() primitives can be used to do lock-free insertion. The 

searching task will either see the new element or not, depending on the exact timing, 

just like the locked case. In any case, we are guaranteed not to see an invalid pointer, 

regardless of timing, again, just like the locked case. The problem is that wmb() is 

guaranteed to enforce ordering only on the writing CPU --

 the reading CPU must use rmb() to keep the ordering. 





 struct el *insert(long key, long data)

 {

  struct el *p;

  p = kmalloc(sizeof(*p), GPF_ATOMIC);

  spin_lock(&mutex);

  p->next = head.next;

  p->key = key;

  p->data = data;

  wmb();

  head.next = p;

  spin_unlock(&mutex);

 }



 struct el *search(long key)

 {

  struct el *p;

  p = head.next;

  while (p != &head) {

   rmb();

   if (p->key == key) {

    return (p);

   }

   p = p->next;

  };

  return (NULL);

 }



(Note: see read-copy update for information on how to delete elements from this list 

while still permitting lock-free searches.) 





The rmb()s in search() cause unnecessary performance degradation on CPUs (such as the 

i386, IA64, PPC, and SPARC) where data dependencies result in an implied rmb(). In 

addition, code that traverses a chain of pointers would have to be broken up in order to 

insert the needed rmb()s. For example: 



 d = p->next->data;



would have to be rewritten as: 

 q = p->next;

 rmb();

 d = q->data;



One could imagine much uglier code expansion where there are more dereferences in a 

single expression. The inefficiencies and code bloat could be avoided if there were a 

primitive like wmb() that allowed read-side data dependencies to act as implicit rmb() 

invocations. 





Why do You Need the rmb()? 


Many CPUs have single instructions that cause other CPUs to see preceding stores before 

subsequent stores, without the reading CPUs needing an explicit rmb() if a data dependency 

forces the ordering. 



However, some CPUs have no such instruction, the Alpha being a case in point. On these 

CPUs, a wmb() only guarantees that the invalidate requests corresponding to the writes 

will be emitted in order. The wmb() does not guarantee that the reading CPU will process 

these invalidates in order. 



For example, consider a CPU with a partitioned cache, as shown in the following diagram: 







Here, even-numbered cachelines are maintained in cache bank 0, and odd-numbered cache 

lines are maintained in cache bank 1. Suppose that head was maintained in cache bank 0, 

and that a newly allocated element was maintained in cache bank 1. The insert() code's 

wmb() would guarantee that the invalidates corresponding to the writes to the next, key, 

and data fields would appear on the bus before the write to head->next. 

But suppose that the reading CPU's cache bank 1 was extremely busy, with lots of pending 

invalidates and outstanding accesses, and that the reading CPU's cache bank 0 was idle. 

The invalidation corresponding to head->next could then be processed before that of the 



three fields. If search() were to be executing just at that time, it would pick up the 

new value of head->next, but, since the invalidates corresponding to the three fields

had not yet been processed, it could pick up the old (garbage!) value corresponding to 

these fields, possibly resulting in an oops or worse. 

Placing an rmb() between the access to head->next and the three fields fixes this 

problem. The rmb() forces all outstanding invalidates to be processed before any 

subsequent reads are allowed to proceed. Since the invalidate corresponding to the three 

fields arrived before that of head->next, this will guarantee that if the new value of 

head->next was read, then the new value of the three fields will also be read. 

No oopses (or worse). 



However, all the rmb()s add complexity, are easy to leave out, and hurt performance of 

all architectures. And if you forget a needed rmb(), you end up with very intermittent 

and difficult-to-diagnose memory-corruption errors. Just what we don't need in Linux! 



So, there is strong motivation for a way of eliminating the need for these rmb()s. 

Solutions for lockfree search and insertions



Search and Insert Using wmbdd()



It would much nicer (and faster, on many architectures) to have a primitive similar to 

wmb(), but that allowed read-side data dependencies to substitute for an explicit rmb(). 



It is possible to do this (see patch). With such a primitive, the code looks as follows: 



 struct el *insert(long key, long data)

 {

  struct el *p;

  p = kmalloc(sizeof(*p), GPF_ATOMIC);

  spin_lock(&mutex);

  p->next = head.next;

  p->key = key;

  p->data = data;

  wmbdd();

  head.next = p;

  spin_unlock(&mutex);

 }



 struct el *search(long key)

 {

  struct el *p;

  p = head.next;

  while (p != &head) {

   if (p->key == key) {

    return (p);

   }

   p = p->next;

  }

  return (NULL);

 }





This code is much nicer: no rmb()s are required, searches proceed 

fully in parallel with no locks or writes, and no intermittent data corruption. 



Search and Insert Using read_barrier_depends()



  Introduce a new primitive read_barrier_depends() that is defined to be an rmb() on 

Alpha, and a nop on other architectures. This removes the read-side performance 

problem for non-Alpha architectures, but still leaves the read-side 

read_barrier_depends(). It is almost possible for the compiler to do a good job of 

generating these (assuming that a "lockfree" gcc struct-field attribute is created 

and used), but, unfortunately, the compiler cannot reliably tell when the relevant lock

 is held. (If the lock is held, the read_barrier_depends() calls should not be generated.) 



After discussions in lkml about this, it was decided that putting an explicit 

read_barrier_depends() is the appropriate thing to do in the linux kernel. Linus also 

suggested that the barrier names be made more explict. With such a primitive, 

the code looks as follows: 



 struct el *insert(long key, long data)

 {

  struct el *p;

  p = kmalloc(sizeof(*p), GPF_ATOMIC);

  spin_lock(&mutex);

  p->next = head.next;

  p->key = key;

  p->data = data;

  write_barrier();

  head.next = p;

  spin_unlock(&mutex);

 }



 struct el *search(long key)

 {

  struct el *p;

  p = head.next;

  while (p != &head) {

   read_barrier_depends();

   if (p->key == key) {

    return (p);

   }

   p = p->next;

  }

  return (NULL);

 }





A preliminary patch for this is barriers-2.5.7-1.patch. The future releases of this 

patch can be found along with the RCU package here. 





Other Approaches Considered





Just make wmb() work like wmbdd(), so that data dependencies act as implied rmb()s. 

Although wmbdd()'s semantics are much more intuitive, there are a number of uses of 

wmb() in Linux that do not require the stronger semantics of wmbdd(), and strengthening 

the semantics would incur unnecessary overhead on many CPUs--or require many changes to 

the code, and thus a much larger patch. 



Just drop support for Alpha. After all, Compaq seems to be phasing it out, right? There 

are nonetheless a number of Alphas out there running Linux, and Compaq (or perhaps HP) 

will be manufacturing new Alphas for quite a few years to come. Microsoft would likely 

focus quite a bit of negative publicity on Linux's dropping support for anything (never 

mind that they currently support only two CPU architectures). And the code to make Alpha 

work correctly is not all that complex, and it does not impact performance of other CPUs. 



Besides, we cannot be 100% certain that there won't be some other CPU lacking a 

synchronous invalidation instruction...