x86 CAS实现

denglin12315

已于 2022-01-21 16:33:24 修改

阅读量738

点赞数

文章标签： CAS x86

于 2022-01-21 15:21:57 首次发布

本文链接：https://blog.csdn.net/denglin12315/article/details/122621081

版权

一. CAS的典型应用就是解决ABA问题

A more simple way to solve the ABA issue used for example in the article from Micheal and Scott Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms, is to replace your pointer by a pair with a pointer and a counter.

The strategy is simple, each time the pointer is changed the counter is incremented, thus even if the address is the same the counter value will differ.

The only remaining issue is how to perform a double word CAS ?

Atomic CAS requires processor’s specific instructions, on the x86/x86-64 processors you’ll find that the CAS instruction is named cmpxchg.

But, as usual, this instruction has various versions depending on the size of the value to swap. So, if you want a double-word CAS for 32bit wide pointers, you’ll use cmpxchg8b and cmpxchg16b for 64bit wide pointers.

Note: GCC provides atomic operations, but these operations are limited to integral types, that is types with size less (or equal) to 8 bytes, but we need 16 bytes wide CAS.

二.Using cmpxchg for 32bit wide pointers

So, let’s start with the easy case: 32bit wide pointers. It’s easy because you have 64bit integral types (here we need some unsigned integer) directly defined.

The double-word CAS will look like that:

//nval作为要写入的内容，会放到ECX:EBX
uint64_t cas(uint64_t* adr, uint64_t nval, uint64_t cmp)
{
  uint64_t old;					//定义一个局部变量
  __asm__ __volatile__(			//__volatile__ 告诉编译器不要优化下面的汇编代码
    "lock cmpxchg8b %1\n\t"		//指令部，*adr作为指令的操作数，即将*adr和cmp比较
	 //第一个冒号表示从汇编里输出到c语言的变量, =号表示在汇编里只能改变old变量的值，而不能取它的值.
	 //+号表示可以取变量值，也可改变变量的值. a表示在汇编里用寄存器eax代替old变量
     : "=a" (old)				//输出部输出1，=等号表示函数输出值,old的值初始化为eax
      ,"+m" (*adr)				//输出部输出2，内存(m)地址adr

	 //第二个冒号表示汇编里只能取变量的值, 不能再有"=","+"号
     //输入的变量的寄存器只能使用一次, 如果多次使用此输入的值,则应放到一个固定的寄存器上面(R0-R12)
								//输入部edx:eax  +  ecx:ebx
     : "d" ((uint32_t)(cmp >> 32)), "a" ((uint32_t)(cmp & 0xffffffff))
      ,"c" ((uint32_t)(nval >> 32)), "b" ((uint32_t)(nval & 0xffffffff))
     
	 //“memory”告诉GCC,内嵌汇编指令改变了内存的值，强迫编译器在执行该汇编代码前，存储所有缓存的值，
	 //在执行完汇编代码之后重新加载该值，目的是防止编译乱序；
	 //“cc”:表示内嵌汇编代码影响状态寄存器相关的标志位；
	 : "cc"						//损坏部
  );
}


cmpxchg8b mem64 指令的工作如下：
比较 mem64 和 EDX:EAX
如果相等，那么把 ECX:EBX 存储到 mem64
如果不相等，那么把 mem64 存储到 EDX:EAX

我们要一次性改8字节内存，用的是相等的情况，先把要写入的内容放到 ECX:EBX ，然后调用 cmpxchg8b 指令即可。

三.And a 64bit version ?

At first, it seems logical to simply replace cmpxchg8b by cmpxchg16b is the previous code to obtain a double-word CAS for 64bit wide pointer, no ?

Of course not, we don’t have a 128bits wide integer type (some compiler may provide such a type) so we have to embedded our pair in a struct (we’ll see the code later in the template version.) Beware that cmpxchg16b requires a memory operand aligned on 16 bytes boundaries.

But that’s not all. In the previous version, the CAS operation returns the old value, which is then often used to test if the operation has succeed or not. But, the compiler won’t let us simply compare structures like integers.

Hopefully, cmpxchg16b (as cmpxchg8b and cmpxchg) set an arithmetic flag indicating whether the operation has succeed or not ! Thus, we only have to do a setz to some Boolean-like value.

Taking everything together:

we need a struct holding our pair (in fact we probably have it already)
our compare and swap will return a Boolean value
depending on the pointer size we should use cmpxchg8b or cmpxchg16b

Now, how can we have a unified code size dependent ?

通过C++模板实现：

//------------ 32bit wide pointers ----------
template<typename T,unsigned N=sizeof (uint32_t)>
struct DPointer {
public:
  union {
    uint64_t ui;
    struct {
      T* ptr;
      size_t count;
    };
  };

  DPointer() : ptr(NULL), count(0) {}
  DPointer(T* p) : ptr(p), count(0) {}
  DPointer(T* p, size_t c) : ptr(p), count(c) {}

  bool cas(DPointer<T,N> const& nval, DPointer<T,N> const & cmp)
  {
    bool result;
    __asm__ __volatile__(
        "lock cmpxchg8b %1\n\t"
        "setz %0\n"
        : "=q" (result)
         ,"+m" (ui)
        : "a" (cmp.ptr), "d" (cmp.count)
         ,"b" (nval.ptr), "c" (nval.count)
        : "cc"
    );
    return result;
  }

  // We need == to work properly
  bool operator==(DPointer<T,N> const&x) { return x.ui == ui; }

};


//---------- 64bit wide pointers ----------
template<typename T>
struct DPointer <T,sizeof (uint64_t)> {
public:
  union {
    uint64_t ui[2];
    struct {
      T* ptr;
      size_t count;
    } __attribute__ (( __aligned__( 16 ) ));
  };

  DPointer() : ptr(NULL), count(0) {}
  DPointer(T* p) : ptr(p), count(0) {}
  DPointer(T* p, size_t c) : ptr(p), count(c) {}

  bool cas(DPointer<T,8> const& nval, DPointer<T,8> const& cmp)
  {
    bool result;
    __asm__ __volatile__ (
        "lock cmpxchg16b %1\n\t"
        "setz %0\n"
        : "=q" ( result )
         ,"+m" ( ui )
        : "a" ( cmp.ptr ), "d" ( cmp.count )
         ,"b" ( nval.ptr ), "c" ( nval.count )
        : "cc"
    );
    return result;
  }

  // We need == to work properly
  bool operator==(DPointer<T,8> const&x)
  {
    return x.ptr == ptr && x.count == count;
  }
};

The first trick is to use an anonymous union (and an anonymous struct) in order to have access to the pointer and the counter directly, and a direct value access to the value as a whole itself for the assembly code. In fact, we probably can had done it without, but it simpler to read like that.

As you can see, the template as an integer parameter (that is not use) and is specialized upon it (for N=8.) Now, when you want to use our pointer, all you have to do is to instantiate our template with N=sizeof (void*).

四. An Example

Here is an quick’n’dirty implementation of the non-blocking concurrent queue described in the article by Micheal and Scott.

template<typename T>
class Queue
{
public:
  struct Node;
  typedef DPointer<Node,sizeof (size_t)> Pointer;

  struct Node
  {
    T           value;
    Pointer     next;

    Node() : next(NULL) {}
    Node(T x, Node* nxt) : value(x), next(nxt) {}
  };

  Pointer       Head, Tail;

  Queue() {
    Node       *node = new Node();
    Head.ptr = Tail.ptr = node;
  }

  void push(T x);
  bool take(T& pvalue);
};

template<typename T>
void Queue<T>::push(T x) {
  Node         *node = new Node(x,NULL);
  Pointer       tail, next;
  do {
    tail = Tail;
    next = tail.ptr->next;
    if (tail == Tail) {
      if (next.ptr == NULL) {
        if (tail.ptr->next.cas(Pointer(node,next.count+1),next))
          break;
      } else {
        Tail.cas(Pointer(next.ptr,tail.count+1), tail);
      }
    }
  } while (true);
  Tail.cas(Pointer(node,tail.count+1), tail);
}

template<typename T>
bool Queue<T>::take(T& pvalue) {
  Pointer       head, tail, next;
  do {
    head = Head;
    tail = Tail;
    next = head.ptr->next;
    if (head == Head)
      if (head.ptr == tail.ptr) {
        if (next.ptr == NULL) return false;
        Tail.cas(Pointer(next.ptr,tail.count+1), tail);
      } else {
        pvalue = next.ptr->value;
        if (Head.cas(Pointer(next.ptr,head.count+1), head))
          break;
      }
  } while (true);
  delete(head.ptr);
  return true;
}

五. Going further

There’s some possible enhancement of our pointer template:

Our assembly code doesn’t support -fPIC relocation: by convention, ebx is supposed to be preserved in each block of code, so, we have to backup its value before using in the asm inline block.
Not all operation are done atomically, in order to have a better complete implementation, we should override some operators.

denglin12315

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
x86 CAS实现

一. CAS的典型应用就是解决ABA问题A more simple way to solve the ABA issue used for example in the article from Micheal and ScottSimple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms, is to replace your pointer by a pair with a pointer and.
复制链接

扫一扫