Multithreaded simple data type access and atomic variables

Table of contents

 

Introduction
How atomic variables work
Atomic variables size limitations
Use cases
The real thing…
Time to see some action
Precautions
Conslusion

 

IntroductionBACK TO TOC

In this article I would like to continue subject I started in my previous two posts (post 1 and post2). Question I am trying to answer is what is the most efficient, yet safe way of accessing simple data type variables from two or more threads. I.e. how to change a variable from two threads at the same time, without screwing its value.

In my first post I’ve shown how easy it is to turn variable value into garbage by modifying it from two or more threads. In my second post I am talking about spinlocks, a recent addition into pthread library. Spinlocks indeed can help to solve the problem. Yet spinlocks more suitable for protecting small data structures rather than simple data types such as int and long. Atomic variables, on the other hand, are perfect for the later task.

Key thing about atomic variables is that once someone starts reading or writing it, nothing else cannot interrupt the process and come in the middle. I.e. nothing can split the process of accessing atomic variable into two. This is why they called atomic.

On the practical side, atomic variables are the best solution for the problem of simultaneous access to a simple variable from two or more threads.

How atomic variables workBACK TO TOC

This is actually quiet simple. Intel x86 and x86_64 processor architectures (as well as vast majority of other modern CPU architectures) has instructions that allow one to lock FSB, while doing some memory access. FSB stands for Front Serial Bus. This is the bus that processor use to communicate with RAM. I.e. locking FSB will prevent from any other processor (core), and process running on that processor, from accessing RAM. And this is exactly what we need to implement atomic variables.

Atomic variables being widely used in kernel, but from some reason no-one bothered to implement them for user-mode folks. Until gcc 4.1.2.

Atomic variables size limitationsBACK TO TOC

From practical considerations, gurus at Intel did not implement FSB locking for every possible memory access. For instance, for quiet some time, Intel processors allow memcpy() and memcmp() implementation with one processor instruction. But locking FSB while copying large memory buffer can be too expensive.

In practice you can lock FSB while accessing 1, 2, 4 and 8 byte long integers. Almost transparently, gcc allows you to do atomic operations on int‘s, long‘s and long long‘s (and their unsigned counterparts).

Use casesBACK TO TOC

Incrementing a variable and knowing that no-one else screws its value is nice, but not enough. Consider following piece of pseudo-code.

decrement_atomic_value();
if (atomic_value() == 0)
    fire_a_gun();

Let us imagine that the value of an atomic variable is 1. What happens if two threads of execution try to execute this piece of pseudo-C simultaneously?

Back to our simulation. It is possible that thread 1 will execute line 1 and stop, while thread 2 will execute line 1 and continue executing line 2. Later thread 1 will wake up and execute line 2.

 

When this happens, no one of the threads will run fire_a_gun() routine (line 3). This is obviously wrong behavior and if we were protecting this piece of code with a mutex or a spinlock this would not have happened.

In case you’re wondering how likely something like this to happen, be sure that this is very likely. When I first started working with multithreaded programing I was amazed to find out that despite our intuition tells us that scenario I described earlier is unlikely, it happens overwhelmingly often.

As I mentioned, we could solve this problem by giving up on atomic variables and using spinlock or mutex instead. Luckily, we can still use atomic variables. gcc developers have thought about our needs and this particular problem and offered a solution. Lets see actual routines that operate atomic variables.

The real thing…BACK TO TOC

There are several simple functions that do the job. First of all, there are twelve (yes, twelve – 12) functions that do atomic add, substitution, and logical atomic or, and, xor and nand. There are two functions for each operation. One that returns value of the variable before changing it and another that returns value of the variable after changing it.

Here are the actual functions:

type __sync_fetch_and_add (type *ptr, type value);
type __sync_fetch_and_sub (type *ptr, type value);
type __sync_fetch_and_or (type *ptr, type value);
type __sync_fetch_and_and (type *ptr, type value);
type __sync_fetch_and_xor (type *ptr, type value);
type __sync_fetch_and_nand (type *ptr, type value);

These are functions that return value of the variable before changing it. Following functions, on the other hand, return value of the variable after changing it.

type __sync_add_and_fetch (type *ptr, type value);
type __sync_sub_and_fetch (type *ptr, type value);
type __sync_or_and_fetch (type *ptr, type value);
type __sync_and_and_fetch (type *ptr, type value);
type __sync_xor_and_fetch (type *ptr, type value);
type __sync_nand_and_fetch (type *ptr, type value);

type in each of the expressions can be one of the following:

  • int
  • unsigned int
  • long
  • unsigned long
  • long long
  • unsigned long long

These are so called built-in functions, meaning that you don’t have to include anything to use them.

Time to see some actionBACK TO TOC

Back to the example I started in the first post I mentioned earlier.

To remind you, this small program opens several of threads. Number of threads is as number of CPUs in the computer. Then it binds each one of the threads to one of the CPUs. Finally each thread runs a loop and increments a global integer 1 million times.

#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
#include <sched.h>
#include <linux/unistd.h>
#include <sys/syscall.h>
#include <errno.h>

#define INC_TO 1000000 // one million...

int global_int = 0;

pid_t gettid( void )
{
	return syscall( __NR_gettid );
}

void *thread_routine( void *arg )
{
	int i;
	int proc_num = (int)(long)arg;
	cpu_set_t set;

	CPU_ZERO( &set );
	CPU_SET( proc_num, &set );

	if (sched_setaffinity( gettid(), sizeof( cpu_set_t ), &set ))
	{
		perror( "sched_setaffinity" );
		return NULL;
	}

	for (i = 0; i < INC_TO; i++)
	{
//		global_int++;
		__sync_fetch_and_add( &global_int, 1 );
	}

	return NULL;
}

int main()
{
	int procs = 0;
	int i;
	pthread_t *thrs;

	// Getting number of CPUs
	procs = (int)sysconf( _SC_NPROCESSORS_ONLN );
	if (procs < 0)
	{
		perror( "sysconf" );
		return -1;
	}

	thrs = malloc( sizeof( pthread_t ) * procs );
	if (thrs == NULL)
	{
		perror( "malloc" );
		return -1;
	}

	printf( "Starting %d threads...\n", procs );

	for (i = 0; i < procs; i++)
	{
		if (pthread_create( &thrs[i], NULL, thread_routine,
			(void *)(long)i ))
		{
			perror( "pthread_create" );
			procs = i;
			break;
		}
	}

	for (i = 0; i < procs; i++)
		pthread_join( thrs[i], NULL );

	free( thrs );

	printf( "After doing all the math, global_int value is: %d\n",
		global_int );
	printf( "Expected value is: %d\n", INC_TO * procs );

	return 0;
}

To compile and run, throw this snippet into a file and run:

gcc -pthread "file name"

Then run ./a.out to execute the program.

Note lines 36 and 37. Instead of simply incrementing the variable, I use built-in function __sync_fetch_and_add(). Running this code obviously produces expected results – i.e. value of global_int is 4,000,000 as expected (number of CPUs in the machine multiply 1 million – in my case this is a 4 core machine). Remember that when I ran this code snippet leaving line 36 as is, the result was 1,908,090 and not 4,000,000 as we’d expect.

PrecautionsBACK TO TOC

When using atomic variables, some extra precautions have to be taken. One serious problem with atomic variable implementation in gcc is that it allows you to do atomic operations on regular variables. I.e. there is no clear distinction between atomic variables and regular variables. There is nothing that prevents you from incrementing value of the atomic variable with __sync_fetch_and_add() as I just demonstrated and later in the code doing same thing with regular ++ operator.

Obviously this might be a serious problem. Things tend to be forgotten and it is a matter of time until someone in your project or even you yourself will modify value of the variable using regular operators, instead of atomic functions that gcc has.

To address this problem, I strongly suggest wrapping around atomic functions and variables with either ADT in C or C++ class.

ConslusionBACK TO TOC

This article concludes a series or articles and posts where we investigate and study newest techniques in the world of multithreaded programming for Linux. Hope you’ll find these posts and article useful. As usual, in case you have further questions please don’t hesitate to email me to alexander.sandler@gmail.com.

 

当我们在做多线程编程的时候,会涉及到一个称为memory order的问题。

例如

int x(0),y(0);
x=4;
y=3;

请问,实际执行的时候,这两条赋值语句谁先执行,谁后执行? 会不会有某个时间点,在某个CPU看来,y比x大?

答案很复杂。本文的目的是从非常实践的角度来考虑这个问题。

首先,它分为两个层面。在编译器看来,x和y是两个没有关联的变量,那么编译器有权利调整这两行代码的执行顺序,只要它乐意。

其次,CPU也有权利这么做。

如果我非要严格要求顺序,那么就应该插入一个memory barrier

int x(0),y(0);
x=4;
在此插入memory barrier指令
y=3;

下面要论述,中间那行怎么写。请耐心看下去,因为大多数人都在瞎整。

gcc的手册中有一节叫做”Built-in functions for atomic memory access”,然后里面列举了这样一个函数:

__sync_synchronize (…)

This builtin issues a full memory barrier.

来,我们写段代码试下:

 
  1. int main(){

  2. __sync_synchronize();

  3. return 0;

  4. }

然后用gcc4.2编译,

# gcc -S -c test.c

然后看对应的汇编代码,

 
  1. main:

  2. pushq %rbp

  3. movq %rsp, %rbp

  4. movl $0, %eax

  5. leave

  6. ret

嗯?Nothing at all !!! 不信你试一试,我的编译环境是Freebsd 9.0 release, gcc (GCC) 4.2.1 20070831 patched [FreeBSD]。 好,我换个高版本的gcc编译器试一试,gcc46 (FreeBSD Ports Collection) 4.6.3 20120113 (prerelease)

 
  1. main:

  2.  
  3. pushq %rbp

  4.  
  5. movq %rsp, %rbp

  6.  
  7. mfence

  8.  
  9. movl $0, %eax

  10.  
  11. popq %rbp

  12.  
  13. ret

看,多了一行,mfence。 怎么回事呢?这是gcc之前的一个BUG:http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36793 。 2008年被发现,然后修复的。其实它之所以是一个BUG,关键在于gcc的开发者很迷惑,mfence在x86 CPU上到底有没有用?有嘛用? 说到这里,我们得到一个结论:gcc的__sync_synchronize()尽量别用,因为你的代码在低版本的gcc下会有BUG。大部分人用的gcc都比4.4低。从CentOS 6开始,默认的编译器才是gcc 4.4。

那么mfence到底能不能提供我们想要的结果呢? 之前intel的手册一直语焉不详,没说清楚。

最新的手册对mfence的解释是:

“Serializes all store and load operations that occurred prior to the MFENCE instruction in the 
program instruction stream”

并且特别强调,这个指令影响的是data memory子系统,而不是指令执行流。

对于单个CPU来说,

"Reads cannot pass earlier MFENCE instructions”

“Writes cannot pass earlier MFENCE instructions. ”

“MFENCE instructions cannot pass earlier reads or writes”

而对于多个CPU来说,

• Individual processors use the same ordering principles as in a single-processor system.

• Writes by a single processor are observed in the same order by all processors.

• Writes from an individual processor are NOT ordered with respect to the writes from other processors.

• Memory ordering obeys causality (memory ordering respects transitive visibility).

• Any two stores are seen in a consistent order by processors other than those performing the stores

简单点说,对于单个CPU,即便你不用mfence,写入顺序也是保证的。

假如你在C++中, std::string* str=new std::string();

那么不会出现str指针已经被赋值但是它指向的对象还未被初始化好的情况。

另一个有趣的问题是,gcc有一个汇编指令是用来控制内存顺序的,请看这段文档:

Accesses to non-volatile objects are not ordered with respect to volatile accesses. You cannot use a volatile object as a memory barrier to order a sequence of writes to non-volatile memory. For instance:

     int *ptr = something;
      volatile int vobj;
      *ptr = something;     
     vobj = 1; 

Unless *ptr and vobj can be aliased, it is not guaranteed that the write to *ptr occurs by the time the update of vobj happens. If you need this guarantee, you must use a stronger memory barrier such as:

     int *ptr = something;     
     volatile int vobj;      
     *ptr = something;     
      asm volatile ("" : : : "memory");    
      vobj = 1;

经我测试,asm volatile (“” : : : “memory”);并不生成任何汇编代码。也就是说,这个仅仅是给编译器看的。

为了进一步证实我的观点,请看如下从Intel的Threading Building Blocks函数库中摘取的代码:

#define __TBB_compiler_fence() __asm__ __volatile__(“”: : :”memory”)
#define __TBB_control_consistency_helper() __TBB_compiler_fence()
#define __TBB_acquire_consistency_helper() __TBB_compiler_fence()
#define __TBB_release_consistency_helper() __TBB_compiler_fence()

#ifndef __TBB_full_memory_fence
#define __TBB_full_memory_fence() __asm__ __volatile__(“mfence”: : :”memory”)
#endif

能同时起编译器和硬件内存屏障作用的是__asm__ __volatile__(“mfence”: : :”memory”)。注意:mfence!

另外,我们在intel cpu上用的CAS指令都是带lock前缀的。所以在使用CAS的时候完全不必考虑memory order的问题。

最后推荐一篇文章:《Mathematizing C++ Concurrency》http://www.cl.cam.ac.uk/~pes20/cpp/popl085ap-sewell.pdf 第一作者是剑桥的某在读博士。)

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值