Memory Barriers and JVM Concurrency

zz

Memory Barriers and JVM Concurrency


Memory barriers, or fences, are a set of processor instructions used to apply ordering limitations on memory operations. This article explains the impact memory barriers have on the determinism of multi-threaded programs. We'll look at how memory barriers relate to JVM concurrency constructs such as volatile, synchronized and atomic conditionals. It is assumed the reader has a solid understanding of these concepts and the Java memory model. This is not an article about mutual exclusion, parallelism or atomicity per se. Memory barriers are used to achieve an equally important element of concurrent programming called visibility.

Thanks to Brian Goetz and Eric Yew for reviewing this article. I'd also like to thank Christian Thalinger for access to SPARC hardware.

Why Are Memory Barriers Important?

A trip to main memory costs hundreds of clock cycles on commodity hardware. Processors use caching to decrease the costs of memory latency by orders of magnitude. These caches re-order pending memory operations for the sake of performance. In other words, the reads and writes of a program are not necessarily performed in the order in which they are given to the processor. When data is immutable and/or confined to the scope of one thread these optimizations are harmless. Combining these optimizations with symmetric multi-processing and shared mutable state on the other hand can be a nightmare. A program can behave non-deterministically when memory operations on shared mutable state are re-ordered. It is possible for a thread to write values that become visible to another thread in ways that are inconsistent with the order in which they were written. A properly placed memory barrier prevents this problem by forcing the processor to serialize pending memory operations.

Memory Barriers As Protocols

Memory barriers are not directly exposed by the JVM; instead they are inserted into the instruction sequence by the JVM in order to uphold the semantics of language level concurrency primitives. We'll look at the source code and assembly instructions of some simple Java programs to see how. Let's begin a crash course in memory barriers with Dekker's algorithm. This algorithm uses three volatile variables to coordinate access to a shared resource between two threads.

Try not to focus on the finer details of this algorithm. Which parts are relevant? Each thread attempts to enter the critical section on the first line of code by signaling intent to do so. If a thread observes a conflict on line three (both threads have signaled intent) the conflict is resolved by turn taking. Only one thread can access the critical section at a given point in time.

      // code run by first thread   		// code run by second thread

 1    intentFirst = true;		        intentSecond = true;
 2
 3    while (intentSecond)			while (intentFirst)       // volatile read
 4    	if (turn != 0) {			   if (turn != 1) {       // volatile read
 5    	  intentFirst = false;			     intentSecond = false;
 6    	  while (turn != 0) {}			     while (turn != 1) {}
 7    	  intentFirst = true;			     intentSecond = true;
 8    	}				           }
 9
10    criticalSection();			criticalSection();
11
12    turn = 1;					turn = 0;                 // volatile write
13    intentFirst = false;			intentSecond = false;     // volatile write

Hardware optimizations can break this code without memory barriers, even if the compiler were to emit all memory operations in the order they appear to be in from the programmer's point of view. Consider the two consecutive volatile read operations on lines three and four. Each thread checks to see if the other has signaled an intent to enter the critical sectionand then checks to see whose turn it is. Consider the two consecutive volatile write operations on lines 12 and 13. Each thread gives the other its "turn"and then withdraws its intent to enter the critical section. A reading thread should never expect to observe the other thread's write to the turn variable after the other thread's withdrawal of intent. This would be a disaster. But without the volatile modifier on these variables this indeed can happen! For example, without the volatile modifier the second thread could observe the first thread's write to intentFirst (last line) before the first thread's write to turn (second to last line). The keyword volatile prevents this problem because it establishes a happens before relationship between the write to the turn variable and the write to the intentFirst variable. The compiler cannot re-order these write operations and if necessary it must forbid the processor from doing so with a memory barrier. A peek under the hood shows how.

The PrintAssembly HotSpot option is a diagnostic flag for the JVM that allows us to capture the generated assembly instructions of the JIT compiler. This requires the latest OpenJDK release or a new version of HotSpot, update 14 or above. A disassembler plugin is also required. The Kenai project has plugin binaries for Solaris, Linux and BSD. The hsdis plugin is an alternative that can be built from source for Windows.

The first of the two consecutive read operations on line three is captured in the assembly instructions below. This stream was captured on multi-processing Itanium 2 hardware running JDK 1.6 with update 17. All of the instruction streams in this article are sequenced by line number on the left hand side. Relevant read operations, write operations and memory barrier instructions are in bold. The reader is advised to avoid getting caught up in the semantics of each and every instruction.

1  0x2000000001de819c:      adds r37=597,r36;;  ;...84112554
2  0x2000000001de81a0:      ld1.acq r38=[r37];;  ;...0b30014a a010
3  0x2000000001de81a6:      nop.m 0x0     ;...00000002 00c0
4  0x2000000001de81ac:      sxt1 r38=r38;;  ;...00513004
5  0x2000000001de81b0:      cmp4.eq p0,p6=0,r38  ;...1100004c 8639
6  0x2000000001de81b6:      nop.i 0x0     ;...00000002 0003
7  0x2000000001de81bc:      br.cond.dpnt.many 0x2000000001de8220;;

This short stream of instructions tells a long story. The first volatile read is on line two. The Java memory model guarantees the JVM will deliver this read to the processor before the second read, in "program order" - but this alone would not be enough because the processor is still free to perform these operations out of order. To uphold the consistency guarantees of the Java memory model the JVM annotates the first read operation with a variant of ld.acq, or "load acquire". By using ld.acq the compiler ensures the read operation on line two will complete before the subsequent read operation. Problem solved.

Notice this affects reads, not writes. A memory barrier that enforces ordering limitations on readsor writes is said to be unidirectional. A memory barrier that enforces ordering limitations on readsand writes is said to be bidirectional, or, a full fence. Using ld.acq is an example of a unidirectional memory barrier.

Consistency is a two way street. How useful is it for a reading thread to insert a memory barrier between both reads if the other thread does not separate both writes with one as well? In order for threads to communicate they mustall obey the protocol; just like nodes on a network, or people on a team. If one thread breaks formation then the efforts of all other threads are rendered useless. We should expect to see a memory barrier in the assembly instructions for the last two lines of Dekker's algorithm, a volatile write followed by a volatile write.

$ java -XX:+UnlockDiagnosticVMOptions -XX:PrintAssemblyOptions=hsdis-print-bytes -XX:CompileCommand=print,WriterReader.write WriterReader
 1  0x2000000001de81c0:      adds r37=592,r36;;  ;...0b284149 0421
 2  0x2000000001de81c6:      st4.rel [r37]=r39  ;...00389560 2380
 3  0x2000000001de81cc:      adds r36=596,r36;;  ;...84112544
 4  0x2000000001de81d0:      st1.rel [r36]=r0  ;...09000048 a011
 5  0x2000000001de81d6:      mf            ;...00000044 0000
 6  0x2000000001de81dc:      nop.i 0x0;;   ;...00040000
 7  0x2000000001de81e0:      mov r12=r33   ;...00600042 0021
 8  0x2000000001de81e6:      mov.ret b0=r35,0x2000000001de81e0
 9  0x2000000001de81ec:      mov.i ar.pfs=r34  ;...00aa0220
10  0x2000000001de81f0:      mov r6=r32    ;...09300040 0021

Here we can see the second write operation annotated with an explicit memory barrier on line four. By using a variant of st.rel, or "store release", the compiler ensures the first write operation will be visible before the second write operation. This completes both sides of the protocol because the first write operation happens before the second write operation.

The st.rel barrier is unidirectional - just like ld.acq. On line five however the compiler emits a bidirectional memory barrier. The mf instruction, or "memory fence", is a full fence for the Itanium 2 instruction set. This seems redundant to the author.

Memory Barriers Are Hardware Specific

This article does not aim to be a comprehensive overview of all memory barriers. This would be a monumental task. It is important though to appreciate the fact that these instructions vary considerably across different hardware architectures. Below is what the consecutive volatile writes translate to on multi-processing Intel Xeon hardware. All remaining assembly instruction sequences in this article were captured on an Intel Xeon unless specified otherwise.

 1  0x03f8340c: push   %ebp               ;...55
 2  0x03f8340d: sub    $0x8,%esp          ;...81ec0800 0000
 3  0x03f83413: mov    $0x14c,%edi        ;...bf4c0100 00
 4  0x03f83418: movb   $0x1,-0x505a72f0(%edi)  ;...c687108d a5af01
 5  0x03f8341f: mfence                    ;...0faef0
 6  0x03f83422: mov    $0x148,%ebp        ;...bd480100 00
 7  0x03f83427: mov    $0x14d,%edx        ;...ba4d0100 00
 8  0x03f8342c: movsbl -0x505a72f0(%edx),%ebx  ;...0fbe9a10 8da5af
 9  0x03f83433: test   %ebx,%ebx          ;...85db
10  0x03f83435: jne    0x03f83460         ;...7529
11  0x03f83437: movl   $0x1,-0x505a72f0(%ebp)  ;...c785108d a5af01
12  0x03f83441: movb   $0x0,-0x505a72f0(%edi)  ;...c687108d a5af00
13  0x03f83448: mfence                    ;...0faef0
14  0x03f8344b: add    $0x8,%esp          ;...83c408
15  0x03f8344e: pop    %ebp               ;...5d

Here we see both volatile writes on lines 11 and 12 on the x86 Xeon. The second write is chased with an mfence instruction, an explicit bidirectional memory barrier.

And now the consecutive volatile writes on SPARC.

 1 0xfb8ecc84: ldub  [ %l1 + 0x155 ], %l3  ;...e60c6155
 2 0xfb8ecc88: cmp  %l3, 0               ;...80a4e000
 3 0xfb8ecc8c: bne,pn   %icc, 0xfb8eccb0  ;...12400009
 4 0xfb8ecc90: nop                       ;...01000000
 5 0xfb8ecc94: st  %l0, [ %l1 + 0x150 ]  ;...e0246150
 6 0xfb8ecc98: clrb  [ %l1 + 0x154 ]     ;...c02c6154
 7 0xfb8ecc9c: membar  #StoreLoad        ;...8143e002
 8 0xfb8ecca0: sethi  %hi(0xff3fc000), %l0  ;...213fcff0
 9 0xfb8ecca4: ld  [ %l0 ], %g0          ;...c0042000
10 0xfb8ecca8: ret                       ;...81c7e008
11 0xfb8eccac: restore                   ;...81e80000

Here we see the both volatile writes on lines five and six. The second write is chased with a membar instruction, an explicit bidirectional memory barrier.

There is one important difference between the instruction streams for x86 and SPARC and the instruction stream for Itanium. The JVM chased the consecutive write operations with a memory barrier on x86 and SPARC, but it did not place a memory barrierbetween the two write operations. On the other hand the instruction stream for Itanium has a memory barrier between both writes. Why does the JVM behave differently across hardware architectures? Because a hardware architecture has a memory model and each memory model has a set of consistency guarantees. Some memory models, like that of x86 or SPARC, have a very strong set of consistency guarantees. Other memory models, like that of Itanium, PowerPC or Alpha, have a much more relaxed set of guarantees. For example x86 and SPARC do not re-order consecutive write operations - so no memory barrier is needed. Itanium, PowerPC and Alpha will re-order consecutive write operations - so the JVM has to place a memory barrier between them. The JVM uses memory barriers to bridge the gaps between the Java memory model and the memory model of the hardware it runs on.

Implicit Memory Barriers

Explicit fence instructions are not the only way to serialize memory operations. Let's switch gears to the Counter class to see an example.

    class Counter{

        static int counter = 0;

        public static void main(String[] _){
            for(int i = 0; i < 100000; i++)
                inc();
        }

        static synchronized void inc(){ counter += 1; }

    }
		

The Counter class performs a classic read-modify-write operation. The static counter field is not volatile because all three operations must be observed atomically. For this reason the inc method of the Counter class is synchronized. We can compile the Counter class and observe the generated assembly instructions for the synchronized inc method with the following command. The Java memory model guarantees the same visibility semantics for exiting of synchronized regions as it does for volatile memory operations, so we should expect to see another memory barrier.

$ java -XX:+UnlockDiagnosticVMOptions -XX:PrintAssemblyOptions=hsdis-print-bytes -XX:-UseBiasedLocking -XX:CompileCommand=print,Counter.inc Counter
 1  0x04d5eda7: push   %ebp               ;...55
 2  0x04d5eda8: mov    %esp,%ebp          ;...8bec
 3  0x04d5edaa: sub    $0x28,%esp         ;...83ec28
 4  0x04d5edad: mov    $0x95ba5408,%esi   ;...be0854ba 95
 5  0x04d5edb2: lea    0x10(%esp),%edi    ;...8d7c2410
 6  0x04d5edb6: mov    %esi,0x4(%edi)     ;...897704
 7  0x04d5edb9: mov    (%esi),%eax        ;...8b06
 8  0x04d5edbb: or     $0x1,%eax          ;...83c801
 9  0x04d5edbe: mov    %eax,(%edi)        ;...8907
10  0x04d5edc0: lock cmpxchg %edi,(%esi)  ;...f00fb13e
11  0x04d5edc4: je     0x04d5edda         ;...0f841000 0000
12  0x04d5edca: sub    %esp,%eax          ;...2bc4
13  0x04d5edcc: and    $0xfffff003,%eax   ;...81e003f0 ffff
14  0x04d5edd2: mov    %eax,(%edi)        ;...8907
15  0x04d5edd4: jne    0x04d5ee11         ;...0f853700 0000
16  0x04d5edda: mov    $0x95ba52b8,%eax   ;...b8b852ba 95
17  0x04d5eddf: mov    0x148(%eax),%esi   ;...8bb04801 0000
18  0x04d5ede5: inc    %esi               ;...46
19  0x04d5ede6: mov    %esi,0x148(%eax)   ;...89b04801 0000
20  0x04d5edec: lea    0x10(%esp),%eax    ;...8d442410
21  0x04d5edf0: mov    (%eax),%esi        ;...8b30
22  0x04d5edf2: test   %esi,%esi          ;...85f6
23  0x04d5edf4: je     0x04d5ee07         ;...0f840d00 0000
24  0x04d5edfa: mov    0x4(%eax),%edi     ;...8b7804
25  0x04d5edfd: lock cmpxchg %esi,(%edi)  ;...f00fb137
26  0x04d5ee01: jne    0x04d5ee1f         ;...0f851800 0000
27  0x04d5ee07: mov    %ebp,%esp          ;...8be5
28  0x04d5ee09: pop    %ebp               ;...5d
		

To no surprise the number of instructions generated by synchronized is more than volatile. The increment is found on line 18 but at no point does the JVM insert an explicit memory barrier. Instead, the JVM has killed two birds with one stone using a lock prefixed cmpxchg instruction on lines 10 and 25. The semantics of cmpxchg are beyond the scope of this article. What's relevant is that 'lock cmpxchg' not only performs the write operation atomically - it also flushes pending read and write operations. The write operation will now become visible before all subsequent memory operations. If we refactor and run the Counter class to use java.util.concurrent.atomic.AtomicInteger we can observe this same trick.

    import java.util.concurrent.atomic.AtomicInteger;

    class Counter{

        static AtomicInteger counter = new AtomicInteger(0);

        public static void main(String[] args){
            for(int i = 0; i < 1000000; i++)
                counter.incrementAndGet();
        }

    }
$ java -XX:+UnlockDiagnosticVMOptions -XX:PrintAssemblyOptions=hsdis-print-bytes -XX:CompileCommand=print,*AtomicInteger.incrementAndGet Counter
 1  0x024451f7: push   %ebp               ;...55
 2  0x024451f8: mov    %esp,%ebp          ;...8bec
 3  0x024451fa: sub    $0x38,%esp         ;...83ec38
 4  0x024451fd: jmp    0x0244520a         ;...e9080000 00
 5  0x02445202: xchg   %ax,%ax            ;...6690
 6  0x02445204: test   %eax,0xb771e100    ;...850500e1 71b7
 7  0x0244520a: mov    0x8(%ecx),%eax     ;...8b4108
 8  0x0244520d: mov    %eax,%esi          ;...8bf0
 9  0x0244520f: inc    %esi               ;...46
10  0x02445210: mov    $0x9a3f03d0,%edi   ;...bfd0033f 9a
11  0x02445215: mov    0x160(%edi),%edi   ;...8bbf6001 0000
12  0x0244521b: mov    %ecx,%edi          ;...8bf9
13  0x0244521d: add    $0x8,%edi          ;...83c708
14  0x02445220: lock cmpxchg %esi,(%edi)  ;...f00fb137
15  0x02445224: mov    $0x1,%eax          ;...b8010000 00
16  0x02445229: je     0x02445234         ;...0f840500 0000
17  0x0244522f: mov    $0x0,%eax          ;...b8000000 00
18  0x02445234: cmp    $0x0,%eax          ;...83f800
19  0x02445237: je     0x02445204         ;...74cb
20  0x02445239: mov    %esi,%eax          ;...8bc6
21  0x0244523b: mov    %ebp,%esp          ;...8be5
22  0x0244523d: pop    %ebp               ;...5d

Again we see the write operation being combined with a lock prefix on line 14. This ensures the new value of the variable will become visible to other threads before all subsequent memory operations.

Memory Barriers Can Be Avoided

The JVM is very good at eliminating unnecessary memory barriers. Often it gets lucky and the consistency guarantees of the hardware memory model are greater than or equal to those of the Java memory model. When this happens the JVM simply inserts a no op instead of an actual memory barrier. For example, the consistency guarantees of the x86 and SPARC memory models are strong enough to eliminate the need for a memory barrier when reading a volatile variable. Remember the explicit unidirectional memory barrier used to separate both read operations on Itanium? Well, the generated assembly instructions for the consecutive volatile reads in Dekker's algorithm on an x86 haveno memory barrier.

A read followed by a read of shared memory on x86

 1  0x03f83422: mov    $0x148,%ebp        ;...bd480100 00
 2  0x03f83427: mov    $0x14d,%edx        ;...ba4d0100 00
 3  0x03f8342c: movsbl -0x505a72f0(%edx),%ebx  ;...0fbe9a10 8da5af
 4  0x03f83433: test   %ebx,%ebx          ;...85db
 5  0x03f83435: jne    0x03f83460         ;...7529
 6  0x03f83437: movl   $0x1,-0x505a72f0(%ebp)  ;...c785108d a5af01
 7  0x03f83441: movb   $0x0,-0x505a72f0(%edi)  ;...c687108d a5af00
 8  0x03f83448: mfence                    ;...0faef0
 9  0x03f8344b: add    $0x8,%esp          ;...83c408
10  0x03f8344e: pop    %ebp               ;...5d
11  0x03f8344f: test   %eax,0xb78ec000    ;...850500c0 8eb7
12  0x03f83455: ret                       ;...c3
13  0x03f83456: nopw   0x0(%eax,%eax,1)   ;...66660f1f 840000
14  0x03f83460: mov    -0x505a72f0(%ebp),%ebx  ;...8b9d108d a5af
15  0x03f83466: test   %edi,0xb78ec000    ;...853d00c0 8eb7

The volatile read operations are found on lines three and fourteen. Neither are paired with a memory barrier. In other words the only performance penalty for a volatile read on an x86 (or on SPARC for that matter) is a minor loss of code motion optimization opportunities - the instruction itself is no different than an ordinary read.

Unidirectional memory barriers are naturally less expensive than bidirectional ones. The JVM will avoid a bidirectional memory barrier when it knows a unidirectional one is sufficient. The first example in this article demonstrated this. We saw the first of two consecutive volatile read operations on Itanium were annotated with a unidirectional memory barrier. If the read operations had been annotated with an explicit bidirectional memory barrier the program would still be correct, but at a greater latency cost.

Dynamic Compilation

Everything a static compiler knows at build time is known by a dynamic compiler at runtime, and more. More information means more opportunities to optimize. For example let's look at how the JVM treats memory barriers when running on a single processor. The following instruction stream was captured from a runtime compilation of two consecutive volatile writes in Dekker's algorithm. The program was running in a VMWare workstation image in uni-processor mode on x86 hardware.

 1  0x017b474c: push   %ebp               ;...55
 2  0x017b474d: sub    $0x8,%esp          ;...81ec0800 0000
 3  0x017b4753: mov    $0x14c,%edi        ;...bf4c0100 00
 4  0x017b4758: movb   $0x1,-0x507572f0(%edi)  ;...c687108d 8aaf01
 5  0x017b475f: mov    $0x148,%ebp        ;...bd480100 00
 6  0x017b4764: mov    $0x14d,%edx        ;...ba4d0100 00
 7  0x017b4769: movsbl -0x507572f0(%edx),%ebx  ;...0fbe9a10 8d8aaf
 8  0x017b4770: test   %ebx,%ebx          ;...85db
 9  0x017b4772: jne    0x017b4790         ;...751c
10  0x017b4774: movl   $0x1,-0x507572f0(%ebp)  ;...c785108d 8aaf01
11  0x017b477e: movb   $0x0,-0x507572f0(%edi)  ;...c687108d 8aaf00
12  0x017b4785: add    $0x8,%esp          ;...83c408
13  0x017b4788: pop    %ebp               ;...5d

On a uni-processor system the JVM inserts a no op for all memory barriers because memory operations are already serialized. Neither write operation (lines 10 and 11) is chased with a barrier. The JVM makes similar optimizations for atomic conditionals. Here is an instruction stream captured from the runtime compilation of AtomicInteger.incrementAndGet on the same VMWare image.

 1  0x036880f7: push   %ebp               ;...55
 2  0x036880f8: mov    %esp,%ebp          ;...8bec
 3  0x036880fa: sub    $0x38,%esp         ;...83ec38
 4  0x036880fd: jmp    0x0368810a         ;...e9080000 00
 5  0x03688102: xchg   %ax,%ax            ;...6690
 6  0x03688104: test   %eax,0xb78b8100    ;...85050081 8bb7
 7  0x0368810a: mov    0x8(%ecx),%eax     ;...8b4108
 8  0x0368810d: mov    %eax,%esi          ;...8bf0
 9  0x0368810f: inc    %esi               ;...46
10  0x03688110: mov    $0x9a3f03d0,%edi   ;...bfd0033f 9a
11  0x03688115: mov    0x160(%edi),%edi   ;...8bbf6001 0000
12  0x0368811b: mov    %ecx,%edi          ;...8bf9
13  0x0368811d: add    $0x8,%edi          ;...83c708
14  0x03688120: cmpxchg %esi,(%edi)       ;...0fb137
15  0x03688123: mov    $0x1,%eax          ;...b8010000 00
16  0x03688128: je     0x03688133         ;...0f840500 0000
17  0x0368812e: mov    $0x0,%eax          ;...b8000000 00
18  0x03688133: cmp    $0x0,%eax          ;...83f800
19  0x03688136: je     0x03688104         ;...74cc
20  0x03688138: mov    %esi,%eax          ;...8bc6
21  0x0368813a: mov    %ebp,%esp          ;...8be5
22  0x0368813c: pop    %ebp               ;...5d

Notice the cmpxchg instruction on line 14. Previously we saw the compiler give this instruction to the processor with a lock prefix. In the absence of SMP the JVM has chosen to avoid this cost - something it could not have done with static compilation.

Closing

Memory barriers are a necessity for multi-threaded programming. They come in many flavors. Some are explicit, others are implicit. Some are bidirectional, others are unidirectional. The JVM uses this array of choices to efficiently honor the Java memory model across all platforms. I hope this article helps experienced JVM developers become a little more knowledgeable about how their code behaves under the hood.

Reference

About the Author

Dennis Byrne is a senior software engineer forDRW Trading, a proprietary trading firm and liquidity provider. He is a writer, presenter and active member of the open source community.


Memory Consistency Models -- 内存一致性模型



评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值