Inter Thread Latency

原文地址:http://mechanical-sympathy.blogspot.com/2011/08/inter-thread-latency.html (移到墙内)

概述:

CPU多个核之间内存交换的速度决定了线程间信息传输的速度

CPU多个核之间内存交换的最小单位是一个Cache line (一般为8字节)

本文用程序小例子比较Java和C++处理线程间通信的latency.

小例子程序的基本逻辑就是用两个线程,每个线程使用一个全局变量轮流等待对方+1,这样如果设定最终两个全局变量的值都为N时程序结束,则共有2N-1次线程间通信(这里线程间通信其实就是全局变量的内存两个线程都可以访问)。 由此可以算出线程间一次通信的平均时间。

最终根据得出的测试结果,作者说disruptor的作用很显著,disruptor使得msg传递速度比我们测试的简单程序还要快。

Message rates between threads are fundamentally determined by the latency of memory exchange between CPU cores.   The minimum unit of transfer will be a cache line exchanged via shared caches or socket interconnects.  In a previous article I explained Memory Barriers and why they are important to concurrent programming between threads.  These are the instructions that cause a CPU to make memory visible to other cores in an ordered and timely manner.

Lately I’ve been asked a lot about how much faster the Disruptor would be if C++ was used instead of Java.  For sure C++ would give more control for memory alignment and potential access to underlying CPU instructions such as memory barriers and lock instructions.  In this article I’ll directly compare C++ and Java to measure the cost of signalling a change between threads.

For the test we’ll use two counters each updated by their own thread.  A simple ping-pong algorithm will be used to signal from one to the other and back again.  The exchange will be repeated millions of times to measure the average latency between cores.  This measurement will give us the latency of exchanging a cache line between cores in a serial manner.

For Java we’ll use volatile counters which the JVM will kindly insert a lock instruction for the update giving us an effective memory barrier.

01 public final class InterThreadLatency
02 implements Runnable
03 {
04 public static final long ITERATIONS = 500L * 1000L * 1000L;
05  
06 public static volatile long s1;
07 public static volatile long s2;
08  
09 public static void main(final String[] args)
10 {
11 Thread t = new Thread(new InterThreadLatency());
12 t.setDaemon(true);
13 t.start();
14  
15 long start = System.nanoTime();
16  
17 long value = s1;
18 while (s1 < ITERATIONS)
19 {
20 while (s2 != value)
21 {
22 // busy spin
23 }
24 value = ++s1;
25 }
26  
27 long duration = System.nanoTime() - start;
28  
29 System.out.println("duration = " + duration);
30 System.out.println("ns per op = " + duration / (ITERATIONS * 2));
31 System.out.println("op/sec = " +
32 (ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration);
33 System.out.println("s1 = " + s1 + ", s2 = " + s2);
34 }
35  
36 public void run()
37 {
38 long value = s2;
39 while (true)
40 {
41 while (value == s1)
42 {
43 // busy spin
44 }
45 value = ++s2;
46 }
47 }
48 }

For C++ we’ll use the <a href=”http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html”>GNU Atomic Builtins</a> which give us a similar lock instruction insertion to that which the JVM uses.

01 #include <time.h>
02 #include <pthread.h>
03 #include <stdio.h>
04  
05 typedef unsigned long long uint64;
06 const uint64 ITERATIONS = 500LL * 1000LL * 1000LL;
07  
08 volatile uint64 s1 = 0;
09 volatile uint64 s2 = 0;
10  
11 void* run(void*)
12 {
13     register uint64 value = s2;
14     while (true)
15     {
16         while (value == s1)
17         {
18             // busy spin
19         }
20         value = __sync_add_and_fetch(&s2, 1);
21     }
22 }
23  
24 int main (int argc, char *argv[])
25 {
26     pthread_t threads[1];
27     pthread_create(&threads[0], NULL, run, NULL);
28  
29     timespec ts_start;
30     timespec ts_finish;
31     clock_gettime(CLOCK_MONOTONIC, &ts_start);
32  
33     register uint64 value = s1;
34     while (s1 < ITERATIONS)
35     {
36         while (s2 != value)
37         {
38             // busy spin
39         }
40         value = __sync_add_and_fetch(&s1, 1);
41     }
42  
43     clock_gettime(CLOCK_MONOTONIC, &ts_finish);
44  
45     uint64 start = (ts_start.tv_sec * 1000000000LL) + ts_start.tv_nsec;
46     uint64 finish = (ts_finish.tv_sec * 1000000000LL) + ts_finish.tv_nsec;
47     uint64 duration = finish - start;
48  
49     printf("duration = %lld\n", duration);
50     printf("ns per op = %lld\n", (duration / (ITERATIONS * 2)));
51     printf("op/sec = %lld\n",
52         ((ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration));
53     printf("s1 = %lld, s2 = %lld\n", s1, s2);
54  
55     return 0;
56 }

Results

01 $ taskset -c 2,4 /opt/jdk1.7.0/bin/java InterThreadLatency
02 duration = 50790271150
03 ns per op = 50
04 op/sec = 19,688,810
05 s1 = 500000000, s2 = 500000000
06  
07 $ g++ -O3 -lrt -o itl itl.cpp -lpthread
08 $ taskset -c 2,4 ./itl
09 duration = 45087955393
10 ns per op = 45
11 op/sec = 22,178,872
12 s1 = 500000000, s2 = 500000000

The C++ version is slightly faster on my Intel Sandybridge laptop.  So what does this tell us?  Well, that the latency between 2 cores on a 2.2 GHz machine is ~45ns and that you can exchange 22m messages per second in a serial fashion.  On an Intel CPU this is fundamentally the cost of the lock instruction enforcing total order and forcing the store buffer and write combining buffers to drain, followed by the resulting cache coherency traffic between the cores.   Note that each core has a 96GB/s port onto the L3 cache ring bus, yet 22m * 64-bytes is only 1.4 GB/s.  This is because we have measured latency and not throughput.  We could easily fit some nice fat messages between those memory barriers as part of the exchange if the data has been written before the lock instruction was executed.

So what does this all mean for the Disruptor?  Basically, the latency of the Disruptor is about as low as we can get from Java.  It would be possible to get a ~10% latency improvement by moving to C++.  I’d expect a similar improvement in throughput for C++.  The main win with C++ would be the control, and therefore, the predictability that comes with it if used correctly.  The JVM gives us nice safety features like garbage collection in complex applications but we pay a little for that with the extra instructions it inserts that can be seen if you get Hotspot to dump the assembler instructions it is generating.

How does the Disruptor achieve more than 25m messages per second I hear you say???   Well that is one of the neat parts of its design.  The “waitFor” semantics on the SequenceBarrier enables a very efficient form of batching, which allows the BatchEventProcessor to process a series of events that occurred since it last checked in with theRingBuffer, all without incurring a memory barrier.  For real world applications this batching effect is really significant.  For micro benchmarks it only makes the results more random,  especially when there is little work done other than accepting the message.

Conclusion

So when processing events in series, the measurements tell us that the current generation of processors can do between 20-30 million exchanges per second at a latency less than 50ns.  The Disruptor design allows us to get greater throughput without explicit batching on the publisher side.  In addition the Disruptor has an explicit batching API on the publisher side that can give over 100 million messages per second.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值