Inter Thread Latency

最新推荐文章于 2021-07-19 15:51:32 发布

wqfhenanxc

最新推荐文章于 2021-07-19 15:51:32 发布

阅读量237

点赞数

原文地址：http://mechanical-sympathy.blogspot.com/2011/08/inter-thread-latency.html （移到墙内）

概述：

CPU多个核之间内存交换的速度决定了线程间信息传输的速度

CPU多个核之间内存交换的最小单位是一个Cache line （一般为8字节）

本文用程序小例子比较Java和C++处理线程间通信的latency.

小例子程序的基本逻辑就是用两个线程，每个线程使用一个全局变量轮流等待对方+1，这样如果设定最终两个全局变量的值都为N时程序结束，则共有2N-1次线程间通信（这里线程间通信其实就是全局变量的内存两个线程都可以访问）。由此可以算出线程间一次通信的平均时间。

最终根据得出的测试结果，作者说disruptor的作用很显著，disruptor使得msg传递速度比我们测试的简单程序还要快。

Message rates between threads are fundamentally determined by the latency of memory exchange between CPU cores. The minimum unit of transfer will be a cache line exchanged via shared caches or socket interconnects. In a previous article I explained Memory Barriers and why they are important to concurrent programming between threads. These are the instructions that cause a CPU to make memory visible to other cores in an ordered and timely manner.

Lately I’ve been asked a lot about how much faster the Disruptor would be if C++ was used instead of Java. For sure C++ would give more control for memory alignment and potential access to underlying CPU instructions such as memory barriers and lock instructions. In this article I’ll directly compare C++ and Java to measure the cost of signalling a change between threads.

For the test we’ll use two counters each updated by their own thread. A simple ping-pong algorithm will be used to signal from one to the other and back again. The exchange will be repeated millions of times to measure the average latency between cores. This measurement will give us the latency of exchanging a cache line between cores in a serial manner.

For Java we’ll use volatile counters which the JVM will kindly insert a lock instruction for the update giving us an effective memory barrier.

 
public final class InterThreadLatency
 
implements Runnable
 
{
 
public static final long ITERATIONS = 500L * 1000L * 1000L;
 
 
 
public static volatile long s1;
 
public static volatile long s2;
 
 
 
public static void main(final String[] args)
 
{
 
Thread t = new Thread(new InterThreadLatency());
 
t.setDaemon(true);
 
t.start();
 
 
 
long start = System.nanoTime();
 
 
 
long value = s1;
 
while (s1 &lt; ITERATIONS)
 
{
 
while (s2 != value)
 
{
 
// busy spin
 
}
 
value = ++s1;
 
}
 
 
 
long duration = System.nanoTime() - start;
 
 
 
System.out.println("duration = " + duration);
 
System.out.println("ns per op = " + duration / (ITERATIONS * 2));
 
System.out.println("op/sec = " +
 
(ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration);
 
System.out.println("s1 = " + s1 + ", s2 = " + s2);
 
}
 
 
 
public void run()
 
{
 
long value = s2;
 
while (true)
 
{
 
while (value == s1)
 
{
 
// busy spin
 
}
 
value = ++s2;
 
}
 
}
 
}

For C++ we’ll use the <a href=”http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html”>GNU Atomic Builtins</a> which give us a similar lock instruction insertion to that which the JVM uses.

 
#include <time.h>
 
#include <pthread.h>
 
#include <stdio.h>
 
 
 
typedef unsigned long long uint64;
 
const uint64 ITERATIONS = 500LL * 1000LL * 1000LL;
 
 
 
volatile uint64 s1 = 0;
 
volatile uint64 s2 = 0;
 
 
 
void* run(void*)
 
{
 
    register uint64 value = s2;
 
    while (true)
 
    {
 
        while (value == s1)
 
        {
 
            // busy spin
 
        }
 
        value = __sync_add_and_fetch(&s2, 1);
 
    }
 
}
 
 
 
int main (int argc, char *argv[])
 
{
 
    pthread_t threads[1];
 
    pthread_create(&threads[0], NULL, run, NULL);
 
 
 
    timespec ts_start;
 
    timespec ts_finish;
 
    clock_gettime(CLOCK_MONOTONIC, &ts_start);
 
 
 
    register uint64 value = s1;
 
    while (s1 < ITERATIONS)
 
    {
 
        while (s2 != value)
 
        {
 
            // busy spin
 
        }
 
        value = __sync_add_and_fetch(&s1, 1);
 
    }
 
 
 
    clock_gettime(CLOCK_MONOTONIC, &ts_finish);
 
 
 
    uint64 start = (ts_start.tv_sec * 1000000000LL) + ts_start.tv_nsec;
 
    uint64 finish = (ts_finish.tv_sec * 1000000000LL) + ts_finish.tv_nsec;
 
    uint64 duration = finish - start;
 
 
 
    printf("duration = %lld\n", duration);
 
    printf("ns per op = %lld\n", (duration / (ITERATIONS * 2)));
 
    printf("op/sec = %lld\n",
 
        ((ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration));
 
    printf("s1 = %lld, s2 = %lld\n", s1, s2);
 
 
 
    return 0;
 
}

Results

 
$ taskset -c 2,4 /opt/jdk1.7.0/bin/java InterThreadLatency
 
duration = 50790271150
 
ns per op = 50
 
op/sec = 19,688,810
 
s1 = 500000000, s2 = 500000000
 
 
 
$ g++ -O3 -lrt -o itl itl.cpp -lpthread
 
$ taskset -c 2,4 ./itl
 
duration = 45087955393
 
ns per op = 45
 
op/sec = 22,178,872
 
s1 = 500000000, s2 = 500000000

The C++ version is slightly faster on my Intel Sandybridge laptop. So what does this tell us? Well, that the latency between 2 cores on a 2.2 GHz machine is ~45ns and that you can exchange 22m messages per second in a serial fashion. On an Intel CPU this is fundamentally the cost of the lock instruction enforcing total order and forcing the store buffer and write combining buffers to drain, followed by the resulting cache coherency traffic between the cores. Note that each core has a 96GB/s port onto the L3 cache ring bus, yet 22m * 64-bytes is only 1.4 GB/s. This is because we have measured latency and not throughput. We could easily fit some nice fat messages between those memory barriers as part of the exchange if the data has been written before the lock instruction was executed.

So what does this all mean for the Disruptor? Basically, the latency of the Disruptor is about as low as we can get from Java. It would be possible to get a ~10% latency improvement by moving to C++. I’d expect a similar improvement in throughput for C++. The main win with C++ would be the control, and therefore, the predictability that comes with it if used correctly. The JVM gives us nice safety features like garbage collection in complex applications but we pay a little for that with the extra instructions it inserts that can be seen if you get Hotspot to dump the assembler instructions it is generating.

How does the Disruptor achieve more than 25m messages per second I hear you say??? Well that is one of the neat parts of its design. The “waitFor” semantics on the SequenceBarrier enables a very efficient form of batching, which allows the BatchEventProcessor to process a series of events that occurred since it last checked in with theRingBuffer, all without incurring a memory barrier. For real world applications this batching effect is really significant. For micro benchmarks it only makes the results more random, especially when there is little work done other than accepting the message.

Conclusion

So when processing events in series, the measurements tell us that the current generation of processors can do between 20-30 million exchanges per second at a latency less than 50ns. The Disruptor design allows us to get greater throughput without explicit batching on the publisher side. In addition the Disruptor has an explicit batching API on the publisher side that can give over 100 million messages per second.

wqfhenanxc

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Inter Thread Latency

原文地址：http://mechanical-sympathy.blogspot.com/2011/08/inter-thread-latency.html （移到墙内）CPU多个核之间内存交换的速度决定了线程间信息传输的速度CPU多个核之间内存交换的最小单位是一个Cache line （一般为8字节）本文用程序小例子比较Java和C++处理线程间通信的latency
复制链接

扫一扫