Disruptor

最新推荐文章于 2024-04-28 12:35:40 发布

JackLoveBlueSky

最新推荐文章于 2024-04-28 12:35:40 发布

阅读量895

点赞数

分类专栏： Disruptor

Disruptor 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Disruptor:

High performance alternative to bounded queues for exchanging databetween concurrent threads

MartinThompson
Dave Farley
Michael Barker
Patricia GeeAndrew Stewart

1 Abstract

LMAX was established to create a very highperformance financial exchange. As partof our work to accomplish this goal we have evaluated several approaches to thedesign of such a system, but as we began to measure these we ran into somefundamental limits with conventional approaches.

Many applications depend on queues toexchange data between processing stages. Our performance testing showed that the latency costs, when using queuesin this way, were in the same order of magnitude as the cost of IO operationsto disk (RAID or SSD based disk system) – dramatically slow. If there are multiple queues in an end-to-endoperation, this will add hundreds of microseconds to the overall latency. There is clearly room for optimisation.

Further investigation and a focus on thecomputer science made us realise that the conflation of concerns inherent inconventional approaches, (e.g. queues and processing nodes) leads to contentionin multi-threaded implementations, suggesting that there may be a betterapproach.

Thinking about how modern CPUs work,something we like to call “mechanical sympathy”, using good design practiceswith a strong focus on teasing apart the concerns, we came up with a datastructure and a pattern of use that we have called the Disruptor.

Testing has shown that the mean latencyusing the Disruptor for a three-stage pipeline is 3 orders of magnitude lower thanan equivalent queue-based approach. Inaddition, the Disruptor handles approximately 8 times more throughput for thesame configuration.

These performance improvements represent astep change in the thinking around concurrent programming. This new pattern is an ideal foundation forany asynchronous event processing architecture where high-throughput andlow-latency is required.

At LMAX we have built an order matchingengine, real-time risk management, and a highly available in-memory transactionprocessing system all on this pattern to great success. Each of these systems has set new performancestandards that, as far as we can tell, are unsurpassed.

However this is not a specialist solutionthat is only of relevance in the Finance industry. The Disruptor is a general-purpose mechanismthat solves a complex problem in concurrent programming in a way that maximizesperformance, and that is simple to implement. Although some of the concepts may seem unusualit has been our experience that systems built to this pattern are significantlysimpler to implement than comparable mechanisms.

The Disruptor has significantly less writecontention, a lower concurrency overhead and is more cache friendly thancomparable approaches, all of which results in greater throughput with lessjitter at lower latency. On processorsat moderate clock rates we have seen over 25 million messages per second andlatencies lower than 50 nanoseconds. This performance is a significant improvementcompared to any other implementation that we have seen. This is very close to the theoretical limit ofa modern processor to exchange data between cores.

2 Overview

The Disruptor is the result of our efforts tobuild the world’s highest performance financial exchange at LMAX. Early designs focused on architectures derivedfrom SEDA[1]and Actors[2]using pipelines for throughput. Afterprofiling various implementations it became evident that the queuing of eventsbetween stages in the pipeline was dominating the costs. We found that queues also introduced latencyand high levels of jitter. We expended significanteffort on developing new queue implementations with better performance. However it became evident that queues as afundamental data structure are limited due to the conflation of design concernsfor the producers, consumers, and their data storage. The Disruptor is the result of our work tobuild a concurrent structure that cleanly separates these concerns.

3 The Complexities of Concurrency

In the context of this document, andcomputer science in general, concurrency means not only that two or more taskshappen in parallel, but also that they contend on access to resources. The contended resource may be a database,file, socket or even a location in memory.

Concurrent execution of code is about twothings, mutual exclusion and visibility of change. Mutual exclusion is about managing contended updatesto some resource. Visibility of changeis about controlling when such changes are made visible to other threads. It is possible to avoid the need for mutualexclusion if you can eliminate the need for contended updates. If your algorithm can guarantee that any givenresource is modified by only one thread, then mutual exclusion is unnecessary. Read and write operations require that allchanges are made visible to other threads. However only contended write operationsrequire the mutual exclusion of the changes.

The most costly operation in any concurrentenvironment is a contended write access. To have multiple threads write to the same resource requires complex andexpensive coordination. Typically this isachieved by employing a locking strategy of some kind.

3.1 The Cost of Locks

Locks provide mutual exclusion and ensure thatthe visibility of change occurs in an ordered manner. Locks are incredibly expensive because theyrequire arbitration when contended. Thisarbitration is achieved by a context switch to the operating system kernelwhich will suspend threads waiting on a lock until it is released. During such a context switch, as well asreleasing control to the operating system which may decide to do otherhouse-keeping tasks while it has control, execution context can lose previouslycached data and instructions. This canhave a serious performance impact on modern processors. Fast user mode locks can be employed but theseare only of any real benefit when not contended.

We will illustrate the cost of locks with asimple demonstration. The focus of this experimentis to call a function which increments a 64-bit counter in a loop 500 milliontimes. This can be executed by a singlethread on a 2.4Ghz Intel Westmere EP in just 300ms if written in Java. The language is unimportant to thisexperiment and results will be similar across all languages with the same basicprimitives.

Once a lock is introduced to provide mutualexclusion, even when the lock is as yet un-contended, the cost goes up significantly. The cost increases again, by orders ofmagnitude, when two or more threads begin to contend. The results of this simple experiment areshown in the table below:

Method	Time (ms)
Single thread	300
Single thread with lock	10,000
Two threads with lock	224,000
Single thread with CAS	5,700
Two threads with CAS	30,000
Single thread with volatile write	4,700

Table 1 - Comparative costs of contention

3.2 The Costs of “CAS”

A more efficient alternative to the use of lockscan be employed for updating memory when the target of the update is a singleword. These alternatives are based upon theatomic, or interlocked, instructions implemented in modern processors. These are commonly known as CAS (Compare AndSwap) operations, e.g. “lock cmpxchg”on x86. A CAS operation is a specialmachine-code instruction that allows a word in memory to be conditionally setas an atomic operation. For the “incrementa counter experiment” each thread can spin in a loop reading the counter thentry to atomically set it to its new incremented value. The old and new values are provided asparameters to this instruction. If, whenthe operation is executed, the value of the counter matches the suppliedexpected value, the counter is updated with the new value. If, on the other hand, the value is not asexpected, the CAS operation will fail. Itis then up to the thread attempting to perform the change to retry, re-readingthe counter incrementing from that value and so on until the changesucceeds. This CAS approach is significantlymore efficient than locks because it does not require a context switch to the kernelfor arbitration. However CAS operationsare not free of cost. The processor mustlock its instruction pipeline to ensure atomicity and employ a memory barrierto make the changes visible to other threads. CAS operations are available in Java by usingthe java.util.concurrent.Atomic*classes.

If the critical section of the program ismore complex than a simple increment of a counter it may take a complex statemachine using multiple CAS operations to orchestrate the contention. Developing concurrent programs using locks isdifficult; developing lock-free algorithms using CAS operations and memorybarriers is many times more complex and it is very difficult to prove that theyare correct.

The ideal algorithm would be one with onlya single thread owning all writes to a single resource with other threadsreading the results. To read the resultsin a multi-processor environment requires memory barriers to make the changesvisible to threads running on other processors.

3.3 Memory Barriers

Modern processors perform out-of-orderexecution of instructions and out-of-order loads and stores of data betweenmemory and execution units for performance reasons. The processors need only guarantee thatprogram logic produces the same results regardless of execution order. This is not an issue for single-threadedprograms. However, when threads sharestate it is important that all memory changes appear in order, at the pointrequired, for the data exchange to be successful. Memory barriers are used by processors toindicate sections of code where the ordering of memory updates is important. They are the means by which hardware orderingand visibility of change is achieved between threads. Compilers can put in place complimentarysoftware barriers to ensure the ordering of compiled code, such software memorybarriers are in addition to the hardware barriers used by the processorsthemselves.

Modern CPUs are now much faster than thecurrent generation of memory systems. Tobridge this divide CPUs use complex cache systems which are effectively fasthardware hash tables without chaining. These caches are kept coherent with other processor cache systems viamessage passing protocols. In addition,processors have “store buffers” tooffload writes to these caches, and “invalidatequeues” so that the cache coherency protocols can acknowledge invalidationmessages quickly for efficiency when a write is about to happen.

What this means for data is that the latestversion of any value could, at any stage after being written, be in a register,a store buffer, one of many layers of cache, or in main memory. If threads are to share this value, it needsto be made visible in an ordered fashion and this is achieved through the coordinatedexchange of cache coherency messages. The timely generation of these messages can be controlled by memorybarriers.

A read memory barrier orders loadinstructions on the CPU that executes it by marking a point in the invalidate queuefor changes coming into its cache. Thisgives it a consistent view of the world for write operations ordered before theread barrier.

A write barrier orders store instructionson the CPU that executes it by marking a point in the store buffer, thusflushing writes out via its cache. This barriergives an ordered view to the world of what store operations happen before thewrite barrier.

A full memory barrier orders both loads andstores but only on the CPU that executes it.

Some CPUs have more variants in addition tothese three primitives but these three are sufficient to understand thecomplexities of what is involved. In theJava memory model the read and write of a volatilefield implements the read and write barriers respectively. This was made explicit in the Java Memory Model[3] asdefined with the release of Java 5.

3.4 Cache Lines

The way in which caching is used in modernprocessors is of immense importance to successful high performance operation. Such processors are enormously efficient atchurning through data and instructions held in cache and yet, comparatively,are massively inefficient when a cache miss occurs.

Our hardware does not move memory around inbytes or words. For efficiency, cachesare organised into cache-lines that are typically 32-256 bytes in size, the mostcommon cache-line being 64 bytes. Thisis the level of granularity at which cache coherency protocols operate. This means that if two variables are in thesame cache line, and they are written to by different threads, then they presentthe same problems of write contention as if they were a single variable. This is a concept know as “false sharing”. For high performance then, it is important toensure that independent, but concurrently written, variables do not share thesame cache-line if contention is to be minimised.

When accessing memory in a predictablemanner CPUs are able to hide the latency cost of accessing main memory by predictingwhich memory is likely to be accessed next and pre-fetching it into the cachein the background. This only works ifthe processors can detect a pattern of access such as walking memory with apredictable “stride”. When iteratingover the contents of an array the stride is predictable and so memory will bepre-fetched in cache lines, maximizing the efficiency of the access. Strides typically have to be less than 2048 bytesin either direction to be noticed by the processor. However, data structures like linked listsand trees tend to have nodes that are more widely distributed in memory with nopredictable stride of access. The lackof a consistent pattern in memory constrains the ability of the system topre-fetch cache-lines, resulting in main memory accesses which can be more than2 orders of magnitude less efficient.

3.5 The Problems of Queues

Queues typically use either linked-lists orarrays for the underlying storage of elements. If an in-memory queue is allowed to be unbounded then for many classesof problem it can grow unchecked until it reaches the point of catastrophicfailure by exhausting memory. Thishappens when producers outpace the consumers. Unbounded queues can be useful in systems where the producers areguaranteed not to outpace the consumers and memory is a precious resource, butthere is always a risk if this assumption doesn’t hold and queue grows withoutlimit. To avoid this catastrophicoutcome, queues are commonly constrained in size (bounded). Keeping a queue bounded requires that it iseither array-backed or that the size is actively tracked.

Queue implementations tend to have writecontention on the head, tail, and size variables. When in use, queues are typically alwaysclose to full or close to empty due to the differences in pace betweenconsumers and producers. They veryrarely operate in a balanced middle ground where the rate of production andconsumption is evenly matched. This propensityto be always full or always empty results in high levels of contention and/or expensivecache coherence. The problem is thateven when the head and tail mechanisms are separated using different concurrentobjects such as locks or CAS variables, they generally occupy the samecache-line.

The concerns of managing producers claimingthe head of a queue, consumers claiming the tail, and the storage of nodes inbetween make the designs of concurrent implementations very complex to managebeyond using a single large-grain lock on the queue. Large grain locks on the whole queue for put and take operations are simple to implement but represent a significantbottleneck to throughput. If theconcurrent concerns are teased apart within the semantics of a queue then theimplementations become very complex for anything other than a single producer –single consumer implementation.

In Java there is a further problem with theuse of queues, as they are significant sources of garbage. Firstly, objects have to be allocated andplaced in the queue. Secondly, iflinked-list backed, objects have to be allocated representing the nodes of thelist. When no longer referenced, allthese objects allocated to support the queue implementation need to bere-claimed.

3.6 Pipelines and Graphs

For many classes of problem it makes sense towire together several processing stages into pipelines. Such pipelines often haveparallel paths, being organised into graph-like topologies. The links between each stage are oftenimplemented by queues with each stage having its own thread.

This approach is not cheap - at each stagewe have to incur the cost of en-queuing and de-queuing units of work. The number of targets multiplies this costwhen the path must fork, and incurs an inevitable cost of contention when itmust re-join after such a fork.

It would be ideal if the graph ofdependencies could be expressed without incurring the cost of putting thequeues between stages.

4 Design of the LMAX Disruptor

While trying to address the problemsdescribed above, a design emerged through a rigorous separation of the concernsthat we saw as being conflated in queues. This approach was combined with a focus on ensuring that any data shouldbe owned by only one thread for write access, therefore eliminating writecontention. That design became known asthe “Disruptor”. It was so named becauseit had elements of similarity for dealing with graphs of dependencies to the conceptof “Phasers”[4]in Java 7, introduced to support Fork-Join.

The LMAX disruptor is designed to address allof the issues outlined above in an attempt to maximize the efficiency of memoryallocation, and operate in a cache-friendly manner so that it will performoptimally on modern hardware.

At the heart of the disruptor mechanismsits a pre-allocated bounded data structure in the form of a ring-buffer. Data is added to the ring buffer through oneor more producers and processed by one or more consumers.

4.1 Memory Allocation

All memory for the ring buffer ispre-allocated on start up. A ring-buffercan store either an array of pointers to entries or an array of structuresrepresenting the entries. Thelimitations of the Java language mean that entries are associated with thering-buffer as pointers to objects. Eachof these entries is typically not the data being passed itself, but a containerfor it. This pre-allocation of entrieseliminates issues in languages that support garbage collection, since theentries will be re-used and live for the duration of the Disruptor instance. The memory for these entries is allocated atthe same time and it is highly likely that it will be laid out contiguously inmain memory and so support cache striding. There is a proposal by John Rose to introduce “value types”[5] tothe Java language which would allow arrays of tuples, like other languages suchas C, and so ensure that memory would be allocated contiguously and avoid thepointer indirection.

Garbage collection can be problematic whendeveloping low-latency systems in a managed runtime environment like Java. The more memory that is allocated the greaterthe burden this puts on the garbage collector. Garbage collectors work at their best when objects are either very short-livedor effectively immortal. Thepre-allocation of entries in the ring buffer means that it is immortal as faras garbage collector is concerned and so represents little burden.

Under heavy load queue-based systems can backup, which can lead to a reduction in the rate of processing, and results in theallocated objects surviving longer than they should, thus being promoted beyondthe young generation with generational garbage collectors. This has two implications: first, the objectshave to be copied between generations which cause latency jitter; second, theseobjects have to be collected from the old generation which is typically a much moreexpensive operation and increases the likelihood of “stop the world” pauses thatresult when the fragmented memory space requires compaction. In large memory heaps this can cause pauses ofseconds per GB in duration.

4.2 Teasing Apart the Concerns

We saw the following concerns as beingconflated in all queue implementations, to the extent that this collection of distinctbehaviours tend to define the interfaces that queues implement:

1. Storage of items beingexchanged

2. Coordination of producersclaiming the next sequence for exchange

3. Coordination of consumers beingnotified that a new item is available

When designing a financial exchange in alanguage that uses garbage collection, too much memory allocation can beproblematic. So, as we have describedlinked-list backed queues are a not a good approach. Garbage collection is minimized if the entirestorage for the exchange of data between processing stages can be pre-allocated. Further, if this allocation can beperformed in a uniform chunk, then traversal of that data will be done in amanner that is very friendly to the caching strategies employed by modernprocessors. A data-structure that meetsthis requirement is an array with all the slots pre-filled. On creation of the ring buffer the Disruptorutilises the abstract factory pattern to pre-allocate the entries. When an entry is claimed, a producer can copyits data into the pre-allocated structure.

On most processors there is a very highcost for the remainder calculation on the sequence number, which determines theslot in the ring. This cost can begreatly reduced by making the ring size a power of 2. A bit mask of size minus one can be used toperform the remainder operation efficiently.

As we described earlier bounded queuessuffer from contention at the head and tail of the queue. The ring buffer data structure is free from thiscontention and concurrency primitives because these concerns have been teasedout into producer and consumer barriers through which the ring buffer must beaccessed. The logic for these barriersis described below.

In most common usages of the Disruptor thereis usually only one producer. Typicalproducers are file readers or network listeners. In cases where there is asingle producer there is no contention on sequence/entry allocation.

In more unusual usages where there aremultiple producers, producers will race one another to claim the next entry inthe ring-buffer. Contention on claiming thenext available entry can be managed with a simple CAS operation on the sequencenumber for that slot.

Once a producer has copied the relevantdata to the claimed entry it can make it public to consumers by committing thesequence. This can be done without CASby a simple busy spin until the other producers have reached this sequence intheir own commit. Then this producer canadvance the cursor signifying the next available entry for consumption. Producers can avoid wrapping the ring bytracking the sequence of consumers as a simple read operation before they writeto the ring buffer.

Consumers wait for a sequence to becomeavailable in the ring buffer before they read the entry. Various strategies can be employed whilewaiting. If CPU resource is precious theycan wait on a condition variable within a lock that gets signalled by theproducers. This obviously is a point ofcontention and only to be used when CPU resource is more important than latencyor throughput. The consumers can alsoloop checking the cursor which represents the currently available sequence inthe ring buffer. This could be done withor without a thread yield by trading CPU resource against latency. This scales very well as we have broken thecontended dependency between the producers and consumers if we do not use alock and condition variable. Lock free multi-producer– multi-consumer queues do exist but they require multiple CAS operations onthe head, tail, size counters. TheDisruptor does not suffer this CAS contention.

4.3 Sequencing

Sequencing is the core concept to how theconcurrency is managed in the Disruptor. Each producer and consumer works off a strict sequencing concept for howit interacts with the ring buffer. Producers claim the next slot in sequence when claiming an entry in thering. This sequence of the nextavailable slot can be a simple counter in the case of only one producer or anatomic counter updated using CAS operations in the case of multipleproducers. Once a sequence value isclaimed, this entry in the ring buffer is now available to be written to by theclaiming producer. When the producer hasfinished updating the entry it can commit the changes by updating a separatecounter which represents the cursor on the ring buffer for the latest entryavailable to consumers. The ring buffercursor can be read and written in a busy spin by the producers using memorybarrier without requiring a CAS operation as below.

long expectedSequence = claimedSequence– 1;

while (cursor != expectedSequence)

{

// busy spin

}

cursor = claimedSequence;

Consumers wait for a given sequence tobecome available by using a memory barrier to read the cursor. Once the cursor has been updated the memorybarriers ensure the changes to the entries in the ring buffer are visible tothe consumers who have waited on the cursor advancing.

Consumers each contain their own sequencewhich they update as they process entries from the ring buffer. These consumer sequences allow the producersto track consumers to prevent the ring from wrapping. Consumer sequences also allow consumers to coordinatework on the same entry in an ordered manner

In the case of having only one producer,and regardless of the complexity of the consumer graph, no locks or CASoperations are required. The wholeconcurrency coordination can be achieved with just memory barriers on thediscussed sequences.

4.4 Batching Effect

When consumers are waiting on an advancing cursorsequence in the ring buffer an interesting opportunity arises that is notpossible with queues. If the consumerfinds the ring buffer cursor has advanced a number of steps since it last checkedit can process up to that sequence without getting involved in the concurrencymechanisms. This results in the lagging consumerquickly regaining pace with the producers when the producers burst ahead thusbalancing the system. This type of batchingincreases throughput while reducing and smoothing latency at the same time. Based on our observations, this effectresults in a close to constant time for latency regardless of load, up untilthe memory sub-system is saturated, and then the profile is linear followingLittle’s Law[6]. This is very different to the “J” curveeffect on latency we have observed with queues as load increases.

4.5 Dependency Graphs

A queue represents the simple one steppipeline dependency between producers and consumers. If the consumers form a chain or graph-likestructure of dependencies then queues are required between each stage of thegraph. This incurs the fixed costs ofqueues many times within the graph of dependent stages. When designing the LMAX financial exchange ourprofiling showed that taking a queue based approach resulted in queuing costsdominating the total execution costs for processing a transaction.

Because the producer and consumer concernsare separated with the Disruptor pattern, it is possible to represent a complexgraph of dependencies between consumers while only using a single ring bufferat the core. This results in greatlyreduced fixed costs of execution thus increasing throughput while reducinglatency.

A single ring buffer can be used to storeentries with a complex structure representing the whole workflow in a cohesiveplace. Care must be taken in the designof such a structure so that the state written by independent consumers does notresult in false sharing of cache lines.

4.6 Disruptor Class Diagram

The core relationships in the Disruptorframework are depicted in the class diagram below. This diagram leaves out the convenienceclasses which can be used to simplify the programming model. After the dependency graph is constructed theprogramming model is simple. Producersclaim entries in sequence via a ProducerBarrier,write their changes into the claimed entry, then commit that entry back via theProducerBarrier making them availablefor consumption. As a consumer all oneneeds do is provide a BatchHandlerimplementation that receives call backs when a new entry is available. This resulting programming model is eventbased having a lot of similarities to the Actor Model.

Separating the concerns normally conflatedin queue implementations allows for a more flexible design. A RingBufferexists at the core of the Disruptor pattern providing storage for data exchangewithout contention. The concurrencyconcerns are separated out for the producers and consumers interacting with theRingBuffer. The ProducerBarriermanages any concurrency concerns associated with claiming slots in the ringbuffer, while tracking dependant consumers to prevent the ring fromwrapping. The ConsumerBarrier notifies consumers when new entries are available,and Consumers can be constructed intoa graph of dependencies representing multiple stages in a processing pipeline.

4.7 Code Example

The code below is an example of a singleproducer and single consumer using the convenience interface BatchHandler for implementing aconsumer. The consumer runs on aseparate thread receiving entries as they become available.

//Callback handler which can be implemented by consumers

finalBatchHandler<ValueEntry> batchHandler = newBatchHandler<ValueEntry>()

{

public void onAvailable(final ValueEntryentry) throws Exception

{

// process a new entry as it becomesavailable.

}

public void onEndOfBatch() throws Exception

{

// useful for flushing results to an IOdevice if necessary.

}

public void onCompletion()

{

// do any necessary clean up beforeshutdown

}

};

RingBuffer<ValueEntry>ringBuffer =

newRingBuffer<ValueEntry>(ValueEntry.ENTRY_FACTORY, SIZE,

ClaimStrategy.Option.SINGLE_THREADED,

WaitStrategy.Option.YIELDING);

ConsumerBarrier<ValueEntry>consumerBarrier = ringBuffer.createConsumerBarrier();

BatchConsumer<ValueEntry>batchConsumer =

newBatchConsumer<ValueEntry>(consumerBarrier, batchHandler);

ProducerBarrier<ValueEntry>producerBarrier = ringBuffer.createProducerBarrier(batchConsumer);

//Each consumer can run on a separate thread

EXECUTOR.submit(batchConsumer);

//Producers claim entries in sequence

ValueEntryentry = producerBarrier.nextEntry();

//copy data into the entry container

//make the entry available to consumers

producerBarrier.commit(entry);

5 Throughput Performance Testing

As a reference we choose Doug Lea’sexcellent java.util.concurrent.ArrayBlockingQueue[7]which has the highest performance of any bounded queue based on our testing. The tests are conducted in a blockingprogramming style to match that of the Disruptor. The tests cases detailed below are availablein the Disruptor open source project. Note: running the tests requires asystem capable of executing at least 4 threads in parallel.

Unicast:1P – 1C

ThreeStep Pipeline: 1P – 3C

Sequencer: 3P – 1C

Multicast: 1P – 3C

Diamond: 1P – 3C

For the above configurations an ArrayBlockingQueue was applied for each arcof data flow compared to barrier configuration with the Disruptor. The following table shows the performanceresults in operations per second using a Java 1.6.0_25 64-bit Sun JVM, Windows7, Intel Core i7 860 @ 2.8 GHz without HT and Intel Core i7-2720QM, Ubuntu11.04, and taking the best of 3 runs when processing 500 million messages. Results can vary substantially across differentJVM executions and the figures below are not the highest we have observed.

	Nehalem 2.8Ghz – Windows 7 SP1 64-bit		Sandy Bridge 2.2Ghz – Linux 2.6.38 64-bit
	ABQ	Disruptor	ABQ	Disruptor
Unicast: 1P – 1C	5,339,256	25,998,336	4,057,453	22,381,378
Pipeline: 1P – 3C	2,128,918	16,806,157	2,006,903	15,857,913
Sequencer: 3P – 1C	5,539,531	13,403,268	2,056,118	14,540,519
Multicast: 1P – 3C	1,077,384	9,377,871	260,733	10,860,121
Diamond: 1P – 3C	2,113,941	16,143,613	2,082,725	15,295,197

Table 2 - Comparative throughput (in ops per sec)

6 Latency Performance Testing

To measure latency we take the three stagepipeline and generate events at less than saturation. This is achieved by waiting 1 microsecondafter injecting an event before injecting the next and repeating 50 milliontimes. To time at this level ofprecision it is necessary to use time stamp counters from the CPU. We chose CPUs with an invariant TSC becauseolder processors suffer from changing frequency due to power saving and sleepstates. Intel Nehalem and laterprocessors use an invariant TSC which can be accessed by the latest Oracle JVMsrunning on Ubuntu 11.04. No CPU bindinghas been employed for this test.

For comparison we use the ArrayBlockingQueue once again. We could have used ConcurrentLinkedQueue[8]which is likely to give better results but we want to use a bounded queueimplementation to ensure producers do not outpace consumers by creating backpressure. The results below are for2.2Ghz Core i7-2720QM running Java 1.6.0_25 64-bit on Ubuntu 11.04.

Mean latency per hop for the Disruptorcomes out at 52 nanoseconds compared to 32,757 nanoseconds for ArrayBlockingQueue. Profiling shows the use of locks andsignalling via a condition variable are the main cause of latency for the ArrayBlockingQueue.

	Array Blocking Queue (ns)	Disruptor (ns)
Min Latency	145	29
Mean Latency	32,757	52
99% observations less than	2,097,152	128
99.99% observations less than	4,194,304	8,192
Max Latency	5,069,086	175,567

Table 3 - Comparative Latency in three stage pipeline

7 Conclusion

The Disruptor is a major step forward forincreasing throughput, reducing latency between concurrent execution contextsand ensuring predictable latency, an important consideration in manyapplications. Our testing shows that itout-performs comparable approaches for exchanging data between threads. We believe that this is the highestperformance mechanism for such data exchange. By concentrating on a clean separation of the concerns involved incross-thread data exchange, by eliminating write contention, minimizing readcontention and ensuring that the code worked well with the caching employed bymodern processors, we have created a highly efficient mechanism for exchangingdata between threads in any application.

The batching effect that allows consumersto process entries up to a given threshold, without any contention, introducesa new characteristic in high performance systems. For most systems, as load and contentionincrease there is an exponential increase in latency, the characteristic “J”curve. As load increases on theDisruptor, latency remains almost flat until saturation occurs of the memorysub-system.

We believe that the Disruptor establishes anew benchmark for high-performance computing and is very well placed tocontinue to take advantage of current trends in processor and computer design.

[1] Staged Event Driven Architecture –http://www.eecs.harvard.edu/~mdw/proj/seda/

[2] Actor model –http://dspace.mit.edu/handle/1721.1/6952

[3] Java Memory Model - http://www.ibm.com/developerworks/library/j-jtp02244/index.html

[4] Phasers -http://gee.cs.oswego.edu/dl/jsr166/dist/jsr166ydocs/jsr166y/Phaser.html

[5] Value Types -http://blogs.oracle.com/jrose/entry/tuples_in_the_vm

[6] Little’s Law -http://en.wikipedia.org/wiki/Little%27s_law

[7] ArrayBlockingQueue -http://download.oracle.com/javase/1.5.0/docs/api/java/util/concurrent/ArrayBlockingQueue.html

[8] ConcurrentLinkedQueue -http://download.oracle.com/javase/1.5.0/docs/api/java/util/concurrent/ConcurrentLinkedQueue.html

JackLoveBlueSky

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Disruptor

Disruptor:High performance alternative to bounded queues for exchanging databetween concurrent threadsMartinThompsonDave FarleyMichael BarkerPatricia GeeAndrew Stewart1 AbstractLMA
复制链接

扫一扫