Low latency answer from discussion forum

最新推荐文章于 2019-05-25 17:37:21 发布

tablaneyc

最新推荐文章于 2019-05-25 17:37:21 发布

阅读量765

点赞数

分类专栏： latency 文章标签： latency

latency 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

http://programmers.stackexchange.com/questions/183723/low-latency-unix-linux

jimwise:

I've done a fair amount of work supporting HFT groups in IB and Hedge Fund settings. I'm going to answer from the sysadmin view, but some of this is applicable to programming in such environments as well.

There are a couple of things an employer is usually looking for when they refer to "Low Latency" support. Some of these are "raw speed" questions (do you know what type of 10g card to buy, and what slot to put it in?), but more of them are about the ways in which a High Frequency Trading environment differs from a traditional Unix environment. Some examples:

Unix is traditionally tuned to support running a large number of processes without starving any of them for resources, but in an HFT environment, you are likely to want to run one application with an absolute minimum of overhead for context switching, and so on. As a classic small example, turning on hyperthreading on an Intel CPU allows more processes to run at once -- but has a significant performance impact on the speed at which each individual process is executed. As a programmer, you're likewise going to have to look at the cost of abstractions like threading and RPC, and figure out where a more monolithic solution -- while less clean -- will avoid overhead.
TCP/IP is typically tuned to prevent connection drops and make efficient use of the bandwidth available. If your goal is to get the lowest latency possible out of a very fast link -- instead of to get the highest bandwidth possible out of a more constrained link -- you're going to want to adjust the tuning of the network stack. From a programming side, you'll likewise going to want to look at the available socket options, and figure out which ones have defaults more tuned for bandwidth and reliability than for reducing latency.
As with networking, so with storage -- you're going to want to know how to tell a storage performance problem from an application problem, and learn what patterns of I/O usage are least likely to interfere with your program's performance (as an example, learn where the complexity of using asynchronous IO can pay off for you, and what the downsides are).
Finally, and more painfully: we Unix admins want as much information on the state of the environments we monitor as possible, so we like to run tools like SNMP agents, active monitoring tools like Nagios, and data gathering tools like sar(1). In an environment where context switches need to be absolutely minimized and use of disk and network IO tightly controlled, though, we have to find the right tradeoff between the expense of monitoring and the bare-metal performance of the boxes monitored. Similarly, what techniques are you using that make coding easier but are costing you performance?

Finally, there are other things that just come with time; tricks and details that you learn with experience. But these are more specialized (when do I use epoll? why do two models of HP server with theoretically identical PCIe controllers perform so differently?), more tied to whatever your specific shop is using, and more likely to change from one year to another.

JBRWilkinson:

In addition to the excellent hardware/setup tuning answer from @jimwise, "low latency linux" is implying:

C++ for reasons of determinism (no surprise delay while GC kicks in), access to low-level facilities (I/O, signals), language power (full use of TMP and STL, type safety).
prefer speed-over-memory: >512 Gb of RAM is common; databases are in-memory, cached up-front, or exotic NoSQL products.
algorithm choice: as-fast-as-possible versus sane/understandable/extensible, e.g. lock-free, multiple bit arrays instead of array-of-objects-with-bool-properties.
full use of OS facilities such as Shared Memory between processes on different cores.
secure. HFT software is usually co-located in an Stock Exchange so malware possibilities are unacceptable.

Many of these techniques have overlap with games development which is one reason why the financial software industry absorbs any recently-redundant games programmers (at least until they pay their rent arrears).

The underlying need is to be able to listen to a very high bandwidth stream of market data such as security (stocks, commodities, fx) prices and then make a very fast buy/sell/do-nothing decision based on the security, the price and current holdings.

Of course, this can all go spectacularly wrong, too.

So i'll elaborate on the bit arrays point. Let's say we have a High Frequency Trading system that operates on a long list of Orders (Buy 5k IBM, Sell 10k DELL, etc). Let's say we need to quickly determine if all of the orders are filled, so that we can move onto the next task. In traditional OO programming, this is going to look like:

class Order {
  bool _isFilled;
  ...
public:
  inline bool isFilled() const { return _isFilled; }
};

std::vector<Order> orders;
bool needToFillMore = std::any_of(orders.begin(), orders.end(), 
  [](const Order & o) { return !o.isFilled(); } );

the algorithmic complexity of this code going to be O(N) as it is a linear scan. Let's take a look at the performance profile in terms of memory accesses: each iteration of the loop inside std::any_of() is going to call o.isFilled(), which is inlined, so becomes a memory access of _isFilled, 1 byte (or 4 dependending on your architecture, compiler and compiler settings) in an object of let's say 128 bytes total. So we're accessing 1 byte in every 128 bytes. When we read the 1 byte, presuming worst-case, we'll get a CPU data cache miss. This'll cause a read request to RAM which reads an entire line from RAM (see here for more info) just to read out 8 bits. So the memory access profile is proportional to N.

Compare this with:

const size_t ELEMS = MAX_ORDERS / sizeof (int);
unsigned int ordersFilled[ELEMS];

bool needToFillMore = std::any_of(ordersFilled, &ordersFilled[ELEMS+1],
   [](int packedFilledOrders) { return !(packedOrders == 0xFFFFFFFF); }

the memory access profile of this, assuming worst-case again, is ELEMS divided by the width of a RAM line (varies - could be dual-channel or triple-channel, etc).

So, in effect, we're optimising algorithms for memory access patterns. No amount of RAM will help - it's the CPU data cache size that causes this need.

http://www.velocityreviews.com/forums/t706409-how-to-achieve-low-latency-in-c.html

Maxim Yegorushkin:

There are a few fundamental principles you should follow. Your server will
exhibit low latency to the extent you adhere to these principles:

1) Avoid data copying. Prefer zero-copy algorithms. For example, when you
send/receive data through a socket use a ring-buffer with wrapping iterators.
2) Avoid dynamic memory allocation on fast code paths, such as receiving market
data and sending orders. Preallocate as much as possible. Use mapped memory.
3) Avoid lock contention. Use wait-free and lock-free algorithms if possible,
share as little data as possible between threads.
4) Avoid context switching. Don't have more threads ready to run than hardware
CPUs. Use fifo realtime process priority, so that your threads do not get
context switched off when it still has data to process.

One of the top latency killers is formatting strings using things like
std::stringstream or snprintf into std::string. This is because they do a fancy
data copy (1), dynamic memory allocation (2), locking a mutex (3) when doing the
memory allocation (unless some fancy multi-thread allocator is used that avoid
locking a mutex).

Jorgen Grahn:

Two very late observations:

- "Low latency" is a useless requirement. Maybe it means "less than
ten seconds" in the trading world? Who knows. Don't optimize where
it's not needed. Adding threading, for example, adds a whole new
dimension of complexity to the code.

- We don't know what I/O mechanisms and which protocols are used. I
would concentrate on them first. For example, there are many ways
you can mess up TCP performance with a badly designed application
protocol or badly tuned software.

/Jorgen

https://www.quantnet.com/threads/how-do-we-minimize-latency-in-trasporting-data-from-exchanges-to-our-network-then-from-our-process.8038/

Siddharth Singh:

It is more about the machine than just the programming language. You colocate the servers. Additionally you can hardware based services for data. So it depends on how much latency you want to achieve.

KaiRu:

Through the language its more about having the right number of threads. Aviod any locks or blocking calls.

There is no single thing like overall latency, there are tons of latencies to consider:
- number of hops on route to the exchange - can you decrease it with optimized routing table and specialized hardware ?
- transfer protocol overhead - are you sure that you need extra packets with ACK/SYNC messages?
- registries latency, cache latency, memory access latency - can you proceed everything on fast registers?
- memory allocation latency - what if heap is fragmented to allocate appropriate block ?
- etc.