High performance server architecture(高性能服务器架构)
The purpose of this document is to share some ideas that I've developed over the years about how to develop a certain kind of application for which the term "server" is only a weak approximation. More accurately, I'll be writing about a broad class of programs that are designed to handle very large numbers of discrete messages or requests per second. Network servers most commonly fit this definition, but not all programs that do are really servers in any sense of the word. For the sake of simplicity, though, and because "High-Performance Request-Handling Programs" is a really lousy title, we'll just say "server" and be done with it.
I will not be writing about "mildly parallel" applications, even though multitasking within a single program is now commonplace. The browser you're using to read this probably does some things in parallel, but such low levels of parallelism really don't introduce many interesting challenges. The interesting challenges occur when the request-handling infrastructure itself is the limiting factor on overall performance, so that improving the infrastructure actually improves performance. That's not often the case for a browser running on a gigahertz processor with a gigabyte of memory doing six simultaneous downloads over a DSL line. The focus here is not on applications that sip through a straw but on those that drink from a firehose, on the very edge of hardware capabilities where how you do it really does matter.
Some people will inevitably take issue with some of my comments and suggestions, or think they have an even better way. Fine. I'm not trying to be the Voice of God here; these are just methods that I've found to work for me, not only in terms of their effects on performance but also in terms of their effects on the difficulty of debugging or extending code later. Your mileage may vary. If something else works better for you that's great, but be warned that almost everything I suggest here exists as an alternative to something else that I tried once only to be disgusted or horrified by the results. Your pet idea might very well feature prominently in one of these stories, and innocent readers might be bored to death if you encourage me to start telling them. You wouldn't want to hurt them, would you?
The rest of this article is going to be centered around what I'll call the Four Horsemen of Poor Performance:
- Data copies
- Context switches
- Memory allocation
- Lock contention
There will also be a catch-all section at the end, but these are the biggest performance-killers. If you can handle most requests without copying data, without a context switch, without going through the memory allocator and without contending for locks, you'll have a server that performs well even if it gets some of the minor parts wrong.
This could be a very short section, for one very simple reason: most people have learned this lesson already. Everybody knows data copies are bad; it's obvious, right? Well, actually, it probably only seems obvious because you learned it very early in your computing career, and that only happened because somebody started putting out the word decades ago. I know that's true for me, but I digress. Nowadays it's covered in every school curriculum and in every informal how-to. Even the marketing types have figured out that "zero copy" is a good buzzword.
Despite the after-the-fact obviousness of copies being bad, though, there still seem to be nuances that people miss. The most important of these is that data copies are often hidden and disguised. Do you really know whether any code you call in drivers or libraries does data copies? It's probably more than you think. Guess what "Programmed I/O" on a PC refers to. An example of a copy that's disguised rather than hidden is a hash function, which has all the memory-access cost of a copy and also involves more computation. Once it's pointed out that hashing is effectively "copying plus" it seems obvious that it should be avoided, but I know at least one group of brilliant people who had to figure it out the hard way. If you really want to get rid of data copies, either because they really are hurting performance or because you want to put "zero-copy operation" on your hacker-conference slides, you'll need to track down a lot of things that really are data copies but don't advertise themselves as such.
The tried and true method for avoiding data copies is to use indirection, and pass buffer descriptors (or chains of buffer descriptors) around instead of mere buffer pointers. Each descriptor typically consists of the following:
- A pointer and length for the whole buffer.
- A pointer and length, or offset and length, for the part of the buffer that's actually filled.
- Forward and back pointers to other buffer descriptors in a list.
- A reference count.
Now, instead of copying a piece of data to make sure it stays in memory, code can simply increment a reference count on the appropriate buffer descriptor. This can work extremely well under some conditions, including the way that a typical network protocol stack operates, but it can also become a really big headache. Generally speaking, it's easy to add buffers at the beginning or end of a chain, to add references to whole buffers, and to deallocate a whole chain at once. Adding in the middle, deallocating piece by piece, or referring to partial buffers will each make life increasingly difficult. Trying to split or combine buffers will simply drive you insane.
I don't actually recommend using this approach for everything, though. Why not? Because it gets to be a huge pain when you have to walk through descriptor chains every time you want to look at a header field. There really are worse things than data copies. I find that the best thing to do is to identify the large objects in a program, such as data blocks, make sure those get allocated separately as described above so that they don't need to be copied, and not sweat too much about the other stuff.
This brings me to my last point about data copies: don't go overboard avoiding them. I've seen way too much code that avoids data copies by doing something even worse, like forcing a context switch or breaking up a large I/O request. Data copies are expensive, and when you're looking for places to avoid redundant operations they're one of the first things you should look at, but there is a point of diminishing returns. Combing through code and then making it twice as complicated just to get rid of that last few data copies is usually a waste of time that could be better spent in other ways.
Whereas everyone thinks it's obvious that data copies are bad, I'm often surprised by how many people totally ignore the effect of context switches on performance. In my experience, context switches are actually behind more total "meltdowns" at high load than data copies; the system starts spending more time going from one thread to another than it actually spends within any thread doing useful work. The amazing thing is that, at one level, it's totally obvious what causes excessive context switching. The #1 cause of context switches is having more active threads than you have processors. As the ratio of active threads to processors increases, the number of context switches also increases - linearly if you're lucky, but often exponentially. This very simple fact explains why multi-threaded designs that have one thread per connection scale very poorly. The only realistic alternative for a scalable system is to limit the number of active threads so it's (usually) less than or equal to the number of processors. One popular variant of this approach is to use only one thread, ever; while such an approach does avoid context thrashing, and avoids the need for locking as well, it is also incapable of achieving more than one processor's worth of total throughput and thus remains beneath contempt unless the program will be non-CPU-bound (usually network-I/O-bound) anyway.
The first thing that a "thread-frugal" program has to do is figure out how it's going to make one thread handle multiple connections at once. This usually implies a front end that uses select/poll, asynchronous I/O, signals or completion ports, with an event-driven structure behind that. Many "religious wars" have been fought, and continue to be fought, over which of the various front-end APIs is best. Dan Kegel's C10K paper is a good resource is this area. Personally, I think all flavors of select/poll and signals are ugly hacks, and therefore favor either AIO or completion ports, but it actually doesn't matter that much. They all - except maybe select() - work reasonably well, and don't really do much to address the matter of what happens past the very outermost layer of your program's front end.
The simplest conceptual model of a multi-threaded event-driven server has a queue at its center; requests are read by one or more "listener" threads and put on queues, from which one or more "worker" threads will remove and process them. Conceptually, this is a good model, but all too often people actually implement their code this way. Why is this wrong? Because the #2 cause of context switches is transferring work from one thread to another. Some people even compound the error by requiring that the response to a request be sent by the original thread - guaranteeing not one but two context switches per request. It's very important to use a "symmetric" approach in which a given thread can go from being a listener to a worker to a listener again without ever changing context. Whether this involves partitioning connections between threads or having all threads take turns being listener for the entire set of connections seems to matter a lot less.
Usually, it's not possible to know how many threads will be active even one instant into the future. After all, requests can come in on any connection at any moment, or "background" threads dedicated to various maintenance tasks could pick that moment to wake up. If you don't know how many threads are active, how can you limit how many are active? In my experience, one of the most effective approaches is also one of the simplest: use an old-fashioned counting semaphore which each thread must hold whenever it's doing "real work". If the thread limit has already been reached then each listen-mode thread might incur one extra context switch as it wakes up and then blocks on the semaphore, but once all listen-mode threads have blocked in this way they won't continue contending for resources until one of the existing threads "retires" so the system effect is negligible. More importantly, this method handles maintenance threads - which sleep most of the time and therefore dont' count against the active thread count - more gracefully than most alternatives.
Once the processing of requests has been broken up into two stages (listener and worker) with multiple threads to service the stages, it's natural to break up the processing even further into more than two stages. In its simplest form, processing a request thus becomes a matter of invoking stages successively in one direction, and then in the other (for replies). However, things can get more complicated; a stage might represent a "fork" between two processing paths which involve different stages, or it might generate a reply (e.g. a cached value) itself without invoking further stages. Therefore, each stage needs to be able to specify "what should happen next" for a request. There are three possibilities, represented by return values from the stage's dispatch function:
- The request needs to be passed on to another stage (an ID or pointer in the return value).
- The request has been completed (a special "request done" return value)
- The request was blocked (a special "request blocked" return value). This is equivalent to the previous case, except that the request is not freed and will be continued later from another thread.
Note that, in this model, queuing of requests is done within stages, not between stages. This avoids the common silliness of constantly putting a request on a successor stage's queue, then immediately invoking that successor stage and dequeuing the request again; I call that lots of queue activity - and locking - for nothing.
If this idea of separating a complex task into multiple smaller communicating parts seems familiar, that's because it's actually very old. My approach has its roots in the Communicating Sequential Processes concept elucidated by C.A.R. Hoare in 1978, based in turn on ideas from Per Brinch Hansen and Matthew Conway going back to 1963 - before I was born! However, when Hoare coined the term CSP he meant "process" in the abstract mathematical sense, and a CSP process need bear no relation to the operating-system entities of the same name. In my opinion, the common approach of implementing CSP via thread-like coroutines within a single OS thread gives the user all of the headaches of concurrency with none of the scalability.
A contemporary example of the staged-execution idea evolved in a saner direction is Matt Welsh's SEDA. In fact, SEDA is such a good example of "server architecture done right" that it's worth commenting on some of its specific characteristics (especially where those differ from what I've outlined above).
- SEDA's "batching" tends to emphasize processing multiple requests through a stage at once, while my approach tends to emphasize processing a single request through multiple stages at once.
- SEDA's one significant flaw, in my opinion, is that it allocates a separate thread pool to each stage with only "background" reallocation of threads between stages in response to load. As a result, the #1 and #2 causes of context switches noted above are still very much present.
- In the context of an academic research project, implementing SEDA in Java might make sense. In the real world, though, I think the choice can be characterized as unfortunate.
Allocating and freeing memory is one of the most common operations in many applications. Accordingly, many clever tricks have been developed to make general-purpose memory allocators more efficient. However, no amount of cleverness can make up for the fact that the very generality of such allocators inevitably makes them far less efficient than the alternatives in many cases. I therefore have three suggestions for how to avoid the system memory allocator altogether.
Suggestion #1 is simple preallocation. We all know that static allocation is bad when it imposes artificial limits on program functionality, but there are many other forms of preallocation that can be quite beneficial. Usually the reason comes down to the fact that one trip through the system memory allocator is better than several, even when some memory is "wasted" in the process. Thus, if it's possible to assert that no more than N items could ever be in use at once, preallocation at program startup might be a valid choice. Even when that's not the case, preallocating everything that a request handler might need right at the beginning might be preferable to allocating each piece as it's needed; aside from the possibility of allocating multiple items contiguously in one trip through the system allocator, this often greatly simplifies error-recovery code. If memory is very tight then preallocation might not be an option, but in all but the most extreme circumstances it generally turns out to be a net win.
Suggestion #2 is to use lookaside lists for objects that are allocated and freed frequently. The basic idea is to put recently-freed objects onto a list instead of actually freeing them, in the hope that if they're needed again soon they need merely be taken off the list instead of being allocated from system memory. As an additional benefit, transitions to/from a lookaside list can often be implemented to skip complex object initialization/finalization.
It's generally undesirable to have lookaside lists grow without bound, never actually freeing anything even when your program is idle. Therefore, it's usually necessary to have some sort of periodic "sweeper" task to free inactive objects, but it would also be undesirable if the sweeper introduced undue locking complexity or contention. A good compromise is therefore a system in which a lookaside list actually consists of separately locked "old" and "new" lists. Allocation is done preferentially from the new list, then from the old list, and from the system only as a last resort; objects are always freed onto the new list. The sweeper thread operates as follows:
- Lock both lists.
- Save the head for the old list.
- Make the (previously) new list into the old list by assigning list heads.
- Free everything on the saved old list at leisure.
Objects in this sort of system are only actually freed when they have not been needed for at least one full sweeper interval, but always less than two. Most importantly, the sweeper does most of its work without holding any locks to contend with regular threads. In theory, the same approach can be generalized to more than two stages, but I have yet to find that useful.
One concern with using lookaside lists is that the list pointers might increase object size. In my experience, most of the objects that I'd use lookaside lists for already contain list pointers anyway, so it's kind of a moot point. Even if the pointers were only needed for the lookaside lists, though, the savings in terms of avoided trips through the system memory allocator (and object initialization) would more than make up for the extra memory.
Suggestion #3 actually has to do with locking, which we haven't discussed yet, but I'll toss it in anyway. Lock contention is often the biggest cost in allocating memory, even when lookaside lists are in use. One solution is to maintain multiple private lookaside lists, such that there's absolutely no possibility of contention for any one list. For example, you could have a separate lookaside list for each thread. One list per processor can be even better, due to cache-warmth considerations, but only works if threads cannot be preempted. The private lookaside lists can even be combined with a shared list if necessary, to create a system with extremely low allocation overhead.
Efficient locking schemes are notoriously hard to design, because of what I call Scylla and Charybdis after the monsters in the Odyssey. Scylla is locking that's too simplistic and/or coarse-grained, serializing activities that can or should proceed in parallel and thus sacrificing performance and scalability; Charybdis is overly complex or fine-grained locking, with space for locks and time for lock operations again sapping performance. Near Scylla are shoals representing deadlock and livelock conditions; near Charybdis are shoals representing race conditions. In between, there's a narrow channel that represents locking which is both efficient and correct...or is there? Since locking tends to be deeply tied to program logic, it's often impossible to design a good locking scheme without fundamentally changing how the program works. This is why people hate locking, and try to rationalize their use of non-scalable single-threaded approaches.
Almost every locking scheme starts off as "one big lock around everything" and a vague hope that performance won't suck. When that hope is dashed, and it almost always is, the big lock is broken up into smaller ones and the prayer is repeated, and then the whole process is repeated, presumably until performance is adequate. Often, though, each iteration increases complexity and locking overhead by 20-50% in return for a 5-10% decrease in lock contention. With luck, the net result is still a modest increase in performance, but actual decreases are not uncommon. The designer is left scratching his head (I use "his" because I'm a guy myself; get over it). "I made the locks finer grained like all the textbooks said I should," he thinks, "so why did performance get worse?"
In my opinion, things got worse because the aforementioned approach is fundamentally misguided. Imagine the "solution space" as a mountain range, with high points representing good solutions and low points representing bad ones. The problem is that the "one big lock" starting point is almost always separated from the higher peaks by all manner of valleys, saddles, lesser peaks and dead ends. It's a classic hill-climbing problem; trying to get from such a starting point to the higher peaks only by taking small steps and never going downhill almost never works. What's needed is a fundamentally different way of approaching the peaks.
The first thing you have to do is form a mental map of your program's locking. This map has two axes:
- The vertical axis represents code. If you're using a staged architecture with non-branching stages, you probably already have a diagram showing these divisions, like the ones everybody uses for OSI-model network protocol stacks.
- The horizontal axis represents data. In every stage, each request should be assigned to a data set with its own resources separate from any other set.
You now have a grid, where each cell represents a particular data set in a particular processing stage. What's most important is the following rule: two requests should not be in contention unless they are in the same data set and the same processing stage. If you can manage that, you've already won half the battle.
Once you've defined the grid, every type of locking your program does can be plotted, and your next goal is to ensure that the resulting dots are as evenly distributed along both axes as possible. Unfortunately, this part is very application-specific. You have to think like a diamond-cutter, using your knowledge of what the program does to find the natural "cleavage lines" between stages and data sets. Sometimes they're obvious to start with. Sometimes they're harder to find, but seem more obvious in retrospect. Dividing code into stages is a complicated matter of program design, so there's not much I can offer there, but here are some suggestions for how to define data sets:
- If you have some sort of a block number or hash or transaction ID associated with requests, you can rarely do better than to divide that value by the number of data sets.
- Sometimes, it's better to assign requests to data sets dynamically, based on which data set has the most resources available rather than some intrinsic property of the request. Think of it like multiple integer units in a modern CPU; those guys know a thing or two about making discrete requests flow through a system.
- It's often helpful to make sure that the data-set assignment is different for each stage, so that requests which would contend at one stage are guaranteed not to do so at another stage.
If you've divided your "locking space" both vertically and horizontally, and made sure that lock activity is spread evenly across the resulting cells, you can be pretty sure that your locking is in pretty good shape. There's one more step, though. Do you remember the "small steps" approach I derided a few paragraphs ago? It still has its place, because now you're at a good starting point instead of a terrible one. In metaphorical terms you're probably well up the slope on one of the mountain range's highest peaks, but you're probably not at the top of one. Now is the time to collect contention statistics and see what you need to do to improve, splitting stages and data sets in different ways and then collecting more statistics until you're satisfied. If you do all that, you're sure to have a fine view from the mountaintop.
As promised, I've covered the four biggest performance problems in server design. There are still some important issues that any particular server will need to address, though. Mostly, these come down to knowing your platform/environment:
- How does your storage subsystem perform with larger vs. smaller requests? With sequential vs. random? How well do read-ahead and write-behind work?
- How efficient is the network protocol you're using? Are there parameters or flags you can set to make it perform better? Are there facilities like TCP_CORK, MSG_PUSH, or the Nagle-toggling trick that you can use to avoid tiny messages?
- Does your system support scatter/gather I/O (e.g. readv/writev)? Using these can improve performance and also take much of the pain out of using buffer chains.
- What's your page size? What's your cache-line size? Is it worth it to align stuff on these boundaries? How expensive are system calls or context switches, relative to other things?
- Are your reader/writer lock primitives subject to starvation? Of whom? Do your events have "thundering herd" problems? Does your sleep/wakeup have the nasty (but very common) behavior that when X wakes Y a context switch to Y happens immediately even if X still has things to do?
I'm sure I could think of many more questions in this vein. I'm sure you could too. In any particular situation it might not be worthwhile to do anything about any one of these issues, but it's usually worth at least thinking about them. If you don't know the answers - many of which you will not find in the system documentation - find out. Write a test program or micro-benchmark to find the answers empirically; writing such code is a useful skill in and of itself anyway. If you're writing code to run on multiple platforms, many of these questions correlate with points where you should probably be abstracting functionality into per-platform libraries so you can realize a performance gain on that one platform that supports a particular feature.
The "know the answers" theory applies to your own code, too. Figure out what the important high-level operations in your code are, and time them under different conditions. This is not quite the same as traditional profiling; it's about measuring design elements, not actual implementations. Low-level optimization is generally the last resort of someone who screwed up the design.
一些人可能会对我的某些观点和建议发出置疑，或者自认为有更好的方法, 这是无法避免的。在本文中我不想扮演上帝的角色；这里所谈论的是我自己的一些经验，这些经验对我来说, 不仅在提高服务器性能上有效，而且在降低调试困难度和增加系统的可扩展性上也有作用。但是对某些人的系统可能会有所不同。如果有其它更适合于你的方法，那实在是很不错. 但是值得注意的是，对本文中所提出的每一条建议的其它一些可替代方案，我经过实验得出的结论都是悲观的。你自己的小聪明在这些实验中或许有更好的表现，但是如果因此怂恿我在这里建议读者这么做，可能会引起无辜读者的反感。你并不想惹怒读者，对吧？
1) 数据拷贝（Data Copies）
2) 环境切换（Context Switches）
3) 内存分配（Memory allocation）
4) 锁竞争（Lock contention）
本节会有点短，因为大多数人在数据拷贝上吸取过教训。几乎每个人都知道产生数据拷贝是不对的，这点是显而易见的，在你的职业生涯中, 你很早就会见识过它；而且遇到过这个问题，因为10年前就有人开始说这个词。对我来说确实如此。现今，几乎每个大学课程和几乎所有how-to文档中都提到了它。甚至在某些商业宣传册中，"零拷贝" 都是个流行用语。
现在，代码可以简单的在相应的描述符上增加引用计数来代替内存中数据的拷贝。这种做法在某些条件下表现的相当好，包括在典型的网络协议栈的操作上，但有些情况下这做法也令人很头大。一般来说，在buffer chains的开头和结尾增加buffer很容易，对整个buffer增加引用计数，以及对buffer chains的即刻释放也很容易。在chains的中间增加buffer，一块一块的释放buffer，或者对部分buffer增加引用技术则比较困难。而分割，组合chains会让人立马崩溃。
我 不建议在任何情况下都使用这种技术，因为当你想在链上搜索你想要的一个块时，就不得不遍历一遍描述符链，这甚至比数据拷贝更糟糕。最适用这种技术地方是在 程序中大的数据块上，这些大数据块应该按照上面所说的那样独立的分配描述符，以避免发生拷贝，也能避免影响服务器其它部分的工作.(大数据块拷贝很消耗CPU,会影响其它并发线程的运行)。
相 对于数据拷贝影响的明显，非常多的人会忽视了上下文切换对性能的影响。在我的经验里，比起数据拷贝，上下文切换是让高负载应用彻底完蛋的真正杀手。系统更 多的时间都花费在线程切换上，而不是花在真正做有用工作的线程上。令人惊奇的是，（和数据拷贝相比）在同一个水平上，导致上下文切换原因总是更常见。引起 环境切换的第一个原因往往是活跃线程数比CPU个数多。随着活跃线程数相对于CPU个数的增加，上下文切换的次数也在增加，如果你够幸运，这种增长是线性的，但更常见是指数增长。这个简单的事实解释了为什么每个连接一个线程的多线程设计的可伸缩性更差。对于一个可伸缩性的系统来说，限制活跃线程数少于或等于CPU个数是更有实际意义的方案。曾经这种方案的一个变种是只使用一个活跃线程，虽然这种方案避免了环境争用，同时也避免了锁，但它不能有效利用多CPU在增加总吞吐量上的价值，因此除非程序无CPU限制（non-CPU-bound），(通常是网络I/O限制 network-I/O-bound)，应该继续使用更实际的方案。
一个有适量线程的程序首先要考虑的事情是规划出如何创建一个线程去管理多连接。这通常意味着前置一个select/poll, 异步I/O，信号或者完成端口，而后台使用一个事件驱动的程序框架。关于哪种前置API是最好的有很多争论。 Dan Kegel的C10K在这个领域是一篇不错的论文。个人认为，select/poll和信号通常是一种丑陋的方案，因此我更倾向于使用AIO或者完成端口，但是实际上它并不会好太多。也许除了select()，它们都还不错。所以不要花太多精力去探索前置系统最外层内部到底发生了什么。
即使在将来,也不可能有办法知道在服务器中同一时刻会有多少激活线程.毕竟，每时每刻都可能有请求从任意连接发送过来，一些进行特殊任务的“后台”线 程也会在任意时刻被唤醒。那么如果你不知道当前有多少线程是激活的，又怎么能够限制激活线程的数量呢？根据我的经验，最简单同时也是最有效的方法之一是： 用一个老式的带计数的信号量，每一个线程执行的时候就先持有信号量。如果信号量已经到了最大值，那些处于监听模式的线程被唤醒的时候可能会有一次额外的环 境切换,(监听线程被唤醒是因为有连接请求到来, 此时监听线程持有信号量时发现信号量已满,所以即刻休眠), 接 着它就会被阻塞在这个信号量上，一旦所有监听模式的线程都这样阻塞住了，那么它们就不会再竞争资源了，直到其中一个线程释放信号量，这样环境切换对系统的 影响就可以忽略不计。更主要的是，这种方法使大部分时间处于休眠状态的线程避免在激活线程数中占用一个位置，这种方式比其它的替代方案更优雅。
这种把一个复杂的任务分解成多个较小的互相协作的部分的方式，看起来很熟悉，这是因为这种做法确实很老了。我的方法，源于CAR在1978年发明的"通信序列化进程"(Communicating Sequential Processes CSP)，它的基础可以上溯到1963时的Per Brinch Hansen and Matthew Conway--在我出生之前！然而，当Hoare创造出CSP这个术语的时候，“进程”是从抽象的数学角度而言的，而且，这个CSP术语中的进程和操作系统中同名的那个进程并没有关系。依我看来，这种在操作系统提供的单个线程之内，实现类似多线程一样协同并发工作的CSP的方法，在可扩展性方面让很多人头疼。
一个实际的例子是，Matt Welsh的SEDA，这个例子表明分段执行的(stage-execution)思想朝着一个比较合理的方向发展。SEDA是一个很好的“server Aarchitecture done right”的例子，值得把它的特性评论一下：
建 议一是使用预分配。我们都知道由于使用静态分配而对程序的功能加上人为限制是一种糟糕的设计。但是还是有许多其它很不错的预分配方案。通常认为，通过系统 一次性分配内存要比分开几次分配要好，即使这样做在程序中浪费了某些内存。如果能够确定在程序中会有几项内存使用，在程序启动时预分配就是一个合理的选 择。即使不能确定，在开始时为请求句柄预分配可能需要的所有内存也比在每次需要一点的时候才分配要好。通过系统一次性连续分配多项内存还能极大减少错误处 理代码。在内存比较紧张时，预分配可能不是一个好的选择，但是除非面对最极端的系统环境，否则预分配都是一个稳赚不赔的选择。
建议二是使用一个内存释放分配的lookaside list(监视列表或者后备列表)。基本的概念是把最近释放的对象放到链表里而不是真的释放它，当不久再次需要该对象时，直接从链表上取下来用，不用通过系统来分配。使用lookaside list的一个额外好处是可以避免复杂对象的初始化和清理.
通常，让lookaside list不受限制的增长，即使在程序空闲时也不释放占用的对象是个糟糕的想法。在避免引入复杂的锁或竞争情况下，不定期的“清扫"非活跃对象是很有必要的。一个比较妥当的办法是,让lookaside list由两个可以独立锁定的链表组成：一个"新链"和一个"旧链".使用时优先从"新"链分配，然后最后才依靠"旧"链。对象总是被释放的"新"链上。清除线程则按如下规则运行：
第三条建议与我们还没有讨论的锁有关系。先抛开它不说。即使使用lookaside list,内存分配时的锁竞争也常常是最大的开销。解决方法是使用线程私有的lookasid list, 这样就可以避免多个线程之间的竞争。更进一步，每个处理器一个链会更好，但这样只有在非抢先式线程环境下才有用。基于极端考虑，私有lookaside list甚至可以和一个共用的链工作结合起来使用。
高效率的锁是非常难规划的, 以至于我把它称作卡律布狄斯和斯库拉(参见附录)。一方面, 锁的简单化(粗粒度锁)会导致并行处理的串行化,因而降低了并发的效率和系统可伸缩性; 另一方面, 锁的复杂化(细粒度锁)在空间占用上和操作时的时间消耗上都可能产生对性能的侵蚀。偏向于粗粒度锁会有死锁发生，而偏向于细粒度锁则会产生竞争。在这两者之间，有一个狭小的路径通向正确性和高效率，但是路在哪里？
几乎我们每个系统中锁的设计都始于一个"锁住一切的超级大锁"，并寄希望于它不会影响性能，当希望落空时(几乎是必然), 大锁被分成多个小锁，然后我们继续祷告(性能不会受影响)，接着，是重复上面的整个过程(许多小锁被分成更小的锁), 直到性能达到可接受的程度。通常，上面过程的每次重复都回增加大于20%-50%的复杂性和锁负荷，并减少5%-10%的锁竞争。最终结果是取得了适中的效率，但是实际效率的降低是不可避免的。设计者开始抓狂:"我已经按照书上的指导设计了细粒度锁，为什么系统性能还是很糟糕?"
一旦你定义出了上面那个网格图，在你的系统中的每种类型的锁就都可以被标识出来了。你的下一个目标是确保这些标识出来的锁尽可能在两个轴之间均匀的分布, 这 部分工作是和具体应用相关的。你得像个钻石切割工一样，根据你对程序的了解，找出请求阶段和数据集之间的自然“纹理线”。有时候它们很容易发现，有时候又 很难找出来，此时需要不断回顾来发现它。在程序设计时，把代码分隔成不同阶段是很复杂的事情，我也没有好的建议，但是对于数据集的定义，有一些建议给你：
如果你在纵向和横向上把“锁空间(这里实际指锁的分布)" 分 隔了，并且确保了锁均匀分布在网格上，那么恭喜你获得了一个好方案。现在你处在了一个好的登山点，打个比喻，你面有了一条通向顶峰的缓坡，但你还没有到山 顶。现在是时候对锁竞争进行统计，看看该如何改进了。以不同的方式分隔阶段和数据集，然后统计锁竞争，直到获得一个满意的分隔。当你做到这个程度的时候， 那么无限风景将呈现在你脚下。
l 你使用的网络协议效率如何？是否可以通过修改参数改善性能？是否有类似于TCP_CORK, MSG_PUSH,Nagle-toggling算法的手段来避免小消息产生？
l 你的系统是否支持Scatter-Gather I/O(例如readv/writev)? 使用这些能够改善性能，也能避免使用缓冲链(见第一节数据拷贝的相关叙述)带来的麻烦。(说明：在dma传输数据的过程中，要求源物理地址和目标物理地址必须是连续的。但在有的计算机体系中，如IA，连续的存储器地址在物理上不一定是连续的，则dma传输要分成多次完成。如果传输完一块物理连续的数据后发起一次中断，同时主机进行下一块物理连续的传输，则这种方式即为block dma方式。scatter/gather方式则不同，它是用一个链表描述物理不连续的存储器，然后把链表首地址告诉dma master。dma master传输完一块物理连续的数据后，就不用再发中断了，而是根据链表传输下一块物理连续的数据，最后发起一次中断。很显然 scatter/gather方式比block dma方式效率高)
l 你是否知道锁原语的饥饿现象？你的事件机制有没有"惊群"问题?你的唤醒/睡眠机制是否有这样糟糕的行为: 当X唤醒了Y, 环境立刻切换到了Y,但是X还有没完成的工作?
我 在这里考虑的了很多方面，相信你也考虑过。在特定情况下，应用这里提到的某些方面可能没有价值，但能考虑这些因素的影响还是有用的。如果在系统手册中，你 没有找到这些方面的说明，那么就去努力找出答案。写一个测试程序来找出答案；不管怎样，写这样的测试代码都是很好的技巧锻炼。如果你写的代码在多个平台上 都运行过，那么把这些相关的代码抽象为一个平台相关的库，将来在某个支持这里提到的某些功能的平台上，你就赢得了先机。
对你的代码,“知其所以然”, 弄明白其中高级的操作, 以及在不同条件下的花销.这不同于传统的性能分析, 不是关于具体的实现,而是关乎设计. 低级别的优化永远是蹩脚设计的最后救命稻草.
相传，斯库拉和卡律布狄斯是古希腊神话中的女妖和魔怪，女妖斯库拉住在意大利和西西里岛之间海峡中的一个洞穴里，她的对面住着另一个妖怪卡律布狄斯。它们为害所有过往航海的人。据荷马说，女妖斯库拉长着12只不规则的脚，有6个蛇一样的脖子，每个脖子上各有一颗可怕的头，张着血盆大口，每张嘴有3 排毒牙，随时准备把猎物咬碎。它们每天在意大利和西西里岛之间海峡中兴风作浪，航海者在两个妖怪之间通过是异常危险的，它们时刻在等待着穿过西西里海峡的船舶。在海峡中间，卡律布狄斯化成一个大旋涡，波涛汹涌、水花飞溅，每天3次 从悬崖上奔涌而出，在退落时将通过此处的船只全部淹没。当奥德修斯的船接近卡律布狄斯大旋涡时，它像火炉上的一锅沸水，波涛滔天，激起漫天雪白的水花。当 潮退时，海水混浊，涛声如雷，惊天动地。这时，黑暗泥泞的岩穴一见到底。正当他们惊恐地注视着这一可怕的景象时，正当舵手小心翼翼地驾驶着船只从左绕过旋 涡时，突然海怪斯库拉出现在他们面前，她一口叼住了6个同伴。奥德修斯亲眼看见自己的同伴在妖怪的牙齿中间扭动着双手和双脚，挣扎了一会儿，他们便被嚼碎，成了血肉模糊的一团。其余的人侥幸通过了卡律布狄斯大旋涡和海怪斯库拉之间的危险的隘口。后来又历经种种灾难，最后终于回到了故乡——伊塔刻岛。