下篇：Real-world Concurrency（真实世界的并发）翻译&笔记

霜晨月c

已于 2023-10-31 14:49:08 修改

阅读量181

点赞数

分类专栏： C++ 文章标签：笔记 c++ 学习方法程序人生

于 2023-10-31 14:48:03 首次发布

本文链接：https://blog.csdn.net/m0_53485135/article/details/134140228

版权

C++ 专栏收录该内容

19 篇文章 1 订阅

订阅专栏

下篇：Real-world Concurrency（真实世界的并发）翻译&笔记

这篇文章主要讨论了编写并发程序的主要原则，精读之后收获颇大，故作为笔记，记录于此。

I、翻译部分

书接中篇

Illuminating the Black Art

What if you are the one developing the operating system or database or some other body of code that must be explicitly parallelized? If you count yourself among the relative few who need to write such code, you presumably do not need to be warned that writing multithreaded code is hard. In fact, this domain’s reputation for difficulty has led some to conclude (mistakenly) that writing multithreaded code is simply impossible: “No one knows how to organize and maintain large systems that rely on locking,” reads one recent (and typical) assertion.5 Part of the difficulty of writing scalable and correct multithreaded code is the scarcity of written wisdom from experienced practitioners: oral tradition in lieu of formal writing has left the domain shrouded in mystery. So in the spirit of making this domain less mysterious for our fellow practitioners (if not also to demonstrate that some of us actually do know how to organize and maintain large lock-based systems), we present our collective bag of tricks for writing multithreaded code.

如果是你在开发操作系统、数据库或其他必须明确并行化的代码，你会怎么做？如果你是少数需要编写此类代码的人之一，那么你大概不需要被警告编写多线程代码是一件困难的事情。事实上，这个领域的困难名声已经让一些人（错误地）得出结论：编写多线程代码根本不可能： "5 编写可扩展且正确的多线程代码之所以困难，部分原因在于经验丰富的实践者的书面智慧太少：口头传统代替了正式的书面材料，使得该领域蒙上了一层神秘的面纱。因此，为了让我们的同行对这一领域不再感到神秘（如果不是为了证明我们中的一些人确实知道如何组织和维护基于锁的大型系统的话），我们将介绍我们编写多线程代码的集体诀窍。

Know your cold paths from your hot paths. If there is one piece of advice to dispense to those who must develop parallel systems, it is to know which paths through your code you want to be able to execute in parallel (the hot paths) versus which paths can execute sequentially without affecting performance (the cold paths). In our experience, much of the software we write is bone-cold in terms of concurrent execution: it is executed only when initializing, in administrative paths, when unloading, etc. Not only is it a waste of time to make such cold paths execute with a high degree of parallelism, but it is also dangerous: these paths are often among the most difficult and error-prone to parallelize.

了解冷路径和热路径。如果要给那些必须开发并行系统的人一个建议的话，那就是要知道在代码中哪些路径可以并行执行（热路径），哪些路径可以顺序执行而不影响性能（冷路径）。根据我们的经验，我们编写的很多软件在并发执行方面都是冷门：只有在初始化、管理路径、卸载等情况下才执行。让这些冷门路径以高度并行的方式执行不仅浪费时间，而且还很危险：这些路径往往是最难并行化且最容易出错的。

In cold paths, keep the locking as coarse-grained as possible. Don’t hesitate to have one lock that covers a wide range of rare activity in your subsystem. Conversely, in hot paths—those that must execute concurrently to deliver highest throughput—you must be much more careful: locking strategies must be simple and fine-grained, and you must be careful to avoid activity that can become a bottleneck. And what if you don’t know if a given body of code will be the hot path in the system? In the absence of data, err on the side of assuming that your code is in a cold path and adopt a correspondingly coarse-grained locking strategy—but be prepared to be proven wrong by the data.

在冷路径中，尽可能保持粗粒度锁定。不要犹豫，只要一个锁就能覆盖子系统中各种罕见的活动。相反，在热路径中，即那些必须并发执行才能提供最高吞吐量的路径中，你必须更加小心：锁定策略必须简单、细粒度，你必须小心避免可能成为瓶颈的活动。如果你不知道某段代码是否会成为系统中的热门路径，该怎么办？在缺乏数据的情况下，请尽量假设您的代码处于冷路径，并采用相应的粗粒度锁定策略，但要做好被数据证明是错误的准备。

Intuition is frequently wrong—be data intensive. In our experience, many scalability problems can be attributed to a hot path that the developing engineer originally believed (or hoped) to be a cold path. When cutting new software from whole cloth, you will need some intuition to reason about hot and cold paths—but once your software is functional, even in prototype form, the time for intuition has ended: your gut must defer to the data. Gathering data on a concurrent system is a tough problem in its own right. It requires you first to have a machine that is sufficiently concurrent in its execution to be able to highlight scalability problems. Once you have the physical resources, it requires you to put load on the system that resembles the load you expect to see when your system is deployed into production. Once the machine is loaded, you must have the infrastructure to be able to dynamically instrument the system to get to the root of any scalability problems.

直觉经常是错误的–要数据密集。根据我们的经验，许多可扩展性问题都可归因于热路径，而开发工程师最初认为（或希望）这是冷路径。从零开始开发新软件时，您需要一些直觉来推理热路径和冷路径，但一旦软件开始运行，即使是原型形式，直觉的时代就结束了：您的直觉必须服从数据。收集并发系统的数据本身就是一个棘手的问题。它要求你首先拥有一台能够充分并发执行的机器，以便能够突出可扩展性问题。一旦拥有了物理资源，就需要向系统施加负载，使其与系统部署到生产环境时的预期负载相似。一旦机器加载了负载，您就必须拥有能够动态检测系统的基础设施，以找出任何可扩展性问题的根源。

The first of these problems has historically been acute: there was a time when multiprocessors were so rare that many software development shops simply didn’t have access to one. Fortunately, with the rise of multicore CPUs, this is no longer a problem: there is no longer any excuse for not being able to find at least a two-processor (dual-core) machine, and with only a little effort, most will be able (as of this writing) to run their code on an eight-processor (two-socket, quad-core) machine.

其中第一个问题历来都很尖锐：曾几何时，多处理器是如此罕见，以至于许多软件开发公司根本无法获得多处理器。幸运的是，随着多核 CPU 的兴起，这个问题已经不复存在：再也没有任何借口找不到至少两核（双核）处理器的机器，而且只需稍加努力，大多数人就能在八核（双插槽、四核）处理器的机器上运行他们的代码。

Even as the physical situation has improved, however, the second of these problems—knowing how to put load on the system—has worsened: production deployments have become increasingly complicated, with loads that are difficult and expensive to simulate in development. As much as possible, you must treat load generation and simulation as a first-class problem; the earlier you tackle this problem in your development, the earlier you will be able to get critical data that may have tremendous implications for your software. Although a test load should mimic its production equivalent as closely as possible, timeliness is more important than absolute accuracy: the absence of a perfect load simulation should not prevent you from simulating load altogether, as it is much better to put a multithreaded system under the wrong kind of load than under no load whatsoever.

然而，即使物理状况有所改善，第二个问题–知道如何给系统加载负载–却在不断恶化：生产部署变得越来越复杂，在开发过程中模拟负载既困难又昂贵。您必须尽可能将负载生成和模拟视为头等问题；在开发过程中越早解决这个问题，就能越早获得可能对软件产生巨大影响的关键数据。虽然测试负载应尽可能接近生产负载，但及时性比绝对准确性更重要：即使没有完美的负载模拟，也不应妨碍您完全模拟负载，因为将多线程系统置于错误的负载下，总比置于任何负载下要好得多。

Once a system is loaded—be it in development or in production—it is useless to software development if the impediments to its scalability can’t be understood. Understanding scalability inhibitors on a production system requires the ability to safely dynamically instrument its synchronization primitives. In developing Solaris, our need for this was so historically acute that it led one of us (Bonwick) to develop a technology (lockstat) to do this in 1997. This tool became instantly essential—we quickly came to wonder how we ever resolved scalability problems without it—and it led the other of us (Cantrill) to further generalize dynamic instrumentation into DTrace, a system for nearly arbitrary dynamic instrumentation of production systems that first shipped in Solaris in 2004, and has since been ported to many other systems including FreeBSD and Mac OS.6 (The instrumentation methodology in lockstat has been reimplemented to be a DTrace provider, and the tool itself has been reimplemented to be a DTrace consumer.)

一旦系统加载完毕，无论是在开发阶段还是在生产阶段，如果无法了解其可扩展性的阻碍因素，那么该系统对软件开发就毫无用处。要了解生产系统的可扩展性抑制因素，就必须能够安全地动态检测其同步原语。在开发 Solaris 的过程中，我们对这种能力的需求是如此迫切，以至于我们中的一位（Bonwick）在 1997 年开发了一种技术（lockstat）来实现这一目标。这一工具立即成为我们的必备工具，我们很快就开始怀疑，如果没有它，我们是如何解决可扩展性问题的。它还促使我们中的另一位（Cantrill）进一步将动态仪表化推广到 DTrace 中，这是一个用于生产系统近乎任意的动态仪表化的系统，于 2004 年首次在 Solaris 中使用，随后被移植到包括 FreeBSD 和 Mac OS 在内的许多其他系统6（lockstat 中的仪表化方法已被重新实现为 DTrace 提供者，而工具本身也已被重新实现为 DTrace 消费者）。

Today, dynamic instrumentation continues to provide us with the data we need not only to find those parts of the system that are inhibiting scalability, but also to gather sufficient data to understand which techniques will be best suited for reducing that contention. Prototyping new locking strategies is expensive, and one’s intuition is frequently wrong; before breaking up a lock or rearchitecting a subsystem to make it more parallel, we always strive to have the data in hand indicating that the subsystem’s lack of parallelism is a clear inhibitor to system scalability!

如今，动态仪表不断为我们提供所需的数据，我们不仅能找到系统中阻碍可扩展性的部分，还能收集足够的数据来了解哪些技术最适合减少争用。新锁定策略的原型设计耗资巨大，而且直觉往往是错误的；在拆分锁定或重新架构子系统使其更加并行之前，我们总是努力掌握数据，以表明子系统缺乏并行性是系统可扩展性的明显阻碍因素！

Know when—and when not—to break up a lock. Global locks can naturally become scalability inhibitors, and when gathered data indicates a single hot lock, it is reasonable to want to break up the lock into per-CPU locks, a hash table of locks, per-structure locks, etc. This might ultimately be the right course of action, but before blindly proceeding down that (complicated) path, carefully examine the work done under the lock: breaking up a lock is not the only way to reduce contention, and contention can be (and often is) more easily reduced by decreasing the hold time of the lock. This can be done by algorithmic improvements (many scalability improvements have been achieved by reducing execution under the lock from quadratic time to linear time!) or by finding activity that is needlessly protected by the lock. Here’s a classic example of this latter case: if data indicates that you are spending time (say) deallocating elements from a shared data structure, you could dequeue and gather the data that needs to be freed with the lock held and defer the actual deallocation of the data until after the lock is dropped. Because the data has been removed from the shared data structure under the lock, there is no data race (other threads see the removal of the data as atomic), and lock hold time has been decreased with only a modest increase in implementation complexity.

了解何时–何时不拆分锁。全局锁会自然而然地成为可扩展性的抑制因素，当收集到的数据表明存在单个热锁时，就有理由将该锁分解为按 CPU 的锁、按哈希表的锁、按结构的锁等。这最终可能是正确的做法，但在盲目地沿着这条（复杂的）道路前进之前，请仔细检查在锁下所做的工作：拆分锁并不是减少竞争的唯一方法，通过减少锁的保持时间可以（而且通常可以）更容易地减少竞争。这可以通过改进算法来实现（许多可扩展性改进都是通过将锁下的执行时间从二次时间缩短为线性时间实现的！），也可以通过找到不必要受锁保护的活动来实现。这里有一个后一种情况的典型例子：如果数据显示你正在花费时间（比如）从共享数据结构中删除元素，你可以在锁被锁定的情况下删除并收集需要释放的数据，并将数据的实际删除推迟到锁被释放之后。由于数据已在锁的控制下从共享数据结构中移除，因此不会出现数据竞赛（其他线程认为数据的移除是原子性的），而且锁保持时间减少了，而实现的复杂性只略有增加。

Be wary of readers/writer locks. If there is a novice error when trying to break up a lock, it is this: seeing that a data structure is frequently accessed for reads and infrequently accessed for writes, one may be tempted to replace a mutex guarding the structure with a readers/writer lock to allow for concurrent readers. This seems reasonable, but unless the hold time for the lock is long, this solution will scale no better (and indeed, may scale worse) than having a single lock. Why? Because the state associated with the readers/writer lock must itself be updated atomically, and in the absence of a more sophisticated (and less space-efficient) synchronization primitive, a readers/writer lock will use a single word of memory to store the number of readers. Because the number of readers must be updated atomically, acquiring the lock as a reader requires the same bus transaction—a read-to-own—as acquiring a mutex, and contention on that line can hurt every bit as much.

警惕读写器锁。如果说新手在尝试拆分锁时有什么错误的话，那就是：看到一个数据结构经常被读取访问，而不经常被写入访问，人们可能会想用一个读写器锁来代替保护该结构的互斥锁，以允许并发读取。这看似合理，但除非锁的保持时间很长，否则这种解决方案的扩展性不会比单锁更好（事实上，扩展性可能更差）。为什么呢？因为与读写器锁相关的状态本身必须以原子方式更新，如果没有更复杂（空间效率更低）的同步原语，读写器锁将使用单字内存来存储读者数量。由于读取器的数量必须以原子方式更新，因此作为读取器获取锁需要进行与获取静态锁相同的总线事务–读取–下发，而该线路上的争用会造成同样严重的后果。

There are still many situations where long hold times (e.g., performing I/O under a lock as reader) more than pay for any memory contention, but one should be sure to gather data to make sure that it is having the desired effect on scalability. Even in those situations where a readers/writer lock is appropriate, an additional note of caution is warranted around blocking semantics. If, for example, the lock implementation blocks new readers when a writer is blocked (a common paradigm to avoid writer starvation), one cannot recursively acquire a lock as reader: if a writer blocks between the initial acquisition as reader and the recursive acquisition as reader, deadlock will result when the recursive acquisition is blocked. All of this is not to say that readers/writer locks shouldn’t be used—just that they shouldn’t be romanticized.

在很多情况下，较长的保持时间（例如，以读写器身份在锁下执行 I/O）足以抵消任何内存争用，但我们应确保收集数据，以确保它对可扩展性产生预期效果。即使在适合使用读写器锁的情况下，也需要注意阻塞语义。例如，如果锁的实现在写入器被阻塞时阻塞新的读取器（这是避免写入器饥饿的一种常见模式），那么就不能以读取器的身份递归获取锁：如果写入器在以读取器的身份初始获取锁和以读取器的身份递归获取锁之间被阻塞，那么当递归获取被阻塞时就会导致死锁。说了这么多，并不是说不应该使用读者/写者锁，只是说不应该把它们浪漫化。

Consider per-CPU locking. Per-CPU locking (that is, acquiring a lock based on the current CPU identifier) can be a convenient technique for diffracting contention, as a per-CPU lock is not likely to be contended (a CPU can run only one thread at a time). If one has short hold times and operating modes that have different coherence requirements, one can have threads acquire a per-CPU lock in the common (noncoherent) case, and then force the uncommon case to grab all the per-CPU locks to construct coherent state. Consider this concrete (if trivial) example: if one were implementing a global counter that is frequently updated but infrequently read, one could implement a per-CPU counter protected by its own lock. Updates to the counter would update only the per-CPU copy, and in the uncommon case in which one wanted to read the counter, all per-CPU locks could be acquired and their corresponding values summed.

考虑按 CPU 加锁。按 CPU 加锁（即根据当前 CPU 的标识符获取锁）是一种方便的分散争用的技术，因为按 CPU 加锁不可能被争用（一个 CPU 一次只能运行一个线程）。如果保持时间较短，且运行模式对一致性要求不同，则可以让线程在普通（非一致性）情况下获取每个 CPU 的锁，然后强制非普通情况下获取所有每个 CPU 的锁，以构建一致性状态。请看下面这个具体（虽然微不足道）的例子：如果要实现一个频繁更新但不常被读取的全局计数器，可以实现一个由自己的锁保护的每个 CPU 计数器。对计数器的更新只会更新每个 CPU 的副本，在不经常读取计数器的情况下，可以获取所有每个 CPU 的锁，并将其相应的值相加。

Two notes on this technique: first, it should be employed only when the data indicates that it’s necessary, as it clearly introduces substantial complexity into the implementation; second, be sure to have a single order for acquiring all locks in the cold path: if one case acquires the per-CPU locks from lowest to highest and another acquires them from highest to lowest, deadlock will (naturally) result.

关于这项技术，有两点需要注意：首先，只有当数据表明有必要使用时才可以使用，因为它显然会给实现带来很大的复杂性；其次，要确保在冷路径中获取所有锁的顺序是单一的：如果一种情况是按 CPU 从低到高获取锁，而另一种情况是按 CPU 从高到低获取锁，那么（自然）会导致死锁。

Know when to broadcast—and when to signal. Virtually all condition variable implementations allow threads waiting on the variable to be awakened either via a signal (in which case one thread sleeping on the variable is awakened) or via a broadcast (in which case all threads sleeping on the variable are awakened). These constructs have subtly different semantics: because a broadcast will awaken all waiting threads, it should generally be used to indicate state change rather than resource availability. If a condition broadcast is used when a condition signal would have been more appropriate, the result will be a thundering herd: all waiting threads will wake up, fight over the lock protecting the condition variable, and (assuming that the first thread to acquire the lock also consumes the available resource) sleep once again when they discover that the resource has been consumed. This needless scheduling and locking activity can have a serious effect on performance, especially in Java-based systems, where notifyAll() (i.e., broadcast) seems to have entrenched itself as a preferred paradigm; changing these calls to notify() (i.e., signal) has been known to result in substantial performance gains.7

知道何时广播，何时发出信号。几乎所有的条件变量实现都允许通过信号（在这种情况下，唤醒沉睡在变量上的一个线程）或广播（在这种情况下，唤醒沉睡在变量上的所有线程）唤醒在变量上等待的线程。这两种构造的语义有细微的不同：由于广播会唤醒所有等待的线程，因此一般应使用它来表示状态变化，而不是资源可用性。如果在使用条件信号更合适的情况下使用了条件广播，结果将是群龙无首：所有等待的线程都会被唤醒，争夺保护条件变量的锁，并在（假设第一个获得锁的线程也消耗了可用资源）发现资源已被消耗时再次沉睡。这种无谓的调度和加锁活动会严重影响性能，尤其是在基于 Java 的系统中，notifyAll()（即广播）似乎已成为首选范式；众所周知，将这些调用改为 notify()（即信号）可大幅提高性能7。

Learn to debug postmortem. Among some Cassandras of concurrency, a deadlock seems to be a particular bogeyman of sorts, having become the embodiment of all that is difficult in lock-based multithreaded programming. This fear is somewhat peculiar, because deadlocks are actually among the simplest pathologies in software: because (by definition) the threads involved in a deadlock cease to make forward progress, they do the implementer the service of effectively freezing the system with all state intact. To debug a deadlock, one need have only a list of threads, their corresponding stack backtraces, and some knowledge of the system. This information is contained in a snapshot of state so essential to software development that its very name reflects its origins at the dawn of computing: it is a core dump.

学会事后调试。在一些并发领域的 "卡桑德拉 "中，死锁似乎是一个特殊的恶魔，它代表了基于锁的多线程编程中所有的困难。这种恐惧有点奇怪，因为死锁实际上是软件中最简单的病理现象之一：因为（根据定义）参与死锁的线程不再向前推进，它们为实现者提供了有效冻结系统的服务，而且所有状态都完好无损。要调试死锁，只需要线程列表、相应的堆栈回溯和一些系统知识。这些信息都包含在对软件开发至关重要的状态快照中，其名称本身就反映了它在计算机诞生之初的起源：这就是核心转储。

Debugging from a core dump—postmortem debugging—is an essential skill for those who implement parallel systems: problems in highly parallel systems are not necessarily reproducible, and a single core dump is often one’s only chance to debug them. Most debuggers support postmortem debugging, and many allow user-defined extensions.8 We encourage practitioners to understand their debugger’s support for postmortem debugging (especially of parallel programs) and to develop extensions specific to debugging their systems.

从内核转储进行调试–死后调试–是并行系统实施人员的一项基本技能：高度并行系统中的问题不一定具有可重复性，而单个内核转储往往是调试这些问题的唯一机会。大多数调试器都支持死后调试，而且许多调试器还允许用户定义扩展8 。我们鼓励实践者了解其调试器对死后调试（尤其是并行程序）的支持，并开发专门用于调试其系统的扩展。

Design your systems to be composable. Among the more galling claims of the detractors of lock-based systems is the notion that they are somehow uncomposable: “Locks and condition variables do not support modular programming,” reads one typically brazen claim, “building large programs by gluing together smaller programs[:] locks make this impossible.”9 The claim, of course, is incorrect. For evidence one need only point at the composition of lock-based systems such as databases and operating systems into larger systems that remain entirely unaware of lower-level locking.

设计可组合的系统基于锁的系统的诋毁者最令人讨厌的说法是，这些系统在某种程度上是不可组合的： "锁和条件变量不支持模块化编程，"一个典型的厚颜无耻的说法是，"通过将较小的程序粘合在一起来构建大型程序[：]锁使这成为不可能。9 当然，这种说法是不正确的，我们只需看看基于锁的系统（如数据库和操作系统）与更大的系统的组合，就能找到证据。

There are two ways to make lock-based systems completely composable, and each has its own place. First (and most obviously), one can make locking entirely internal to the subsystem. For example, in concurrent operating systems, control never returns to user level with in-kernel locks held; the locks used to implement the system itself are entirely behind the system call interface that constitutes the interface to the system. More generally, this model can work whenever a crisp interface exists between software components: as long as control flow is never returned to the caller with locks held, the subsystem will remain composable.

有两种方法可以使基于锁的系统完全可组合，每种方法都有自己的用武之地。首先（也是最明显的），我们可以将锁完全置于子系统内部。例如，在并发操作系统中，内核锁被锁定后，控制权永远不会返回到用户层；用于实现系统本身的锁完全位于系统调用接口之后，而系统调用接口则是系统的接口。更广泛地说，只要软件组件之间存在清晰的接口，这种模式就能发挥作用：只要控制流在锁被锁住的情况下从不返回调用者，子系统就能保持可组合性。

Second (and perhaps counterintuitively), one can achieve concurrency and composability by having no locks whatsoever. In this case, there must be no global subsystem state—subsystem state must be captured in per-instance state, and it must be up to consumers of the subsystem to assure that they do not access their instance in parallel. By leaving locking up to the client of the subsystem, the subsystem itself can be used concurrently by different subsystems and in different contexts. A concrete example of this is the AVL tree implementation used extensively in the Solaris kernel. As with any balanced binary tree, the implementation is sufficiently complex to merit componentization, but by not having any global state, the implementation may be used concurrently by disjoint subsystems—the only constraint is that manipulation of a single AVL tree instance must be serialized.

其次（也许与直觉相反），我们可以通过不加任何锁来实现并发性和可组合性。在这种情况下，必须没有全局子系统状态–子系统状态必须捕获在每个实例状态中，而且必须由子系统的消费者来确保他们不会并行访问自己的实例。通过让子系统的客户机自行锁定，子系统本身可以在不同的上下文中被不同的子系统并发使用。一个具体的例子就是 Solaris 内核中广泛使用的 AVL 树实现。与任何平衡二叉树一样，该实现非常复杂，值得组件化，但由于没有任何全局状态，该实现可以被不同的子系统并发使用–唯一的限制是对单个 AVL 树实例的操作必须序列化。

Don’t use a semaphore where a mutex would suffice. A semaphore is a generic synchronization primitive originally described by Dijkstra that can be used to effect a wide range of behavior. It may be tempting to use semaphores in lieu of mutexes to protect critical sections, but there is an important difference between the two constructs: unlike a semaphore, a mutex has a notion of ownership—the lock is either owned or not, and if it is owned, it has a known owner. By contrast, a semaphore (and its kin, the condition variable) has no notion of ownership: when sleeping on a semaphore, one has no way of knowing which thread one is blocking upon.

不要在使用互斥就足够的地方使用信号传递器。信号是一种通用的同步原语，最初由 Dijkstra 描述，可用于实现多种行为。使用 semaphores 代替互斥来保护关键部分可能很有诱惑力，但这两种构造之间有一个重要的区别：与 semaphore 不同，互斥有所有权的概念–锁要么被拥有，要么不被拥有；如果被拥有，它就有一个已知的所有者。相比之下，semaphore（及其近亲–条件变量）没有所有权概念：在semaphore上休眠时，无法知道阻塞的是哪个线程。

The lack of ownership presents several problems when used to protect critical sections. First, there is no way of propagating the blocking thread’s scheduling priority to the thread that is in the critical section. This ability to propagate scheduling priority—priority inheritance—is critical in a realtime system, and in the absence of other protocols, semaphore-based systems will always be vulnerable to priority inversions. A second problem with the lack of ownership is that it deprives the system of the ability to make assertions about itself. For example, when ownership is tracked, the machinery that implements thread blocking can detect pathologies such as deadlocks and recursive lock acquisitions, inducing fatal failure (and that all-important core dump) upon detection. Finally, the lack of ownership makes debugging much more onerous. A common pathology in a multithreaded system is a lock not being dropped in some errant return path. When ownership is tracked, one at least has the smoking gun of the past (faulty) owner—and, thus, clues as to the code path by which the lock was not correctly dropped. Without ownership, one is left clueless and reduced to debugging by staring at code/the ceiling/into space.

当用于保护关键部分时，所有权的缺失会带来几个问题。首先，无法将阻塞线程的调度优先级传播给处于关键部分的线程。这种传播调度优先级的能力–优先级继承–在实时系统中至关重要，而在没有其他协议的情况下，基于信号的系统总是容易受到优先级倒置的影响。缺乏所有权的第二个问题是，它剥夺了系统对自身进行断言的能力。例如，当所有权被跟踪时，实现线程阻塞的机制就能检测到死锁和递归锁获取等病态现象，并在检测到时引发致命故障（以及最重要的内核转储）。最后，所有权的缺失使调试工作变得更加繁重。在多线程系统中，一个常见的病理现象是锁在某些错误的返回路径中没有被丢弃。如果能追踪到锁的所有权，就至少能找到过去（错误的）所有者的蛛丝马迹，从而找到锁没有被正确释放的代码路径。如果没有所有权，就会毫无头绪，只能盯着代码/天花板/太空进行调试。

All of this is not to say that semaphores shouldn’t be used (indeed, some problems are uniquely suited to a semaphore’s semantics), just that they shouldn’t be used when mutexes would suffice.

说了这么多，并不是说不应该使用信号传递器（事实上，有些问题就非常适合信号传递器的语义），只是说当互斥就足够了的时候，不应该使用信号传递器。

Consider memory retiring to implement per-chain hash-table locks. Hash tables are common data structures in performance-critical systems software, and sometimes they must be accessed in parallel. In this case, adding a lock to each hash chain, with the per-chain lock held while readers or writers iterate over the chain, seems straightforward. The problem, however, is resizing the table: dynamically resizing a hash table is central to its efficient operation, and the resize means changing the memory that contains the table. That is, in a resize the pointer to the hash table must change—but we do not wish to require hash lookups to acquire a global lock to determine the current hash table!

考虑内存退役，实现每链哈希表锁。哈希表是性能关键型系统软件中常见的数据结构，有时必须并行访问。在这种情况下，为每个哈希表链添加一个锁，在读写器遍历链时保持每个链的锁，似乎很简单。然而，问题在于调整表的大小：动态调整哈希表的大小是其高效运行的关键，而调整大小意味着改变包含表的内存。也就是说，在调整大小时，必须改变指向哈希表的指针，但我们不希望要求哈希查找获取全局锁来确定当前的哈希表！

This problem has several solutions, but a (relatively) straightforward one is to retire memory associated with old hash tables instead of freeing it. On a resize, all per-chain locks are acquired (using a well-defined order to prevent deadlock), and a new table is then allocated, with the contents of the old hash table being rehashed into the new table. After this operation, the old table is not deallocated but rather placed in a queue of old hash tables. Hash lookups then require a slight modification to operate correctly: after acquiring the per-chain lock, the lookup must check the hash-table pointer and compare it with the hash-table pointer that was used to determine the hash chain. If the hash table has changed (that is, if a hash resize has occurred), it must drop the lock and repeat the lookup (which will acquire the correct chain lock in the new table).

这个问题有多种解决方案，但（相对而言）最直接的一种是清空与旧哈希表相关的内存，而不是释放内存。在调整大小时，所有链锁都会被获取（使用明确定义的顺序以防止死锁），然后分配一个新表，旧散列表的内容会被重新散列到新表中。操作完成后，旧表不会被重新分配，而是被放入旧散列表的队列中。散列查找需要稍作修改才能正确运行：在获取每链锁后，查找必须检查散列表指针，并与用于确定散列链的散列表指针进行比较。如果哈希表发生了变化（即发生了哈希表大小调整），则必须丢弃锁并重复查找（这将在新表中获取正确的链锁）。

There are some delicate issues in implementing this—the hash-table pointer must be declared volatile, and the size of the hash table must be contained in the table itself—but the implementation complexity is modest given the alternatives, and (assuming hash tables are doubled when they are resized) the cost in terms of memory is only a factor of two. For an example of this in production code, the reader is directed to the file descriptor locking in Solaris, the source code for which can be found by searching the Internet for “flist_grow.”

在实现过程中会遇到一些棘手的问题–哈希表指针必须声明为易失性，而且哈希表的大小必须包含在哈希表本身中，但考虑到其他替代方案，实现的复杂性并不高，而且（假设哈希表在调整大小时加倍）内存成本仅为 2 倍。读者可参考 Solaris 中的文件描述符锁定，在互联网上搜索 "flist_grow "即可找到其源代码。

Be aware of false sharing. There are a variety of different protocols for keeping memory coherent in caching multiprocessor systems. Typically, these protocols dictate that only a single cache may have a given line of memory in a dirty state. If a different cache wishes to write to the dirty line, the new cache must first read-to-own the dirty line from the owning cache. The size of the line used for coherence (the coherence granularity) has an important ramification for parallel software: because only one cache may own a line a given time, one wishes to avoid a situation where two (or more) small, disjoint data structures are both contained within a single line and accessed in parallel by disjoint caches. This situation—called false sharing—can induce suboptimal scalability in otherwise scalable software. This most frequently arises in practice when one attempts to defract contention with an array of locks: the size of a lock structure is typically no more than the size of a pointer or two and is usually quite a bit less than the coherence granularity (which is typically on the order of 64 bytes). Disjoint CPUs acquiring different locks can therefore potentially contend for the same cache line.

注意错误共享。在高速缓存多处理器系统中，有多种不同的协议来保持内存一致性。通常情况下，这些协议规定只有单个高速缓存可以拥有处于脏状态的特定内存行。如果不同的高速缓存希望写入脏行，新的高速缓存必须首先从拥有它的高速缓存读取脏行。用于一致性的行的大小（一致性粒度）对并行软件有重要影响：由于在给定时间内只有一个高速缓存可以拥有一行内存，因此我们希望避免出现两个（或更多）互不关联的小型数据结构同时包含在一行内存中，并被互不关联的高速缓存并行访问的情况。这种情况被称为 “错误共享”，会导致原本可扩展的软件出现次优可扩展性。在实践中，当人们试图用锁数组来消除争用时，最常出现这种情况：锁结构的大小通常不超过一两个指针的大小，而且通常比一致性粒度（通常为 64 字节）小得多。因此，获取不同锁的不同 CPU 有可能争夺同一缓存行。

False sharing is excruciating to detect dynamically: it requires not only a bus analyzer, but also a way of translating from the physical addresses of the bus to the virtual addresses that make sense to software, and then from there to the actual structures that are inducing the false sharing. (This process is so arduous and error-prone that we have experimented—with some success—with static mechanisms to detect false sharing.10) Fortunately, false sharing is rarely the single greatest scalability inhibitor in a system, and it can be expected to be even less of an issue on a multicore system (where caches are more likely to be shared among CPUs). Nonetheless, this remains an issue that the practitioner should be aware of, especially when creating arrays that are designed to be accessed in parallel. (In this situation, array elements should be padded out to be a multiple of the coherence granularity.)

虚假共享的动态检测非常困难：它不仅需要总线分析器，还需要将总线的物理地址转换为对软件有意义的虚拟地址，然后再从虚拟地址转换为导致虚假共享的实际结构。(10 幸运的是，虚假共享很少成为系统中最大的可扩展性阻碍因素，而且在多核系统（CPU 之间更有可能共享缓存）中，虚假共享的问题会更小。）尽管如此，这仍然是实践者应该注意的问题，尤其是在创建设计用于并行访问的数组时。(在这种情况下，应将数组元素填充为一致性粒度的倍数）。

Consider using nonblocking synchronization routines to monitor contention. Many synchronization primitives have different entry points to specify different behavior if the primitive is unavailable: the default entry point will typically block, whereas an alternative entry point will return an error code instead of blocking. This second variant has a number of uses, but a particularly interesting one is the monitoring of one’s own contention: when an attempt to acquire a synchronization primitive fails, the subsystem can know that there is contention. This can be especially useful if a subsystem has a way of dynamically reducing its contention. For example, the Solaris kernel memory allocator has per-CPU caches of memory buffers. When a CPU exhausts its per-CPU caches, it must obtain a new series of buffers from a global pool. Instead of simply acquiring a lock in this case, the code attempts to acquire the lock, incrementing a counter when this fails (and then acquiring the lock through the blocking entry point). If the counter reaches a predefined threshold, the size of the per-CPU caches is increased, thereby dynamically reducing contention.

考虑使用非阻塞同步例程来监控竞争。许多同步原语都有不同的入口点，以便在原语不可用时指定不同的行为：默认入口点通常会阻塞，而替代入口点会返回错误代码，而不是阻塞。第二种变体有很多用途，但其中一个特别有趣的用途是监控自身的争用情况：当获取同步基元的尝试失败时，子系统可以知道存在争用。如果子系统有办法动态减少争用，这一点就特别有用。例如，Solaris 内核内存分配器具有按 CPU 缓存的内存缓冲区。当 CPU 用完每个 CPU 的缓存时，它必须从全局池中获取一系列新的缓冲区。在这种情况下，代码不会简单地获取锁，而是会尝试获取锁，并在失败时递增计数器（然后通过阻塞入口点获取锁）。如果计数器达到预定义的阈值，就会增加每个 CPU 缓存的大小，从而动态减少竞争。

When reacquiring locks, consider using generation counts to detect state change. When lock ordering becomes complicated, at times one will need to drop one lock, acquire another, and then reacquire the first. This can be tricky, as state protected by the first lock may have changed during the time that the lock was dropped—and reverifying this state may be exhausting, inefficient, or even impossible. In these cases, consider associating a generation count with the data structure; when a change is made to the data structure, a generation count is bumped. The logic that drops and reacquires the lock must cache the generation before dropping the lock, and then check the generation upon reacquisition: if the counts are the same, the data structure is as it was when the lock was dropped and the logic may proceed; if the count is different, the state has changed and the logic may react accordingly (for example, by reattempting the larger operation).

重新获取锁时，可考虑使用生成计数来检测状态变化。当锁的排序变得复杂时，有时需要放弃一个锁，获取另一个锁，然后重新获取第一个锁。这可能会很棘手，因为第一个锁所保护的状态可能在锁被丢弃期间发生了变化，而重新获取这些状态可能会耗费精力、效率低下，甚至是不可能的。在这种情况下，可以考虑将世代计数与数据结构关联起来；当数据结构发生变化时，世代计数就会发生变化。丢弃和重新获取锁的逻辑必须在丢弃锁之前缓存代数，然后在重新获取时检查代数：如果代数相同，则数据结构与丢弃锁时一样，逻辑可以继续；如果代数不同，则状态已经改变，逻辑可以做出相应反应（例如，重新尝试更大的操作）。

Use wait- and lock-free structures only if you absolutely must. Over our careers, we have each implemented wait- and lock-free data structures in production code, but we did this only in contexts in which locks could not be acquired for reasons of correctness. Examples include the implementation of the locking system itself,11 the subsystems that span interrupt levels, and dynamic instrumentation facilities.12 These constrained contexts are the exception, not the rule; in normal contexts, wait- and lock-free data structures are to be avoided as their failure modes are brutal (livelock is much nastier to debug than deadlock), their effect on complexity and the maintenance burden is significant, and their benefit in terms of performance is usually nil.

只有在万不得已的情况下，才使用无等待和无锁结构。在我们的职业生涯中，我们每个人都在生产代码中实现了无等待和无锁的数据结构，但我们只是在出于正确性的考虑而无法获取锁的情况下才这样做。例如，锁定系统本身的实现、11 跨中断级的子系统和动态仪表设施12。这些受限的情况是例外，而不是常规；在正常情况下，应避免使用等待和无锁数据结构，因为它们的失效模式非常残酷（活锁比死锁更难调试），对复杂性和维护负担的影响很大，而在性能方面的好处通常为零。

Prepare for the thrill of victory—and the agony of defeat. Making a system scale can be a frustrating pursuit: the system will not scale until all impediments to scalability have been removed, but it is often impossible to know if the current impediment to scalability is the last one. Removing that last impediment is incredibly gratifying: with that change, throughput finally gushes through the system as if through an open sluice. Conversely, it can be disheartening to work on a complicated lock breakup only to discover that while it was the impediment to scalability, it was merely hiding another impediment, and removing it improves performance very little—or perhaps not at all. As discouraging as it may be, you must return to the system to gather data: does the system not scale because the impediment was misunderstood, or does it not scale because a new impediment has been encountered? If the latter is the case, you can take solace in knowing that your work is necessary—though not sufficient—to achieve scalability, and that the glory of one day flooding the system with throughput still awaits you.

准备迎接胜利的喜悦和失败的痛苦。系统扩展是一项令人沮丧的工作：只有消除了所有影响可扩展性的障碍，系统才能实现扩展，但通常无法确定当前的障碍是否是最后一个。消除最后一个障碍令人无比欣喜：随着这一变化，吞吐量终于像开闸放水一样涌入系统。相反，如果在处理复杂的锁分解时发现，虽然它是可扩展性的障碍，但它只是隐藏了另一个障碍，而移除它对性能的改善微乎其微，甚至根本没有改善，这可能会令人沮丧。尽管这可能令人沮丧，但你必须回到系统中收集数据：系统无法扩展是因为误解了障碍，还是因为遇到了新的障碍？如果是后者，你可以感到欣慰的是，你的工作对于实现可扩展性是必要的，尽管还不够，但有朝一日系统吞吐量爆棚的荣耀仍在等着你。

The Concurrency Buffet

There is universal agreement that writing multithreaded code is difficult: although we have attempted to elucidate some of the lessons learned over the years, it nonetheless remains, in a word, hard. Some have become fixated on this difficulty, viewing the coming of multicore computing as cataclysmic for software. This fear is unfounded, for it ignores the fact that relatively few software engineers actually need to write multithreaded code: for most, concurrency can be achieved by standing on the shoulders of those subsystems that already are highly parallel in implementation. Those practitioners who are implementing a database or an operating system or a virtual machine will continue to need to sweat the details of writing multithreaded code, but for everyone else, the challenge is not how to implement those components but rather how best to use them to deliver a scalable system. While lunch might not be exactly free, it is practically all-you-can-eat—and the buffet is open!

众所周知，编写多线程代码是一件非常困难的事情：尽管我们多年来一直试图阐明一些经验教训，但一言以蔽之，这仍然是一件非常困难的事情。有些人对这一困难耿耿于怀，认为多核计算的到来对软件来说是一场灾难。这种担心是没有根据的，因为它忽略了一个事实，即真正需要编写多线程代码的软件工程师相对较少：对于大多数人来说，并发性可以通过那些在实现上已经高度并行的子系统来实现。那些正在实施数据库、操作系统或虚拟机的从业人员仍需要为编写多线程代码的细节费心费力，但对其他人来说，挑战不在于如何实施这些组件，而在于如何最好地利用它们来提供一个可扩展的系统。虽然午餐可能并不完全免费，但它实际上是自助餐，而且自助餐是开放式的！

Ⅱ、笔记部分

1、为什么要并发（Why Concurrent）

并发执行以3种方式提升系统性能：

降低延迟（reduce latency）（即使工作单元执行得更快）

原文：make a unit of work execute faster

详解：对于一些任务（如科学计算或海量数据处理），若其容易被分割成更小任务单元（子任务间无共享资源或状态依赖），并发执行的优势极其明显，MapReduce就是这种并发情况的典型解决框架。
隐藏延迟（hide lantency）（即允许系统在长时间延迟操作期间继续工作）

原文：allow the system to continue doing work during a long-latency operation

详解：IO密集型系统尤其明显。另外，并发执行耗时操作并非隐藏系统延时的唯一途径，异步IO接口或事件池（select/poll/epoll）的合理使用也能达到相同的效果。
提高吞吐量（increase throughout）（即使系统能够执行更多工作）

原文：make the system able to perform more work

详解：是前2条的必然结果，IO密集型系统尤其明显。

值得注意的是：通过并发来提升系统吞吐性能并不要求必须用多线程实现。

具体而言：可以将系统中无共享状态的部分抽象成独立模块仍以串行方式执行，只需同时执行多个实例即可实现并发；而对于系统中需共享的资源（比如数据），可以将其放置到专门为共享状态的并发执行而设计的模块（如数据库）中。这种并发是通过架构而非业务代码实现的，我认为是设计系统时应该追求的优雅方案。

2、编写多线程代码的技巧（Tricks for writing multithreaded code）

实现无错高效的并发程序是公认的难点，针对实现底层系统的开发者，文中给出了编写多线程程序的15条建议，分别如下所列。

Know your cold paths from your hot paths（了解冷路径和热路径）

所谓"hot paths"是指会被频繁执行以至于可能成为系统瓶颈的代码段落（如模块的核心逻辑或循环体）；类似地，"cold paths"指只会执行有限次的代码段（如系统启动时读配置等的初始化代码）。

针对hot paths，可按业务需求尽量并发执行；而追求cold paths的并发执行不但会在复杂的编码实现上浪费时间，而且这类代码段也是容易引入bug的地方。作者的建议是"In cold paths, keep the locking as coarse-grained as possible"，即在cold paths上可以尽管加粗粒度的锁而不要担心其性能问题；而在hot paths上加锁时，需要特别注意锁的粒度。

如何判断某代码段的热度？作者的建议是可以先将其当作cold paths处理，后续出现性能瓶颈时，再做针对性优化。这也与“程序不要过早优化”（Donald Knuth: “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil” ）的原则相一致。
Intuition is frequently wrong - be data intensive（直觉往往是错误的–数据要密集）
程序员凭直觉判断代码段的hot or cold通常会出现偏差。凭直觉判断代码段的热度在开发阶段是可行的，但一旦软件已经可运行或者已经其原型已经实现，则需要让数据说话。具体而言，就是对可运行的模块做并发测试（压测）以便收集运行时的统计数据，从而指导后续的优化工作。

备注：文中提到的DTrace是一个动态追踪框架，用于收集运行时数据以便帮助我们分析程序行为。
Know when and when not to break up a lock（知道什么时候该拆锁，什么时候不该拆锁）
保持并发程序高性能的最理想情况是无锁或把锁拆成更小的锁（如把对整个hash-table的锁拆成针对htable中每个冲突链的锁(pre-chain locks)），但考虑到实现代价，拆解锁粒度并非减少多线程竞争的理想途径。事实上，通过优化lock保护的关键代码段，可以极大地减少竞争。这可以通过改进算法（如优化其时间复杂度）或将没有必要保护的代码移至lock的作用域之外来实现。

关于将没必要在lock作用域执行的代码（通常是耗时操作）移出保护区的典型例子，原文中有描述，限于篇幅，这里不再赘述。
Be wary of readers/writer locks（警惕读/写锁）
型误区：若某个读多写少的数据结构原来以mutex保护，则很多人会用rwlock来代替mutex。这在每次读/写持锁时间较长的场合下是合理的，但在R/W耗时很短的场合，用rwlock代替mutex不但无助于性能提升，甚至还会降低性能。

原因分析：根据文章-Spinlocks and Read-Write Locks（可能需翻墙）的剖析可知：rwlock的实现原理是其内部维护reader的引用计数，而这个refcount在内存单元中占据1个字，每次申请rdlock时，reader的refcount必须保证原子更新。因此，多个reader同时申请rdlock时，更新那个内存单元的bus会非常繁忙以至于最终形成性能瓶颈。
Consider per-CPU locking（考虑每 CPU 锁定）
基本思路是将全局锁拆分为对每个CPU加锁。但该思路有两个点需注意：

a. 当且仅当性能数据显示有必要这个拆锁时才应该考虑这样实现（因为实现代价很大）

b. 需确保在cold paths上获取/释放这些per-CPU locks的顺序要一致，否则必然出现deadlock。）
Know when to broadcast and when to signal（了解何时广播，何时发出信号）
当多个线程阻塞在某个变量处等待唤醒时，需要斟酌是以广播方式还是以signal方式唤醒线程：broadcast会唤醒所有的等待线程，所以它仅适用于通知系统状态变化(be used to indicate state change)给各线程，而signal方式适用于通知各线程等待的资源目前可用(be used to indicate resource availability)。若在signal方式更合适的场合下误用broadcast方式，则系统会出现惊群效应(Thundering herd problem)，这会对系统性能造成明显的负面影响。
Learn to debug post postmortem（学会事后调试）
并发程序出现死锁的频率较高，要调试死锁问题，需提供死锁发生时的threads list、各thread的stack backtrace，另外还需一些操作系统底层知识，而进程的core dump可以提供这些信息。故掌握core dump的调试技巧是必需技能。
Design your systems to be composable（设计可组合的系统）
术语composability的含义可以参考这里。本规则要求设计系统时，要保证组成整个系统的subsystems是composable的。

文中给出了保证lock-based system具有composable的2条设计原则：

a. 锁操作需完全封装在subsystem内部，不应暴漏给其它subsystems；

b. 消除global subsystem state且subsystem的consumers要保证它们不会并发访问subsystem的实例，这样可用lock-free的方式实现composability。文中用Solaris kernel中AVL树的实现来对第2条原则做了解释。
Don’t use a semaphore where a mutex would suffice（在使用互斥就足够的情况下，不要使用寄存器）
semaphore和mutex都是常用的线程同步机制，但从调试角度来看，它们有个显著的不同点：mutex具有ownership信息，即一个mutex实例只有被持有或未被持有两种状态，但它被持有时，owner是可知的；与此相反，semaphore（或者condition variable）无ownership的概念，当线程阻塞等待semaphore状态变化时，无法获取哪个线程将会"消费"或"持有"最近将要变化的semaphore状态。

对于临界区(critical section)的保护来说，缺失ownership信息会带来几个问题：

a. 处于临界区的线程无法获得被同步机制阻塞的线程的调度优先级信息。对于realtime系统来说，这意味者某些具有低延时要求的线程无法得到优先调度。

b. 使得系统无法对自身运行状态做评估。例如，若ownership可跟踪，则系统在实现thread blocking时可以检测某些异常操作（如deadlock或递归锁），而semaphore使得系统无法实现这种跟踪。

c. 增加了调试难度。假设代码的某异常分支未释放锁就直接return，若由此造成程序行为异常，则在系统无法跟踪ownership的情况下，我们无法通过现场的backtrace信息定位bug，只能从头开始调试。
Consider memory retiring to implement per-chain hash-table locks（考虑保留内存以实现每链哈希表锁）
当需要并发访问hash table时，一种高效方法是用rwlock对hashtable的每个chain做防护。

假设ht需要动态rehash，在不想引入针对整个ht的全局锁来防护rehash的前提下，这种per-chain locks在hashtable resize时有需要特别注意的地方。具体而言：rehash时，先依一定次序获取每个chain的lock，然后alloct新hashtable所需内存，然后将old hashtable的内容通过rehash算法填充至new hashtable。

上面的操作成功完成后，将old table指针放入一个专门用于存放old table pointer的队列中，而非直接deallocate old table的存储空间（这样才能保证性能）。不过，为实现memory retiring，还有一些需要特别注意的地方（如lookup时需要添加不同于传统lookup的辅助逻辑，如hashtable pointer需要申明为volatile，等等），这里不再赘述，感兴趣的同学可以阅读原论文。

这种技巧在Solaris的file descriptor locking实现代码中得到应用。
Be aware of false sharing（警惕虚假共享）
目前大多数CPU结构采用的cache coherency protocol是以write invalidate方式实现的，故在某些情况下，会引起False Sharing，这会降低系统性能（性能影响系数可能达到100x数量级），因此，编写并发或多线程程序时，要特别注意避免false sharing。

关于false sharing的原理解释和避免方法，这里极力推荐两篇文章：MSDN - .NET Matters: False Sharing 及Lockfree Algorithms: False-sharing。
Consider using nonblocking synchronization routines to monitor contention（考虑使用非阻塞同步例程来监控争用情况）
采用不会造成阻塞的同步方式，如可以用pthread_mutex_trylock()代替pthread_mutex_lock()以避免调用时直接阻塞（直接阻塞会让程序员失去"异常"处理的机会）。
When reacquiring locks, consider using generation counts to detect state change（重新获取锁时，考虑使用生成计数来检测状态变化）
考虑存在这种操作流程：某时刻释放lock1 => 获取lock2 => 释放lock2 => 重新获取lock1，显然，当重新获取lock1后，lock1保护的数据可能已经被改变，而验证这种变化通常是mission impossible。由此，可以考虑引入generation counts来辅助检测这种变化，即：要保护的数据结构中增加gen_count字段，每次数据结构更新时，同时更新gen_count的值；释放锁时，记录最新的gen_count；重新持有锁时，通过对本地记录的gen_count与数据结构中最新的gen_count做比较，就很容易判断期间数据是否被修改过。

其实，很多软件通过本地版本号与Server端最新版本号对比来决定是否需要更新就是这种思路的直观应用。
Use wait and lock-free structures only if you absolutely must（只有在万不得已的情况下才使用等待和无锁结构）
在通常的应用中，与wait and lock-free带来的性能提升相比，正确实现lock-free机制所花费的代价要大得多，而且程序不易调试。因此，除非是在编写操作系统底层代码，一般的应用没有必要一味追求lock-free。
Prepare for the thrill of victory - and the agony of defeat（为胜利的快感和失败的痛苦做好准备）
原文意思可概括为：编写正确高效且可扩展的并发程序是非常有挑战的，所以需要摆正心态且做好打持久战的准备。当系统始终达不到预期的并发或扩展效果时，要谨记一条原则：return to the system to gather data，即让数据说话，避免经验主义错误。

结论：
尽管前面总结了15条Rules，但编写并发程序依然是公认的难点，甚至于有人将多核计算机的出现当作洪水猛兽。其实，这种担心大可不必：编写高质量的并发程序只是少数工程师（如操作系统或数据库开发者）的工作任务。对于大多数人来说，可以站在巨人的肩膀上，用已被广泛验证的高质量并发模块来组成自己的系统。