
Reliable,Scalable,and Maintainable Applications(可靠,可伸缩,可维护的应用)

Many applications today are data-intensive, as opposed to compute-intensive. Raw CPU power is rarely limiting factor for these application—bigger problems are usually the amount of data, the complexity of data, the speed at which it is changing.


A data-intensive application is typically built from standard building blocks that provide commonly need functionality. For example, many application need to:

  • Store data so that they, or another application, can find it again later (databases)
  • Remember the result of an expensive operation, so speed up reads (caches)
  • Allow users to search data by keyword or filter it in various ways. (Search indexes)
  • Send a message to another process, to  be handled asynchronously (stream process)
  • Periodically crunch a large amount of accumulated data (batch process)


  • 存储数据方便日后供自己或者其他应用访问(数据库)。
  • 记住一些开销大的操作,加快数据读取(缓存)。
  • 允许用户根据关键字搜索,使用各种方式查询数据(搜索)
  • 将消息发送至其他的进程,进行异步处理(流式处理)
  • 间断性的处理累积的历史数据(批处理)

If that sounds painfully obvious, that’s just because these data systems are such a successful abstractions: we use them all the time without thinking too much. When building an application, most engineers wouldn’t dream of writing a new data storage engine from scratch, because databases are a good tool for the job.


But reality is not that simple. There are many database systems with different characteristics, because different applications have different requirements. There are various approaches to caching, several ways to search indexes, and so on. When building an application, we still need to figure out which tools and which approaches are the most appropriate for the job at hand. And it can be hard to combine tools when you need to do something that a single tool cannot do alone.


This book is a journey through both the principles and the practicalities of data systems, and how you can use them to build data-intensive applications. We will explore what different Tools have in common, what distinguish them, and how they achieve their characteristics.


In this chapter, we will start by exploring the fundamentals of what we are trying to achieve: reliable,scalable,and maintainable Data systems. We’ll clarify what those things mean, outline some ways of thinking about them, and go over the basics we will need for later chapters.


Thinking about data systems

We typically think of databases,queues,caches,etc. as being very different categories of tools. Although a database and a message queue have some superficial similarity—both store data for some time—they have different access patterns, which means different performance characteristics, and thus different implementations.


So why should we lump them all together under an umbrella term like data systems?


Many new tools for data storage and processing have emerged in recent years. They are optimizied for a variety of different use cases, and they no longer neatly fit into traditional categories. For example, there are message queues with database-like durability guarantees(Apache Kafka).The boundaries between the categories are becoming blurred.

近些年出现了许多针对于数据存储和数据处理的新工具。它们针对于一系列不同的用例而设计,不再满足于传统的目录分类。举个例子,现在的消息队列能够提供和数据库类似的持久化保证(Apache Kafka)。不同目录的界限正在变得模糊。

Secondary,increasingly many applications now have such demanding or wide-ranging requirements that a single tool can no longer meet all of its data processing and storing need. Instead, the work is broken down into tasks that can be preformed efficiently by a single tool, and those different tools are stitched together using application code.


For example, if you have an application-managed caching layer(using Memcached or similar), or a full-text search server(such as ELasticsearch or Solr) separate from your main database, it is normally the application code’s responsibility to keep those caches and indexes in sync with the main database.


When you combine several tools in order to provide a service, the service’s interface or application programming interface usually hides those implementations details from clients. Now you have essentially created a new, special-purpose data systems From smaller, general-purpose components. Your composite data system may provide certain guarantees:e.e., that the cache will be correctly invalidated or updated on writes so that outside clients can see consistent results. You are now not only an application developer, but also a data system designer.


If you are designing a data system or service, a lot of tricky questions arise. How do you ensure that data remains correct and complete even when things go wrong internally? How do you provide consistently good performance to clients, even when parts of your system are degraded? How do you scale to handle an increase in load? What does a good API for service look like?


There are many factors that may influence the design of a data system, including the skills and experiences of people involved, legacy system dependencies, the time-scale for delivery, your organization’s tolerance of different kinds of risk, regulatory constraints, etc. Those factors depend very much on situations.


In this book, we focus on three concerns that are important in most software systems


Reliability 稳定性

   The system should continue to work correctly even in the face of adversity(hardware or software faults, and even human error)


Scalability 伸缩性或扩展性

   As the system grows(in data volume, traffic volumes, or complexity), there should be reasonable ways of dealing with that growth,


Maintainability 可维护性

    Over time, many different people will work on the system(engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively.


These words are often cast around without a clear understanding of what they mean.In the interest of thoughtful engineering, we will spend the rest of this chapter exploring ways of thinking about reliability, scalability, maintainability. Then, in the following chapter, we will look at various techniques, architectures, and algorithms that are used in order to achieve those goals.


Reliability 稳定性

Everybody has an intuitive idea of what it means for something to be reliable or unreliable. For software, typical expectations include:

  • The application performs the function that the user expected.
  • It can tolerate the user making mistake or using the software in unexpected ways
  • Its performance is good enough for the required use case, under the expected load an data volume.
  • The system prevents unauthorized access and abuse.


  • 系统能够按照用户的预期执行功能。
  • 系统能够允许用户犯错、或者按照未预期的方式使用软件。
  • 在预期到的负载情况下,系统能够提供良好的性能。
  • 系统能够阻止未授权的访问和滥用

If all those things together mean “working correctly”, then we can understand reliability as meaning, roughly,”continuing to work correctly, even when things go wrong”


The things that can go wrong was called faults,and systems that anticipate faults and can cope with them called fault-tolerant or resilient. The former term is slightly misleading: it suggests that we could make a system tolerant of every possible kind of fault, which in reality is not feasible. If the entire planet Earth was swallowed by a black hole, tolerance of that fault would require web hosting in space. So it only makes sense to talk about tolerating contain types of faults.

系统运行中发生错误的东西我们成为fault, 能够预测到这些错误并且处理错误的系统我们称为能够“容忍错误”或者有”弹性”。容错这个词语可能会令人产生误解:字面上看,我们可以使得系统容忍各种可能的错误,但实际是不可能的。设想一下,如果整个地球被一个黑洞吞没,我们必须把服务架设在太空中才能容忍这种错误。因此在这里讨论容忍某些类型的错误才有意义。

Note that a fault is not the same as a failure. A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a a whole stops providing the required service to the user. It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault-tolerance mechanisms that prevent faults from causing failures. In this book we cover several techniques for building reliable systems from unreliable parts.


Counterintuitively, in such fault-tolerant systems, it can make sense to increase the rate of faults by triaging them deliberately—for example, by randomly killing individual processes without warning. Many Many bugs are actually due to poor error handing;by deliberately inducing faults, you ensure that the fault-tolerate machinery is continually exercised and tested, which can increase you confidence that faults will be handled correctly when they occur naturally.


Although we generally prefer to tolerating faults over preventing faults, there are cases where prevention is better than tolerance. This is the case with security matters, for example: if an attacker has compromised a system and gained access to sensitive data, that event cannot be undone. However, this book mostly deals with the kinds of faults that can be cured, as described in following sections.

尽管通常情况下容忍错误优于阻止错误,但是有一些情况阻止错误优于容忍错误。例如在安全领域,一个黑客攻破了一个系统获得了数据的操作权限,这种事情是无法容忍的。然而, 本书主要关注于那些能够被解决的错误。

Hardware faults 硬件错误

When we think of causes of system failure, hardware faults quickly come to mind. Hard disks crash, RAM becomes faulty, the power grid has a blackout, someone unplugs the wrong network cable. Anyone who has worked with large data center can tell you that these things happen all the time when you have a lot of machines.


Hard disks are reported as having a mean time to failure(MTTF) of about 10 to 50 years. Thus, on a storage cluster with 10,000 Disks, we should expect on average one disk to die per day.


Our first response is usually to add redundancy to the individual hardware components in order to reduce the failure rate of the system. Disk may be set up in a RAID configuration, servers may have dual power and hot-swapped CPUs, and data enters may have diesel generators for backup power. When one component dies, redundant component can take its place while the broken component is replaced. This approach cannot completely prevent hardware problems from causing failures, but it is well understood and can often keep a machine running uninterrupted for years.


Until recently, redundancy of hardware components was sufficient for most applications, since it makes total failure of a single machine fairly rare. As long as you restore a backup onto a new machine fairly quickly, the downtime in case of failure is not catastrophic in most applications. Thus, multi-machine redundancy was only required by a small number of applications for which high availability is essential.


However, as data volumes and applications’ computing demands have increased, more applications have begun using larger numbers of machines, which proportionally increase the rate of hardware failure. Moreover, in some clout platforms such as Amazon Web Services(AWS) it is fairly common for virtual machine to become unavailable without warning, as the platforms are designed to prioritize flexibility and elasticity over single-machine reliability.


Hence there is a move toward systems that can tolerate the loss of entire machines, by using software fault-tolerance techniques in preference or in addition to hardware redundancy. Such systems also have operational advantages: a single-server system requires planned downtime if you need to reboot the machine(to apply operating system security patches , for example), whereas a system that can tolerate machine failure can be patched one node at a time, without downtime of the entire system (a rolling upgrade).


Software Errors 软件错误

We usually think of hardware failure as being random and independent from each other: one machine’s disk failing does not imply that another machine’s disk is going to failure.There may be weak correlation(for example due to a common cause, such as the temperature in the server rack), but otherwise it is unlikely that a larger number of hardware components will fail at the same time.


Another class of fault is a systematic error within the system. Such faults are hard to anticipate, and because they are correlated across nodes, they tend to cause many more system failures than hardware faults. Examples include:


  • A software bug that cause each instance of an system to crash when given a error input. For example, consider the leap second on June 30,2012, that caused many applications to hang simultaneously due to a bug in Linux kernel.
  • A runaway process that uses up some shared resources—CPU time, memory, disk space , or network bandwidth.
  • A service that a system depends on that slows down, becomes unresponsive,or starts returning corrupted response.
  • Cascading failures, where a small fault in one component triggers a fault in another component, which in turn triggers future faults.

The bugs that cause these kinds of software faults lie dormant for a long time until they are triggered by an unusual set of circumstances. In those circumstances, it is revealed that the software is making some kind of assumption about its environment—and while that assumption is usually true, it eventually stops bring true for some reason.


There is no quick solution to the problem of systematic failure of system. Lots of small things can help: carefully thinking about assumptions and interactions in the system;thorough testing;process isolation;allowing processes to crash and restart;measuring,monitoring,and analyzing system behavior in production.If a system is expected to provide some guarantee(for example,in a message queue, that the number of incoming messages equals the number of outgoing messages), it can constantly check itself while it is running and raise an alert if a discrepancy is found.


Human Errors 人为错误

Humans design and build software systems, and the operators who keep the systems running are also human.Even when they have the best intentions, humans are known to be unreliable. For example, one study of large internet services found that configuration errors by operators were the leading cause of outages, whereas hardware faults played a role in only 10-25% of outages.


How do we make our system reliable,in spite of unreliable human?The best systems combine several approaches:

  • Design systems in a way that minimizes opportunities for error. For example, well-designed abstraction, APIs, and Amin interfaces make it easy to do “the right thing” and discourage “the wrong thing.” However, if the interfaces are too restrictive people will work around them, negating their benefits, so this is a tricky balance to get right.
  • Decouple the places where people make the most mistakes from the places where they can cause failures. In particular, provide full featured non-production sandbox environments where people can explore and experiment safely, using real data, without affecting real users.
  • Test thoroughy at all levels, from unit tests to whole system integration tests and manual tests. Automated testing is widely used, well understood, and especially valuable for covering corner cases that rarely arise in normal operations.
  • Allow quickly and easy recovery from human errors,to minimize the impact in the case of a failure. For example, make it fast to roll back configuration changes, roll out new code gradually, and provide tools to recompute data(in case it turns out that the old computation was incorrect)
  • Set up detailed and clear monitoring, such as performance metrics and error rates. In other engineering disciplines this is referred to as telemetry.(once a rocket has left the ground, telemetry is essential for tracking what is happening and for understanding failures) Monitoring can show us early warning signals and allow to us to check whether any assumption or constraints are being violated. When a problem occurs, metrics are invaluable in diagnosing the issue.
  • Implementing good management practice and training—a complex and important aspect, and beyond the scope of this book.

How important is reliability? 稳定性的重要性

Reliability is not just for nuclear power station and air traffic control software—more mundane application are also expected to work reliably. Bugs in business applications cause lost productivity(and legal risks if figures are reported incorrectly), and outages of e-commerce sites can have huge costs in terms of lost revenue and damage to reputation.


Even in “noncritical” applications we have a responsibility to our users. Consider a parent who stores all their pictures and videos of their children in your photo application.How would they feel if that database was suddenly corrupted?Would they know how to restore it from a backup?


There are situations in which we may sacrifice reliability in order to reduce development cost or operational cost—but we should be very conscious of when we are cutting corner.







