性能之巅-第一章-简介

这一章主要讲述了系统性能的定义、特点、和可用的方法论、技术,并给出了两个案例:1)磁盘为何慢  2)软件升级后如何对比测试。 很生动,可用使你快速了解性能优化的全貌。

Chapter 1 Introduction
Computer performance is an exciting,varied,and challenging discipline.This chapter introduces you to the field of systems performance.The learning objectives of this chapter are:
--Understand systems performance, roles, activities, and challenges.
--Undertand the difference between observability and experimental tools.
--Develop a basic understanding of performance observability:statistics,profiling,flame graphs,tracing,static instrumentation,and dynamic instrumentation.
--Learn the role of methodologies and the Linux 60-second checklist.

References to later chapters are included so that this works as an introduction both to systems performance and to this book. This chapter finishes with case studies to show how systems performance works in practice.

1.1 Systems performance
Systems performance studies the performance of an entire computer system, including all major software and hardware components.Anything in the data path,from storage devices to application software,is included,because it can affect performance.For distributed systems this means multiple servers and applications. If you don't have a diagram of your environment showing the data path,find one of draw it yourself;this will help you understand the relationships between components and ensure that you don't overlook entire areas.

The typical goals of sytems performance are to improve the enduser experience by reducing latency and to reduce computing cost. Reducing cost can be achieved by eliminating inefficiencies, improving system throughput,and general tuning.

Figure 1.1 shows a generic system software stack on a single server,including the operating system(OS) kernel, with example database and application tiers.The term full stack is sometimes used to describe only the application environment, including databases, applications, and web servers. When speaking of systems performance,however, we use full stack to mean the entire software stack from the application down to metal (the hardware), including system libraries, the kernel, and the hardware itself. Systems performance studies the full stack.

Compilers are included in Figure 1.1 because they play a role in systems performance. This stack is discussed in Chapter 3, Operating Systems, and investigated in more detail in later chapters. The following sections describe systems performance in more detail.

1.2 Roles
Systems performance is done by a variety of job roles,including system administrators, site reliability engineers, application developers,network engineers,database administrators,web administrators,and other support staff.For many of these roles,performance is only part of the job,and performance analysis focuses on that role's area of responsibility: the network team checks the network, the database team checks the database, and so on. For some performance issues, finding the root cause or contributing factors requires a cooperative effort from more than one team.

Some companies employ performance engineers,whose primary activity is performance.They can work with multiple teams to perfrom a holistic整体的 study of environment, an approach that may be vital重要的 in resolving complex performance issues. They can also act as a central resource to find and develop better tooling for performance analysis and capacity planning across the whole environment.

For example, Netflix has a cloud performance team, of which I am a member.We assist the microservice and SRE teams with performance analysis and build performance tools for everyone to use.

Companies that hire multiple performance engineers can allow individuals to specialize in one or more areas,providing deeper levels of support. For example, a large performance engineering team may include specialists in kernel performance, client performance, language performance (e.g.,Java), runtime performance (e.g., the JVM),performance tooling, and more.

1.3 Activities
Systems performance involves a variety of activities. The following is a list of activities that are also ideal steps in the life cycle of software project from conception through development to production deployment. Methodologies and tools to help perform these activities are covered in this book.

1.Setting performance objectives and performance modeling for a future product.
2.Performance characterization of prototype software and hardware.
3.Performance analysis of in-development products in a test environment.
4.Non-regression testing for new product versions.
5.Benchmarking product releases.
6.Proof-of-concept testing in the target production environment.
7.Performance tuning in production.
8.Monitoring of running production software.
9.Performance analysis of production issues.
10.Incident review for production issues.
11.Performance tool development to enhance production analysis.

Steps 1 to 5 comprise traditional product development, whether for a product sold to customers or a company-internal service. The product is then launched, perhaps first with proof-of-concept testing in the target environment (customer or local), or it may go straight to deployment and configuration. If an issue is encountered in the target environment (step 6 to 9), it means that the issue was not detected of fixed during the development stages.

Performance engineering should ideally begin before any hardware is chosen or software is written: the first step should be to set objectives and create a performance model. However, products are often developed without this step, deferring 推迟 performance engineering work to a later time, after a problem arises. With each step of the development process it can become progressively 逐步地 harder to fix performance issues that arise due to to architectural decisions made earlier.

Cloud computing provides new techniques for proof-of-concept testing(step 6) that encourage skipping the earlier steps (steps 1 to 5). One such technique is testing new software on a single instance with a fraction 片段 of the production workload: this is known as canary 金丝雀 testing. Another technique makes this a normal step in software deployment: traffic is gradually 逐渐地 moved to a new pool of instances while leaving the old pool online as a backup; this is known as blue-green deployment. 蓝绿部署?--Netflix uses the red-black deployments.
With such safe-to-fail options available, new software is often tested without any prior 事先的 performance analysis, and quickly reverted 恢复 if need be. I recommend 推荐 that, when practical, you also perform the earlier activities so that the best performance can be achieved (although there may be time-to-market reasons for moving to production sooner).

The term capacity planning 容量规划 can refer to a number of the preceding activities.
指的是一系列事前行为。
During design, it includes studying the resource footprint 占用空间 of development software to see how well the design can meet the target needs. After deployment, it includes monitoring resource usage to predict problems before they occur.

The performance analysis of production issues (step 9) may also involve site reliability engineers(SREs); this step is followed by incident 事件 review meetings (step 10) to analyze what happened, share debugging techniques, and look for ways to avoid the same incident in the future. Such meetings are similar to developer restrospectives. 回顾

Environments and activities vary from company and product to product, and in many cases not all ten steps are performed. Your job may also focus on only some or just one of these activities.

1.4 Perspectives
Apart from a focus on different activities, performance roles can be viewed from different perspectives. Two perspectives for performance analysis are labeled in Figure 1.2: workload analysis and resource analysis, which approach the software stack from different directions.

The resource analysis perspective is commonly employed by system administrators, who are responsible for the system resources. Application developers, who are responsible for the delivered performance of the workload, commonly focus on the workload analysis perspective.
Each perspective has its own strengths, discussed in detail in Chapter 2, Methodologies. For challenging issues, it helps to try analyzing from both perspectives.

1.5 Performance is Challenging
Systems performance engineering is a challenging field for a number of reasons, including that it is subjective,主观的 it is complex, there may not b a single root cause, and it often involves multiple issues.

1.5.1 Subjectivity
Technology disciplines tend to be objective, 客观的 so much so that people in the industry are known for seeing things in black and white.
This can be true of software troubleshooting, where a bug is either present or absent and is either fixed or not fixed. Such bugs often manifest as error messages that can be easily interpreted and understood to mean the presence of an error.

Performance, on the other hand, is often subjective. With performance issues, it can be unclear whether there is an issue to begin with, and if so, when it has been fixed. What may be considered "bad" performance for one user, and therefore an issue, may be considered "good" performance for another.

Consider the following information:
The average disk I/O response time is 1 ms.
Is this “good” or "bad"? While response time, or latency, is one of the best metrics available,interpreting latency information is difficult. To some degree, whether a given metric is "good" or "bad" may depend on the performance expectations 预期 of the application developers and end users.

Subjective performance can be made objective by defining clear goals, such as having a target average response time, or requiring a percentage of requests to fall within a certain latency range.Other ways to deal with this subjectivity are introduced in Chapter 2,Methodologies, including latecy analysis.

1.5.2 Complexity
In addition to subjectivity, performance can be challenging discipline due to the complexity of systems and the lack of an obvious starting point for analysis. In cloud computing environments you may not even know which server instance to look at first. Sometimes we begin with a hypothesis, 假设 such as blaming 归咎 the network or a database, and the performance analyst must figure out if this is the right direction.

Performance issues may also originate from complex interactions between subsystems that perform well when analyzed in isolation. This can occur due to a cascading failure, when one failed component causes performance issues in others. To undertand the resulting issue, you must untangle 解开 the relationships between components and understand how they contribute.

Bottlenecks can also be complex and related in unexpected ways; fixing one may simply move the bottleneck elsewhere in the system, with overall performance not improving as much as hoped.
Apart from the complexity of the system, performance issues may also be caused by a complex characteristic of the production workload. These case may never be reproducible in a lab environment, or only intermittently so. 间歇性

Solving complex performance issues often requires a holistic 整体的 approach. The whole system -- both its internals and its external interactions -- may need to be investigated.调查研究 This requires a wide range of skills, and can make performance engineering a varied and intellectually 智力 challenging line of work.

Different methodologies can be used to guide us through these complexities, as introduced in Chapter 2; Chapter 6 to 10 include specific methodologies for specific system resources:CPUs, Memory,File Systems, Disks, and Network. (The analysis of complex systems in general, including oil spills 漏油 and the collapse of financial systems 经济危机, has been studied by Dekker)

In some cases, a performance issue can be caused by the interaction of these resources.

1.5.3 Multiple Causes
Some performance issues do not have a single root cause, but in stead have multiple contributing 起作用的 factors. Imagine a scenario where three normal events occur simultaneously and combine to cause a performance issue: each is a normal event that in isolation is not the root cause.
Apart from multiple causes, there can also be multiple performance issues.

1.5.4 Multiple Performance Issues
Finding a performance issue is usually not a problem; incomplex software there are often many. To illustrate this, try finding the bug database for your operating system or applications and search for the word performance. You might be surprised! Typically, there will be a numberr of performance issues that are known but not yet fixed, even in mature 成熟的 software that is considered to have high performance. This poses yet another difficulty when analyzing performance: the real task isn't finding an issue; it's identifying which issue or issues matter the most.

To do this, the performance analyst must quantify 量化 the magnitude 重要性 of issues. Some performance issues may not apply to your workload, or may apply only to a very small degree. Ideally, you will not just quantify the issues but also estimate the potential speedup to be gained for each one. This information can be valuable when management looks for justification for spending engineering or operations resources.
A metric 指标 well suited to performance quantification, when available, is latency.

1.6 latency
Latency is a measure of time spent waiting, and is an essential performance metric. Used broadly, it can mean the time for any operation to complete, such as an application request, a database querey, a file system operation, and so forth. For example, latency can express the time for a website to load completely, from link click to screen paint. This is an important metric for both the customer and the website provider: high latency can cause frustration 沮丧, and customers may take their business elsewhere.

As a metric, latency can allow maximum speedup to be estimated.
For example, Figure 1.3 depicts 描述 a database query that takes 100 ms (which is the latency) during which it spends 80 ms blocked waiting for disk reads. The maximum performance improvement by eliminating disk reads (e.g., by caching) can be calculated: from 100 ms to 20 ms (100 - 80) is five times (5x) faster. This is the estimated speedup, and the calculation has also quantified the performance issue: disk reads are causing the query to run up to 5x more slowly.

Such a calculation is not possible when using other metrics. I/O operations per second (IOPS), for example, depend on the type of I/O and are often not directly comarable. If a change were to reduce the IOPS rate by 80%, it is difficult to know what the performance impact would be. There might be 5x fewer IOPS, but what if each of these I/O increased in size (bytes) by 10X?

latency can also be ambiguous without qualifying terms. For example, in networking, latency can mean the time for a connection to be established but not the data transfer time; or it can mean the total duration of a connection, including the data transfer (e.g., DNS latency is commonly measured this way). Troughout this book I will use clarifying terms where possible: those examples would be better described as connection latency and request latency. Latency terminology is also summarized at the beginning of each chapter.

While latency is a useful metric, it hasn't always been available when and where needed. Some system areas provide average latenc only; some provide no latency measurements at all. With the availability of new BPF-based observability tools, latency can now be measured from custom arbitrary 任意的 points of interest and can provide data showing the full distribution of latency.

BPF is now a name and no longer an acronym 首字母缩写 (originally Berkeley Packet Filter).

1.7 Observability
Observability refers to understanding a system through observation, and classifies 分类 the tools that accomplish this. This includes tools that use counters, profiling, and tracing. I does not include benchmark tools, which modify the state of the system by performing a workload experiment. For production environments, observability tools should be tried first wherever possible, as experimental tools may perturb 扰乱 production workloads through resource contention. For test environments that are idle, you may wish to begin with benchmarking tools to determine hardware performance.

In this setcion I'll introduce counters, metrics, profiling, and tracing. I'll explain observability in more detail in Chapter 4, covering system wide versus per-process observability, Linux observability tools, and their internals. Chapter 5 to 11 include chapter-specific sections on observability, for example, Section 6.6 for CPU observability tools.

1.7.1 Counters, Statistics, and Metrics
Applications and the kernel typically provide data on their state and activity: 
operation counts, byte counts, latency measurements, resource utilization, and error rates. They are typically implemented as integer variables called counters that are hard-coded in the software.some of which are cumulative 累加的 and always increment. These cumulative counters can be read at different times by performance tools for calculating statistics: the rate of change over time, the average, percentiles, 百分位 etc.

For example, the vmstat(8) utility prints a system-wide summary of virtual memory statistics and more, based on kernel counters in the /proc file system. This example vmstat(8) output is from a 48-CPU production API server:
$ vmstat 1 5
This shows a system-wide CPU utilization of around 57% (CPU us + sy columns).
The columns are explained in detail in Chapters 6 and 7.

A metric is statistic that has been selected to evaluate or monitor a target. Most companies use monitoring agents to record selected statistics(metrics) at regular intervals, and chart them in a graphical interface to see changes over time. Monitoring software can also support creating custom alert from these metrics, such as sending emails to notify staff when problems are detected.

This hierarchy from counters to alerts is depicted in Figure 1.4. Figure 1.4 is provided as a guide to help you understand these terms, but their use in the industry is not rigid.刚性的 The terms counters, statistics, and metrics are often used interchangeably. Also, alerts may be generated by any layer, and not just a dedicated alerting system.

As an example of graphing metrics, Figure 1.5 is a screenshot of a Grafana-based tool observing the same server as the earlier vmstat(8) output.

These line graphs are useful for capacity planning, helping you predict when resources will become exhausted.耗尽

Your interpretation of performance statistics will improve with an understanding of how they are calculated. Statistics, including averages, distributions, modes, and outliers, are summarized in Chapter 2, methodologies, Section 2.8,Statistics.

Sometimes, time-series metrics are all that is needed to resolve a performance issue. Knowing the exact time a problem began may correlate with a known software or configuration change, which can be reverted. Other times, metrics only point in a direction, suggesting that there is a CPU or disk issue, but without explaining why. Profiling or tracing tools are necessary to dig deeper and find the cause.

In systems performance, the term profiling usually refers to the use of tools that perform sampling: taking a subset (a sample) of measurements to paint a coarse picture of the target. CPUs are a common profiling target. The commonly used method to profile CPUs in volves taking timed-interval samples of the on-CPU code paths.

An effective visualization of CPU profiles is flame graphs. CPU flame graphs can help you find more performance wins than any other tool, after metrics. They reveal 揭示 not only CPU issues, but other types of issues as well, found by the CPU footprints they leave behind. Issues of lock contention can be found by looking for CPU time in spin paths; memory issues can be analyzed by finding excessive CPU time in memory allocation functions(malloc()), along with the code paths that led to them; performace issues involving misconfigured networking may be discovered by seeing CPU time in slow or legacy codepaths; and so on.
Figure 1.6 is an example CPU flame graph showing the CPU cycles spent by the iperf(1) network micro-benchmark tool.
火焰图

This flame graph shows how much CPU time is spent copying bytes (the path that ends in copy_user_enhanced_fast_string()) versus TCP transmission (the tower on the left that includes tcp_write_xmit()). The widths are proportional to the CPU time spent, and the vertical axis shows the code path.

Profiler are explained in chapter 4,5, and 6, and the flame graph visualization is explained in Chapter 6, CPUs, Section 6.7.3, Flame Graphs.

1.7.3 Tracing
Tracing is event-based recording, where event data is captured and save for later analysis or consumed on-the-fly for custom summaries and other actions. There are special-purpose tracing tools for system calls (e.g., Linux strace(1)) and network packets (e.g. Linux tcpdump(8)); and general-purpose tracing tools that can analyze the execution of all software and hardware events (e.g., Linux Ftrace, BCC, and bpftrace). These allseeing tracers use a  variety of event sources, in particular, static and dynamic instrumentation, and BPF for programmability.

static Instrumentation -- 仪表
Static instrumentation describes hard-coded software instrumentation points added to the source code. There are hundreds of these points in the Linux kernel that instrument 装备测量仪器 disk I/O, scheduler events, system calls, and more. The linux technology for kernel static instrumentation is called tracepoints. There is also a static instrumentation technology for user-space software called user statically defined tracing (USDT). USDT is used by libraries (e.g., libc) for instrumenting library calls and by many applications for instrumenting service requests.

As an example tool that uses static instrumentation, execsnoop(8) prints new processes created while it is tracing (running) by instrumenting a tracepoint for the execve(2) system call. The following shows execsnoop(8) tracing a SSH login:
# execsnoop
PCOMM PID PPID RET ARGS
ssh  30656 20063 0 /usr/bin/ssh 0
sshd 30657 1401  0 /usr/bin/sshd -D -R
sh   30660 20657 0
......
This is especially useful for revealing short-lived processes that may be missed by other observability tools such as top(1). These short-lived processes can be a source of performance issues.

See Chapter 4 for more information about tracepoints and USDT probes.

Dynamic instrumentation
Dynamic instrumentation creates instrumentation points after the software is running, by modifying in-memory instructions to insert instrumentation toutines. This similar to how debuggers can insert 
a breakpoint on any function in running software. Debuggers pass execution flow to an interactive debugger when the breakpoint is hit, whereas dynamic instrumentation runs a routine and then continues the target software. this capability allows custom performance statistics to be created from any running software. Issues that were previously impossible or prohibitively 过分地 difficult to solve due to a lack of observability can now be fixed.

Dynamic instrumentation is so different from traditional observation that it can be difficult,at first, to grasp its role. Consider an operating system kernel: analyzing kernel internals can be like ventruing into a dark room, with candles 蜡烛 (system counters) placed where the kernel engineers though they were needed. Dynamic instrumentation is like having a flashlight 手电筒 that you can point anywhere.

Dynamic instrumentation was first created in the 1990s, along with tools that use it called dynamic tracers. For linux, dynamic isntrumentation was first developed in 2000 and began merging into the kernel in 2004(kprobes). However, these technologies were not well known and were difficult to use. This changed when Sun Microsystems launched their own version in 2005, DTrace, which was easy to use and production-safe. I developed many DTrace-based tools that showed how important it was for systems performance, tools that saw widespread use and helped make DTrace and dynamic instrumentation well known.

BPF
BPF, which originally stood for Berkeley Packet Filter, is powering the latest dynamic tracing tools for Linux. BPF originated as a mini in-kernel virtual machine for speeding up the execution of tcpdump(8) expressions. Since 2013 it has been extended (hence is sometimes called eBPF) to become a generic in-kernel execution environment, one that provides safety and fast access to resources. Among its many new uses are tracing tools, where it provides programmability for the BPF Compiler Collection(BCC) and bpftrace front ends. execsnoop(8), shown earlier, is a BCC tool.

Chapter 3 explains BPF, and Chapter 15  introduces the BPF tracing front ends: BCC and bpftrace. Other chapters introduce many BPF-based tracing tools in their observability sections; for example, CPU tracing tools are included in Chapter 6, CPUs, Section 6.6, observability Tools. I have also published prior books on tracing tools(for DTrace and BPF).

1.8 Experimentation 实验
Apart from observability tools there are also experimentation tools, 

These perform an experiment by applying a synthetic 合成的,人造的 workload to the system and measuring its performancee. This must be done carefully, because experimental tools can perturb 扰乱 the performance of systems under test.

There are macro-benchmark tools that simulate a real-world workload such as clients making application requests; and there are micro-benchmark tools that test a specific component, such as CPUs,disks,or networks. As an analogy 类比 : a car's lap time 圈速 at Laguna Seca 拉姑纳 西卡赛道 Raceway could be considered a macro-bench-mark, whereas its top speed and 0 to 60mph time could be considered micro-benchmarks. Both benchmark types are important, although micro-benchmarks are typically easier to debug, repeat, and understand, and are more stable.

The following example uses iperf(1) on an idle server to perform a TCP network throughput micro-benchmark with a remote idle server. This benchmark ran for ten seconds (-t 10) and produces per-second averages(-i 1):
# iperf -c 100.65.33.90 -i 1 -t 10
......
The output shows a throughput 吞吐量 of around 4.8Gbits for the first five seconds,which drops to around 3.2 Gbits/sec.  This is an interesting result that show bi-modal throughput. To improve performance, one might focus on the 3.2 Gbits/sec mode, and search for other metrics that can explain it.

Consider the drawbacks of debugging this performance issue on a production server using observability tools alone. Network throughput can vary from second to second because of natural variance in the client workload, and the underlying bi-modal behavior of the network might not be apparent. By using iperf(1) with a fixed workload, you eliminate client variance, revealing the variance due to other factors(e.g., external network throttling, buffering utilization, and so on).

As I recommended earlier, on production systems you should first try observability tools. However, there are so many observability tools that you might spend hours working through them when an experimental tool would lead to quicker results. An analogy taught to me by a senior performance engineer (Rock Bourbonnais) many years ago was this: you have two hands, observability and experimentation. Only using one type of tool is like trying to solve a problem one-handed.

Chapters 6 to 10 include sections on experimental tools; for example, CPU experimental tools are covered in Chapter 6, CPUs, Section 6.8, Experimentation.

1.9 CLOUD COMPUTING

Cloud computing, a way to deploy computing resources on demand, has enabled rapid scaling of applications by supporting their deployment across an increasing number of small virtual systems called instances. This has decreased the need for rigorous capacity planning, as more capacity can be added from the cloud at short notice. In some cases it has also increased the desire for performance analysis, because using fewer resources can mean fewer systems. Since cloud usage is typically charged by the minute or hour, a performance win resulting in fewer systems can mean immediate cost savings. Compare this scenario to an enterprise data center, where you may be locked inot a fixed support contract for years, unable to realize cost savings until the contract has ended.

New difficulties caused by cloud computing and virtualization include the management of performance effects from other tenants (sometimes called performance isolation) and physical system observability from each tenant. For example, unless managed properly by the system, disk I/O performance may be poor due to contention 竞争 with a neighbor. In some enviroments, the true usage of the physical disks may not be observable by each tenant, making identification of this issue difficult.

These topics are covered in Chapter 11, Cloud Computing.

1.10 METHODOLOGIES

Methodologies are a way to document the recommended steps for performing various tasks in systems performance.Without a methodology, a performance investigation can turn into a fishing expedition: trying random things in the hope of catching a win. This can be time-comsuming and ineffective, while allowing important areas to be overlooked. 忽略
Chapter 2, Methodologies, includes a library of methodologies for systems performance. The following is the first I use for any performance issue: a tool based cheklist.

1.10 Linux Perf Analysis in 60 Seconds

This is a Linux tool-based checklist that can be executed in the first 60 seconds of a performance issue investigation, using traditional tools that should be available for most Linux distributions. Table 1.1 shows the commands, what to check for, and the seciton in this book that covers the command in more detail.

1.uptime Load averages to identify if load is increasing or decreasing.
2.dmesg -T | tail  kernel errors including OOM events.
3.vmstat -SM 1 System-wide statistics: run queue length,swapping,overall CPU usage.
4.mpstat -P ALL 1  Per-CPU balance: a single busy CPU can indicate poor thread scaling.
5.pipstat 1 Per-process CPU usage: identify unexpected CPU consumers and user/system CPU time for each process.
6.iostat -sxz 1 Disk I/O statistics: IOPS and throughput, average wait time ,percent busy.
7.free -m Memory usage including the file system cache.
8.sar -n DEV 1 Network device I/O: packets and throughput.
9.sar -n TCP,ETCP 1 TCP statistics connection rates, retransmits.
10.top  Check overview.

This checklist can also be followed using a monitoring GUI, provided the same metrics are available.

Chapter 2, Methodologies, as well as later chapters, contain many more methodologies for performance analysis, including the USE method, workload characterization, latency analysis, and more.

1.11 CASE STUDIES

If you are new to systems performance, case studies showing when and why various activities are performed can help you relate them to your current evironment. Two hypothetical 假设的 examples are summarized here; one is a performance issue involving disk I/O, and one is performance testing of a software change.

These case studies describe activities that are explained in other chapters of this book. The approaches described here are also intended to show not the right way or the only way, but rather a way that these performance activities can be conducted, for your critical consideration.

1.11.1 Slow Disks

Sumit is a system administrator at a medium-size company. The database team has filed a support ticket complaining 抱怨 of "slow disks" on one of their database servers.

Sumit's first task is to learn more about the issue, gathering detials to form a problem statement. The ticket claims that the disks are slow, but it doesn't explain whether this is causing a database issue or not. Sumit responds by asking these questions:
-Is there currently a database performance issue? How is it measured?
-How long has this issue been present?
-Has anything changed with the database recently?
-Why were the disks suspected? 有嫌疑

The database team replies: "We have a log for queries slowere than 1,000 milliseconds. These usually don't happen, but during the past week they have been growing to dozens per hour. ACemMon showed that the disks were busy."

This confirms that there is a real database issue, but it also shows that the disk hypothesis 假设 is likely a guess. Sumit wants to check the disks, but he also wants to check other resources quickly in case that guess was wrong.

AcmeMon is the company's basic server monitoring system, providing historical performance graphs based on standard operating system metrics, the same metrics printed by mpstat(1),iostat(1),and system utilities. Sumit logs in to AcmeMon to see for himself.

Sumit begins with a methodology called the USE method(defined in Chapter 2, Methodologies, Section 2.5.9) to quickly check for resource bottlenecks. As the database team reported, utilization for the disks is high, around 80%, while for the other resources(CPU,network) utilization is much lowere. The historical data shows the disk utilization has been steadily increasing during the past week, while CPU utilization hase been steady. AcmeMon doesn't provide saturation 饱和 or error statistics for the disks, so to complete the USE method Sumit must log in to the server and run some commands.

He checks disk error counter from /sys; they are zero.He runs iostat(1) with an interval of one second and watches utilization and saturation metrics over time. AcmeMon reported 80% untilizaiton but uses a one minute interval. At one-second granularity, Sumit can see that disk utilizaiton fluctuates 波动, often hitting 100% and causing levels of saturation and increased disk I/O latency.

To further confirm that this is blocking the database - and isn't asynchronous with respect to the database queries - he ueses a BCC/BPF tracing tool called offcputime(8) to capture stack traces whenever the database was descheduled by the kernel, along with the time spent off-CPU. The stack traces show that the database is often blocking during a file system read, during a querey. This is enough evidence for Sumit.

The next question is why. The disk performance statistics appear to be consistent 持续 with high load.Sumit performs workload characterization to understand this further, using iostat(1) to measure IOPS, throughput, average disk I/O latency, and the read/write ratio. For more details, Sumit can use disk I/O tracing; however, he is satisfied that this a already points to a case of high disk load, and not a problem with the disks.

Sumit adds more details to the ticket, stating what he checked and including screenshots of the commands used to study the disks. His summary for far is that the disks are under high load, which increases I/O latency and is slowing the queries. However, the disks appear to be acting normally for the load.He asks if there is a simple explanation: did the database load increase?

The database team responds that it did not, and that the rate of queries (which isn't reported by AcmeMon) has been steady. This sounds consistent with an earlier finding, that CPU utilization was also steady.

Sumit thinks about what else could cause higher disk I/O load without a noticeable increase in CPU and has a quick talk with his colleagues about it. One ot them suggests file system fragmentation, which is expected when the file system approaches 100% capacity. Sumit finds that it is only at 30%.

Sumit knows he can perform drill-down analysis to understand the exact causes of disk I/O, but this can be time-consuming. He tries to think of other easy explanations 解释 that he can check quickly first, based on his knowledge of the kernel I/O stack. He remembers that this disk I/O is largely caused by file system cache (page cache) misses.

Sumit checks the file system cache hit ratio using cachestat(8) and finds it is currently at 91%. This sounds high (good), but he has no historical data too compare it to. He logs in to other database servers that serve similar workloads and finds their cache hit ratio to be over 98%. He also finds that the file system cache size is much larger on the other servers.

Turing his attention to the file system cache size and server memory usage, he finds something that had been overlooked: a development project has a prototype application that is consuming a growing amount of memory, even though it isn't under production load yet. This memory is taken from what is available for the file system cache, reducing its hit rate and causing more file system reads to become disk reads.

Sumit contacts the application development team and asks them to shudown the application and move it to a different server, referring to the database issue.After they do this, Sumit watches disk utilization creep downward in AcmeMon as the file system cache recovers to its original size. The slow queries return to zero, and he closes the ticket as resolved.

1.11.2 Software Change

Pamela is a performance and scalability engineer at a small company where she works on all performance-related activites. The application developers have developed a new core feature and are unsure whether its introduction could hurt performance. Pamela decides to perform non-regression 回归 testing of the new application version, before it is deployed in production.

Pamela acquires an idle server for the purpose of testing and searches for a client workload simulator. The application team had written one a while ago, although it has various limitations and known bugs. She decides to try it but wants to confirm that it adequately 充分地  resembles 与...相似 the current production workload.

She configures the server to match the current deployment configuration and runs the client workload simlulator from a different system to the server. The client workload can be characterized by studying an access log, and there is already a company tool to do this, which she uses. She also runs the tools on a production server log for different times of day and compares workloads. It appears that the client simulator applies an average production workload but doesn't account for variance. She notes this and continues her analysis.

Pamela knows a number of approaches to use at this point.She picks the easiest: increasing load from the client simulator until a limit is reached.(this is sometimes called stress testing) The client simulator can be configured to execute a target number of client requests per second, with a default of 1,000 that she had used earlier.She decides to increase load starting at 100 and adding increments of 100 until a limit is reached, each level being tested for one minute. She writes a shell script to perform the test, which collects results in a file for plotting 绘制 by other tools.

With the load running, she performs active benchmarking to determine what the limiting factors are. The server resources and server threads seem largely idle. The client simulator shows that the request throughput levels off at around 700 per second.

She switches to the new software version and repeats the test. This also reaches the 700 mark and levels off. She also analyzes the server to look for limiting factors but again cannot see any.

She plots the results, showing completed request rate versus load, to visually identify the scalability profile. Both appear to reach an abrupt ceiling.

While it appears that the software versions have similar performance characteristics, Pamela is disappointed that she wasn't able to identify the limiting factor causing the scalability ceiling. She knows she checked only server resources, and the limiter could instead be an application logic issue. It could also be elswhere: the network or the client simulator.

Pamela wonders if a different approach may be needed, such as sunning a fixed rate of operations and then characterizing resource usage(CPU, disk I/O,network I/O), so that it can be expressed in terms of a single client request.She runs the simulator at a reate of 700 per second for the current and new software and measures resource consumption. The current software drove the 32 CPUs to an average of 20% utilization for the given load. The new software drove the same CPUs of 30% utilization, for the same load. It would appear that this is indeed a regression, one that consumes more CPU resources.

Curious 古怪的 to understand the 700 limit, Pamela launches a higher load and then investigates all components in the data path, including the network, the client system, and the client workload generator. She also performs drill-down analysis of the server and client software. She documents what she has checked, including screenshots, for refercence.

To investigate the client software she performs thread state analysis and finds that it is single-threaded! That one thread is spending 100% of its time executing on-CPU. This convinces her that this the limiter of the test.

As an experiment, she launches the client software in parallel on different client systems. In this way, she drives the server to 100% CPU utilization for both the current and new software. The current version reaches 3,500 requests/sec, and the new version 2,300 requests/sec, consistent with earlier findings of resource consumption.

Pamela informs the application developers that there is a regression with the new software version, and she begins to profile its CPU usage using a CPU flame graph to understand why: what code paths are contributing.She notes that an average production workload was tested and that varied workloadss were not. She also files a bug to note that the client workload generator is single-threaded, whic can become a bottleneck.

1.11.3 More Reading
A more detailed case study is provided as Chapter 16, Case study, wich documents how I resolved a particular cloud performance issue. The net chapter introduces the methodologies used for performance analysis, and the remaining chapters cover the necessary background and specifics.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值