Improving Linux kernel performance and scalability

最新推荐文章于 2023-08-10 09:31:59 发布

HiHunter

最新推荐文章于 2023-08-10 09:31:59 发布

阅读量824

点赞数

分类专栏：测试文件系统

测试同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

文件系统

2 篇文章 0 订阅

订阅专栏

Improving Linux kernel performance and scalability

Making way for Linux in the enterprise

Sandra JohnsonIBM Linux Technology Center

William Hartner ( bhartner@us.ibm.com)IBM Linux Technology Center

William Brantley ( Bill.Brantley@amd.com)Advanced Micro Devices

Summary: The first step in improving Linux performance is quantifying it. But how exactly do you quantify performance for Linux or for comparable systems? In this article, members of the IBM Linux Technology Center share their expertise as they describe how they ran several benchmark tests on the Linux 2.4 and 2.5 kernels late last year.

Date: 01 Jan 2003
Level: Introductory
Also available in: Japanese

Activity: 32691 views
Comments: (View | Add comment - Sign in)

Average rating 4 stars based on 15 votes

Average rating (15 votes)
Rate this article

The Linux operating system is one of the most successful open sourceprojects to date. Linux exhibits high reliability as a Web serveroperating system, and it has significant market share in this market. Webservers are typically low-end to midrange systems with up to 4-waysymmetric multiple processors (SMP); enterprise-level systems have morecomplex requirements, such as larger processor counts and I/Oconfigurations and significant memory and bandwidth requirements. In orderfor Linux to be enterprise-ready and commercially viable in the SMPmarket, its SMP scalability, disk and network I/O performance, scheduler, and virtual memory manager must be improved relative to commercial UNIXsystems.

The Linux Scalability Effort (LSE) (see Resources for a link) is an open source project that addresses these Linux kernel issues for enterprise classmachines, with 8-way scalability and beyond.

The IBM Linux Technology Center's (LTC) Linux Performance Team (see Resources for a link) actively participates in theLSE effort. In addition, their objective is to make Linuxbetter by improving Linux kernel performance with special emphasis onSMP scalability.

This article describes the strategy and methodology used by theteam for measuring, analyzing, and improving the performance andscalability of the Linux kernel, focusing on platform-independent issues.A suite of benchmarks is used to accomplish this task. The benchmarksprovide coverage for a diverse set of workloads, including Web serving,database, and file serving. In addition, we show the various components ofthe kernel (disk I/O subsystem, for example) that are stressed by eachbenchmark.

Analysis methodology

Here we discuss the analysis methodology we used to quantify Linuxperformance for SMP scalability. If you prefer, you can skip ahead to the section.

Our strategy for improving Linux performance and scalability includesrunning several industry accepted and component-level benchmarks,selecting the appropriate hardware and software, developing benchmark runrules, setting performance and scalability targets, and measuring,analyzing and improving performance and scalability. These processes aredetailed in this section.

Performance is defined as raw throughput on a uniprocessor (UP) or SMP.We distinguish between SMP scalability (CPUs) and resourcescalability (number of network connections, for example).

Hardware and software

The architecture used for the majority of this work is IA-32 (in other words, x86),from one to eight processors. We also study the issues associated withfuture use of non-uniform memory access (NUMA) IA-32 and NUMA IA-64architectures. The selection of hardware typically aligns with theselection of the benchmark and the associated workload. The selection ofsoftware aligns with IBM's Linux middleware strategy and/or open sourcemiddleware. For example:

Database
We use a query database benchmark, and the hardware is an 8-way SMP systemwith a large disk configuration. IBM DB2 for Linux is the databasesoftware used, and the SCSI controllers are IBM ServeRAID 4H. The databaseis targeted for 8-way SMP.
SMB file serving
The benchmark is NetBench and the hardware is a 4-way SMP system with asmany as 48 clients driving the SMP server. The middleware is Samba (opensource). SMB file serving is targeted for 4-way SMP.
Web serving
The benchmark is SPECweb99, and the hardware is an 8-way with a largememory configuration and as many as 32 clients. The benchmarking wasconducted for research purposes only and was non-compliant (more on thisin the Benchmarks section). The Web server is Apache, which is the basisfor the IBM HTTP Server. We chose an 8-way in order to investigatescalability, and we chose Apache because it enables the measurement andanalysis of next generation posix threads (NGPT) (see Resources). In addition, it is open source and themost popular Web server.
Linux kernel version
The level of the Linux kernel.org kernel (2.2.x, 2.4.x, or 2.5.x) used isbenchmark dependent; this is discussed further in the Benchmarks section.The Linux distribution selected is Red Hat 7.1 or 7.2 in order to simplifyour administration. Our focus is kernel performance, not the performanceof the distribution: we replaced the Red Hat kernel with one fromkernel.org along with the patches we evaluated.

Run rules

During benchmark setup, we developed run rules to detail how thebenchmark is installed, configured, and run, and how results are to beinterpreted. The run rules serve several purposes:

Define the metric that will be used to measure benchmark performance and scalability (for example, messages/sec).
Ensure that the benchmark results are suitable formeasuring the performance and scalability of the workload and kernelcomponents.
Provide a documented set of instructions that will allowothers to repeat the performance tests.
Define the set of data that iscollected so that performance and scalability of the System Under Test(SUT) can be analyzed to determine where bottlenecks exist.

Setting targets

Performance and scalability targets for a benchmark are associated with aspecific SUT (hardware and software configuration). Setting performanceand scalability targets requires the following:

Baseline measurements todetermine the performance of the benchmark on the baseline kernel version. Baseline scalability is then calculated.
Initial performance analysis todetermine a promising direction for performance gains (for example, a profileindicating the scheduler is very busy might suggest trying an O(1) scheduler).
Comparison of baseline results with similar published results(for example, find SPECweb99 publications on the same Web server on a similar8-way from spec.org).

If external published results are not available, weattempt to use internal results. We also attempt to compare to otheroperating systems. Given the competitive data and our baseline, we selecta performance target for UP and SMP machines.

Finally, a target may bepredicated on getting a change in the application. For example, if we knowthat the way the application does asynchronous I/O is inefficient, then wemay publish the performance target assuming the I/O method will bechanged.

Tuning, measurement, and analysis

Before any measurements are made, both the hardware and softwareconfigurations are tuned. Tuning is an iterative cycle of tuning and measuring. It involvesmeasuring components of the system such as CPU utilization and memoryusage, and possibly adjusting system hardware parameters, system resourceparameters, and middleware parameters. Tuning is one of the first stepsof performance analysis. Without tuning, scaling results may bemisleading; that is, they may not indicate kernel limitations but rather someother issue.

The benchmark runs are made according to the run rules so that both performance and scalability can be measured in terms of the defined performancemetric. When calculating SMP scalability for a given machine, we chose between computing this metric based upon the performance ofa UP kernel or computing it upon the performance of an SMP kernel, with the number ofprocessors set to 1 (1P). We decided to compute SMP scalabilityusing UP measurements to more accurately reflect the SMP kernelperformance improvements.

A baseline measurement is made using the previously determined versionof the Linux kernel. For most benchmarks, both UP and SMP baselinemeasurements are made. For a few benchmarks, only the 8-way performance ismeasured since collecting UP performance information is time prohibitive. Most other benchmarks measure the amount of work completed in a specifictime period, which takes no longer to measure on a UP than on an 8-way.

The first step required to analyze the performance and scalability ofthe SUT (System Under Test) is to understand the benchmark and theworkload tested. Initial performance analysis is made against a tunedsystem. Sometimes analysis uncovers additional modifications to tuningparameters.

Analysis of the performance and scalability of the SUT requires a setof performance tools. Our strategy is to use Open Source community (OSC) tools whenever possible. This allows us to post analysis data to the OSCin order to illustrate performance and scalability bottlenecks. It alsoallows those in the OSC to replicate our results with the tool or tounderstand the results after experimenting with the tool on another application. If ad hoc performance tools are developed togain a better understanding of a specific performance bottleneck, then thead hoc performance tool is generally shared with the OSC. Ad hocperformance tools are usually simple tools that instrument a specificcomponent of the Linux kernel. The performance tools we used include:

/proc file system
meminfo, slabinfo, interrupts, network stats, I/O stats, etc.
SGI's lockmeter
From SMP lock analysis
SGI's kernel profiler (kernprof)
Time-based profiling, performance counter-based profiling, annotated callgraph (ACG) of kernel space only
IBM Trace Facility
Single step (mtrace) and both time-based and performance counter-basedprofiling for both user and system space

Ad hoc performance tools are developed to further understand a specificaspect of the system.

Examples are:

sstat
Collects scheduler statistics
schedret
Determines which kernel functions are blocking for investigation of idle time
acgparse
Post-processes kernprof ACG
copy in/out instrumentation
Determines alignment of buffers, size of copy, and CPU utilization of copy in/out algorithm

Performance analysis data is then used to identify performance andscalability bottlenecks. A broad understanding of the SUT and a morespecific understanding of certain Linux kernel components that are beingstressed by the benchmark are required, in order to understand where theperformance bottlenecks exist. There must also be an understanding of theLinux kernel source code that is the cause of the bottleneck. In addition,we work very closely with the LTC Linux kernel development teams and theOSC (Open Source community) so that a patch can be developed to fix thebottleneck.

Exit strategy

An evaluation of the Linux kernel performance may require several cyclesof running the benchmarks, conducting an analysis of the results toidentify performance and scalability bottlenecks, addressing anybottlenecks by integrating patches into the Linux kernel and running thebenchmark again. The patches can be obtained by finding existing patchesin the OSC or by developing new patches, as a performance teammember, in close collaboration with the members of the Linux kerneldevelopment team or OSC). There is a set of criteria for determining whenLinux is "good enough" and we end this process.

First, if we have met our targets and we do not have any outstandingLinux kernel issues to address for the specific benchmark that wouldsignificantly improve its performance, we assert that Linux is "goodenough" and move on to other issues. Second, if we go through severalcycles of performance analysis and still have outstanding bottlenecks,then we consider the tradeoffs between the development costs of continuingthe process and the benefits of any additional performance gains. If thedevelopment costs are too high, relative to any potential performanceimprovements, we discontinue our analysis and articulate the rationaleappropriately.

In both cases, we then review all of the additional outstanding Linuxkernel-related issues we want to address, make an assessment ofappropriate benchmarks that may be used to address these kernel componentissues, examine any data we may have on the issue, and make a decision toconduct an analysis of the kernel component (or collection of components) based upon this collective information.

Benchmarks

This section includes a description of the bottlenecks used and associatedkernel components stressed by the benchmarks used in our suite. Inaddition, performance results and analysis is included for some of thebenchmarks used by the Linux performance team.

Table 1. Linux kernel performance benchmarks

Linux kernel component	Database query	VolanoMark	SPECweb99 Apache2	NetBench	Netperf	LMBench	TioBench IOZone
Scheduler	Â	X	X	X	Â	Â	Â
Disk I/O	X	Â	Â	Â	Â	Â	X
Block I/O	X	Â	Â	Â	Â	Â	Â
Raw, Direct & Async I/O	X	Â	Â	Â	Â	Â	Â
Filesystem (ext2 & journaling)	Â	Â	X	X	Â	X	X
TCP/IP	Â	X	X	X	X	X	Â
Ethernet driver	Â	X	X	X	X	Â	Â
Signals	Â	X	Â	Â	Â	X	Â
Pipes	Â	Â	Â	Â	Â	X	Â
Sendfile	Â	Â	X	X	Â	Â	Â
pThreads	Â	X	X	Â	X	Â	Â
Virtual memory	Â	Â	X	X	Â	X	Â
SMP scalability	X	X	X	X	X	Â	X

Benchmark descriptions

The benchmarks used are selected based on a number of criteria: industrybenchmarks that are reliable indicators of a complex workload, andcomponent-level benchmarks that indicate specific kernel performanceproblems. Industry benchmarks are generally accepted by the industry tomeasure performance and scalability of a specific workload. Thesebenchmarks often require a complex or expensive setup that is notavailable to most of the OSC (Open Source community). These complex setupsare one of our contributions to the OSC. Examples include:

SPECweb99
Representative of Web-serving performance
SPECsfs
Representative of NFS performance
Database query
Representative of database-query performance
NetBench
Representative of SMB file-serving performance

Component-level benchmarks measure performance and scalability ofspecific Linux kernel components that are deemed critical to a widespectrum of workloads. Examples include:

Netperf3
Measures performance of network stack, including TCP, IP, and networkdevice drivers
VolanoMark
Measures performance of scheduler, signals, TCP send/receive, loopback
Block I/O Test
Measures performance of VFS, raw and direct I/O, block device layer, SCSIlayer, low-level SCSI/fibre device driver

Some benchmarks are commonly used by the OSC. They are preferredbecause the OSC already accepts the importance of the benchmark. Thus, itis easier to convince the OSC of performance and scalability bottlenecksilluminated by the benchmark. In addition, there are generally nolicensing issues that prevent us from publishing raw data. The OSC canrun these benchmarks because they are often simple to set up, and thehardware required is minimal. On the other hand, they often do not meetour requirements for enterprise systems. Examples include:

LMBench
Used to measure performance of the Linux APIs
IOZone
Used to measure native file system throughput
DBench
Used to measure the file system component of NetBench
SMB Torture
Used to measure SMB file-serving performance

There are many benchmark options available for our targeted workloads. We chose the ones listed above because they are best suited for ourmission, given our resources. There are some important benchmarks wechose not to utilize. In addition, we have chosen not to run somebenchmarks that are already under study by other performance teams withinIBM (for example, the IBM Solution Technologies System Performance Team has foundthat SPECjbb on Linux is "good enough"). Presented in Table 1 are thebenchmarks currently used by the Linux performance team and the targetedkernel component.

Benchmark results

Presented are descriptions of three selected benchmarks used in our suiteto quantify Linux kernel performance: database query, VolanoMark, andSPECweb99. For all three benchmarks, we used 8-way machines, as detailedin the figures presenting the benchmark results.

Figure 1. Database query benchmark results

Figure 1 shows the database query benchmark results. Alsoincluded is a description of the hardware and software configurationsused. The figure graphically illustrates the progress we have made inachieving our target. Some of the issues we have addressed have resultedin improvements that include adding bounce buffer avoidance, ips,io_request_lock, readv, kiobuf and O(1) scheduler kernel patches, as wellas several DB2 optimizations.

The VolanoMark benchmark (see Resources)creates 10 chat rooms of 20 clients. Each room echoes the messages fromone client to the other 19 clients in the room. This benchmark, not yet anopen source benchmark, consists of the VolanoChat server and a secondprogram that simulates the clients in the chat room. It is used to measurethe raw server performance and network scalability performance. VolanoMarkcan be run in two modes: loopback and network. The loopback mode teststhe raw server performance, and the network mode tests the networkscalability performance. VolanoMark uses two parameters to control thesize and number of chat rooms.

The VolanoMark benchmark creates client connections in groups of 20 andmeasures how long it takes for the server to take turns broadcasting allof the clients' messages to the group. At the end of the loopback test, itreports a score as the average number of messages transferred per second. In the network mode, the metric is the number of connections between theclients and the server. The Linux kernel components stressed with thisbenchmark include the scheduler, signals, and TCP/IP.

Figure 2. VolanoMark benchmark results; loopback mode

Presented in Figure 2 are the VolanoMark benchmark results for loopbackmode. Also included is a description of the hardware and softwareconfigurations used and our target for this benchmark. We have establishedclose collaboration with the members of the Linux kernel development teamon moving forward to achieve this target. Some of the issues we haveaddressed that have resulted in improvements include adding O(1) scheduler, SMP scalable timer, tunable priority preemption and softaffinity kernel patches. As illustrated, we have exceeded our target forthis benchmark; however, there are some outstanding Linux kernel component-related and Java-related issues we are addressing that we believe will furtherimprove the performance of this benchmark.

Please note that the SPECweb99 benchmark work was conducted forresearch purposes only and was non-compliant, with the followingdeviations from the rules:

It was run on hardware that does not meet the SPECavailability-to-the public criteria. The machine was an engineeringsample.
access_log wasn't kept for full accounting. It was written, butdeleted every 200 seconds.

This benchmark presents a demanding workload to a Web server. Thisworkload requests 70% static pages and 30% simple dynamic pages. Sizes ofthe Web pages range from 102 to 921,000 bytes. The dynamic content modelsGIF advertisement rotation. There is no SSL content. SPECweb99 is relevantbecause Web serving, especially with Apache, is one of the most commonuses of Linux servers. Apache is rich in functionality and is not designedfor high performance. However, we chose Apache as the Web server for thisbenchmark because it currently hosts more Web sites than any other Webserver on the Internet. SPECweb99 is the accepted standard benchmark forWeb serving. SPECweb99 stresses the following kernel components: scheduler, TCP/IP, various threading models, sendfile, zero copy andnetwork drivers.

Figure 3. SPECweb99 benchmark results using the Apache Web server

Presented in Figure 3 are our results for SPECweb99. Also included is adescription of the hardware and software configurations used and ourbenchmark target. We have a close collaboration with the Linux kerneldevelopment team and the IBM Apache team as we make progress on theperformance of this benchmark. Some of the issues we have addressed thathave resulted in the improvements shown include adding O(1) and read copyupdate (RCU) dcache kernel patches and adding a new dynamic APImod_specweb module to Apache. As shown in Figure 3, we have exceeded ourtarget on this benchmark; however, there are several outstanding Linuxkernel component-related issues we are addressing that we believe willsignificantly improve the performance of this benchmark.

Summary

Linux has enjoyed great popularity, specifically with low-end and midrangesystems. In fact, Linux is well regarded as a stable, highly-reliableoperating system to use for Web servers for these machines. However,high-end, enterprise level systems have access to gigabytes, petabytes,and exabytes of data. These systems require a different set ofapplications and solutions with high memory and bandwidth requirements, inaddition to larger numbers of processors (see Resources for the developerWorks article, "Opensource in the biosciences", which discusses this type ofapplication).

This type of system application introduces a unique set of issues thatmay be orders of magnitude more complex than those present in smallerinstallations. In order for Linux to be competitive for the enterprisemarket, its performance and scalability must improve.

Our experience thus far indicates that the performance of the Linuxkernel can be improved significantly. We are proud to contribute to thisgoal by working within the open source community to quantify Linux kernelperformance, and to develop patches to address degradation issues to makeLinux better, and to make it enterprise ready.

Acknowledgments

We would like to thank Kaivalya Dixit, Dustin Fredrickson, ParthaNarayanan, Troy Wilson, Peter Wong, and the LTC Linux kernel developmentteam for their input in preparing this article.

Resources