Improving Linux kernel performance and scalability

Improving Linux kernel performance and scalability

Making way for Linux in the enterprise

Sandra JohnsonIBM Linux Technology Center
William Hartner ( bhartner@us.ibm.com)IBM Linux Technology Center
William Brantley ( Bill.Brantley@amd.com)Advanced Micro Devices

Summary:  The first step in improving Linux performance is quantifying it. But how exactly do you quantify performance for Linux or for comparable systems? In this article, members of the IBM Linux Technology Center share their expertise as they describe how they ran several benchmark tests on the Linux 2.4 and 2.5 kernels late last year.

Date:  01 Jan 2003
Level:  Introductory
Also available in:  Japanese

Activity:  32691 views
Comments:    (View | Add comment - Sign in)

Average rating 4 stars based on 15 votes Average rating (15 votes)
Rate this article

The Linux operating system is one of the most successful open sourceprojects to date. Linux exhibits high reliability as a Web serveroperating system, and it has significant market share in this market. Webservers are typically low-end to midrange systems with up to 4-waysymmetric multiple processors (SMP); enterprise-level systems have morecomplex requirements, such as larger processor counts and I/Oconfigurations and significant memory and bandwidth requirements. In orderfor Linux to be enterprise-ready and commercially viable in the SMPmarket, its SMP scalability, disk and network I/O performance, scheduler, and virtual memory manager must be improved relative to commercial UNIXsystems.

The Linux Scalability Effort (LSE) (see Resources for a link) is an open source project that addresses these Linux kernel issues for enterprise classmachines, with 8-way scalability and beyond.

The IBM Linux Technology Center's (LTC) Linux Performance Team (see Resources for a link) actively participates in theLSE effort. In addition, their objective is to make Linuxbetter by improving Linux kernel performance with special emphasis onSMP scalability.

This article describes the strategy and methodology used by theteam for measuring, analyzing, and improving the performance andscalability of the Linux kernel, focusing on platform-independent issues.A suite of benchmarks is used to accomplish this task. The benchmarksprovide coverage for a diverse set of workloads, including Web serving,database, and file serving. In addition, we show the various components ofthe kernel (disk I/O subsystem, for example) that are stressed by eachbenchmark.

Analysis methodology

Here we discuss the analysis methodology we used to quantify Linuxperformance for SMP scalability. If you prefer, you can skip ahead to the section.

Our strategy for improving Linux performance and scalability includesrunning several industry accepted and component-level benchmarks,selecting the appropriate hardware and software, developing benchmark runrules, setting performance and scalability targets, and measuring,analyzing and improving performance and scalability. These processes aredetailed in this section.

Performance is defined as raw throughput on a uniprocessor (UP) or SMP.We distinguish between SMP scalability (CPUs) and resourcescalability (number of network connections, for example).

Hardware and software

The architecture used for the majority of this work is IA-32 (in other words, x86),from one to eight processors. We also study the issues associated withfuture use of non-uniform memory access (NUMA) IA-32 and NUMA IA-64architectures. The selection of hardware typically aligns with theselection of the benchmark and the associated workload. The selection ofsoftware aligns with IBM's Linux middleware strategy and/or open sourcemiddleware. For example:

  • Database
    We use a query database benchmark, and the hardware is an 8-way SMP systemwith a large disk configuration. IBM DB2 for Linux is the databasesoftware used, and the SCSI controllers are IBM ServeRAID 4H. The databaseis targeted for 8-way SMP.
  • SMB file serving
    The benchmark is NetBench and the hardware is a 4-way SMP system with asmany as 48 clients driving the SMP server. The middleware is Samba (opensource). SMB file serving is targeted for 4-way SMP.
  • Web serving
    The benchmark is SPECweb99, and the hardware is an 8-way with a largememory configuration and as many as 32 clients. The benchmarking wasconducted for research purposes only and was non-compliant (more on thisin the Benchmarks section). The Web server is Apache, which is the basisfor the IBM HTTP Server. We chose an 8-way in order to investigatescalability, and we chose Apache because it enables the measurement andanalysis of next generation posix threads (NGPT) (see Resources). In addition, it is open source and themost popular Web server.
  • Linux kernel version
    The level of the Linux kernel.org kernel (2.2.x, 2.4.x, or 2.5.x) used isbenchmark dependent; this is discussed further in the Benchmarks section.The Linux distribution selected is Red Hat 7.1 or 7.2 in order to simplifyour administration. Our focus is kernel performance, not the performanceof the distribution: we replaced the Red Hat kernel with one fromkernel.org along with the patches we evaluated.

Run rules

During benchmark setup, we developed run rules to detail how thebenchmark is installed, configured, and run, and how results are to beinterpreted. The run rules serve several purposes:

  • Define the metric that will be used to measure benchmark performance and scalability (for example, messages/sec).
  • Ensure that the benchmark results are suitable formeasuring the performance and scalability of the workload and kernelcomponents.
  • Provide a documented set of instructions that will allowothers to repeat the performance tests.
  • Define the set of data that iscollected so that performance and scalability of the System Under Test(SUT) can be analyzed to determine where bottlenecks exist.

Setting targets

Performance and scalability targets for a benchmark are associated with aspecific SUT (hardware and software configuration). Setting performanceand scalability targets requires the following:

  • Baseline measurements todetermine the performance of the benchmark on the baseline kernel version. Baseline scalability is then calculated.
  • Initial performance analysis todetermine a promising direction for performance gains (for example, a profileindicating the scheduler is very busy might suggest trying an O(1) scheduler).
  • Comparison of baseline results with similar published results(for example, find SPECweb99 publications on the same Web server on a similar8-way from spec.org).

If external published results are not available, weattempt to use internal results. We also attempt to compare to otheroperating systems. Given the competitive data and our baseline, we selecta performance target for UP and SMP machines.

Finally, a target may bepredicated on getting a change in the application. For example, if we knowthat the way the application does asynchronous I/O is inefficient, then wemay publish the performance target assuming the I/O method will bechanged.

Tuning, measurement, and analysis

Before any measurements are made, both the hardware and softwareconfigurations are tuned. Tuning is an iterative cycle of tuning and measuring. It involvesmeasuring components of the system such as CPU utilization and memoryusage, and possibly adjusting system hardware parameters, system resourceparameters, and middleware parameters. Tuning is one of the first stepsof performance analysis. Without tuning, scaling results may bemisleading; that is, they may not indicate kernel limitations but rather someother issue.

The benchmark runs are made according to the run rules so that both performance and scalability can be measured in terms of the defined performancemetric. When calculating SMP scalability for a given machine, we chose between computing this metric based upon the performance ofa UP kernel or computing it upon the performance of an SMP kernel, with the number ofprocessors set to 1 (1P). We decided to compute SMP scalabilityusing UP measurements to more accurately reflect the SMP kernelperformance improvements.

A baseline measurement is made using the previously determined versionof the Linux kernel. For most benchmarks, both UP and SMP baselinemeasurements are made. For a few benchmarks, only the 8-way performance ismeasured since collecting UP performance information is time prohibitive. Most other benchmarks measure the amount of work completed in a specifictime period, which takes no longer to measure on a UP than on an 8-way.

The first step required to analyze the performance and scalability ofthe SUT (System Under Test) is to understand the benchmark and theworkload tested. Initial performance analysis is made against a tunedsystem. Sometimes analysis uncovers additional modifications to tuningparameters.

Analysis of the performance and scalability of the SUT requires a setof performance tools. Our strategy is to use Open Source community (OSC) tools whenever possible. This allows us to post analysis data to the OSCin order to illustrate performance and scalability bottlenecks. It alsoallows those in the OSC to replicate our results with the tool or tounderstand the results after experimenting with the tool on another application. If ad hoc performance tools are developed togain a better understanding of a specific performance bottleneck, then thead hoc performance tool is generally shared with the OSC. Ad hocperformance tools are usually simple tools that instrument a specificcomponent of the Linux kernel. The performance tools we used include:

  • /proc file system
    meminfo, slabinfo, interrupts, network stats, I/O stats, etc.
  • SGI's lockmeter
    From SMP lock analysis
  • SGI's kernel profiler (kernprof)
    Time-based profiling, performance counter-based profiling, annotated callgraph (ACG) of kernel space only
  • IBM Trace Facility
    Single step (mtrace) and both time-based and performance counter-basedprofiling for both user and system space

Ad hoc performance tools are developed to further understand a specificaspect of the system.

Examples are:

  • sstat
    Collects scheduler statistics
  • schedret
    Determines which kernel functions are blocking for investigation of idle time
  • acgparse
    Post-processes kernprof ACG
  • copy in/out instrumentation
    Determines alignment of buffers, size of copy, and CPU utilization of copy in/out algorithm

Performance analysis data is then used to identify performance andscalability bottlenecks. A broad understanding of the SUT and a morespecific understanding of certain Linux kernel components that are beingstressed by the benchmark are required, in order to understand where theperformance bottlenecks exist. There must also be an understanding of theLinux kernel source code that is the cause of the bottleneck. In addition,we work very closely with the LTC Linux kernel development teams and theOSC (Open Source community) so that a patch can be developed to fix thebottleneck.

Exit strategy

An evaluation of the Linux kernel performance may require several cyclesof running the benchmarks, conducting an analysis of the results toidentify performance and scalability bottlenecks, addressing anybottlenecks by integrating patches into the Linux kernel and running thebenchmark again. The patches can be obtained by finding existing patchesin the OSC or by developing new patches, as a performance teammember, in close collaboration with the members of the Linux kerneldevelopment team or OSC). There is a set of criteria for determining whenLinux is "good enough" and we end this process.

First, if we have met our targets and we do not have any outstandingLinux kernel issues to address for the specific benchmark that wouldsignificantly improve its performance, we assert that Linux is "goodenough" and move on to other issues. Second, if we go through severalcycles of performance analysis and still have outstanding bottlenecks,then we consider the tradeoffs between the development costs of continuingthe process and the benefits of any additional performance gains. If thedevelopment costs are too high, relative to any potential performanceimprovements, we discontinue our analysis and articulate the rationaleappropriately.

In both cases, we then review all of the additional outstanding Linuxkernel-related issues we want to address, make an assessment ofappropriate benchmarks that may be used to address these kernel componentissues, examine any data we may have on the issue, and make a decision toconduct an analysis of the kernel component (or collection of components) based upon this collective information.

Benchmarks

This section includes a description of the bottlenecks used and associatedkernel components stressed by the benchmarks used in our suite. Inaddition, performance results and analysis is included for some of thebenchmarks used by the Linux performance team.


Table 1. Linux kernel performance benchmarks
Linux kernel componentDatabase queryVolanoMarkSPECweb99 Apache2NetBenchNetperfLMBenchTioBench IOZone
Scheduler  XXX   
Disk I/O X     X
Block I/O X      
Raw, Direct & Async I/O X      
Filesystem (ext2 & journaling)   XX XX
TCP/IP  XXXXX 
Ethernet driver  XXXX  
Signals  X   X 
Pipes      X 
Sendfile   XX   
pThreads  XX X  
Virtual memory   XX X 
SMP scalability XXXXX X

Benchmark descriptions

The benchmarks used are selected based on a number of criteria: industrybenchmarks that are reliable indicators of a complex workload, andcomponent-level benchmarks that indicate specific kernel performanceproblems. Industry benchmarks are generally accepted by the industry tomeasure performance and scalability of a specific workload. Thesebenchmarks often require a complex or expensive setup that is notavailable to most of the OSC (Open Source community). These complex setupsare one of our contributions to the OSC. Examples include:

  • SPECweb99
    Representative of Web-serving performance
  • SPECsfs
    Representative of NFS performance
  • Database query
    Representative of database-query performance
  • NetBench
    Representative of SMB file-serving performance

Component-level benchmarks measure performance and scalability ofspecific Linux kernel components that are deemed critical to a widespectrum of workloads. Examples include:

  • Netperf3
    Measures performance of network stack, including TCP, IP, and networkdevice drivers
  • VolanoMark
    Measures performance of scheduler, signals, TCP send/receive, loopback
  • Block I/O Test
    Measures performance of VFS, raw and direct I/O, block device layer, SCSIlayer, low-level SCSI/fibre device driver

Some benchmarks are commonly used by the OSC. They are preferredbecause the OSC already accepts the importance of the benchmark. Thus, itis easier to convince the OSC of performance and scalability bottlenecksilluminated by the benchmark. In addition, there are generally nolicensing issues that prevent us from publishing raw data. The OSC canrun these benchmarks because they are often simple to set up, and thehardware required is minimal. On the other hand, they often do not meetour requirements for enterprise systems. Examples include:

  • LMBench
    Used to measure performance of the Linux APIs
  • IOZone
    Used to measure native file system throughput
  • DBench
    Used to measure the file system component of NetBench
  • SMB Torture
    Used to measure SMB file-serving performance

There are many benchmark options available for our targeted workloads. We chose the ones listed above because they are best suited for ourmission, given our resources. There are some important benchmarks wechose not to utilize. In addition, we have chosen not to run somebenchmarks that are already under study by other performance teams withinIBM (for example, the IBM Solution Technologies System Performance Team has foundthat SPECjbb on Linux is "good enough"). Presented in Table 1 are thebenchmarks currently used by the Linux performance team and the targetedkernel component.

Benchmark results

Presented are descriptions of three selected benchmarks used in our suiteto quantify Linux kernel performance: database query, VolanoMark, andSPECweb99. For all three benchmarks, we used 8-way machines, as detailedin the figures presenting the benchmark results.


Figure 1. Database query benchmark results
Figure 1. Database query benchmark results

Figure 1 shows the database query benchmark results. Alsoincluded is a description of the hardware and software configurationsused. The figure graphically illustrates the progress we have made inachieving our target. Some of the issues we have addressed have resultedin improvements that include adding bounce buffer avoidance, ips,io_request_lock, readv, kiobuf and O(1) scheduler kernel patches, as wellas several DB2 optimizations.

The VolanoMark benchmark (see Resources)creates 10 chat rooms of 20 clients. Each room echoes the messages fromone client to the other 19 clients in the room. This benchmark, not yet anopen source benchmark, consists of the VolanoChat server and a secondprogram that simulates the clients in the chat room. It is used to measurethe raw server performance and network scalability performance. VolanoMarkcan be run in two modes: loopback and network. The loopback mode teststhe raw server performance, and the network mode tests the networkscalability performance. VolanoMark uses two parameters to control thesize and number of chat rooms.

The VolanoMark benchmark creates client connections in groups of 20 andmeasures how long it takes for the server to take turns broadcasting allof the clients' messages to the group. At the end of the loopback test, itreports a score as the average number of messages transferred per second. In the network mode, the metric is the number of connections between theclients and the server. The Linux kernel components stressed with thisbenchmark include the scheduler, signals, and TCP/IP.


Figure 2. VolanoMark benchmark results; loopback mode
Figure 2. VolanoMark benchmark results; loopback mode

Presented in Figure 2 are the VolanoMark benchmark results for loopbackmode. Also included is a description of the hardware and softwareconfigurations used and our target for this benchmark. We have establishedclose collaboration with the members of the Linux kernel development teamon moving forward to achieve this target. Some of the issues we haveaddressed that have resulted in improvements include adding O(1) scheduler, SMP scalable timer, tunable priority preemption and softaffinity kernel patches. As illustrated, we have exceeded our target forthis benchmark; however, there are some outstanding Linux kernel component-related and Java-related issues we are addressing that we believe will furtherimprove the performance of this benchmark.

Please note that the SPECweb99 benchmark work was conducted forresearch purposes only and was non-compliant, with the followingdeviations from the rules:

  1. It was run on hardware that does not meet the SPECavailability-to-the public criteria. The machine was an engineeringsample.
  2. access_log wasn't kept for full accounting. It was written, butdeleted every 200 seconds.

This benchmark presents a demanding workload to a Web server. Thisworkload requests 70% static pages and 30% simple dynamic pages. Sizes ofthe Web pages range from 102 to 921,000 bytes. The dynamic content modelsGIF advertisement rotation. There is no SSL content. SPECweb99 is relevantbecause Web serving, especially with Apache, is one of the most commonuses of Linux servers. Apache is rich in functionality and is not designedfor high performance. However, we chose Apache as the Web server for thisbenchmark because it currently hosts more Web sites than any other Webserver on the Internet. SPECweb99 is the accepted standard benchmark forWeb serving. SPECweb99 stresses the following kernel components: scheduler, TCP/IP, various threading models, sendfile, zero copy andnetwork drivers.


Figure 3. SPECweb99 benchmark results using the Apache Web server
Figure 3. SPECweb99 benchmark results using the Apache Web server

Presented in Figure 3 are our results for SPECweb99. Also included is adescription of the hardware and software configurations used and ourbenchmark target. We have a close collaboration with the Linux kerneldevelopment team and the IBM Apache team as we make progress on theperformance of this benchmark. Some of the issues we have addressed thathave resulted in the improvements shown include adding O(1) and read copyupdate (RCU) dcache kernel patches and adding a new dynamic APImod_specweb module to Apache. As shown in Figure 3, we have exceeded ourtarget on this benchmark; however, there are several outstanding Linuxkernel component-related issues we are addressing that we believe willsignificantly improve the performance of this benchmark.

Summary

Linux has enjoyed great popularity, specifically with low-end and midrangesystems. In fact, Linux is well regarded as a stable, highly-reliableoperating system to use for Web servers for these machines. However,high-end, enterprise level systems have access to gigabytes, petabytes,and exabytes of data. These systems require a different set ofapplications and solutions with high memory and bandwidth requirements, inaddition to larger numbers of processors (see Resources for the developerWorks article, "Opensource in the biosciences", which discusses this type ofapplication).

This type of system application introduces a unique set of issues thatmay be orders of magnitude more complex than those present in smallerinstallations. In order for Linux to be competitive for the enterprisemarket, its performance and scalability must improve.

Our experience thus far indicates that the performance of the Linuxkernel can be improved significantly. We are proud to contribute to thisgoal by working within the open source community to quantify Linux kernelperformance, and to develop patches to address degradation issues to makeLinux better, and to make it enterprise ready.

Acknowledgments

We would like to thank Kaivalya Dixit, Dustin Fredrickson, ParthaNarayanan, Troy Wilson, Peter Wong, and the LTC Linux kernel developmentteam for their input in preparing this article.


Resources

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
在信号处理领域,DOA(Direction of Arrival)估计是一项关键技术,主要用于确定多个信号源到达接收阵列的方向。本文将详细探讨三种ESPRIT(Estimation of Signal Parameters via Rotational Invariance Techniques)算法在DOA估计中的实现,以及它们在MATLAB环境中的具体应用。 ESPRIT算法是由Paul Kailath等人于1986年提出的,其核心思想是利用阵列数据的旋转不变性来估计信号源的角度。这种算法相比传统的 MUSIC(Multiple Signal Classification)算法具有较低的计算复杂度,且无需进行特征值分解,因此在实际应用中颇具优势。 1. 普通ESPRIT算法 普通ESPRIT算法分为两个主要步骤:构造等效旋转不变系统和估计角度。通过空间平移(如延时)构建两个子阵列,使得它们之间的关系具有旋转不变性。然后,通过对子阵列数据进行最小二乘拟合,可以得到信号源的角频率估计,进一步转换为DOA估计。 2. 常规ESPRIT算法实现 在描述中提到的`common_esprit_method1.m`和`common_esprit_method2.m`是两种不同的普通ESPRIT算法实现。它们可能在实现细节上略有差异,比如选择子阵列的方式、参数估计的策略等。MATLAB代码通常会包含预处理步骤(如数据归一化)、子阵列构造、旋转不变性矩阵的建立、最小二乘估计等部分。通过运行这两个文件,可以比较它们在估计精度和计算效率上的异同。 3. TLS_ESPRIT算法 TLS(Total Least Squares)ESPRIT是对普通ESPRIT的优化,它考虑了数据噪声的影响,提高了估计的稳健性。在TLS_ESPRIT算法中,不假设数据噪声是高斯白噪声,而是采用总最小二乘准则来拟合数据。这使得算法在噪声环境下表现更优。`TLS_esprit.m`文件应该包含了TLS_ESPRIT算法的完整实现,包括TLS估计的步骤和旋转不变性矩阵的改进处理。 在实际应用中,选择合适的ESPRIT变体取决于系统条件,例如噪声水平、信号质量以及计算资源。通过MATLAB实现,研究者和工程师可以方便地比较不同算法的效果,并根据需要进行调整和优化。同时,这些代码也为教学和学习DOA估计提供了一个直观的平台,有助于深入理解ESPRIT算法的工作原理。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值