Over the past few years, many IT organizations have begun to adopt internal service level agreements (SLAs) designed to ensure the performance and accountability of IT systems that support critical business functions. If you don't already have an internal SLA for Your Linux/Unix server performance and availability, chances are you probably will very soon. The following Best Practices for Managing Your Linux/Unix system performance provide useful guidelines that help you set expectations in Your organization and establish the metrics on which Your performance will be judged.
The reasons behind the trend toward SLAs are not difficult to understand. In a highly competitive global environ-ment, IT operations must support critical applications that automate key business processes. That translates into service level agreement metrics that are based on the availability of applications and their response times for completing crucial tasks. End users, whether they're external or internal customers, judge system performance in terms of application response time.
It's IT's responsibility to optimize the operation of servers and applications to meet acceptable response times, avoid costly downtime due to system failures or bottlenecks, and contain costs through better utilization or performance tuning of existing hardware resources.
Beyond General Health Indicators: A Best Practices ApproachTraditional measures of system performance and availability have usually focused on monitoring the general health indicators of an individual server. Typical server health parameters include monitoring processor, memory, and disk space usage. Yet, even when conventional health parameters appear to be within normal ranges, Your business applications could experience problems.
for example, if a developer decides to test a new piece of code on a production server, that code could lock out a portion of a mission-critical database. When an authorized user logs on to an application and that part of the database cannot be accessed because it's locked, the application fails. General health indicators do not offer any way to diagnose or properly assess this kind of a situation.
While common tools for measuring general server health, such as TOP, are readily available, they do not provide sufficient insight into server performance and availability to satisfy the metrics of most service-level agreements. The following Best Practices describe the key performance and availability indicators you will need to properly monitor and deliver the service levels expected from today's mission-critical Linux/Unix systems. Keep in mind, there are automated, commercially available tools that can help you implement these Best Practices.
Best Practice #1:Workload Monitoring
It's essential to monitor whether specific application processes are getting access to important system resources such as CPU, memory, and disk I/O. Beyond general server health indicators, workload monitoring provides specific analysis of individual-named processes. It's particularly important to monitor disk I/O applications, such as databases, that generate significant activity as it can create "hot spots" that can impair an application's process. You should look for a tool that provides granular capability that includes monitoring the number of logical disk transfers, logical disk reads, and writes per second.
In addition, you should be able to monitor total CPU usage for each named process, the total for all processes, as well as queue length (load average) to determine the risk of CPU overloads. The same goes for individual memory use for each specified process.
Performance management, along with the ability to identify bottlenecks and their causes so that corrective action can be taken before system performance is affected, is critical to understanding what Your end user is experiencing. Bottlenecks can occur when portions of Your system do not run fast enough to keep up with demands from application processes. Your automated tools must be able to calculate the response times of Your back-end database, for example, and provide a measurement whereby you can benchmark optimal perfor-mance (see Figure 1). Identifying and correcting root causes can in some cases avoid the need for hardware upgrades. In addition, you should be able to monitor Your network's ability to support critical application traffic and determine if the network is causing a performance problem.
Best Practice #3:Availability Management
Critical applications like Oracle do not work if their associated processes or daemons are unavailable.You need to know immediately if a process/daemon fails. In the case of Network File System, this means monitoring key processes such as mountd and nfsd. Manually checking these processes/daemons is not practical. You should look for automated tools that can periodically "ping" servers to ensure proper availability, as well as monitor the overall availability of Your Linux/Unix servers.
Directory and file management is crucial for Linux/Unix servers since failure to recognize that a disk or directory is approaching its capacity can be catastrophic. This is especially true for environments with a large number of servers. Since most application log files are written to the /var directory, if this directory becomes full a panic condition results, causing a core dump. If you have several users working on a highly integrated CAD application, for example, and they all try to save or store their work to the same directory and it's full, some of these users could lose some or all of their work that day.
Thus you must be able to track the amount of disk space utilized by files and directories and be alerted when they approach their capacity. That means being able to automatically check the disk space used by specified directories and the files under those directories, generate an alert if the situation exceeds a threshold, and have the option to manually or automatically take corrective action to move files to a less-full disk.
Best Practice #5:User Activity Monitoring
for security as well as performance and availability reasons, it's essential to know which users have logged on to Your system at certain times. It's especially important to ensure that no user be allowed to log in directly as the root user since on Linux/Unix systems this gives the user nearly unlimited power to perform any action. You need to be sure that all users log in with their user name and then require a further command to go to the root directory. Automated performance tools can record and provide alerts on all root directory user requests and provide for escalation in the event of multiple failed root user log-on attempts.
Without these added controls, users may not always realize they have logged in as a root user and could accidentally delete files.
Best Practice #6:Automated Log File Monitoring
Because Linux/Unix servers generate enormous amounts of log file data, it's impossible to try and monitor log files manually. You need an automated tool that can track certain messages or errors, including the Syslog file. You should be able to search for specific messages or strings in order to identify and flag certain messages. for example, if a user is trying to access a server remotely, Your tool should generate an alert that is sent to you as the administrator. The same applies to application log files, whereby the tool can alert you to specific application log file errors.
Best Practice #7:Process Monitoring
It's important for you to recognize which processes are currently running on Your Linux/Unix servers at any time, including those processes that normally operate on Your systems. You should know the total number of processes running at any given time, which processes are running or terminated, and whether the total exceeds a preset threshold representing "normal" server activity. If a threshold is exceeded - say X number of processes at once - Your monitoring tools should generate an alert to make you aware of the server status.
This information can be critical to properly diagnosing system problems. Systems running much more slowly at certain periods during the day may indicate a rogue application process on a production server or a problem with Managing the activity of cron jobs, for example. In addition, proper process monitoring can identify "zombie processes" created by another process that has not properly terminated and is needlessly using valuable system resources.
Best Practice #8:Application Resource Management
Because Linux/Unix servers are typically configured to a specific application, such as an Oracle database or a Network File System, you as the administrator must be sure the intended application is getting the resources necessary to run. The only way to do this properly is to track the specific resources that critical applications are imposing on their servers in terms of memory and processing power over time. Knowing the consumption history of server CPU and memory can assist you in assessing trends and anticipating capacity changes in Your systems.
for example, if you notice that the CPU consumption for an Oracle database application has grown from 55% in the past month to 65% in the last week to 75% within the last 24 hours, you can justifiably predict that CPU utilization will quickly be maximized. In recognizing this situation, you may see that the business has been adding users of the application at a brisk pace and that the application may need to be moved to a larger server with more CPU capacity to accommodate the increased demand.
ConclusionEach of the Best Practices described here represents a more sophisticated level of performance and availability system accountability than the conventional general health measures that most administrators may be familiar with. However, as user demands on IT departments continue to increase and accelerate, these Best Practices are becoming the rule rather than the exception. fortunately, there are performance and availability tools available commercially that can help you automate these Best Practices and achieve the service levels required to maintain and enhance Your Linux/Unix systems.