官方文档 :OSW --- OS Watcher User's Guide


OSW

OS Watcher User's Guide

Carl Davis
Center of Expertise
December 16, 2008


@Please Note:  There have been issues reported with viewing this document in WebIV 
@using Internet Explorer.  It is strongly recommended that another browser be used 
@to view this document. We are currently investigating this issue.

NOTICE:  The Diagnostic Data Output section and oswiostat, oswmpstat, oswnetstat, oswprvtnet, oswps, oswtop, oswvmstat subsections have been added to this user guide.  These new additions provide samples of output collected by OSW and a guide on what to look for in each type of file produced.

Also OSW now provides a graphing utility to graph the data collected. This greatly reduces the need to manually inspect all the output files. See the "Graphing the Output" section below.

To collect database metrics in addition to OS metrics consider running LTOM. To see an example of your system profiled with LTOM click here.
 

Contents

 

Introduction

OS Watcher (OSW) is a collection of UNIX shell scripts intended to collect and archive operating system and network metrics to aid support in diagnosing performance issues. OSW operates as a set of background processes on the server and gathers OS data on a regular basis, invoking such Unix utilities as vmstat, netstat and iostat. OSW can be downloaded from this note. OSW is also included in the RAC-DDT script. file, but is not installed by RAC-DDT. For more information on RAC-DDT see <>. OSW is installed on each node where data is to be collected. Installation instructions for OSW are provided in this user guide.

Back to Contents

Overview

OSW consists of a series of shell scripts. OSWatcher.sh is the main controlling executive, which spawns individual shell processes to collect specific kinds of data, using Unix operating system diagnostic utilities. Control is passed to individually spawned operating system data collector processes, which in turn collect specific data, timestamp the data output, and append the data to pre-generated and named files. Each data collector will have its own file, created and named by the File Manager process.

Data collection intervals are configurable by the user, but will be uniform. for all data collector processes for a single instance of the OSW tool. For example, if OSW is configured to collect data once per minute, each spawned data collector process will generate output for its respective metric, write data to its corresponding data file, then sleep for one minute (or other configured interval) and repeat. Because we are collecting data every minute, the files generated by each spawned processes will contain 60 entries, one for each minute during the previous hour. Each file will contain, at most, one hour of data. At the end of each hour, File Manager will wake up and copy the existing current hour file to an archive location, then create a new current hour file.

The File Manager ensures only the last N hours of information are retained, where N is a configurable integer defaulting to 48. File Manager will wake up once per hour to delete files older than N hours. At any time, the entire output file set will consist of one current hour file, plus N archive files for each data collector process.

stopOSW.sh will terminate all processes associated with OSW, and is the normal, graceful mechanism for stopping the tool's operation.

OSW invokes these distinct operating system utilities, each as a distinct background process, as data collectors. These utilities will be supported, or their equivalents, as available for each supported target platform.

  • ps
  • top
  • mpstat
  • iostat
  • netstat
  • traceroute
  • vmstat

Back to Contents

Supported Platforms

OSW is certified to run on the following platforms:

  • AIX
  • Tru64
  • Solaris
  • HP-UX
  • Linux

Back to Contents

Gathering Diagnostic Data

Back to Contents

Installing OSW

OSW needs to be installed on each node, one installation per node. OSW should be installed manually by using the following procedure:

NOTE: OSW is available through MetaLink and can be downloaded as a tar file. The user then copies the file osw.tar to the directory where OSW is to be installed and issues the following commands.

tar xvf osw.tar

A directory named osw is created which houses all the files associated with OSW. OSW is now installed.

Back to Contents

 

Uninstalling OSW

To de-install OSW issue the following command on the osw directory.

rm -rf osw

Back to Contents

 

Setting up OSW

Once OSW is installed, scripts have been provided to start and stop the OSW utility. When OSW is started for the first time it creates the archive subdirectory. The archive directory contains 7 subdirectories, one for each data collector. Data collectors exist for top, vmstat, iostat, mpstat, netstat, ps and an optional collector for tracing private networks. To turn on data collection for private networks the user must create an executable file in the osw directory named private.net. An example of what this file should look like is named Example private.net in the osw directory. This file can be edited and renamed private.net or a new file named private.net can be created. This file contains entries for running the traceroute command to verify RAC private networks.

Example private.net entry on Solaris:

traceroute -r -F node1 
traceroute -r -F node2

Where node1 and node2 are 2 nodes in addition to the hostnode of a 3 node RAC cluster. If the file private.net does not exist or is not executable then no data will be collected and stored under the oswprvtnet directory.

OSW will need access to the OS utilities: top, vmstat, iostat, mpstat, netstat, and traceroute. These OS utilities need to be install on the system prior to running OSW.  Execute permission on these utilities need to be granted to the user of OSW.

Back to Contents

 

Starting OSW

To start the OSW utility execute the startOSW.sh shell script. from the directory where OSW was installed. This script. has 2 arguments which control the frequency that data is collected and the number of hour's worth of data to archive.

ARG1 = snapshot interval in seconds.
ARG2 = the number of hours of archive data to store.

If you do not enter any arguments the script. runs with default values of 30 and 48 meaning collect data every 30 seconds and store the last 48 hours of data in archive files.

Example 1:

./startOSW.sh 60 10

 

This would start the tool and collect data at 60 second intervals and log the last 10 hours of data to archive files.

Example 2:

./startOSW.sh

NOTE: This would use the default values of 30, 48 and collect data at 30 second intervals and log the last 48 hours of data to archive files.

Example 3:

nohup ./startOSW.sh 60 10 &

This would start the tool, put the process in the background, enable to the tool to continue running after the session has been terminated, collect data at 60 second intervals, and log the last 10 hours of data to archive files.

Back to Contents

 

Stopping OSW

To stop the OSW utility execute the stopOSW.sh command from the directory where OSW was installed. This terminates all the processes associated with the tool.

Example:

./stopOSW.sh

Back to Contents

 

Diagnostic Data Output

As stated above, when OSW is started for the first time it creates the archive subdirectory under the OSW installation directory. The archive directory contains 7 subdirectories, one for each data collector. These directories are named oswiostat, oswmpstat, oswnetstat, oswprvtnet, oswps, oswtop, and oswvmstat. One file per hour will be generated in each of the 7 OS utility subdirectories with the exception of oswprvtnet which is dependent on having private networks tracing configured. A new file is created at the top of each hour during the time that OSW is running. The file will be in the following format:

__YY.MM.DD.HH24.dat

Details about each type of data file can be viewed by clicking on the below links:

oswiostat
oswmpstat
oswnetstat
oswprvtnet
oswps
oswtop
oswvmstat

Back to Contents

 

oswiostat

_iostat_YY.MM.DD:HH24.dat

These files will contain output from the 'iostat' command that is obtained and archive by OSWatcher at specified intervals.  These files will only exist if 'iostat' is installed on the OS and if the OSW user has privileges to run the utility.

The iostat command is used for monitoring system input/output device loading by observing the time the physical disks are active in relation to their average transfer rates. This information can be used to change system configuration to better balance the input/output load between physical disks and adapters.

The iostat utility is fairly standard across UNIX platforms, but really on useful for those platforms that support extended disk statistics: AIX, Solaris and Linux. Also each platform. will have a slightly different version of the iostat utility. You should consult your operating system man pages for specifics. The sample provided below is for Solaris.

OSW runs the iostat utility at the specified interval and stores the data in the oswiostat subdirectory under the archive directory. The data is stored in hourly archive files. Each entry in the file contains a timestamp prefixed by *** embedded in the iostat output. Notice there are 3 entries for each timestamp. You should always ignore the first entry as this entry is always invalid. The second and third entry will be valid but the second entry will be 1 sec later than the timestamp and the third entry will be 2 seconds later than the timestamp.

Sample iostat file produced by OSW

extended device statistics
r/sw/skr/skw/swaitactvwsvc_tasvc_t%w%bdevice
0.00.30.02.10.00.03.40.800c0t0d0
0.02.10.112.90.00.00.60.400c0t2d0
0.00.00.00.00.00.00.00.000fd0
2.91.2240.81.50.00.10.013.305c1t0d0
1.10.818.08.80.00.00.15.901c1t1d0
0.00.00.00.00.00.00.00.000c0t1d0

Field Descriptions

The iostat output contains summary information for all devices.

FieldDescription
r/sShows the number of reads/second
w/sShows the number of writes/second
kr/sShows the number of kilobytes read/second
kw/sShows the number of kilobytes written/second
waitAverage number of transactions waiting for service (queue length)
actvAverage number of transactions actively being serviced
wsvc_tAverage service time in wait queue, in milliseconds
asvc_tAverage service time of active transactions, in milliseconds
%wPercent of time there are transactions waiting for service
%bPercent of time the disk is busy
deviceDevice name

What to look for

  • Average service times greater than 20msec for long duration.

  • High average wait times.

Back to Contents

 

oswmpstat

_mpstat_YY.MM.DD:HH24.dat

These files will contain output from the 'mpstat' command that is obtained and archive by OSWatcher at specified intervals.  These files will only exist if 'mpstat' is installed on the OS and if the OSW user has privileges to run the utility.

The mpstat command collects and displays performance statistics for all logical CPUs in the system.

The mpstat utility is fairly standard across UNIX platforms. Each platform. will have a slightly different version of the mpstat utility. You should consult your operating system man pages for specifics. The sample provided below is for Solaris.

OSW runs the mpstat utility at the specified interval and stores the data in the oswmpstat subdirectory under the archive directory. The data is stored in hourly archive files. Each entry in the file contains a timestamp prefixed by *** embedded in the mpstat output. Notice there are 3 entries for each timestamp. You should always ignore the first entry as this entry is always invalid. The second and third entry will be valid but the second entry will be 1 sec later than the timestamp and the third entry will be 2 seconds later than the timestamp.

Sample mpstat file produced by OSW

***Fri Jan 28 12:50:36 EST 2005
CPUminfmjfxcalintrithrcswicswmigrsmtxsrwsysclusrsyswtidl
0000483383118100064000100
0126800486382414420002902824068
0400479379144300096000100

Field Descriptions

FieldDescription
cpuProcessor ID
minfMinor faults
mifMajor Faults
xcalProcessor cross-calls (when one CPU wakes up another by interrupting it).
intrInterrupts
ithrInterrupts as threads (except clock)
cswContext switches
icswInvoluntary context switches
migrThread migrations to another processor
smtxNumber of times a CPU failed to obtain a mutex
srwNumber of times a CPU failed to obtain a read/write lock on the first try
sysclNumber of system calls
usrPercentage of CPU cycles spent on user processes
sysPercentage of CPU cycles spent on system processes
wtPercentage of CPU cycles spent waiting on event

idl

Percentage of unused CPU cycles or idle time when the CPU is basically doing nothing

 

What to look for

  • Involuntary context switches (this is probably the more relevant statistic when examining performance issues.)

  • Number of times a CPU failed to obtain a mutex. Values consistently greater than 200 per CPU causes system time to increase.

  • xcal is very important, show processor migration

Back to Contents

 

oswnetstat

_netstat_YY.MM.DD:HH24.dat

These files will contain output from the 'netstat' command that is obtained and archive by OSWatcher at specified intervals.  These files will only exist if 'netstat' is installed on the OS and if the OSW user has privileges to run the utility.

The netstat command displays current TCP/IP network connections and protocol statistics.

The netstat utility is standard across UNIX platforms. Each platform. will have a slightly different version of the netstat utility. You should consult your operating system man pages for specifics. The sample provided below is for Solaris.

OSW runs the netstat utility at the specified interval and stores the data in the oswnetstat subdirectory under the archive directory. The data is stored in hourly archive files. Each entry in the file contains a timestamp prefixed by *** embedded in the netstat output. Notice there are 3 entries for each timestamp. You should always ignore the first entry as this entry is always invalid. The second and third entry will be valid but the second entry will be 1 sec later than the timestamp and the third entry will be 2 seconds later than the timestamp.

The netstat utility has many command line flags, and the most commonly used to troubleshoot RAC is "ia(n)" for the interface level output and "s" for the protocol level statistics. The following are examples for the two different command parameters.

The command line options "-ain" have these effects:

Option Description
-aThe command output will use the logical names of the interface. It will also report the name of the IP address found through normal IP address resolution methods.
-iThis triggers the Interface specific statistics, the columns of which are outlined in table [bla-KR]
-nThis causes the output to use IP addresses instead of the resolved names

Example netstat file produced by OSW:

Sample netstat file produced by OSW

***Fri Jan 28 12:50:36 EST 2005
NameMtuNet/DestAddressIpktsIerrsOpktsOerrsCollisQueue
lo08232127.0.0.0127.0.0.12960650296065000
eri01500138.1.140.0138.1.140.96 017624421919510
RAWIP       
 rawipInDatagrams=0 rawipInErrors=0
 rawipInCksumErrs=0 rawipOutDatagrams=0
 rawipOutErrors=0    
UDP       
 udpInDatagrams=295719 udpInErrors=0
 udpOutDatagrams=295671 udpOutErrors=0
TCP       
 tcpRtoAlgorithm=4 tcpRtoMin=400
 tcpRtoMax=60000 tcpMaxConn=-1
 tcpActiveOpens=27 tcpPassiveOpens=21
 tcpAttemptFails=6 tcpEstabResets=0
 tcpCurrEstab=15 tcpOutSegs=691
 tcpOutDataSegs=479 tcpOutDataBytes=43028
 tcpRetransSegs=0 tcpRetransBytes=0
 tcpOutAck=212 tcpOutAckDelayed=83
 tcpOutUrg=0 tcpOutWinUpdate=0
 tcpOutWinProbe=0 tcpOutControl=85
 tcpOutRsts=10 tcpOutFastRetrans  
 tcpInSegs=915  =0
 tcpInAckSegs=489 tcpInAckBytes=43023
 tcpInDupAck=42 tcpInAckUnsent=0
 tcpInInorderSegs=477 tcpInInorderBytes=40640
 tcpInUnorderSegs=0 tcpInUnorderBytes=0
 tcpInDupSegs=0 tcpInDupBytes=0
 tcpInPartDupSegs=0 tcpInPartDupBytes=0
 tcpInPastWinSegs=0 tcpInPastWinBytes=0
 tcpInWinProbe=0 tcpInWinUpdate=0
 tcpInClosed=0 tcpRttNoUpdate=0
 tcpRttUpdate=462 tcpTimRetrans=0
 tcpTimRetransDrop=0 tcpTimKeepalive=80
 tcpTimKeepaliveProbe=0 tcpTimKeepaliveDrop=0
 tcpListenDrop=0 tcpListenDropQ0=0
 tcpHalfOpenDrop=0 tcpOutSackRetrans=0
IPv4       
 ipForwarding=2 ipDefaultTTL=255
 ipInReceives=17858585 ipInHdrErrors=0
 ipInAddrErrors=0 ipInCksumErrs=0
 ipForwDatagrams=0 ipForwProhibits=0
 ipInUnknownProtos=0 ipInDiscards=0
 ipInDelivers=296623 ipOutRequests=17624403
 ipOutDiscards=0 ipOutNoRoutes=827
 ipReasmTimeout=60 ipReasmReqds=0
 ipReasmOKs=0 ipReasmFails=0
 ipReasmDuplicates=0 ipReasmPartDups=0
 ipFragOKs=0 ipFragFails=0
 ipFragCreates=0 ipRoutingDiscards=0
 tcpInErrs=0 udpNoPorts=225722
 udpInCksumErrs=0 udpInOverflows=0
 rawipInOverflows=0 ipsecInSucceeded=0
 ipsecInFailed=0 ipInIPv6=0
 ipOutIPv6=0 ipOutSwitchIPv6=5
IPv6       
 ipv6Forwarding=2 ipv6DefaultHopLimit=255
 ipv6InReceives=0 ipv6InHdrErrors=0
 ipv6InTooBigErrors=0 ipv6InNoRoutes=0
 ipv6InAddrErrors=0 ipv6InUnknownProtos=0
 ipv6InTruncatedPkts=0 ipv6InDiscards=0
 ipv6InDelivers=0 ipv6OutForwDatagrams=0
 ipv6OutRequests=0 ipv6OutDiscards=0
 ipv6OutNoRoutes=0 ipv6OutFragOKs=0
 ipv6OutFragFails=0 ipv6OutFragCreates=0
 ipv6ReasmReqds=0 ipv6ReasmOKs=0
 ipv6ReasmFails=0 ipv6InMcastPkts=0
 ipv6OutMcastPkts=0 ipv6ReasmDuplicates=0
 ipv6ReasmPartDups=0 ipv6ForwProhibits=0
 udpInCksumErrs=0 udpInOverflows=0
 rawipInOverflows=0 ipv6InIPv4=0
 ipv6OutIPv4=0 ipv6OutSwitchIPv4=0
ICMPv4       
 icmpInMsgs=17624914 icmpInErrors=0
 icmpInCksumErrs=0 icmpInUnknowns=0
 icmpInDestUnreachs=72 icmpInTimeExcds=0
 icmpInParmProbs=0 icmpInSrcQuenchs=0
 icmpInRedirects=0 icmpInBadRedirects=0
 icmpInEchos=17624842 icmpInEchoReps=0
 icmpInTimestamps=0 icmpInTimestampReps=0
 icmpInAddrMasks=0 icmpInAddrMaskReps=0
 icmpInFragNeeded=0 icmpOutMsgs=17624920
 icmpOutDrops=225716 icmpOutErrors=0
 icmpOutDestUnreachs=78 icmpOutTimeExcds=0
 icmpOutParmProbs=0 icmpOutSrcQuenchs=0
 icmpOutRedirects=0 icmpOutEchos=0
 icmpOutEchoReps=17624842 icmpOutTimestamps=0
 icmpOutTimestampReps=0 icmpOutAddrMasks=0
 icmpOutAddrMaskReps=0 icmpOutFragNeeded=0
 icmpInOverflows=0    
ICMPv6       
 icmp6InMsgs=0 icmp6InErrors=0
 icmp6InDestUnreachs=0 icmp6InAdminProhibs=0
 icmp6InTimeExcds=0 icmp6InParmProblems=0
 icmp6InPktTooBigs=0 icmp6InEchos=0
 icmp6InEchoReplies=0 icmp6InRouterSols=0
 icmp6InRouterAds=0 icmp6InNeighborSols=0
 icmp6InNeighborAds=0 icmp6InRedirects=0
 icmp6InBadRedirects=0 icmp6InGroupQueries=0
 icmp6InGroupResps=0 icmp6InGroupReds=0
 icmp6InOverflows=0    
 icmp6OutMsgs=0 icmp6OutErrors=0
 icmp6OutDestUnreachs=0 icmp6OutAdminProhibs=0
 icmp6OutTimeExcds=0 icmp6OutParmProblems=0
 icmp6OutPktTooBigs=0 icmp6OutEchos=0
 icmp6OutEchoReplies=0 icmp6OutRouterSols=0
 icmp6OutRouterAds=0 icmp6OutNeighborSols=0
 icmp6OutNeighborAds=0 icmp6OutRedirects=0
 icmp6OutGroupQueries=0 icmp6OutGroupResps=0
 icmp6OutGroupReds=0    
IGMP:   
 2490 messages received
 0 messages received with too few bytes
 0 messages received with bad checksum
 2490 membership queries received
 0 membership queries received with invalid field(s)
 0 membership reports received
 0 membership reports received with invalid field(s)
 0 membership reports received for groups to which we belong
    
 0 membership reports sent

Field Descriptions:

The netstat output produced by OSW contains 2 sections. The first section contains information about all the network interfaces. The second section contains information about per-protocol statistics.

Section 1: Netstat -ain

FieldDescription
nameDevice name of interface
MtuMaximum transmission unit
NetNetwork Segment Address
addressNetwork address of the device
ipktsInput packets
IerrsInput errors
opktsOutput Packets
OerrsOutput errors
collisCollisions
queueNumber in the Queue

Section 2: Protocol Statistics

The per-protocol statistics can be divided into several categories:

  • RAWIP (raw IP) packets
  • TCP packets
  • IPv4 packets
  • ICMPv4 packets
  • IPv6 packets
  • ICMPv6 packets
  • UDP packets
  • IGMP packet

Each protocol type has a specific set of measures associated with it. Network analysis requires evaluation of these measurements on an individual level and all together to examine the overall health of the network communications.

The TCP protocol is used the most in Oracle database and applications. Some implementations for RAC use UDP for the interconnect protocol instead of TCP. The statistics cannot be divided up on a per-interface basis, so these should be compared to the "-i" statistics above.

What to look for:

Section 1

The information in Section 1 will help diagnose network problems when there is connectivity but response is slow.

Values to look at:

  • Collisions (Collis)
  • Output packets (Opkts)
  • Input errors (Ierrs)
  • Input packets (Ipkts)

The above values will give information to workout network collision rates as follows:

Network collision rate = Output collision / Output packets

For a switched network, the collisions should be 0.1 percent or less (see the Cisco web site as a reference) of the output packets. Excessive collisions could lead to the switch port the interface is plugged into to segment, or pull itself off-line, amongst other switch-related issues.

For the input error statistics:

Input Error Rate = Ierrs / Ipkts.

If the input error rate is high (over 0.25 percent), the host is excessively dropping packets. This could mean there is a mismatch of the duplex or speed  settings of the interface card and switch.  It could also imply a failed patch cable.

If ierrs or oerrs show an excessive amount of errors, more information can be found by examination of the netstat -s output.

For Sun systems, further information about a specific interface can be found by using the "-k" option for netstat. The output will give fuller statistics for the device, but this option is not mentioned in the netstat man page. More information can be found at http://sunsolve.sun.com/.

Section 2

The information in Section 2 contains the protocol statistics.

Many performance problems associated with the network involve the retransmission of the TCP packets. For retransmission rate calculations click here.

To find the segment retransmission rate:

%segment-retrans=(tcpRetransSegs / tcpOutDataSegs) * 100

To find the byte retransmission rate:

%byte-retrans = ( tcpRetransBytes / tcpOutDataBytes ) * 100

Most network analyzers report TCP retransmissions as segments (frames) and not in bytes.

Back to Contents

 

oswprvtnet

_prvtnet_YY.MM.DD:HH24.dat

These files will contain output from the 'prvtnet' command that is obtained and archived by OSWatcher at specified intervals.  These files will only exist if 'prvtnet' is installed on the OS and if the OSW user has privileges to run the utility.

Information about the status of RAC private networks should be collected. This requires the user to manually add entries for these private networks into the private.net file located in the base osw directory. Instructions on how to do this are contained in the README file.

OSW uses the traceroute command to obtain the status of these private networks. Each operating system uses slightly different arguments to the traceroute command. Examples of the syntax to use for each operating system are contained in the sample Example private.net file located in the base osw directory. This will result in the output appearing differently across UNIX platforms. OSW runs the private.net file at the specified interval and stores the data in the oswprvtnet subdirectory under the archive directory. The data is stored in hourly archive files. Each entry in the file contains a timestamp prefixed by *** embedded in the top output.

Sample file produced by OSW

***Fri Jan 28 12:50:36 EST 2005

traceroute to celdecclu2.us.oracle.com (138.2.71.112): 1-30 hops
(initial packetsize = 1500)
  1  celdecclu2.us.oracle.com (138.2.71.112) 1.95ms  2.92 ms 1.95 ms

What to Look For

  • Example 1:  Interface is up and responding:

traceroute to X.X.X.X, (X.X.X.X) 30 hops max, 1492 byte packets
1 X.X.X.X 1.015 ms 0.766 ms 0.755 ms

 

  • Example 2:  Target interface is not on a directly connected network, so validate that the address is correct or the switch it is plugged in is on the same VLAN (or other issue):

traceroute to X.X.X.X, (X.X.X.X) 30 hops max, 40 byte packets
traceroute: host X.X.X.X is not on a directly-attached network

 

  • Example 3:  Network is unreachable:

traceroute to X.X.X.X, (X.X.X.X) 30 hops max, 40 byte packets
Network is unreachable

Back to Contents

 

oswps

_ps_YY.MM.DD:HH24.dat

These files will contain output from the 'ps' command that is obtained and archive by OSWatcher at specified intervals.  These files will only exist if 'ps' is installed on the OS and if the OSW user has privileges to run the utility.

The ps (process state) command list all the processes currently running on the system and provides information about CPU consumption, process state, priority of the process, etc. The ps command has a number of options to control which processes are displayed, and how the output is formatted. OSW runs the ps command with the -elf option.

The ps command is fairly standard across UNIX platforms Each platform. will have a slightly different version of the ps utility. You should consult your operating system man pages for specifics. The sample provided below is for Solaris.

OSW runs the ps command at the specified interval and stores the data in the oswps subdirectory under the archive directory. The data is stored in hourly archive files. Each entry in the file contains a timestamp prefixed by *** embedded in the ps output.

Sample ps file produced by OSW

***Wed Feb 2 09:26:54 EST 2005
FSUIDPIDPPIDCPRINIADDRSZWCHANSTIMETTYTIMECMD
19Troot0000SY?0 Jan 31?0:13sched
8Sroot1004120?107?Jan 31?0:00/etc
19Sroot2000SY?0?Jan 31?0:00page
19Sroot3000SY?0?Jan 31?0:50fsflu
8Sroot355104120?232?Jan 31?0:00/usr/
8Sroot29729604120?379?Jan 31?0:00htt_s
8Scedavis39138108920?301?Jan 31?0:00/usr/

Field Descriptions

FieldDescription
fFlags s State of the process
uidThe effective user ID number of the process
pidThe process ID of the process
ppidThe process ID of the parent process.
dProcessor utilization for scheduling (obsolete).
priThe priority of the process.
niNice value, used in priority computation.
addrThe memory address of the process.
szThe total size of the process in virtual memory, including all mapped files and devices, in pages.
wchanThe address of an event for which the process is sleeping (if blank, the process is running).
stimeThe starting time of the process, given in hours, minutes, and seconds.
ttyThe controlling terminal for the process (the message ?, is printed when there is no controlling terminal).
timeThe cumulative execution time for the process.
cmdThe command name process is executing.

What to look for

  • The information in the ps command will primarily be used as supporting information for RAC diagnostics. If for example, the status of a process prior to a system crash may be important for root cause analysis. The amount of memory a process is consuming is another example of how this data can be used.

Back to Contents

 

oswtop

_top_YY.MM.DD:HH24.dat

These files will contain output from the 'top' command that is obtained and archive by OSWatcher at specified intervals.  These files will only exist if 'top' is installed on the OS and if the OSW user has privileges to run the utility.

Top is a program that will give continual reports about the state of the system, including a list of the top CPU using processes. Top has three primary design goals:

  • provide an accurate snapshot of the system and process state,
  • not be one of the top processes itself,
  • be as portable as possible.

Each operating system uses a different version of the UNIX utility top. This will result in the top output appearing differently across UNIX platforms. You should consult your operating system man pages for specifics. The sample provided below is for Solaris.

OSW runs the top utility at the specified interval and stores the data in the oswtop subdirectory under the archive directory. The data is stored in hourly archive files. Each entry in the file contains a timestamp prefixed by *** embedded in the top output.

Sample top file produced by OSW

***Fri Jan 28 12:50:36 EST 2005
load averages: 0.11, 0.07, 0.06 12:50:36
136 processes: 133 sleeping, 2 running, 1 on cpu

Memory: 2048M real, 1061M free, 542M swap in use, 1605M swap free
PIDUSERNAMETHRPRINICESIZERESSTATETIMECPUCOMMAND
704cedavis16490346M276M  sleep222:333.51%java
362root 159034M75M  sleep11:490.21%Xsun
20675cedavis1001584K1064K  cpu0:0019%top
20640cedavis1001904K1240K  sleep0:000.14%OSWatcher.sh
20657cedavis12001904K1240K  sleep0:000.14%oswsub.sh
16881cedavis1590199M159K  sleep23:040.10%oracle
20671cedavis1001904K1240K  run0:000.09%oswsub.sh
20653cedavis1001904K1240K  sleep0:000.09%OSWatcherFM.sh
20665cedavis1001904K1240K  sleep0:000.09%oswsub.sh
20672cedavis1001264K1031K  sleep0:000.09%iostat
20659cedavis11001904K1240K  sleep0:000.09%oswsub.sh
20661cedavis13001096K880Ksleep0:000.09%vmstat
20668cedavis1001904K1240Krun0:000.05%oswsub.sh
20674cedavis100968K624K  sleep0:000.05%sleep
20663cedavis12001080K864Ksleep0:000.05%mpstat

Field Descriptions

load averages: 0.11, 0.07, 0.06 12:50:36

This line displays the load averages over the last 1, 5 and 15 minutes as well as the system time. This is quite handy as top basically includes a timestamp along with the data capture.

Load average is defined as the average number of processes in the run queue. A runnable Unix process is one that is available right now to consume CPU resources and is not blocked on I/O or on a system call. The higher the load average, the more work your machine is doing.

The three numbers are the average of the depth of the run queue over the last 1, 5, and 15 minutes. In this example we can see that .11 processes were on the run queue on average over the last minute, .07 processes on average on the run queue over the last 5 minutes, etc. It is important to determine what the average load of the system is through benchmarking and then look for deviations. A dramatic rise in the load average can indicate a serious performance problem.

136 processes: 133 sleeping, 2 running, 1 on cpu

This line displays the total number of processes running at the time of the last update. It also indicates how many Unix processes exist, how many are sleeping (blocked on I/O or a system call), how many are stopped (someone in a shell has suspended it), and how many are actually assigned to a CPU. This last number will not be greater than the number of processors on the machine, and the value should also correlate to the machine's load average provided the load average is less than the number of CPUs. Like load average, the total number of processes on a healthy machine usually varies just a small amount over time. Suddenly having a significantly larger or smaller number of processes could be a warning sign.

Memory: 2048M real, 1061M free, 542M swap in use, 1605M swap free

The "Memory:" line is very important. It reflects how much real and swap memory a computer has, and how much is free. "Real" memory is the amount of RAM installed in the system, a.k.a. the "physical" memory. "Swap" is virtual memory stored on the machine's disk.

Once a computer runs out of physical memory, and starts using swap space, its performance deteriorates dramatically. If you run out of swap, you'll likely crash your programs or the OS.

Individual process fields

FieldDescription
PIDProcess ID of process
USERNAMEUsername of process
THRProcess thread PRI Priority of process
NICENice value of process
SIZETotal size of a process, including code and data, plus the stack space in kilobytes
RESAmount of physical memory used by the process
STATECurrent CPU state of process. The states can be S for sleeping, D for uninterrupted, R for running, T for stopped/traced, and Z for zombied
TIMEThe CPU time that a process has used since it started
%CPUThe CPU time that a process has used since the last update
COMMANDThe task's command name

What to Look For

  • Large run queue. Large number of processes waiting in the run queue may be an indication that your system does not have sufficient CPU capacity.
  • Process consuming lots of CPU. A process which is "hogging" CPU is always suspect. If this process is an oracle foreground process it's most likely running an expensive query that should be tuned. Oracle background process should not hog CPU for long periods of time.
  • High load averages. Processes should not be backed up on the run queue for extended periods of time.
  • Low swap space. This is an indication you are running low on memory.

Back to Contents

 

oswvmstat

_vmstat_YY.MM.DD:HH24.dat

These files will contain output from the 'vmstat' command that is obtained and archive by OSWatcher at specified intervals.  These files will only exist if 'vmstat' is installed on the OS and if the OSW user has privileges to run the utility.

The name vmstat comes from "report virtual memory statistics".  The vmstat utility does a bit more than this, though. In addition to reporting virtual memory, vmstat reports certain kernel statistics about processes, disk, trap, and CPU activity.

The vmstat utility is fairly standard across UNIX platforms. Each platform. will have a slightly different version of the vmstat utility. You should consult your operating system man pages for specifics. The sample provided below is for Solaris.

OSW runs the vmstat utility at the specified interval and stores the data in the oswvmstat subdirectory under the archive directory. The data is stored in hourly archive files. Each entry in the file contains a timestamp prefixed by *** embedded in the vmstat output. Notice there are 3 entries for each timestamp. You should always ignore the first entry as this entry is always invalid. The second and third entry will be valid but the second entry will be 1 sec later than the timestamp and the third entry will be 2 seconds later than the timestamp.

Sample vmstat file produced by OSW

***Fri Jan 28 12:50:36 EST 2005
procsmemory pagediskfaultscpu
rbwswapfreeremfpi po frde srddf0s0 insycsussyid
000176134412465201600000200038013649004195
0001643920 1086776 331148581616003100044749661315153154
0001643872 1086728 60000000000389147293200100

Field Descriptions

The vmstat output is actually broken up into six sections: procs, memory, page, disk, faults and CPU. Each section is outlined in the following table.

 

FieldDescription
PROCS
rNumber of processes that are in a wait state and basically not doing anything but waiting to run
bNumber of processes that were in sleep mode and were interrupted since the last update
wNumber of processes that have been swapped out by mm and vm subsystems and have yet to run
MEMORY
swapThe amount of swap space currently available free The size of the free list
PAGE
repage reclaims
mfminor faults
pikilobytes paged in
pokilobytes paged out
frkilobytes freed
deanticipated short-term memory shortfall (Kbytes)
srpages scanned by clock algorithm
DISK
BiDisk blocks sent to disk devices in blocks per second
FAULTS
InInterrupts per second, including the CPU clocks
SySystem calls
CsContext switches per second within the kernel
CPU
UsPercentage of CPU cycles spent on user processes
SyPercentage of CPU cycles spent on system processes
IdPercentage of unused CPU cycles or idle time when the CPU is basically doing nothing

What to look for

The following information should be used as a guideline and not considered hard and fast rules. The information documented below comes from Adrian Cockcroft's book, Sun Performance Tuning. Other operating systems like HP and Linux may have different thresholds.

  • Large run queue. Adrian Cockcroft defines anything over 4 processes per CPU on the run queue as the threshold for CPU saturation. This is certainly a problem if this last for any long period of time.

  • CPU utilization. The amount of time spent running system code should not exceed 30% especially if idle time is close to 0%.

  • A combination of large run queue with no idle CPU is an indication the system has insufficient CPU capacity.

  • Memory bottlenecks are determined by the scan rate (sr) . The scan rate is the pages scanned by the clock algorithm per second. If the scan rate (sr) is continuously over 200 pages per second then there is a memory shortage.

  • Disk problems may be identified if the number of processes blocked exceeds the number of processes on run queue.

Back to Contents

 

Graphing the Output

A new utility, OSWg has been added to OSW. This utility provides the ability to graph vmstat and iostat data. See the OSWg User Guide for more information.   To see a sample of the OSWg output, click here. To add database metrics use the LTOM profiler. Click here to see a  sample LTOM profiler.

Sample Graph

OSWG_MemoryFree_24hrs.gif

Back to Contents

 

Known Issues

No issues to report.

Back to Contents

 

Download

Current Unix Version 2.1.2  December 2008

Click here to download the file (change the filename to OSW.tar when saving)

If a file download dialog box does not appear when clicking on the above link, you may need to clear your web browser's cache and/or restart your web browser. If you are still unable to download the file, you may request that we email you a copy: mailto:carl.davis@oracle.com?subject=Unix OSW Request:

Back to Contents

 

Reporting Feedback

If you encounter problems running OSW which is not listed under the Known Issue section or would like to provide comments/feedback about OSW (including enhancement requests) please send email to mailto:carl.davis@oracle.com?subject=Unix OSW Feedback:

Back to Contents

 

Sending Files To Support

For those users running RAC-DDT, the OSW archive directory will be automatically included in the RAC-DDT.tar.Z compressed archive file. For more information on RAC-DDT see <>.  For users not running RAC-DDT, create a tarball of the archive directory to send in to support by executing tarupfiles.sh file in the osw directory.

Back to Contents

Legal Notices and Terms of Use

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/16426127/viewspace-592626/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/16426127/viewspace-592626/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
【1】项目代码完整且功能都验证ok,确保稳定可靠运行后才上传。欢迎下载使用!在使用过程中,如有问题或建议,请及时私信沟通,帮助解答。 【2】项目主要针对各个计算机相关专业,包括计科、信息安全、数据科学与大数据技术、人工智能、通信、物联网等领域的在校学生、专业教师或企业员工使用。 【3】项目具有较高的学习借鉴价值,不仅适用于小白学习入门进阶。也可作为毕设项目、课程设计、大作业、初期项目立项演示等。 【4】如果基础还行,或热爱钻研,可基于此项目进行二次开发,DIY其他不同功能,欢迎交流学习。 【注意】 项目下载解压后,项目名字和项目路径不要用中文,否则可能会出现解析不了的错误,建议解压重命名为英文名字后再运行!有问题私信沟通,祝顺利! 基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip基于C语言实现智能决策的人机跳棋对战系统源码+报告+详细说明.zip
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值