PERF--Advanced Tuning with Statspack(ZT)

cuifan5176

于 2019-07-18 11:57:02 发布

阅读量171

点赞数

Advanced Tuning with Statspack

By Rich Niemiec

But wait, there's more! What waits mean in Statspack reports and how to tune them out.

If you could choose just two Oracle utilities to find and monitor performance problems in your Oracle9i Database system, those two utilities would be Oracle Enterprise Manager (now available in Release 4.0) and Statspack. As of Oracle8i Release 8.1.6, the Statspack utility replaced the UTLBSTAT/UTLESTAT scripts available for performance monitoring with earlier versions of Oracle Database; Statspack offers several significant enhancements to those scripts. This column focuses on solutions to advanced issues regarding wait events—those events for which your system had to wait during processing for a resource to become available, an action to complete, and so on.

Top 5 Wait Events

When you are trying to eliminate bottlenecks on your system, your Statspack report's Top 5 Wait Events section is the first place to look. This section of the report shows the top 5 wait events, the full list of wait events, and the background wait events. If your system's TIMED_STATISTICS initialization parameter is set to true, the events are ordered in time waited, which is preferable, since all events don't show the waits. If TIMED_STATISTICS is false, the events are ordered by the number of waits.

Listing 1 shows a large number of waits related to reading a single block (db file sequential read) as well as waits for latches (latch free). You can see in this listing high waits for some of the writing to datafiles and log files. To identify which of these are major issues, you must narrow down the list by investigating the granular reports within other sections of Statspack.

Resolving Your Wait Events

The following are 10 of the most common causes for wait events, along with explanations and potential solutions:

1. DB File Scattered Read. This generally indicates waits related to full table scans. As full table scans are pulled into memory, they rarely fall into contiguous buffers but instead are scattered throughout the buffer cache. A large number here indicates that your table may have missing or suppressed indexes. Although it may be more efficient in your situation to perform a full table scan than an index scan, check to ensure that full table scans are necessary when you see these waits. Try to cache small tables to avoid reading them in over and over again, since a full table scan is put at the cold end of the LRU (Least Recently Used) list.

2. DB File Sequential Read. This event generally indicates a single block read (an index read, for example). A large number of waits here could indicate poor joining orders of tables, or unselective indexing. It is normal for this number to be large for a high-transaction, well-tuned system, but it can indicate problems in some circumstances. You should correlate this wait statistic with other known issues within the Statspack report, such as inefficient SQL. Check to ensure that index scans are necessary, and check join orders for multiple table joins. The DB_CACHE_SIZE will also be a determining factor in how often these waits show up. Problematic hash-area joins should show up in the PGA memory, but they're also memory hogs that could cause high wait numbers for sequential reads. They can also show up as direct path read/write waits.

3. Free Buffer. This indicates your system is waiting for a buffer in memory, because none is currently available. Waits in this category may indicate that you need to increase the DB_BUFFER_CACHE, if all your SQL is tuned. Free buffer waits could also indicate that unselective SQL is causing data to flood the buffer cache with index blocks, leaving none for this particular statement that is waiting for the system to process. This normally indicates that there is a substantial amount of DML (insert/update/delete) being done and that the Database Writer (DBWR) is not writing quickly enough; the buffer cache could be full of multiple versions of the same buffer, causing great inefficiency. To address this, you may want to consider accelerating incremental checkpointing, using more DBWR processes, or increasing the number of physical disks.

4. Buffer Busy. This is a wait for a buffer that is being used in an unshareable way or is being read into the buffer cache. Buffer busy waits should not be greater than 1 percent. Check the Buffer Wait Statistics section (or V$WAITSTAT) to find out if the wait is on a segment header. If this is the case, increase the freelist groups or increase the pctused to pctfree gap. If the wait is on an undo header, you can address this by adding rollback segments; if it's on an undo block, you need to reduce the data density on the table driving this consistent read or increase the DB_CACHE_SIZE. If the wait is on a data block, you can move data to another block to avoid this hot block, increase the freelists on the table, or use Locally Managed Tablespaces (LMTs). If it's on an index block, you should rebuild the index, partition the index, or use a reverse key index. To prevent buffer busy waits related to data blocks, you can also use a smaller block size: fewer records fall within a single block in this case, so it's not as "hot." When a DML (insert/update/ delete) occurs, Oracle Database writes information into the block, including all users who are "interested" in the state of the block (Interested Transaction List, ITL). To decrease waits in this area, you can increase the initrans, which will create the space in the block to allow multiple ITL slots. You can also increase the pctfree on the table where this block exists (this writes the ITL information up to the number specified by maxtrans, when there are not enough slots built with the initrans that is specified).

5. Latch Free. Latches are low-level queuing mechanisms (they're accurately referred to as mutual exclusion mechanisms) used to protect shared memory structures in the system global area (SGA). Latches are like locks on memory that are very quickly obtained and released. Latches are used to prevent concurrent access to a shared memory structure. If the latch is not available, a latch free miss is recorded. Most latch problems are related to the failure to use bind variables (library cache latch), redo generation issues (redo allocation latch), buffer cache contention issues (cache buffers LRU chain), and hot blocks in the buffer cache (cache buffers chain). There are also latch waits related to bugs; check MetaLink for bug reports if you suspect this is the case (oracle.com/support). When latch miss ratios are greater than 0.5 percent, you should investigate the issue. I will cover latch waits in detail in my next Oracle Magazine column; the topic requires an article in itself.

6. Enqueue. An enqueue is a lock that protects a shared resource. Locks protect shared resources, such as data in a record, to prevent two people from updating the same data at the same time. An enqueue includes a queuing mechanism, which is FIFO (first in, first out). Note that Oracle's latching mechanism is not FIFO. Enqueue waits usually point to the ST enqueue, the HW enqueue, the TX4 enqueue, and the TM enqueue. The ST enqueue is used for space management and allocation for dictionary-managed tablespaces. Use LMTs, or try to preallocate extents or at least make the next extent larger for problematic dictionary-managed tablespaces. HW enqueues are used with the high-water mark of a segment; manually allocating the extents can circumvent this wait. TX4s are the most common enqueue waits. TX4 enqueue waits are usually the result of one of three issues. The first issue is duplicates in a unique index; you need to commit/rollback to free the enqueue. The second is multiple updates to the same bitmap index fragment. Since a single bitmap fragment may contain multiple rowids, you need to issue a commit or rollback to free the enqueue when multiple users are trying to update the same fragment. The third and most likely issue is when multiple users are updating the same block. If there are no free ITL slots, a block-level lock could occur. You can easily avoid this scenario by increasing the initrans and/or maxtrans to allow multiple ITL slots and/or by increasing the pctfree on the table. Finally, TM enqueues occur during DML to prevent DDL to the affected object. If you have foreign keys, be sure to index them to avoid this general locking issue.

7. Log Buffer Space. This wait occurs because you are writing the log buffer faster than LGWR can write it to the redo logs, or because log switches are too slow. To address this problem, increase the size of the log files, or increase the size of the log buffer, or get faster disks to write to. You might even consider using solid-state disks, for their high speed.

8. Log File Switch. All commit requests are waiting for "logfile switch (archiving needed)" or "logfile switch (chkpt. Incomplete)." Ensure that the archive disk is not full or slow. DBWR may be too slow because of I/O. You may need to add more or larger redo logs, and you may potentially need to add database writers if the DBWR is the problem.

9. Log File Sync. When a user commits or rolls back data, the LGWR flushes the session's redo from the log buffer to the redo logs. The log file sync process must wait for this to successfully complete. To reduce wait events here, try to commit more records (try to commit a batch of 50 instead of one at a time, for example). Put redo logs on a faster disk, or alternate redo logs on different physical disks, to reduce the archiving effect on LGWR. Don't use RAID 5, since it is very slow for applications that write a lot; potentially consider using file system direct I/O or raw devices, which are very fast at writing information.

10. Idle Event. There are several idle wait events listed after the output; you can ignore them. Idle events are generally listed at the bottom of each section and include such things as SQL*Net message to/from client and other background-related timings. Idle events are listed in the stats$idle_event table.

Stay Tuned

In the next issue of Oracle Magazine, I'll investigate latches—another of the top waits you may encounter—looking at the usual latch waits you'll see and how to tune them for maximum performance.

Rich Niemiec is the CEO of TUSC (www.tusc.com) and president of the International Oracle Users Group (www.ioug.org). Thanks to Steve Adams for editing help.

Wait Events Quick Reference Guide
Wait Problem	Potential Fix
DB File Scattered Read	Indicates many full table scans: tune the code; cache small tables.
DB File Sequential Read	Indicates many index reads: tune the code (especially joins).
Free Buffer	Increase the DB_CACHE_SIZE; shorten the checkpoint; tune the code.
Buffer Busy	Segment header: add freelists or freelist groups.
Buffer Busy	Data block: separate "hot" data; use reverse key indexes and/or smaller blocks.
Buffer Busy	Data block: increase initrans and/or maxtrans.
Buffer Busy	Undo header: add rollback segments or areas.
Buffer Busy	Undo block: commit more often; use larger rollback segments or areas.
Latch Free	Investigate the latch detail.
Enqueue—ST	Use LMTs or preallocate large extents.
Enqueue—HW	Preallocate extents above high-water mark.
Enqueue—TX4	Increase initrans and/or maxtrans on the table or index.
Enqueue—TM	Index foreign keys; check application locking of tables.
Log Buffer Space	Increase the log buffer; use faster disks for the redo logs.
Log File Switch	Archive destination slow or full; add more or larger redo logs.
Log File Sync	Commit more records at a time; use faster redo log disks or raw devices.
Idle Event	Ignore it.

Common Idle Events
Event	Idle Event Type
Dispatcher timer	Shared server
Lock manager wait for remote message	Oracle9i Real Application Clusters
Pipe get	User process
pmon timer	Background process
PX Idle wait	Parallel query
PX Deq Credit: need buffer	Parallel query
PX Deq Credit: send blkd	Parallel query
rdbms ipc message	Background process
smon timer	Background process
SQL*Net message from client	User process
virtual Circuit status	Shared server