Performance Improvement Methods

最新推荐文章于 2019-10-31 18:50:16 发布

JasonLiu_2012

最新推荐文章于 2019-10-31 18:50:16 发布

阅读量881

点赞数

分类专栏：翻译文章

翻译文章专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Performance Improvement Methods

This chapter discusses Oracle Database improvement methods and contains the

following sections:

■ The Oracle Performance Improvement Method

■ Emergency Performance Methods

The Oracle Performance Improvement Method

Oracle performance methodology helps you to identify performance problems in an

Oracle database. This involves identifying bottlenecks and fixing them. It is recommended that

changes be made to a system only after you have confirmed thatthere is a bottleneck.

oracle性能优化方法帮助你找出oracle数据库的性能问题。这包括找到性能瓶颈和解决这些问题。

建议一定要等你确认某项是瓶颈后，才能对其采取相应调整。

Performance improvement, by its nature, is iterative. For this reason, removing the

first bottleneck might not lead to performance improvement immediately, because

another bottleneck might be revealed. Also, in some cases, if serialization points move

to a more inefficient sharing mechanism, then performance could degrade. Withexperience,

and by following a rigorous method of bottleneck elimination, applications can be debugged and made scalable.

性能的提升，本来就是一个反复的过程。由于这个原因，解决掉一个瓶颈可能不会马上提升性能，因为另一个

瓶颈可能会暴露出来。除此之外，如果串行化程序搬到一个低效率的共享方式的平台上，性能可能会下降。

根据自己的经验，并依照一套严格的消灭性能瓶颈的方法，应用可以变得稳固和健康。

Performance problems generally result from either a lack of throughput, unacceptable

user/job response time, or both. The problem might be localized between application

modules, or it might be for the entire system.

Before looking at any database or operating system statistics, it is crucial to get

feedback from the most important components of the system: the users of the system and the

peopleultimately paying for the application. Typical user feedback includes statements

like the following:

■ "The online performance is so bad that it prevents my staff from doing their jobs."

■ "The billing run takes too long."

■ "When I experience high amounts of Web traffic, the response time becomes

unacceptable, and I am losing customers."

■ "I am currently performing 5000 trades a day, and the system is maxed out. Next

month, we roll out to all our users, and the number of trades is expected to

quadruple."

性能问题通常是由吞吐量不足或者无法接受的响应时间导致的，或者两者都存在。问题可能存在于

应用的模块，也有可能存在于整个系统。在检查数据库或者操作系统的统计数据之前，有一件很重要

的事情需要做：从系统用户和最终为应用付费的使用者那里获得反馈。

典型的用户反馈包括如下内容：

■“这个在线系统的性能是如此的糟糕，导致我的员工无法完成他们的工作。”

■“这个账单结算的太久了。”

■“当我遇到高流量运行的时候，反应时间变的让人无法接受，然后我的客户流失了。”

■“我现在每天有5000单交易，系统已经爆掉了。下个月的时候，我们会向我们所有的用户推广业务，

每天的交易额预计会翻两番。”

From candid feedback, it is easy to set critical success factors for any performance

work. Determining the performance targets and the performance engineer's exit

criteria make managing the performance process much simpler and more successful at

all levels. These critical success factors are better defined in terms of real business goals

rather than system statistics.

从用户坦白的回馈中，是很容易为性能优化工作定义关键的成功因素。确定性能优化目标和工程师的工作

完成标准，使得性能优化工作的各个阶段更容易完成和更加完美。比起根据统计数据来，根据实际商业目标

可以更好的定义这些关键的成功因素。

Some real business goals for these typical user statements might be:

■ "The billing run must process 1,000,000 accounts in a three-hour window."

■ "At a peak period on a Web site, the response time must not exceed five seconds for a page refresh."

■ "The system must be able to process 25,000 trades in an eight-hour window."

一些对应这些典型的用户情况的实际商业目标可能如下：

■“这个计费功能必须在三小时内完成1000000个账号的运算。”

■“在网络的高峰期，一个页面的刷新时间不能超过5秒钟。”

■“这个系统必须能够在8小时内完成25000笔交易。”

The ultimate measure of success is the user's perception of system performance. The performance

engineer's role is to eliminate any bottlenecks that degrade performance.These bottlenecks could

be caused by inefficient use of limited shared resources or byabuse of shared resources,

causing serialization. Because all shared resources arelimited, the goal of a performance

engineer is to maximize the number of businessoperations with efficient use of shared resources.

At a very high level, the entiredatabase server can be seen as a shared resource. Conversely,

at a low level, a singleCPU or disk can be seen as shared resources.

性能改善成功的最终标准是用户对该系统的感知度。性能工程师的角色是消灭掉一切影响系统性能的瓶颈。

这些性能瓶颈可能是因为低效率使用有限的共享资源或者滥用共享资源引起，造成了排队等待。因为所有的

共享资源是有限的，性能工程师的目标是最大限度提升有效利用共享资源的商业操作的数量。在一个高层面

上来说，整个数据库服务器可以看做是一个共享资源。反过来，在一个低层面上来说，一个CPU或者一个硬盘

也可以看成一个共享资源。

You can apply the Oracle performance improvement method until performance goals

are met or deemed impossible. This process is highly iterative. Inevitably, some investigations

may have little or no impact on database performance. Time andexperience are necessary to

develop the skills to accurately and quickly pinpointcritical bottlenecks. However, prior

experience can sometimes work against theexperienced engineer who neglects to use the data

and statistics available. This type ofbehavior encourages database tuning by myth and folklore.

This is a very risky, expensive, and unlikely to succeed method of database tuning.

你可以实施oracle性能改进方法一直到性能目标已经达到或者根本就没法完成目标。这些过程是高度重复的。

不可避免的是，有些调整可能对性能提升是微乎其微甚至压根起不了作用。时间和经验对准确和快速找出

性能瓶颈是必须的。但是，经验有时候会对那些忽视有用的统计数据的有经验丰富的工程师产生副作用，

这种行为鼓励用经验和传说来对数据库进行调优。这是一种很冒险、代价很高而且不会成功的数据库调优方法。

The Automatic Database Diagnostic Monitor (ADDM) implements parts of the

performance improvement method and analyzes statistics to provide automatic

diagnosis of major performance issues. Using ADDM can significantly shorten the

time required to improve the performance of a system. See Chapter6, "Automatic

Performance Diagnostics" for a description of ADDM.

ADDM实现了部分根据性能改进方法、分析统计数据并提供主要性能问题自动诊断的功能。使用ADDM能显著的

缩短改进系统性能需要的时间。见第六章“Automatic Performance Diagnostics”关于ADDM的介绍。

Systems are so different and complex that hard and fast rules for performance analysis

are impossible. In essence, the Oracle performance improvement method defines a

way of working, but not a definitive set of rules. With bottleneck detection, the only rule is

that there are no rules! The best performance engineers use the data provided

and think laterally to determine performance problems.

各个系统是如此的不同和复杂，所以是不可能有性能分析的硬性规定的。在本质上，oracle性能优化方法

只是定义一种工作方式，但没有一个明确的标准。在性能瓶颈的定位上，唯一的标准是没有标准。最好的

性能工程师都是通过提供的数据来进行横向思维，最终定位性能问题。

Steps in The Oracle Performance Improvement Method

1. Perform the following initial standard checks:

a. Get candid feedback from users. Determine the performance project's scope

and subsequent performance goals, and performance goals for the future. This process

is key in future capacity planning.

b. Get a full set of operating system, database, and application statistics from the

system when the performance is both good and bad. If these are not available,

then get whatever is available. Missing statistics are analogous to missing

evidence at a crime scene: They make detectives work harder and it is more

time-consuming.

c. Sanity-check the operating systems of all computers involved with user performance.

By sanity-checking the operating system, you look for hardware or operating system resources

that are fully utilized. List any over-usedresources as symptoms for analysis later.

In addition, check that all hardware shows no errors or diagnostics.

1、执行如下初始化标准检查：

a. 从用户那里得到坦白的反馈。确定性能优化项目的范围和随后的性能目标，以及将来的性能目标。

这些过程对于未来的容量规划是非常重要的。

b. 从系统拿到全面的操作系统、数据库和应用程序的统计数据，包括性能好的时候和性能坏的时候

的数据。如果这些是不可用的，那就尽可能拿到可用的任何数据。缺少统计数据就像在犯罪现场

丢失证据：这将会造成定位问题变得艰难和更耗时间。

c. 完整检查涉及用户性能的计算机的操作系统。通过完整检查操作系统，你查看硬件或者操作系统资源

是否过度使用了。列出所有过度使用的资源作为之后需要分析的症状。此外，检查所有的硬件，

保证显示无错误和诊断信息。

2. Check for the top ten most common mistakes with Oracle Database, and

determine if any of these are likely to be the problem. List these as symptoms for

later analysis. These are included because they represent the most likely problems.

ADDM automatically detects and reports nine of these top ten issues. See

Chapter6, "Automatic Performance Diagnostics" and "Top Ten Mistakes Found in

Oracle Systems" on page3-4.

2. 检查是否存在oracle数据库的top 10个普遍错误，并检测是否是其中的错误导致了性能问题。

列出这些症状方便之后分析。这些之所以包含在内是因为这些错误代表了最有可能发生的错误。

ADDM自动检测并报告其中九个错误。见第六章，"Automatic Performance Diagnostics" and

"Top Ten Mistakes Found inOracle Systems" on page3-4。

3. Build a conceptual model of what is happening on the system using the symptoms

as clues to understand what caused the performance problems. See "A Sample Decision

Process for Performance Conceptual Modeling" on page3-3.

3. 建立一个在系统上正在发生的事情的概念模型，使用症状做为线索来知道是什么导致了性能问题。

见"A Sample Decision Process for Performance Conceptual Modeling" on page3-3。

4. Propose a series of remedy actions and the anticipated behavior to the system,

then apply them in the order that can benefit the application the most. ADDM produces

recommendations each with an expected benefit. A golden rule inperformance work is that

you only change one thing at a time and then measurethe differences. Unfortunately, system

downtime requirements might prohibitsuch a rigorous investigation method. If multiple changes

are applied at the sametime, then try to ensure that they are isolated so that the effects

of each change canbe independently validated.

4. 计划一系列的针对系统的补救措施和预期的行为，然后根据最能改善系统性能的顺序来实施这些行为。

ADDM产生包含预期改善效果的调整建议。一个性能调整的黄金准则是你在同一时间只能调整改变一样东西，

并验证其效果。不幸的是，系统的宕机时间需要不允许这样严格的研究方法。如果多个调整在同一时间实施，

请努力保证他们是个隔离开的，这样使得每一个调整的效果可以独立验证。

5. Validate that the changes made have had the desired effect, and see if the user's perception of

performance has improved. Otherwise, look for more bottlenecks,and continue refining the conceptual

model until your understanding of theapplication becomes more accurate.

5. 验证所做的调整是否达到了预期的效果，并检查用户的感知度是否得到了改善。如果没有，检查更多的

性能瓶颈，并继续完善你的概念模型，直到你更准确的了解了你的应用。

6. Repeat the last three steps until performance goals are met or become impossible

due to other constraints.

6. 重复最后三个步骤，一直到性能目标达到或者因为其他的约束而没法达成目标。

This method identifies the biggest bottleneck and uses an objective approach to

performance improvement. The focus is on making large performance improvements by increasing

application efficiency and eliminating resource shortages andbottlenecks. In this process,

it is anticipated that minimal (less than 10%) performancegains are made from instance tuning,

and large gains (100% +) are made from isolatingapplication inefficiencies.

这个方法找到最大的性能瓶颈，并采用客观的方法来改善性能。重点是通过提升应用系统的效率和消灭

资源瓶颈和短板来达到性能提升的目的。在这个过程中，可以预料小于10%的性能提升来自instance tuning，

大部分的性能提升来自于隔离低效率的应用程序。

A Sample Decision Process for Performance Conceptual Modeling

Conceptual modeling is almost deterministic. However, as you gain experience in

performance tuning, you begin to appreciate that no real rules exist. A flexible

heads-up approach is required to interpret statistics and make good decisions.

概念建模几乎是确定的。但是，当你在性能调优中得到经验时，你开始明白其实没有真正的规则存在。

一个灵活的、聪明的方法是必须解释统计数据并作出正确的决定。

For a quick and easy approach to performance tuning, use ADDM. ADDMautomatically monitors

your Oracle system and provides recommendations forsolving performance problems should

problems occur. For example, suppose a DBAreceives a call from a user complaining that

the system is slow. The DBA simply examines the latest ADDM report to see which of the

recommendations should be implemented to solve the problem. See Chapter 6, "Automatic Performance

Diagnostics" for information about the features that help monitor and diagnose Oracle databases.

如果想找一个快速进行性能调优的方法，请使用ADDM。ADDM自动监控你的Oracle系统并提供解决性能问题

的建议。比如说，假设一个DBA接到一个用户投诉系统很慢的电话。DBA只需要查看最近的ADDM报告，

并查看解决这些性能问题需要实施的建议。见第六章，"Automatic Performance Diagnostics"关于

帮助监控和诊断oracle数据库的功能的相关信息，

The following steps illustrate how a performance engineer might look for bottlenecks

without using automatic diagnostic features. These steps are only intended as a

guideline for the manual process. With experience, performance engineers add to the steps involved.

This analysis assumes that statistics for both the operating system and

the database have been gathered.

如下步骤说明了一个性能工程师在没有自动诊断功能的情况下查找性能瓶颈。这些步骤只是作为手动诊断

的一个指引。随着经验的积累，工程师会把经验增加到如下步骤里。这些分析假设操作系统和数据库的

统计数据都是可用的。

1. Is the response time/batch run time acceptable for a single user on an empty or

lightly loaded computer?

If it is not acceptable, then the application is probably not coded or designed optimally,

and it will never be acceptable in a multiple user situation when system resources are shared.

In this case, get application internal statistics, and get SQL Trace and SQL plan information.

Work with developers to investigate problems in data, index, transaction SQL design, and potential

deferral of work to batch andbackground processing.

1. 如果在单用户、轻负载的系统上，响应时间是不是同样不可接受的？

如果是不可接受的，那么这个应用系统根本就没有好好编码和设计，这个系统如果在多用户、共享资源的

情况下是决不能让人忍受的。这种情况下，应设法取得应用系统内部统计数据、SQL trace和SQL plan的信息。

和开发工程师一起查看数据、索引、事务SQL设计、做后台处理的批处理等是否存在问题。

2. Is all the CPU being utilized?

If the kernel utilization is over 40%, then investigate the operating system for network

transfers, paging, swapping, or process thrashing. Continue to check CPU utilization in user

space to verify if there are any non-database jobs consuming CPU on the system limiting the

amount of shared CPU resources, such as backups, file transforms, print queues, and so on.

After determining that the database is using most of the CPU, investigate the top SQL by CPU utilization.

These statements form the basis of all future analysis. Check the SQL and thetransactions submitting

the SQL for optimal execution. Oracle Database provides CPU statistics in V$SQL and V$SQLSTATS.

If the application is optimal and no inefficiencies exist in the SQL execution, then

consider rescheduling some work to off-peak hours or using a bigger computer.

2. 是否所有的CPU都被充分使用了？

如果内核的使用率超过了40%，那么检查操作系统的网络传输、paging、swapping或者process thrashing。

检查操作系统用户的CPU使用率，看是否存在非数据库任务在消耗CPU进而限制共享资源，比如说备份、文件传输、

打印队列等等。确定数据库在使用大部分的CPU后，检查根据CPU排名的top SQL。这些观点构成了将来所有分析

的基础。检查这些SQL和交易提交这些SQL的最佳执行状况。Oracle数据库在V$SQL和V$SQLSTATS里提供CPU的统计

数据。如果应用程序是最佳的，不存在低效率的SQL执行，那么考虑安排一些工作，避开高峰期，或者使用更好

主机。

3. At this point, the system performance is unsatisfactory, yet the CPU resources are

not fully utilized.

In this case, you have serialization and unscalable behavior within the server. Get

the WAIT_EVENTS statistics from the server, and determine the biggest serialization

point. If there are no serialization points, then the problem is most likely outside

the database, and this should be the focus of investigation. Elimination of WAIT_EVENTS

involves modifying application SQL and tuning database parameters. This process is very iterative

and requires the ability to drill down on the WAIT_EVENTS systematically to eliminate serialization points.

3. 这种情况是系统性能不能让人满意，但是CPU资源并没有全部使用。

在这种情况下，说明服务器存在序列化和不可扩展的行为。取到服务器的WAIT_EVENTS的统计数据，并确定

服务器最大的序列化点(瓶颈)。如果不存在序列化点，那么问题很可能不在数据库，那么去重点调查它。消灭

掉等待事件包括修改应用的SQL和调整数据库的参数。这些过程是高度重复的，还需要有通过不断研究

WAIT_EVENTS来消灭掉序列化点的能力。

Top Ten Mistakes Found in Oracle Systems

This section lists the most common mistakes found in Oracle databases. By following

the Oracle performance improvement methodology, you should be able to avoid these

mistakes altogether. If you find these mistakes in your system, then re-engineer the

application where the performance effort is worthwhile. See "Automatic Performance

Tuning Features" on page1-5 for information about the features that help diagnose

and tune Oracle databases. See Chapter 10, "Instance Tuning Using Performance

Views" for a discussion on how wait event data reveals symptoms of problems that

can be impacting performance.

1. Bad connection management

The application connects and disconnects for each database interaction. This problem is

common with stateless middleware in application servers. It has over

two orders of magnitude impact on performance, and is totally unscalable.

2. Bad use of cursors and the shared pool

Not using cursors results in repeated parses. If bind variables are not used, then

there is hard parsing of all SQL statements. This has an order of magnitude impact

in performance, and it is totally unscalable. Use cursors with bind variables that

open the cursor and execute it many times. Be suspicious of applications generating dynamic SQL.

3. Bad SQL

Bad SQL is SQL that uses more resources than appropriate for the application

requirement. This can be a decision support systems (DSS) query that runs for

more than 24 hours, or a query from an online application that takes more than a

minute. You should investigate SQL that consumes significant system resources

for potential improvement. ADDM identifies high load SQL. SQL Tuning Advisor

can provide recommendations for improvement. See Chapter6, "Automatic

Performance Diagnostics" and Chapter17, "Automatic SQL Tuning".

4. Use of nonstandard initialization parameters

These might have been implemented based on poor advice or incorrect

assumptions. Most databases provide acceptable performance using only the set of

basic parameters. In particular, parameters associated with SPIN_COUNT on latches

and undocumented optimizer features can cause a great deal of problems that can

require considerable investigation.

Likewise, optimizer parameters set in the initialization parameter file can override

proven optimal execution plans. For these reasons, schemas, schema statistics, and

optimizer settings should be managed as a group to ensure consistency ofperformance.

5. Getting database I/O wrong

Many sites lay out their databases poorly over the available disks. Other sites

specify the number of disks incorrectly, because they configure disks by disk space

and not I/O bandwidth. See Chapter 8, "I/O Configuration and Design".

6. Online redo log setup problems

Many sites run with too few online redo log files and files that are too small. Small

redo log files cause system checkpoints to continuously put a high load on the buffer cache

and I/O system. If too few redo log files exist, then the archive cannotkeep up, and the database

must wait for the archiver to catch up. See Chapter 4,"Configuring a Database for Performance" for

information about sizing redo log files for performance.

7. Serialization of data blocks in the buffer cache due to lack of free lists, free list

groups, transaction slots (INITRANS), or shortage of rollback segments.

This is particularly common on INSERT-heavy applications, in applications that have raised

the block size above 8K, or in applications with large numbers ofactive users and few

rollback segments. Use automatic segment-spacemanagement (ASSM) and automatic undo management

to solve this problem.

8. Long full table scans

Long full table scans for high-volume or interactive online operations could

indicate poor transaction design, missing indexes, or poor SQL optimization. Long

table scans, by nature, are I/O intensive and unscalable.

9. High amounts of recursive (SYS) SQL

Large amounts of recursive SQL executed by SYS could indicate spacemanagement activities,

such as extent allocations, taking place. This is unscalableand impacts user response time.

Use locally managed tablespaces to reducerecursive SQL due to extent allocation.

Recursive SQL executed under anotheruser ID is probably SQL and PL/SQL, and this is not a problem.

10. Deployment and migration errors

In many cases, an application uses too many resources because the schema owning

the tables has not been successfully migrated from the development environment

or from an older implementation. Examples of this are missing indexes or incorrect

statistics. These errors can lead to sub-optimal execution plans and poor

interactive user performance. When migrating applications of known

performance, export the schema statistics to maintain plan stability using the

DBMS_STATS package.

Although these errors are not directly detected by ADDM, ADDM highlights the resulting high load SQL.

Emergency Performance Methods

This section provides techniques for dealing with performance emergencies. You

presumably have a methodology for establishing and improving application

performance. However, in an emergency situation, a component of the system has changed to

transform it from a reliable, predictable system to one that is unpredictable and not satisfying user requests.

In this case, the performance engineer must rapidly determine what has changed andtake appropriate

actions to resume normal service as quickly as possible. In many cases, it is necessary to

take immediate action, and a rigorous performanceimprovement project is unrealistic.

After addressing the immediate performance problem, the performance engineer must

collect sufficient debugging information either to get better clarity on the performance

problem or to at least ensure that it does not happen again.

The method for debugging emergency performance problems is the same as the

method described in the performance improvement method earlier in this book.

However, shortcuts are taken in various stages because of the timely nature of the

problem. Keeping detailed notes and records of facts found as the debugging process progresses is

essential for later analysis and justification of any remedial actions.

This is analogous to a doctor keeping good patient notes for future reference.

Steps in the Emergency Performance Method

The Emergency Performance Method is as follows:

1. Survey the performance problem and collect the symptoms of the performance

problem. This process should include the following:

■ User feedback on how the system is underperforming. Is the problem

throughput or response time?

■ Ask the question, "What has changed since we last had good performance?"

This answer can give clues to the problem. However, getting unbiased

answers in an escalated situation can be difficult. Try to locate some reference

points, such as collected statistics or log files, that were taken before and after

the problem.

■ Use automatic tuning features to diagnose and monitor the problem. See

"Automatic Performance Tuning Features" on page1-5 for information about

the features that help diagnose and tune Oracle systems. In addition, you can

use Oracle Enterprise Manager performance features to identify top SQL and

sessions.

2. Sanity-check the hardware utilization of all components of the application system.

Check where the highest CPU utilization is, and check the disk, memory usage,

and network performance on all the system components. This quick process

identifies which tier is causing the problem. If the problem is in the application,

then shift analysis to application debugging. Otherwise, move on to database

server analysis.

3. Determine if the database server is constrained on CPU or if it is spending time

waiting on wait events. If the database server is CPU-constrained, then investigate

the following:

■ Sessions that are consuming large amounts of CPU at the operating system

level and database; check V$SESS_TIME_MODEL for database CPU usage

■ Sessions or statements that perform many buffer gets at the database level;

check V$SESSTAT and V$SQLSTATS

■ Execution plan changes causing sub-optimal SQL execution; these can be

difficult to locate

■ Incorrect setting of initialization parameters

■ Algorithmic issues caused by code changes or upgrades of all components

If the database sessions are waiting on events, then follow the wait events listed in

V$SESSION_WAIT to determine what is causing serialization. The V$ACTIVE_SESSION_HISTORY

view contains a sampled history of session activity which you

can use to perform diagnosis even after an incident has ended and the system has

returned to normal operation. In cases of massive contention for the library cache,

it might not be possible to logon or submit SQL to the database. In this case, use

historical data to determine why there is suddenly contention on this latch. If most

waits are for I/O, then examine V$ACTIVE_SESSION_HISTORY to determine the SQL

being run by the sessions that are performing all of the inputs and outputs. See

Chapter10, "Instance Tuning Using Performance Views" for a discussion on wait

events.

4. Apply emergency action to stabilize the system. This could involve actions that

take parts of the application off-line or restrict the workload that can be applied to

the system. It could also involve a system restart or the termination of job in

process. These naturally have service level implications.

5. Validate that the system is stable. Having made changes and restrictions to the

system, validate that the system is now stable, and collect a reference set of

statistics for the database. Now follow the rigorous performance method described

earlier in this book to bring back all functionality and users to the system. This

process may require significant application re-engineering before it is complete.

JasonLiu_2012

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Performance Improvement Methods

Performance Improvement MethodsThis chapter discusses Oracle Database improvement methods and contains thefollowing sections:■ The Oracle Performance Improvement Method■ Emergency Perfor
复制链接

扫一扫

专栏目录