Oracle Performance Tuning 11g2 (3)

此篇文章作为性能调优概要介绍的最后一篇,是一定要读的。oracle写的非常的棒!

可能有人会说,在网上完全可能搜索到相关的翻译文章的,为什么还要自己再去翻译一下呢?我的理解是这样子的:别人已经学习过oracle了,为什么自己还要学?别人已经教了如何吃螃蟹的步骤,为什么自己还要去钻研?因此通过自己的深入学习,在这个过程中,充满着学习的兴趣、无法了解的痛苦、无法翻译的急躁等等,因为这个时间拉的比较长,因此掌握的程度也会比只看别人写的东西要稳固的从。就像以前的老师在黑板上教学效果要比现在的多媒体好的多的多!(如果是复读生,可能多媒体更好些)。在翻译的过程中,有些英文真是感觉无法翻译,因此参考着网上一些朋友的写法大概的再按自己的思路写,所以如果发现有不对的地方欢迎指正!

 

3 Performance Improvement Methods   性能提升的方法

This chapter discusses Oracle Database improvement methods and contains the following sections:

  • The Oracle Performance Improvement Method                       oracle性能提升的方法

  • Emergency Performance Methods                                         紧急处理时如何提升性能

3.1 The Oracle Performance Improvement Method

Oracle performance methodology helps you to identify performance problems in an Oracle database. This involves identifying bottlenecks and fixing them. It is recommended that changes be made to a system only after you have confirmed that there is a bottleneck.

Performance improvement, by its nature, is iterative. For this reason, removing the first bottleneck might not lead to performance improvement immediately, because another bottleneck might be revealed. Also, in some cases, if serialization points move to a more inefficient sharing mechanism, then performance could degrade. With experience, and by following a rigorous method of bottleneck elimination, applications can be debugged and made scalable.

Performance problems generally result from either a lack of throughput, unacceptable user/job response time, or both. The problem might be localized between application modules, or it might be for the entire system.

oracle性能方法论帮助你确定问题所在。包含查找瓶颈以及解决瓶颈。强烈建议你只有你真的确认那是一个瓶颈的时候再对系统做修改(最好是对你所做的修改都做一个文档记录,需要修改的时候先写到文档中,这样在写的时候就给自己脑子就留出一点时间去思考了,很多时候当写完语句后就会发现原来可以不需要这样做的)。

性能提升的过程是一个迭代的过程。基于这个原因,解决一个瓶颈可能并不能立刻提升性能,因为还有其他一些相关的性能瓶颈存在。另外,在有些时候,修改串行化方式到更新低效的共享机制后,性能还要降低。根据经验,以及遵循一系列的性能瓶颈消除策略,应用程序就可以被调试和伸缩自如。

性能问题通常导致低吞吐量,无法接受的响应时间。问题可能存在于应用程序模块中,或者存在于整个系统中。

 

Before looking at any database or operating system statistics, it is crucial to get feedback from the most important components of the system: the users of the system and the people ultimately paying for the application. Typical user feedback includes statements like the following:

  • "The online performance is so bad that it prevents my staff from doing their jobs."

  • "The billing run takes too long."

  • "When I experience high amounts of Web traffic, the response time becomes unacceptable, and I am losing customers."

  • "I am currently performing 5000 trades a day, and the system is maxed out. Next month, we roll out to all our users, and the number of trades is expected to quadruple."

  • 在你查看数据库或者OS统计数据之前,一个非常重要的事情就是从系统的用户和最后跟踪此应用程序人那里了解一些相关模块的最重要的信息。最典型的用户反馈信息如下:“这个系统联机性能太差了,我的员工都没法工作了”,“这个账单程序运行矿长时间了”,“当网络拥堵时,响应时间太长了,客户都跑了”,“我们系统目前一天可以处理5000笔交易,已经是极限了。下个月,当我们升级到所有用户后,交易量将翻2翻,也就是要增加4倍”

From candid feedback, it is easy to set critical success factors for any performance work. Determining the performance targets and the performance engineer's exit criteria make managing the performance process much simpler and more successful at all levels. These critical success factors are better defined in terms of real business goals rather than system statistics.

Some real business goals for these typical user statements might be:

  • "The billing run must process 1,000,000 accounts in a three-hour window."

  • "At a peak period on a Web site, the response time must not exceed five seconds for a page refresh."

  • "The system must be able to process 25,000 trades in an eight-hour window."

  • 从这些直白的反馈中,将可以看到一些关键的性能提升因素。判断了这些性能的最低要求和性能工程师的最低标准,将使得性能优化变得简单一些。这些关键的性能提升因素最好是根据商业的目标而定,而不是系统的统计数据。(也就是说,通过了解客户的需求,就知道他们要什么,是要每秒钟1000笔交易还是批量程序快点? 任何时间先听听客户想干什么?要求是什么?多听少动! ) 下面是一些真实的商业目标: “我们必须要在3小时内处理完100万的账单”,“在web处理的高峰期,页面的刷新也不能超过5秒钟”,“系统必须要能在8小时内处理完25000个订单”。

The ultimate measure of success is the user's perception of system performance. The performance engineer's role is to eliminate any bottlenecks that degrade performance. These bottlenecks could be caused by inefficient use of limited shared resources or by abuse of shared resources, causing serialization. Because all shared resources are limited, the goal of a performance engineer is to maximize the number of business operations with efficient use of shared resources. At a very high level, the entire database server can be seen as a shared resource. Conversely, at a low level, a single CPU or disk can be seen as shared resources.

You can apply the Oracle performance improvement method until performance goals are met or deemed impossible. This process is highly iterative. Inevitably, some investigations may have little or no impact on database performance. Time and experience are necessary to develop the skills to accurately and quickly pinpoint critical bottlenecks. However, prior experience can sometimes work against the experienced engineer who neglects to use the data and statistics available. This type of behavior encourages database tuning by myth and folklore. This is a very risky, expensive, and unlikely to succeed method of database tuning.

The Automatic Database Diagnostic Monitor (ADDM) implements parts of the performance improvement method and analyzes statistics to provide automatic diagnosis of major performance issues. Using ADDM can significantly shorten the time required to improve the performance of a system. See Chapter 6, "Automatic Performance Diagnostics" for a description of ADDM.

Systems are so different and complex that hard and fast rules for performance analysis are impossible. In essence, the Oracle performance improvement method defines a way of working, but not a definitive set of rules. With bottleneck detection, the only rule is that there are no rules! The best performance engineers use the data provided and think laterally to determine performance problems.

性能最终性能的成功与否的标准在于客户的感觉。 性能工程师的角色是消耗降低性能的瓶颈所在。 这些瓶颈或许是因为共享资源使用不足,或者是是资源的乱用导致的串行化处理。因为资源是有限的,性能优化师的目标就是最大化的利用共享资源。 站在高处看,整个数据库服务器就是一个共享资源。 站在低层看,一个单一的CPU或者硬盘可以看作一个共享资源。

你可以使用oracle性能调优的方式,直到性能达到要求或者的确无法实现。 这是个高度迭代的过程。很多时候,一些研究发现可能对于性能几乎没什么影响。如果想要达到快速的定位问题所在的技能是需要大量的时间经验的(经验一定是通过大量的时间磨出来的)然而有时候经验也可能导致自己疏忽统计数据的研究。这种行为鼓励使用神话与民间传说式的调优(个人英雄主义),将会是非常危险的,昂贵的,不容易成功的调优方法。

ADDM实现了一部分性能调优的工作,它通过分析统计数据去提供一些诊断和解决问题的方法。使用ADDM可以缩短调优的时间。

因为系统千变万化,以及相对复杂,所以要提供一个统一快速的方法去调优是不太可能的。本质上说,oracle的性能调优是定义了一种工作的方法(方法论),而不是定义了调优的步骤规则。 在侦测性能瓶颈的过程中,唯一的规则就是没有规则!  最好的性能工程师使用数据库提供的如AWR数据,端着咖啡站在一边,然后仔细想这个系统的问题所在!(在系统出现问题的时候,别立即就毛手毛脚的,一会动动这,一会动动那;一定要镇定,要自信;不能因为性能差或者出了事故就乱了手脚;做到---急事,慢慢说;大事,清楚说;没有把握的事,谨慎说; 没发生的事,莫胡说; 做不到的事,别乱说;总之在动手之前先动脑!)

举个以前忘记从哪里看的一个故事: 话说很久很久以前,机房还要使用磁带机的时候,客户说IBM的设备多数时间正常,偶尔就出问题。于是IBM派人查看问题,找不出来;接下来检查程序代码,逻辑也没问题;但问题是他真的就是有问题啊。最后IBM派了一个经验丰富的专家去现场查看,这个哥们一去就端着咖啡坐到一旁,什么也不干,就是跟其他人聊聊天,观察一下系统的运行。最终他发现一旦有人走过来,系统就不稳定了;没人走动的时候就没事了。 最后发现是机房的一个地面磁砖松动了,导致人一走过去设备就振动的非常厉害。结局不讲了大家懂的。从这个例子上我们看到,很多看起来很不思议的问题的解决都是让你想撞墙那种方法,但是能找到问题所在的人,往往都是那些专家级人员。在银行业,我想其他行业都差不多,当出现问题的时候,一定是小弟先去解决,解决不了时再让中级工程师解决,还不行就让高级工程师解决,如果还高级工程师解决不了就说明问题不是出现在一般的技术上,专家出面时一般的技术就不用再去考虑了,多想想可能闹鬼的地方!

 


3.1.1 Steps in The Oracle Performance Improvement Method        性能设计的方法
  1. Perform the following initial standard checks:                        先执行下面标准的初始检查:

    • Get candid feedback from users. Determine the performance project's scope and subsequent performance goals, and performance goals for the future. This process is key in future capacity planning.   从用户那里得到直白的系统问题反馈。了解系统范围和优化目标。这步很关键。

    • Get a full set of operating system, database, and application statistics from the system when the performance is both good and bad. If these are not available, then get whatever is available. Missing statistics are analogous to missing evidence at a crime scene: They make detectives work harder and it is more time-consuming.                         获取一份客户系统在好的时候或者差的时候的操作系统,数据库,和应用程序统计数据。假如无法拿到的话,任何时候的都先拿一份再说。 如果没有统计数据,就类似于丢了犯罪现场的证据一样,将导致侦破工作非常艰难和旷日持久

    • Sanity-check the operating systems of all computers involved with user performance. By sanity-checking the operating system, you look for hardware or operating system resources that are fully utilized. List any over-used resources as symptoms for analysis later. In addition, check that all hardware shows no errors or diagnostics.          小心仔细的检查与性能有关的所有机器的操作系统。通过检查操作系统,你可以发现硬件或者操作系统的资源占用情况。 将所有可能过多使用的地方都列出来供以后分析。 另外要检查一下硬件是否发生了故障!

  2. Check for the top ten most common mistakes with Oracle Database, and determine if any of these are likely to be the problem. List these as symptoms for later analysis. These are included because they represent the most likely problems. ADDM automatically detects and reports nine of these top ten issues. See Chapter 6, "Automatic Performance Diagnostics" and "Top Ten Mistakes Found in Oracle Systems".        检查一下数据库的top10问题,看看是否问题就在其中。 列出所有的问题供以后分析。 列出这top10的事项的主要是他们代表了最有可能出现的问题。 ADDM自动的侦测和报告10个中的9个问题

  3. Build a conceptual model of what is happening on the system using the symptoms as clues to understand what caused the performance problems. See "A Sample Decision Process for Performance Conceptual Modeling".    建立一个性能优化的概念模型,通过一个症状做为线索,找出系统目前正在发生什么,以及什么导致了问题所在。见下面的3.1.2中。

  4. Propose a series of remedy actions and the anticipated behavior to the system, then apply them in the order that can benefit the application the most. ADDM produces recommendations each with an expected benefit. A golden rule in performance work is that you only change one thing at a time and then measure the differences. Unfortunately, system downtime requirements might prohibit such a rigorous investigation method. If multiple changes are applied at the same time, then try to ensure that they are isolated so that the effects of each change can be independently validated.    对系统进行修正工作,按照自己的修正顺序,一个一个的来,看看常用的效果如何。 ADDM会产生一系列的建议以及能提升好大的性能。 一个黄金法则是: 在做性能优化时同一时间只修改一个地方,然后比较到底性能差异。  但是呢,如果系统一旦宕机的话,根本没有时间能让我们这么做。 假如一口气改了N多东西的话,那么确保每个效果都是不一样的,他们可以被单独的验证。(我一般是按我自己的方式去调试,如果发现和我的不一样先全部应用上去,很少这样一步步来的)

  5. Validate that the changes made have had the desired effect, and see if the user's perception of performance has improved. Otherwise, look for more bottlenecks, and continue refining the conceptual model until your understanding of the application becomes more accurate.   验证一下修正之后,是否达到自己想像的那种立竿见影的效果,以及客户的直观感觉是否好了。 否则的话,就再检查这些瓶颈,重新定义模型,直接到系统有了更准确的理解。

  6. Repeat the last three steps until performance goals are met or become impossible due to other constraints.  重复3,4,5步骤,直到优化成功或者因为一些其他条件根本无法实现。

This method identifies the biggest bottleneck and uses an objective approach to performance improvement. The focus is on making large performance improvements by increasing application efficiency and eliminating resource shortages and bottlenecks. In this process, it is anticipated that minimal (less than 10%) performance gains are made from instance tuning, and large gains (100% +) are made from isolating application inefficiencies.

这些方式确定了最大的瓶颈,以及使用了一个直观的途径去提升性能。 主要集中在通过增加系统的有效系统和消耗资源短缺和瓶颈来最大的提升性能。 在这个过程中,少则提升低于10%的性能,高的话有可能会超过100%。

 


3.1.2 A Sample Decision Process for Performance Conceptual Modeling        在性能概念模型中的一个简单的决策过程

Conceptual modeling is almost deterministic. However, as you gain experience in performance tuning, you begin to appreciate that no real rules exist. A flexible heads-up approach is required to interpret statistics and make good decisions.

For a quick and easy approach to performance tuning, use ADDM. ADDM automatically monitors your Oracle system and provides recommendations for solving performance problems should problems occur. For example, suppose a DBA receives a call from a user complaining that the system is slow. The DBA simply examines the latest ADDM report to see which of the recommendations should be implemented to solve the problem. See Chapter 6, "Automatic Performance Diagnostics" for information about the features that help monitor and diagnose Oracle databases.

The following steps illustrate how a performance engineer might look for bottlenecks without using automatic diagnostic features. These steps are only intended as a guideline for the manual process. With experience, performance engineers add to the steps involved. This analysis assumes that statistics for both the operating system and the database have been gathered.

概念模型通常是起决定性意义的。 然而当你的性能调优经验越来越多的时候,你就能领悟到根本没有真正的规则存在。在解释统计数据和做出决策的过程中,需要的是足智多谋,思维敏捷的途径。

如果要使用快速和相对容易的方式去调优,那就使用ADDM吧! ADDM自动的监控你的系统,并且对于性能问题提供了一些帮助建议。 例如,假设一个DBA接收客户说系统比较慢的抱怨电话。DBA简单的检查最近的ADDM报告去看看有哪个建议可以使用。

下面的步骤阐述了在没有自动诊断特性时性能工程师如何确定问题。 这些步骤仅仅作为一个手动操作的参考。 性能工程师将连同自己的经验一起参与到自己的解决步骤中(随着经验的积累,性能工程师会增加相应的步骤)。这些分析的前提是操作系统和数据库的统计数据都已经被收集过了。

  1. Is the response time/batch run time acceptable for a single user on an empty or lightly loaded computer?

    If it is not acceptable, then the application is probably not coded or designed optimally, and it will never be acceptable in a multiple user situation when system resources are shared. In this case, get application internal statistics, and get SQL Trace and SQL plan information. Work with developers to investigate problems in data, index, transaction SQL design, and potential deferral of work to batch and background processing. 

    当只有一个用户时,或者是系统负载比较轻的时候这个响应时间和批量处理时间都是正常的? 假如这个时候都是不可接受的,那么说明应用程序可能真的设计的有问题,同时也说明在多用户条件中性能更不可接受了。 在这种情况下,获取应用内部统计数据,获取SQL TRACE和SQL PLAN信息。 与开发人员一起调查数据,索引,事务设计,以及可能延迟的批量任务问题,还有他们的后台进程。(多数时候,不需要了解太多他们的业务需求,只要了解到哪一支程序慢,慢在哪里?这只程序处理了什么逻辑等等就可以了。我有个坏毛病,总是先听他们的处理逻辑,听完后通常都和自己的设计理念不符,通常下他们的设计也绝对不怎么好(好的话怎么可能会出问题,是吧), 在解决问题之前就先鄙视一下,很多人脸上就挂不住了。个人情商太低!)

  2. Is all the CPU being utilized?

    If the kernel utilization is over 40%, then investigate the operating system for network transfers, paging, swapping, or process thrashing. Continue to check CPU utilization in user space to verify if there are any non-database jobs consuming CPU on the system limiting the amount of shared CPU resources, such as backups, file transforms, print queues, and so on. After determining that the database is using most of the CPU, investigate the top SQL by CPU utilization. These statements form the basis of all future analysis. Check the SQL and the transactions submitting the SQL for optimal execution. Oracle Database provides CPU statistics in V$SQL and V$SQLSTATS.

    See Also:

    Oracle Database Reference for more information on V$SQL and V$SQLSTATS

    If the application is optimal and no inefficiencies exist in the SQL execution, then consider rescheduling some work to off-peak hours or using a bigger computer.

    CPU的使用率如何?

    假如操作系统占了40%多的CPU时,查看一下网络传输,分页,交换或者进程颠簸情况。 仍然在用户空间下查看CPU使用率问题,检查下是否有什么非数据库的工作在消耗CPU的工作,例如备份,文件传输,打印队列等等问题(通常夜间都会进行日志等备份工作,比如一个批量的文件在处理前要解压,处理完后要压缩,gzip是非常消耗CPU资源的,明显会看到CPU迅速冲到80~90%的)。如果确认是数据库占据了多数的CPU,检查一下在消耗CPU中,使用率最高的SQL语句。 这些语句就是将来需要我们去分析的。 检查SQL和事务是否正常执行了。 查询这些CPU的统计可以通过V$SQL, V$SQLSTATS去查询! 假如说应用程序是最优的,同时SQL语句也没有明显的低效之处,那么就工作放到非高峰期处理看或者直接换个性能更好的机器试试。(银行业数据库服务器好的一塌糊涂,这个不需要再看了,绝对资源过剩的。出现问题一定是SQL语句或者事务有问题,比如乱加锁的原因)

  3. At this point, the system performance is unsatisfactory, yet the CPU resources are not fully utilized.

    In this case, you have serialization and unscalable behavior within the server. Get the WAIT_EVENTS statistics from the server, and determine the biggest serialization point. If there are no serialization points, then the problem is most likely outside the database, and this should be the focus of investigation. Elimination of WAIT_EVENTS involves modifying application SQL and tuning database parameters. This process is very iterative and requires the ability to drill down on the WAIT_EVENTS systematically to eliminate serialization points.

    单用户也调试了,系统资源占用也考虑了,系统性能还不行,同时CPU也没有充分的利用到。

    在这种情况下,通常情况下是串行化了。 获取一下wait_events统计数据,确定最大的串行化点。 假如没有发现什么串行化情况,那么问题就极有可能出在数据库之外。 消除wait_events的事件,涉及到修改应用SQL和调优系统参数。 这个过程是一个迭代化的过程,需要深入钻研的精神,有步骤地,系统地查看wait_events,最终消除问题所在。

 


3.1.3 Top Ten Mistakes Found in Oracle Systems                        在oracle系统中最高的10种错误

This section lists the most common mistakes found in Oracle databases. By following the Oracle performance improvement methodology, you should be able to avoid these mistakes altogether. If you find these mistakes in your system, then re-engineer the application where the performance effort is worthwhile. See "Automatic Performance Tuning Features" for information about the features that help diagnose and tune Oracle databases. See Chapter 10, "Instance Tuning Using Performance Views" for a discussion on how wait event data reveals symptoms of problems that can be impacting performance.

本节列出了在Oracle系统中最常见的错误。使用Oracle的性能方法论,你应该能总体上避免这些错误。如果你的系统中存在这些错误,重新设计你应用程序的中关于性能部分努力将是值得的。

  1. Bad connection management                             糟糕的连接管理

    The application connects and disconnects for each database interaction. This problem is common with stateless middleware in application servers. It has over two orders of magnitude impact on performance, and is totally unscalable. 

    应用程序不断的的连接和断开与数据库。此问题常见于无状态的中间件应用中。它对性能的影响超过两个数量级,并且完全无法扩展及不稳定。

  2. Bad use of cursors and the shared pool             游标的错误使用和共享池

    Not using cursors results in repeated parses. If bind variables are not used, then there is hard parsing of all SQL statements. This has an order of magnitude impact in performance, and it is totally unscalable. Use cursors with bind variables that open the cursor and execute it many times. Be suspicious of applications generating dynamic SQL.  

    未使用游标导致重复解析。假如不使用绑定变量,将导致所有SQL语句出现硬解析。这将导致一个数据量级的影响,完全不能扩展。使用绑定变量的游标,打开游标,然后多次去执行。一定要注意那些动态产生的SQL。(现在有很多工具可以自动生成SQL语句,特别是java开发人员有时会使用它们,而根本不关心生成的语句效果如何,最后全部推给优化师,日他奶奶的这帮人!关键是因为他们一直在动态的生成,干的活看起来非常多,最后奖金还很多,所以挣的钱多少跟水平不是成正比的!)

  3. Bad SQL                                                         糟糕的SQL

    Bad SQL is SQL that uses more resources than appropriate for the application requirement. This can be a decision support systems (DSS) query that runs for more than 24 hours, or a query from an online application that takes more than a minute. You should investigate SQL that consumes significant system resources for potential improvement. ADDM identifies high load SQL. SQL Tuning Advisor can provide recommendations for improvement. See Chapter 6, "Automatic Performance Diagnostics" and Chapter 17, "Automatic SQL Tuning". 

    糟糕的SQL是指较应用需求使用更多资源的SQL。这可能是一个运行超过24小时的DSS查询或需要一分多钟的联机应用程序查询。要研究消耗大量的系统资源的sql。ADDM确定高负荷SQL. STA可以提供改进的建议。

  4. Use of nonstandard initialization parameters      使用非标准初始化参数

    These might have been implemented based on poor advice or incorrect assumptions. Most databases provide acceptable performance using only the set of basic parameters. In particular, parameters associated with SPIN_COUNT on latches and undocumented optimizer features can cause a great deal of problems that can require considerable investigation.

    Likewise, optimizer parameters set in the initialization parameter file can override proven optimal execution plans. For these reasons, schemas, schema statistics, and optimizer settings should be managed as a group to ensure consistency of performance.

    这些可能基于糟糕的同行建议或不正确的假设。大多数系统仅使用基本参数设定就会得到不错的性能。特别是与latch相关的 SPIN_COUNT 参数和未书面记录的优化特性可能会导致很大的问题,修改前需要严格的考虑。 同样,在初始化参数文件中设置的优化参数可以覆盖已经证明是最佳的执行计划。由于这些原因,相关用户,及用户统计数据,优化设置应作为一组共同管理,以确保性能的一致性。

    See Also:

    • Oracle Database Administrator's Guide for information about initialization parameters and database creation

    • Oracle Database Reference for details on initialization parameters

    • "Performance Considerations for Initial Instance Configuration" for information about parameters and settings in an initial instance configuration

  5. Getting database I/O wrong                              获取数据库I/O错误

    Many sites lay out their databases poorly over the available disks. Other sites specify the number of disks incorrectly, because they configure disks by disk space and not I/O bandwidth. See Chapter 8, "I/O Configuration and Design".  

    许多站点性能不佳是因为数据库放在比较差的磁盘上。还有一些站点磁盘数量是不对的,因为他们依据磁盘空间而不是I/O带宽去置磁盘。(比较调整硬盘一定要作为redo使用等等)

  6. Online redo log setup problems                        重做日志设置问题

    Many sites run with too few online redo log files and files that are too small. Small redo log files cause system checkpoints to continuously put a high load on the buffer cache and I/O system. If too few redo log files exist, then the archive cannot keep up, and the database must wait for the archiver to catch up. See Chapter 4, "Configuring a Database for Performance" for information about sizing redo log files for performance.     

    许多站点redo太小,太少。小的redo会导致系统检查点给高缓冲区和I/O系统持续的压力。如果redo太少,然后归档不能跟上,则数据库将等待归档进程。

  7. Serialization of data blocks in the buffer cache due to lack of free lists, free list groups, transaction slots (INITRANS), or shortage of rollback segments.

    This is particularly common on INSERT-heavy applications, in applications that have raised the block size above 8K, or in applications with large numbers of active users and few rollback segments. Use automatic segment-space management (ASSM) and automatic undo management to solve this problem. 

    高速缓存中的数据块由于缺乏free lists、free list groups(这些都常出现在9i或以前的系统中,新系统中很少有DBA这样使用表空间了,听听就算了),事务插槽,或回滚段导致串行化操作。这在INSERT比较多的应用中,同时块大小设置大于8K,或在应用程序中有大量的活跃用户数及少量回滚段的情况下常见些。使用ASSM,自动UNDO管理已经不存在这样的问题。

  8. Long full table scans                                      长时间全表扫描

    Long full table scans for high-volume or interactive online operations could indicate poor transaction design, missing indexes, or poor SQL optimization. Long table scans, by nature, are I/O intensive and unscalable. 

    在大表或联机交互操作中的进行长时间全表扫描,可能显示事务设计不好、缺少索引或糟糕的SQL优化计划。长时间表扫描,本质上是I/O密集??型的,不可扩展。

  9. High amounts of recursive (SYS) SQL              大量的递归(SYSSQL

    Large amounts of recursive SQL executed by SYS could indicate space management activities, such as extent allocations, taking place. This is unscalable and impacts user response time. Use locally managed tablespaces to reduce recursive SQL due to extent allocation. Recursive SQL executed under another user ID is probably SQL and PL/SQL, and this is not a problem. 

    大量的SYS用户调用的递归SQL的执行,可能表示分配空间的管理行为,如需要更多的extent空间。这是非扩展的并影响用户的响应时间。使用本地管理表空间,以减少由于分配空间造成的的递归SQL。其他用户ID下执行的递归SQL问题可能是SQL语句或PL/SQL问题,这是没有问题的。

  10. Deployment and migration errors                   部署和迁移错误

    In many cases, an application uses too many resources because the schema owning the tables has not been successfully migrated from the development environment or from an older implementation. Examples of this are missing indexes or incorrect statistics. These errors can lead to sub-optimal execution plans and poor interactive user performance. When migrating applications of known performance, export the schema statistics to maintain plan stability using the DBMS_STATS package.

    Although these errors are not directly detected by ADDM, ADDM highlights the resulting high load SQL. 

    在许多情况下,应用程序使用了太多的资源,因为用户的表没有成功地从开发环境迁移过来,或新设计的表没有应用上来。比如缺少索引或不正确的统计数据。这些错误可能会导致不佳的执行计划和糟糕的用户体验。当原有的旧系统是相对稳定的,那么使用DBMS_STATS导出旧系统的统计数据。

    虽然这些错误不能直接被ADDM检测到,但ADDM会突出显示出高负荷的SQL。

 

 


3.2 Emergency Performance Methods            紧急性能处理

This section provides techniques for dealing with performance emergencies. You presumably have a methodology for establishing and improving application performance. However, in an emergency situation, a component of the system has changed to transform it from a reliable, predictable system to one that is unpredictable and not satisfying user requests.

In this case, the performance engineer must rapidly determine what has changed and take appropriate actions to resume normal service as quickly as possible. In many cases, it is necessary to take immediate action, and a rigorous performance improvement project is unrealistic.

After addressing the immediate performance problem, the performance engineer must collect sufficient debugging information either to get better clarity on the performance problem or to at least ensure that it does not happen again.

The method for debugging emergency performance problems is the same as the method described in the performance improvement method earlier in this book. However, shortcuts are taken in various stages because of the timely nature of the problem. Keeping detailed notes and records of facts found as the debugging process progresses is essential for later analysis and justification of any remedial actions. This is analogous to a doctor keeping good patient notes for future reference.

本节提供了紧急情况的性能调优技术。你可能已经建立了自己的应用程序调优方面的技巧。然而,在紧急情况下,系统的一部分从一个稳定的状态改变成一个不可预知时,不能满足用户请求的状态。

在这种情况下,性能工程师的必须迅速确定什么变化了,并采取适当的行动,以尽可能快地恢复正常的服务。在许多情况下,必须立即采取行动,严格的性能提升是不现实的(前段时间我看了一些关于oracle12c的东西,好像里面有了处理这方面的东西,也就是在出现问题时先别急着去重启数据库,先看看到底发生了什么,可以把东西备份出来供以后分析)。

在定位了的当前的紧急性能问题后,性能工程师必须收集足够的调试信息,或者是为了对性能问题有更加清晰的了解,或者至少确保它真的不会再次发生了。

用于调试紧急性能问题的方法和在本书前面章节介绍的性能改进方法是相似的。然而,因为问题的紧迫,可以采取快速方便的捷径处理方式。保持详细的笔记,记录调试处理过程中的发现的一切现象是至关重要的,方便以后的分析及做出补救措施。这类似于医生详细记录病人病历,以备将来参考。


3.2.1 Steps in the Emergency Performance Method   紧急的性能方法的步骤如下:

The Emergency Performance Method is as follows:

  1. Survey the performance problem and collect the symptoms of the performance problem. This process should include the following:

    • User feedback on how the system is underperforming. Is the problem throughput or response time?

    • Ask the question, "What has changed since we last had good performance?" This answer can give clues to the problem. However, getting unbiased answers in an escalated situation can be difficult. Try to locate some reference points, such as collected statistics or log files, that were taken before and after the problem.

    • Use automatic tuning features to diagnose and monitor the problem. See "Automatic Performance Tuning Features" for information about the features that help diagnose and tune Oracle systems. In addition, you can use Oracle Enterprise Manager performance features to identify top SQL and sessions.   

      调查性能问题并收集性能相关的问题症状。这个过程应该包括以下内容:

      从用户反馈中了解系统状况。是吞吐量还是响应时间问题?

      问问他们,“以前正常的时候大概在几号,之后改了什么东西?”这个答案可以提供一些线索。然而,在紧急情况下可以得到准确的答案是相对困难的。尝试找到一些参考点,如收集的统计数据或查看日志文件,比较问题发生之前和之后的记录。(注:通过和客户聊天,可以得到一些相关的修改记录,有的时候,有的“犯罪份子”可能会故意的不说实话,比如他不小心把配置文件删除了,然后随便找了一个替过来等等,这时要从日志中查看信息)
      使用自动调优功能,诊断和监测的问题。此外,您可以使用OEM,识别资源占用最高的SQL和会话信息。(一般情况下,生产上不太喜欢安装OEM,所以还是准备好SQL语句去查看吧!)

  2. Sanity-check the hardware utilization of all components of the application system. Check where the highest CPU utilization is, and check the disk, memory usage, and network performance on all the system components. This quick process identifies which tier is causing the problem. If the problem is in the application, then shift analysis to application debugging. Otherwise, move on to database server analysis.  

    全面检查服务器各个硬件部件使用率。检查CPU使用率最高的什么程序,并检查磁盘,内存使用情况和与数据库与关联的所有系统的网络性能。快速的确定造成问题的是哪一层。如果问题是在应用程序中,就转向分析应用程序的调试。否则,将转向数据库服务器的分析。(像在银行业的话,一般说是服务器出现问题那基本上就是数据库的问题了,因为在找DBA之前他们已经做了大量的研究,基本上不会出现在应用程序中了,许多时候都是由于没有打PATCH的原因)

  3. Determine if the database server is constrained on CPU or if it is spending time waiting on wait events. If the database server is CPU-constrained, then investigate the following:

    • Sessions that are consuming large amounts of CPU at the operating system level and database; check V$SESS_TIME_MODEL for database CPU usage

    • Sessions or statements that perform many buffer gets at the database level; check V$SESSTAT and V$SQLSTATS

    • Execution plan changes causing sub-optimal SQL execution; these can be difficult to locate

    • Incorrect setting of initialization parameters

    • Algorithmic issues caused by code changes or upgrades of all components

    If the database sessions are waiting on events, then follow the wait events listed in V$SESSION_WAIT to determine what is causing serialization. The V$ACTIVE_SESSION_HISTORY view contains a sampled history of session activity which you can use to perform diagnosis even after an incident has ended and the system has returned to normal operation. In cases of massive contention for the library cache, it might not be possible to logon or submit SQL to the database. In this case, use historical data to determine why there is suddenly contention on this latch. If most waits are for I/O, then examine V$ACTIVE_SESSION_HISTORY to determine the SQL being run by the sessions that are performing all of the inputs and outputs. See Chapter 10, "Instance Tuning Using Performance Views" for a discussion on wait events.

    确定是数据库服务器的CPU不够还是时间花在等待事件上。如果数据库服务器是CPU不足的,然后考察以下内容:

    在操作系统和数据库级别消耗大量的CPU的会话,检查 V$SESS_TIME_MODEL、V$SYS_TIME_MODEL 确定数据库CPU的使用情况(V$SESS_TIME_MODEL displays the session-accumulated time for various operations. The time reported is the total elapsed or CPU time (in microseconds). Any timed operation will buffer at most 5 seconds of time data.)

    在数据库级别上执行很多缓冲区获读取的会话或语句(或者说正在大量消耗资源的语句),检查 V$SESSTATV$SQLSTATS

    执行计划的改变导致SQL效率不佳,这个比较难难定位了

    初始化参数设置不正确

    整个系统升级或部分代码升级时其中的算法问题

    如果数据库会话发生了等待事件,查看 V$SESSION_WAIT 上的等待事件,以确定是什么原因造成的序列化。V$ACTIVE_SESSION_HISTORY 记录了采样的活动会话历史记录,即使这些会话发生已结束,当系统恢复正常后,仍可用于诊断分析。在library cache中发生大量争用的情况下,可能无法登录或commit。在这种情况下,使用历史数据,以确定为什么突然间会发生此闩锁争用。如果大多数在等待I/O,通过V$ACTIVE_SESSION_HISTORY以确定正在活动的会话运行的中包含大量IO的SQL语句。

  4. Apply emergency action to stabilize the system. This could involve actions that take parts of the application off-line or restrict the workload that can be applied to the system. It could also involve a system restart or the termination of job in process. These naturally have service level implications.  

    采取紧急行动来稳定系统。可能涉及停掉部分应用程序或限制系统的压力。包括系统重启或终止运行中的JOB。这会对服务产生一定的影响(都重启了当然有影响了,不过现在RAC应用的挺多,重启一台对另一台没影响)。

  5. Validate that the system is stable. Having made changes and restrictions to the system, validate that the system is now stable, and collect a reference set of statistics for the database. Now follow the rigorous performance method described earlier in this book to bring back all functionality and users to the system. This process may require significant application re-engineering before it is complete.

        验证该系统是稳定的。对系统进行了修改和限制后,验证系统现在是稳定的,并为数据库收集参考的统计数据。现在,按照本书前面介绍的严格的性能分析方法重新启用系统的所有功能和用户连接。这个过程完成之前,可能需要大量的重新设计应用程序的工作。

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/1696240/viewspace-1169621/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/1696240/viewspace-1169621/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值