一、概述
要想优化log file sync等待事件,那么懂得其触发机制和原理是很优必要的。这里先引用官方文档来对log file sync等待进行解释(任何信息的描述应以官方文档为准,百度到的结果通常会有一些错误,错误的信息会导致错误的思路以及错误的结果):
What is a ‘log file sync’ wait?
When a user session commits, all redo records generated by that session’s transaction need to be flushed from memory to the redo logfile to insure changes to the database made by that transaction become permanent.
【翻译:当用户会话提交时,该会话的事务生成的所有redo记录都需要从内存中刷新到重做日志文件,以确保该事务永久性的记录到数据库变更中。】
At the time of commit, the user session will post LGWR to write the log buffer (containing the current unwritten redo, including this session’s redo records) to the redo log file. Once LGWR knows that its write requests have completed, it will post the user session to notify it that this has completed. The user session waits on ‘log file sync’ while waiting for LGWR to post it back to confirm all redo it generated have made it safely onto disk.
【翻译:在提交时,用户会话将发布LGWR以将日志缓冲区(包含当前未写入的redo,包括该会话的redo记录)写入redo log。 一旦LGWR知道其写请求已完成,它将发布通知给用户会话告知其已完成。 用户会话等待“log file sync”,同时等待LGWR将其回发,以确认其生成的所有redo操作已将其安全地保存到磁盘上。】
The time between the user session posting the LGWR and the LGWR posting the user after the write has completed is the wait time for ‘log file sync’ that the user session will show.
【翻译:用户会话通知LGWR和LGWR在写入完成后通知用户之间的时间就是“log file sync”显示在用户会话的等待时间。】
Note that in 11.2 and higher LGWR may dynamically switch from the default post/wait mode to a polling mode where it will maintain it’s writing progress in an in-memory structure and sessions waiting on ‘log file sync’ can periodically check that structure (i.e. poll) to see if LGWR has progressed far enough such that the redo records representing their transactions have made it to disk. In that case the wait time will span from posting LGWR until the session sees sufficient progress has been made.
【翻译:请注意,在11.2(确切版本是11.2.0.3.0)及更高版本中,LGWR可以动态地从默认的post/wait模式切换到polling模式(原模式为:post/wait),在这种模式下,它将保持其在内存结构中的写入进度,并且等待“log file sync”的会话可以定期检查该结构(即polling)以查看LGWR是否已取得足够的进展,以使代表其事务的redo记录已进入磁盘。 在这种情况下,等待时间将从发布LGWR到会话看到足够的进展为止。】
Compare the average wait time for ‘log file sync’ to the average wait time for ‘log file parallel write’.
【翻译:将“log file sync”的平均等待时间与“log file parallel write”的平均等待时间进行比较】
Wait event ‘log file parallel’ write is waited for by LGWR while the actual write operation to the redo is occurring. The duration of the event shows the time waited for the IO portion of the operation to occur. For more information on “log file parallel write” see: Document:34583.1 WAITEVENT: “log file parallel write” Reference Note
【翻译:“log file paralle write”是等待lgwr对redo记录进行实际的写入操作,事件的持续表示为等待操作的IO部分发生的时间。 有关“日志文件并行写入”的更多信息,请参见:Document:34583.1 WAITEVENT: “log file parallel write” Reference Note】
To identify a potential high commit rate, if the average wait time for ‘log file sync’ is much higher than the average wait time for ‘log file parallel write’, then this means that most of the time waiting is not due to waiting for the redo to be written and thus slow IO is not the cause of the problem. The surplus time is CPU activity and is most commonly contention caused by over committing.
【为了确定潜在的高提交率,如果“log file sync”的平均等待时间比“log file parallel write”的平均等待时间高得多,那么这意味着大多数等待时间不是由于等待redo记录的写入,从而导致的问题的原因并不是IO慢。 (排除IO消耗的时间后,log file sync等待的)剩余时间是CPU的活动 ,最常见的争用是由于过量提交引起的。】
【上述内容出自:Troubleshooting: ‘Log file sync’ Waits (Doc ID 1376916.1) 】
【PS:MOS文档翻查小技巧,如果要查某知识点的原理和解决方案,可以看看“Troubleshotting:关键字”这个系列,非常强大。】
从官方文档的解释来看,commit后大体分为以下几个阶段:
1)用户进程发起commit;【CPU资源消耗】
2)用户进程通知lgwr写日志;【CPU资源消耗】
3)lgwr接收到请求,并开始写日志(从内存写入物理文件);【I/O资源消耗】
4)lgwr完成日志写入;【I/O资源消耗】
5)lgwr通知用户进程已写入完成;【CPU资源消耗】
6)用户进程接收到写入完成的通知。【CPU资源消耗】
而log file sync和log file parallel write就是在1-6的阶段中,通过官方文档的理解,可以将判断出这两个等待事件各自对应哪几个阶段。为了看起来更直观,这里引用一下tanel poder大师的图:
从图中以及对应阶段来看能清晰的看到:
log file sync等待包含阶段为:2,3,4,5
log file parallel write等待包含阶段为:3,4
我们可以看出,正常情况下最慢的环节应该是在3,4阶段,即物理I/O操作,而其他阶段涉及到的动作都是如CPU调度,进程间通信等,而这些动作一般都是非常快的。
说到最慢的物理I/O环节,再对应2个等待事件以及6个阶段,可以这么说:log file parallel write等待是log file sync等待全过程的一部分。这也是为什么碰到log file sync等待就需要同步分析log file parallel write等待的原因了,那就是评估存储的IO性能问题。因为log file parallel write的3,4步就是一个物理写的耗时,且是log file sync等待的一部分。如果log file parallel write等待耗时较高,且和log file sync差不了太多,那就说明log file sync和log file parallel wirte最大的瓶颈在于存储IO性能。而如果log file parallel write耗时不高,但是log file sync耗时较高,那就说明存储性能是没有问题的,问题就出在log file sync的其他阶段了。我们可以根据得到的结论对log file sync进行以下几个方面的分析。
二、分析与方法
1、COMMIT次数过多(即图中的1-2,5-6阶段)
既然log file sync是由于commit导致的,那么commit次数过多自然会导致log file sync等待次数的增加,导致比较严重的等待。反之,如果log file sync的次数很高,但是等待的平均时间不高,则很可能是由于commit过于频繁导致的。
我们可以到数据库分析commit次数的变化量是否有突增:
select b.end_interval_time ,
value-lag(value,1) over(order by a.snap_id) "user commits"
from DBA_HIST_SYSSTAT a,dba_hist_snapshot b
where a.snap_id=b.snap_id and a.snap_id >=&begin_snapid and a.stat_name in('user commits')
--当前内存中commits较高的语句
SELECT SYSDATE,
SE.USERNAME,
SE.SID,
SE.SERIAL#,
SE.SQL_ID,
SE.STATUS,
SE.MACHINE,
SE.OSUSER,
ROUND(ST.VALUE) "COMMIT TIMES",
SA.SQL_TEXT
FROM V$SESSION SE, V$SESSTAT ST, V$SQLAREA SA
WHERE SE.SID = ST.SID
AND ST.STATISTIC# =
(SELECT STATISTIC# FROM V$STATNAME WHERE NAME = 'user commits')
--AND SE.USERNAME IS NOT NULL