OGG中 Lag at Chkpt Time Since Chkpt

最新推荐文章于 2024-08-19 17:22:48 发布

jnrjian

最新推荐文章于 2024-08-19 17:22:48 发布

阅读量3.1k

点赞数

文章标签： dba

原文链接：https://blog.csdn.net/Smasegain/article/details/46872305

版权

本文深入解析了Oracle GoldenGate中LagatChkpt和Lag在Extract和Replicat过程中的含义，探讨了checkpoint的机制，以及Lag实际是源端处理时间和checkpoint时间差。通过测试实例揭示了Timestamp的来源，并总结了Checkpoint Lag的实质。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1. Lag at Chkpt extract中为什么会出现lag，正常情况，系统繁忙基本不会出现，但是如果就一个session在运行，很长时间没有commit，那么上一次commit的时间和现在系统时间的差距就是 Lag at Chkpt 的值。或者extract hang住了，停在了上一个checkpoint点，也就有了差值。

2. Lag at Chkpt replicat中为什么会出现lag，如果处理某一个trail文件，源trail文件中两次checkpoint时间很长，那么处理完上一个commit的事务，在源端记录了产生这个事务的时间，新事物又没来，就一直在等，源端上一个事务的时间点和现在的系统时间差值就是Lag at Chkpt。可能几个小时，但只要新来一个事务，Lag at Chkpt就会为0.

3. Time Since Chkpt 就是处理这个事务应用的时间，如果replicate正在apply log.

一.概要
前段时间,一个朋友问我关于goldengate中的Lag的问题。因涉及到Lag的详细行程过程，而我一直觉得Lag就是：本进程处理一条记录的时间点和前一个进程处理该记录时的时间点的差距，因为一般情况下都是目标端的REP进程出现LAG的机会比较大，平时也更加侧重于如何去诊断LAG和优化，所以当纠结一些详细原因的时候就卡住了(到底是不是当前处理时间和源库产生的时间)。那既然要更加明确Lag是怎么回事，就需要了解ogg的checkpoint和Lag的原理。以下是相关官方文档的说法:

Checkpoints store the current read and write positions of a process to disk for recovery purposes. Checkpoints ensure that data changes that are marked for synchronization actually are captured by Extract and applied to the target by Replicat, and they prevent redundant processing. They provide fault tolerance by preventing the loss of data should the system, the network, or an Oracle GoldenGate process need to be restarted. For complex synchronization configurations, checkpoints enable multiple Extract or Replicat processes to read from the same set of trails. Checkpoints work with inter-process acknowledgments to prevent messages from being lost in the network. Oracle GoldenGate has a proprietary guaranteed-message delivery technology.(通过这一段，说明checkpoint是用于记录进程当前到底处理到哪个位置的，这个checkpoint是用于防止数据的丢失和重复应用的。)

Extract creates checkpoints for its positions in the data source and in the trail. Because Extract only captures committed transactions, it must keep track of operations in all open transactions, in the event that any of them are committed. This requires Extract to record
a checkpoint where it is currently reading in a transaction log, plus the position of the start of the oldest open transaction, which can be in the current or any preceding log. To control the amount of transaction log that must be re-processed after an outage, Extract persists the current state and data of processing to disk at specific intervals, including the state and data (if any) of long-running transactions. If Extract stops after one of these intervals, it can recover from a position within the previous interval or at the last checkpoint, instead of having to return to the log position where the oldest open longrunning transaction first appeared. For more information, see the BR parameter in the Oracle GoldenGate Windows and UNIX Reference Guide.(本段也很明确的说了，extract进程仅抓取committed的事务。当有大事务的时候它周期性的将cache中未提交的达到一定大小的事务保存到本地磁盘方便进行bound recovery)

Replicat creates checkpoints for its position in the trail. Replicat stores its checkpoints in a checkpoint table in the target database to couple the commit of its transaction with its position in the trail file. The checkpoint table guarantees consistency after a database recovery by ensuring that a transaction will only be applied once, even if there is a failure of the Replicat process or the database process. For reporting purposes, Replicat also has a checkpoint file on disk in the dirchk sub-directory of the Oracle GoldenGate directory. Checkpoints are not required for non-continuous types of configurations that can be re-run from a start point if needed, such as initial loads.

Checkpoint Lag:
Checkpoint Lag is the lag, in seconds, at the time the last checkpoint was written to the trail(Checkpoint Lag就是最近一个检查点发生的时间与源端产生此数据时的时间差，至于为什么是源端，后面会有相应的测试。).
For example, if the following is true...
● Current time = 15:00:00(当前时间)
● Last checkpoint = 14:59:00(上一个检查点时间)
● Timestamp of the last record processed =14:58:00(上一条记录在Trail中的Timestamp,此时间戳为源端数据库处理的时间戳)

二.相关资料：
在OGG的Reference中的LAG REPLICAT和LAG EXTRACT命令中有相关说明帮我们更好了解LAG到底是什么:
About Replicat lag
For Replicat, lag is the difference, in seconds, between the time that the last record was
processed by Replicat (based on the system clock) and the timestamp of the record in the
trail.(REP的LAG是REP进程处理记录的时间和TRAIL文件中的时间差异)

About Extract lag
For Extract, lag is the difference, in seconds, between the time that a record was processed
by Extract (based on the system clock) and the timestamp of that record in the data source
(EXT的LAG是ext处理的时间点和源库之间的时间差异)

三.测试记录:
为了确认LAG中提到的Timestamp是上个进程的处理时间(也就是说每个进程都会更新此值)还是源端写入的值，进行如下测试：
1).停掉源端的ext和dp进程后进行数据操作(操作后通过logminer抓取执行时间)
DATASRC TEST 2014-10-06 16:59:33 drop table datasrc.test purge;

2).间隔半个小时后启动ext进程并通过logdump获取trail文件中的timestamp：
Logdump 18 >n
___________________________________________________________________
Hdr-Ind : E (x45) Partition : . (x00)
UndoFlag : . (x00) BeforeAfter: A (x41)
RecLength : 1191 (x04a7) IO Time : 2014/10/06 16:59:33.000.000
IOType : 160 (xa0) OrigNode : 0 (x00)
TransInd : . (x03) FormatType : R (x52)
SyskeyLen : 0 (x00) Incomplete : . (x00)
AuditRBA : 0 AuditPos : 0
Continued : N (x00) RecCount : 1 (x01)

2014/10/06 16:59:33.000.000 DDLOP Len 1191 RBA 1118 ( 由此可以看出ext写入到trail的也是源端的时间)
Name:
After Image: Partition 0 G s
2c43 353d 2735 3133 272c 2c42 373d 2735 3133 272c | ,C5='513',,B7='513',
2c42 323d 2738 3837 3730 272c 2c42 333d 2744 4154 | ,B2='88770',,B3='DAT
4153 5243 272c 2c42 343d 2754 4553 5427 2c2c 4331 | ASRC',,B4='TEST',,C1
323d 2727 2c2c 4331 333d 2727 2c2c 4235 3d27 5441 | 2='',,C13='',,B5='TA
424c 4527 2c2c 4236 3d27 4452 4f50 272c 2c42 383d | BLE',,B6='DROP',,B8=
2747 4753 5243 2e47 4753 5f44 444c 5f48 4953 5427 | 'GGSRC.GGS_DDL_HIST'
2c2c 4239 3d27 5359 5327 2c2c 4337 3d27 3131 2e32 | ,,B9='SYS',,C7='11.2

3).间隔15分钟后启动datapump进程后查看目标端的trail文件中的内容:
___________________________________________________________________
Hdr-Ind : E (x45) Partition : . (x00)
UndoFlag : . (x00) BeforeAfter: A (x41)
RecLength : 1191 (x04a7) IO Time : 2014/10/06 16:59:33.000.203
IOType : 160 (xa0) OrigNode : 0 (x00)
TransInd : . (x03) FormatType : R (x52)
SyskeyLen : 0 (x00) Incomplete : . (x00)
AuditRBA : 0 AuditPos : 0
Continued : N (x00) RecCount : 1 (x01)

2014/10/06 16:59:33.000.203 DDLOP Len 1191 RBA 1231 (发现此timestamp也是接近于源端时间，但是为何会有差异需要研究。)
Name:
After Image: Partition 0 G s
2c43 353d 2735 3133 272c 2c42 373d 2735 3133 272c | ,C5='513',,B7='513',
2c42 323d 2738 3837 3730 272c 2c42 333d 2744 4154 | ,B2='88770',,B3='DAT
4153 5243 272c 2c42 343d 2754 4553 5427 2c2c 4331 | ASRC',,B4='TEST',,C1
323d 2727 2c2c 4331 333d 2727 2c2c 4235 3d27 5441 | 2='',,C13='',,B5='TA
424c 4527 2c2c 4236 3d27 4452 4f50 272c 2c42 383d | BLE',,B6='DROP',,B8=
2747 4753 5243 2e47 4753 5f44 444c 5f48 4953 5427 | 'GGSRC.GGS_DDL_HIST'
2c2c 4239 3d27 5359 5327 2c2c 4337 3d27 3131 2e32 | ,,B9='SYS',,C7='11.2

4).一定时间后启动replicat通过checkpointtable发现audit_time就是源端写入时间:
SQL> select * from ggtgt.ckpt;
GROUP_NAME GROUP_KEY SEQNO RBA AUDIT_TS CREATE_TS LAST_UPDATE_TS
---------- -------------------- ----------- -------------------- ----------------------------- -------------------- --------------------
REPSRC 2430376187 1 2564 2014-10-06 16:59:33.000203 2014/10/6 16:53:52 2014/10/6 19:41:47

GGSCI (node110) 17> info all

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER RUNNING
REPLICAT RUNNING REPSRC 02:19:11 00:00:00

五.结论
通过相关理论资料和测试后，可以确定Checkpoint Lag实际上就是当前进程上个Checkpoint发生的时间点和检查点前一条记录的timestamp(记录在trail文件中的源端数据库处理时间)的差异
————————————————
版权声明：本文为CSDN博主「司马松儆」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/Smasegain/article/details/46872305