SQL Server CDC Admin & Monitor

zhengyuepo

已于 2022-03-18 11:53:04 修改

阅读量407

点赞数

文章标签：数据库 sql database

于 2022-03-18 11:46:00 首次发布

原文链接：https://blog.csdn.net/wujiandao/article/details/51253438

版权

1。系统自带的 2 个 SQL Server Agent Job。这里有很多共同的处理思路，比方说，很多表都开启了 CDC ，那么 LSN 对于每个表来说都是不一样的，那么在抽取和清理残留数据的时候，做批处理的控制就有难处。

1.1 Capture job

1.1.1 continuous 与 polling interval ： continuous有 2 个值， 1 或者0 。 1 代表的是 capture job 一直处于运行的状态， polling interval 用来控制 waitfor 的时间间隔。经过 polling interval 的间隔后，继续抽取 CDC数据到 change table 里面去，抽完之后，开始停顿 polling interval 指定的时间，但此时这个 Job 如果从 SSMS 来看状态，是 active 的。当 continuous 是0 的时候，抽完数据Capture Agent Job 就停掉了。

整个 capture job 是从[sys].[sp_MScdc_capture_job] 开始执行的，中间会调用[sys].[sp_cdc_scan]。

[sys].[sp_MScdc_capture_job] 是从 msdb.dbo.cdc_jobs 来获取 CDC capture job 需要的一系列参数值，这些参数值，就包括了 continuous 与 polling interval. [sys].[sp_cdc_scan] 根据 continuous 的值，来控制 capture job 是不是自己控制 waitfor 循环。

1.1.2 如何将某段时间的 transaction 都准确的装载到 change table ？

[sys].[sp_cdc_scan] 的执行方式，就是调用 Log reader 来读取需要 transfer 的 transaction, 所以在按序列读取 LSN 的时候，就顺便做了装载 change table 的动作。这个存储过程调用 sp_replcmds 来将捕获到的数据更新装载到change table 里面去，但是没有做 sp_repldone 的时候，连续调用会出错：

Msg 22863, Level 16, State 1, Procedure sp_replcmds, Line 259
1
Failed to insert rows into Change Data Capture change tables. Refer to previous errors in the current session to identify the cause and correct any associated problems.

Msg 3621, Level 16, State 6, Procedure sp_replcmds, Line 259

The statement has been terminated.

The statement has been terminated.

那么 sp_repldone 又作了什么事情呢？sp_replcmds 会将捕获到的数据更新装载到 change table 里面去，那么这部分捕获到的数据更新，是不是在被装载到 change table 里面去之后，就自动消失了呢？其实并没有，上面的错误就是连续执行 2 边 sp_replcmds 之后出现的，所以这部分数据在被清理掉之前，始终是存在在某一个缓存里面，那么到底存在哪里呢？暂时不知道，需要研究。现在可以肯定地就是 sp_repldone 就可以把这部分数据标记为 replicated. 这样下次执行 sp_replcmds 就可以抓取最新的 changed data 了。

Sp_repldone 是 Log Reader 的处理方法，也就是说这条命令其实是由 Log Reader Process 来执行的。官方说法，Log Reader 用 sp_repldone 来标注某个范围内的 LSN 已经被 replicated了。所以捕获到的 change data, 其实是一段 DML的命令，存在于 transaction log 里面， Log Reader 会标记这段 LSN 是不是需要被 replicate，是不是已经被 replicated。将所有的 transaction log 都标为已经被 replicate 的方法就是：

EXEC sp_repldone @xactid = NULL, @xact_segno = NULL, @numtrans = 0, @time = 0, @reset = 1
1
1.2 Clean Up Job

1.2.1 我猜测的这个 job 去 clean up 已经 replicated CDC 数据: Log Reader 将一定时间前（由msdb.dbo.cdc_jobs 指定的 retention）已经标示为 replicated 的 transaction log按照 LSN 从各个 change table 中移出. 那系统是如何记录哪个表存储了哪些 LSN 呢？总不能一个个 change table 轮循过来吧？

1.2.2 cdc.lsn_time_mapping 用来重新计算新的低水位线 new low water marker。根据这个新低水位线，job 将 change table 的部分过时数据删除。那么这个 cdc.lsn_time_mapping 是怎么灌数据的呢？猜想是 checkpoint, 可是并不是，很简单的论证，我们只要执行:

select top 10 * from cdc.lsn_time_mapping order by start_lsn desc

checkpoint

select top 10 * from cdc.lsn_time_mapping order by start_lsn desc

这里应该是跟 msdb.dbo.cdc_jobs 里面配置的 pollinginterval 有关。每过一段时间（默认 5 秒）, capture job 会把 transaction log 标记为需要 replicate 的 change 抓取到 change table 里面，所以我们只要更改下 msdb.dbo.cdc_jobs 里面 pollinginterval 就能证明了。当然还需要重新启动下 capture job. 答案很意外，cdc.lsn_time_mapping 里的数据不是由 pollinginterval 决定的，只要 capture job 开着，每过 5 分钟自动向 cdc.lsn_time_mapping 里面灌数据。只要 capture job 停掉，数据也就停止输入了。关于 5 分钟的这个设计，网上有这么一段对话，写在文章最后。

MSDN 上关于clear up change table, 是这样描述的：从 cdc.lsn_time_mapping 找到最大 tran_end_time, 这个 tran_end_time 对应了一个 LSN . 将这个 LSN 传递给 sp_cdc_cleanup_change_tables，就能删除 change table 里面的数据。这里有个难理解的地方，如果相同的 commit time, 我们要取的是 LSN 的最小值。这么个问题，既然相同的 commit time 可以对应不同的 LSN, 那么这个 LSN 就不好理解了，是全数据库同一个时间可以有不同的 LSN，那么这个 LSN 怎么记录到 transaction log 里面去？其实多读几遍就很好理解了，因为在一个很大的 transaction 里面，各个 DML 的命令，提交执行的时间是不同的。而最终的 commit time 却都是相同的。

这个 sp_cdc_cleanup_change_tables 是怎么个逻辑？在 sp_helptext 下找不到，所以是个黑盒。不仅仅是这个 stored procedure,CDC 中很多的 stored procedure 都找不到对应的代码。有个网站倒是有些记录：G-Productions

有这么一片帖子，讲到了 cleanup change table 的原理：

Change Data Capture (CDC) cleanup job only removes a few records at a time

Summary of the problem: I’m a beginner with SQL Server. For a project I need CDC to be turned on. I copy the cdc data to another (archive) database and after that the CDC tables can be cleaned immediately. So the retention time doesn’t need to be high, I just put it on 1 minute and when the cleanup job runs (after the retention time is already fulfilled) it appears that it only deleted a few records (the oldest ones). Why didn’t it delete everything? Sometimes it doesn’t delete anything at all. After running the job a few times, the other records get deleted. I find this strange because the retention time has long passed.

Entire explanation of what I did: I set the retention time at 1 minute (I actually wanted 0 but it was not possible) and didn’t change the threshold (= 5000). I disabled the schedule since I want the cleanup job to run immediately after the CDC records are copied to my archive database and not particularly on a certain time.
My logic for this idea was that for example there will be updates in the afternoon. The task to copy CDC records to archive database should run at 2:00 AM, after this task the cleanup job gets called. So because of the minimum retention time, all the CDC records should be removed by the cleanup job. The retention time has passed after all?
After a suggestion I just tried to see what happened when I set up a schedule again in the job (with a retention time of an hour this time), like how CDC is meant to be used in general. I had 3 records in that table. I changed 2 records at the same time and a few minutes later another record. After the time has passed I checked the CDC table and turns out it also only deletes the oldest record. I set another schedule again and then it deleted another record (the 2nd record of the 2 that I updated at first). Why aren’t they purged all at once? What am I doing wrong?

I made a workaround where I made a new job with the task to delete all records in the CDC tables (and disabled the entire default CDC cleanup job). This works better as it removes everything but it’s bothering me because I want to work with the original cleanup job and I think it should be able to work in the way that I want it to.

来自微软的回答：

I’ve tested and refered to the definition of the clean job stored procedure [sys].[sp_MScdc_cleanup_job].

You are wondering why it only delete a few records even retention is set to 1 minute. Here is a simple logic for how clean job determine which record should be deleted:

[sys].[sp_MScdc_cleanup_job] will use a low_water_mark_time as flag, then all commit_time less than low_water_mark_time will be deleted. How to get a low_water_mark_time:

1.The clean job will find the latest commit_time from cdc.lsn_time_mapping, then minus retention time (here is 1 minute as you set) as a temp low_water_mark_time.

2.Find the largest commit_time less than or equal the temp low_water_mark_time as the final low_water_mark_time. Below query can help you to find a low_water_mark_time in your environment, you can test (-1 is the retention):

select sys.fn_cdc_map_lsn_to_time
(sys.fn_cdc_map_time_to_lsn('largest less than or equal',(dateadd(minute, -1, (select max(tran_end_time) from cdc.lsn_time_mapping)))))
1
2
For the deleting records, clean job use sys.sp_cdc_cleanup_change_tables to do it. Here is the part of the code, @p2 is the low_water_mark_time we get above:

set @stmt = N'delete top( @p1 ) ' +

N' from ' + @change_table +

N' where __$start_lsn < @p2 '

Now let’s have a simulation:

Suppose you have three records in cdc.dbo_table_CT which commit_time are 10:00, 10:03, 10:06. You execute the clean job at 10:08. Then low_water_mark_time should be 10:03 (10:06 minus 1min, then choose the largest less than or equal one), from above code for deleting, only 10:00 will be deleted.

I admit the retention is somehow confuse here. Hope above explanation can help you understand how this parameter is used.

You may wonder what if I only have 10:00, 10:03, 10:06 three records and execute clean job at 11:00? From above logic, the low_water_mark_time should be 10:03 as well. Will it still only delete the 10:00? Actually you monitor cdc.lsn_time_mapping for some time and you can find there will be a record with tran_id = 0 inserted every 5 minutes. This design is to prevent above condition happening. So if you set retention to 1 minute, a record will exist here for at most 10 minutes. :)

这里要注意的有两个地方：

1 final low water mark time: sys.sp_MScdc_cleanup_job 是根据 low water mark time 为标记去删除 change table 里面的数据的。而 low water mark time 是根据 cdc.lsn_time_mapping 里面最大的 transaction end time( max tran_end_time) 减去 msdb.dbo.cdc_jobs 里面的 retention 时间来作为一个初步的计算值，然后到每个 change table 里面去找到比这个初步计算值小的或者相等的最大的 LSN, 作为最终的 low water mark，去做删除。文章举的例子， 10:00, 10:03, 10:06 等到 10:08 去 clean up 的时候，第一步先拿到最大值 10:06, 减去 1 分钟，就是 10：05, 那么最后的 low water mark 就是 10:03, 删除比 10:03 小的事务，就是 10:00 那一笔。

2 design of 5 minutes automation of populating the incremental LSN into cdc.lsn_time_mapping:如果没有这个 5分钟自动增加一笔 LSN, 那么可以想象，在 11:00 去执行 clean up job 的时候，依然是只删除 10：00 这一笔，而其他都不会被删除，而加了 5 分钟自动增加一笔的设计，那么 10:06 这一笔顶多留到 10:16 .

2。监控 CDC

2.1 Empty Result sets :

select db_name(s.database_id) as dbname, ls.empty_scan_count

from sys.dm_cdc_log_scan_sessions ls

inner join sys.dm_exec_sessions s on ls.session_id = s.session_id

where ls.empty_scan_count <> 0 and db_name(s.database_id) = 'lenistest4'

2.2 延迟与吞吐量

2.2.1 延迟: 原表数据更改 commit 时间与 change table 装载时间差

select db_name(s.database_id) as dbname, ls.session_id, ls.latency

from sys.dm_cdc_log_scan_sessions ls

inner join sys.dm_exec_sessions s on ls.session_id = s.session_id

order by ls.latency desc

2.2.2 吞吐量：每秒钟提交到 change table 的 change data.

select ls.session_id, ls.command_count/ls.duration as Throughput ,db_name(s.database_id) as dbname

from sys.dm_cdc_log_scan_sessions ls

inner join sys.dm_exec_sessions s on ls.session_id = s.session_id

where ls.duration <> 0

order by ls.command_count desc

3处理 change data

3.1 Validate LSN Boundaries 验证 LSN 上下值的有效性. 如果这个 Boundaries 值的有效性有错误，或者超出了正常有效范围，会有如下错误:

select * from cdc.fn_cdc_get_all_changes_dbo_region(0x01,0x10,'all')

Msg 313, Level 16, State 3, Line 6

An insufficient number of arguments were supplied for the procedure or function cdc.fn_cdc_get_all_changes_ ... .

3.2 Query Functions: 两个用来抓取 change data 的函数， cdc.fn_cdc_get_all_changes_<capture_instance>, cdc.fn_cdc_get_net_changes_<capture_instance>

3.2.1 cdc.fn_cdc_get_all_changes_<capture_instance>: 这个函数是通用函数，不需要其它限制条件。唯一要注意的就是记录限制选项( row filter option ), 这里可以是 All 或者 All Update old. 两者的区别在于是否针对 Update 的记录返回 update 之前的数据？

Cdc.fn_cdc_get_all_changes_<capture_instance>(@fromLsn, @toLsn, ‘All’) : 针对 operation 为1，2 的操作，也就是 insert, delete 的操作，返回对应的记录，针对 3，4的记录操作，只返回 4 的结果而忽略 3 即忽略 update 之前的数据.

Cdc.fn_cdc_get_all_changes_<capture_instance>(@fromLsn, @toLsn, ‘All Update old’) : 针对 operation 为1，2 的操作，也就是 insert, delete 的操作，返回对应的记录，针对 3，4的记录操作，都返回，历史数据和更新之后的数据都将返回。这里主要是针对没有任何唯一值约束的表，抓取逻辑就只能依据全字段来匹配了。

3.2.2 cdc.fn_cdc_get_net_changes_<capture_instance>: 使用这个函数是有限制条件的：

1）启用CDC 的表必须有主键或者唯一索引

2）启用 CDC 的表必须指定参数@supports_net_changes 为1

3.3 应用场景：介绍完原理，研究几个应用场景，加深自己对原理的理解，快速应用

3.3.1 最直接的应用，就是根据 LSN Boundaries 来抓取数据. 这个场景最简单，只要获取最小LSN, 最大 LSN, 就可以直接应用上面起到的 query function 来抓数据了。这里有必要介绍下 template. Template 可以说是源码也可以说是针对不同的 SQL Server 的特性微软给出的应用模式。从 SSMS 的 view 里面找到 template explorer, 定位到 Change Data Capture, 我们就可以看到如何去应用上面提到的 query function. 这里面包含了各个应用场景下，我们该如何写我们抓取 CDC的代码格式。

-- ==================================================

-- Enumerate All Changes for the Valid Range Template

-- ==================================================

USE <Database_Name,sysname,Database_Name>

DECLARE @from_lsn binary(10), @to_lsn binary(10)

SET @from_lsn =

sys.fn_cdc_get_min_lsn('<capture_instance,sysname,capture_instance>')

SET @to_lsn = sys.fn_cdc_get_max_lsn()

SELECT * FROM cdc.fn_cdc_get_all_changes_<capture_instance,sysname,capture_instance>

(@from_lsn, @to_lsn, N'all')

-- ==================================================

-- Enumerate Net Changes for the Valid Range Template

-- ==================================================

USE <Database_Name,sysname,Database_Name>

DECLARE @from_lsn binary(10), @to_lsn binary(10)

SET @from_lsn =

sys.fn_cdc_get_min_lsn('<capture_instance,sysname,capture_instance>')

SET @to_lsn = sys.fn_cdc_get_max_lsn()

SELECT * FROM cdc.fn_cdc_get_net_changes_<capture_instance,sysname,capture_instance>

(@from_lsn, @to_lsn, N'all')

GO

3.3.2 考虑到 CDC 是一个持续获取数据同步的应用，那么在配置上面就要花一点功夫了。简单回顾下我以前做的一个小例子，在这个例子里，我配置了一张driven 表，用它来记录每张 CDC表同步的状况，比如有哪些表同步了，同步到了哪个 LSN：

-- 1. driven table : log every transaction that transfers the changed data

-- 2. transfer data application

-- 3. Audit testing

-- 4. clean up the transferred changed data

-- 1. driven table : log every transaction that transfers the changed data

-- 1.1 configuration table : to log the tables that are enabled for cdc

-- 1.2 driven table : log transaction that transfers the data chanage captured

create table dbo.cdc_tables ( cdcId int not null, cdcTable varchar(50), cdcProcessSP varchar(200),cdcEnabled bit )

create table dbo.cdc_driven ( transactionId bigint not null, cdcId int not null,

cdcStartDT datetime, cdcEndDT datetime, cdcCompleted bit,cdcMinLsn binary(10), cdcMaxLsn binary(10) )

insert into dbo.cdc_tables (cdcId, cdcTable,cdcProcessSP,cdcEnabled) values(1,'dbo.region','dbo.cdcprocessRegion',1)

每次新执行的时候，只要抓取最新的 LSN，配合最大的 LSN，就可以取到需要抓取的CDC了。但是这里要注意的是，在每一次执行完之后，都需要将 change table 里面的数据清空，以防止重复读取。

所以第二个应用场景，就是为了解决这个问题的。我们可以不用立即清空 change table ，而是将它保留一段时间，留给 cleanup job 去干这个事情。我们这里记录下每次同步的 LSN，然后抓取同步之后的下一个 LSN，配合最大 LSN，就可以抓取 CDC 在这个同步批次中需要同步的数据了. 怎么获得下次需要同步的起始 LSN，用 sys.fn_cdc_increment_lsn(<last time sync LSN>). 同样用法可以参考 Template .

=========================================================

-- Enumerate All Changes Since the Previous Request Template

-- =========================================================

USE <Database_Name,sysname,Database_Name>

DECLARE @from_lsn binary(10), @to_lsn binary(10), @previous_to_lsn binary(10)

SET @previous_to_lsn = <previous_to_lsn,binary(10),previous_to_lsn>

SET @from_lsn = sys.fn_cdc_increment_lsn(@previous_to_lsn)

SET @to_lsn = sys.fn_cdc_get_max_lsn()

SELECT * FROM cdc.fn_cdc_get_all_changes_<capture_instance,sysname,capture_instance>

(@from_lsn, @to_lsn, N'all')

GO
1
3.3.3 还有很多其他应用，比如根据时间来同步 CDC，有点复杂了，不研究了。
————————————————
版权声明：本文为CSDN博主「dbLenis」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/wujiandao/article/details/51253438

zhengyuepo

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
SQL Server CDC Admin & Monitor

1。系统自带的 2 个 SQL Server Agent Job。这里有很多共同的处理思路，比方说，很多表都开启了 CDC ，那么 LSN 对于每个表来说都是不一样的，那么在抽取和清理残留数据的时候，做批处理的控制就有难处。1.1 Capture job1.1.1 continuous 与 polling interval ： continuous有 2 个值， 1 或者0 。 1 代表的是 capture job 一直处于运行的状态， polling interval 用来控制 waitfo
复制链接

扫一扫