MySQL数据库一致性验证工具pt-table-checksum（Percona Toolkit ）-CSDN博客

https://www.percona.com/doc/percona-toolkit/2.2/pt-table-checksum.html

Percona Toolkit 时 Percona公司提供的一个数据库管理工具包，包括常用的数据一致性验证、差异补齐、查找重复索引、分析二进制日志信息等，大大方便DBA查找和解决问题、优化数据库性能。但是作为一款工具，其使用仍然可能会对数据库产生影响，所以在使用该工具包执行任何操作执行，都需要完整的数据库备份。

以下内容主要从官档摘录。

Percona Toolkit is mature, proven in the real world, and well tested, but all database tools can pose a risk to the system and the database server. Before using this tool, please:

Read the tool’s documentation
Review the tool’s known “BUGS”
Test the tool on a non-production server
Backup your production server and verify the backups

功能点

自助查找（满足过滤条件的）数据库、表
每次操作一张表，所以不会占用大量的内存，也不会额外增加系统负载，即使是大表
- 以块为单位，批量处理大表；每次以 single REPLACE..SELECT query 批量检查
- 在期望的时间内，动态的预估可以检查的块大小
- 确保检查时对系统无影响并且不会增加同步压力、提升系统负载
- 默认的期望检查时间在0.5秒
持续跟踪数据库查询性能，根据性能变化，调整批量处理块大小
- 利用 an exponentially decaying weighted average 来决定块的大小
- 当检测到系统负载较高时，会立刻改变(减少)块的大小
- 当前检测的数据表必须有主键或者唯一约束，至少要有索引；如果没有索引，将会配置一个合适的块大小做检查
自动发现并连接主备库，并不断检测复制关系；如果没有，指定 --recursion-method
- 如果备库延迟主库太多，则暂停运行直至主备无延迟
- 如果备库报错或者停止了，则暂停运行并等待重新正常运行
- 如果备库上设置了过滤条件 slave_skip_errors ，那么除非强制忽略它，否则将不会运行。 because the queries that pt-table-checksum executes could potentially conflict with them and cause replication to fail.
限制数据块的大小，以保证能够安全的进行验证
- 使用EXPALIN检查块的大小，如果比期望的数据库块大，则跳过
- 使用 --chunk-size-limit 限定块的大小
- 如果主库上一个表的数据量很小，但是在备库上却又很大，是不会使用单一的块去做验证
在session级别设置 innodb_lock_wait_timeout = 1 ，这样如果有锁需要等待，那么会首先把自己的session杀掉
监控数据库的负载，如果数据库负载过高或者并发超过25，那么将会进入等待状态，可以设置 --max-load 来确认是否进入等待状态
以低优先级运行，以不影响正常数据库服务为前提；如果需要杀掉验证线程，那么直接杀掉好了；如果在正常的验证过程中，连接中断，那么会重新连接
配置--resume 可以在遇到错误导致验证中断的情况下，自动重新运行；从最后一次验证的表的最后一个块开始；可以使用Control-C终止连接
当验证完一张表之后，pt-table-checksum 会等待所有的备库验证完成；当都结束后，比对备库的数据是否和主库一致，输出比对结果
在验证过程中，会一直输出验证进度
- 每张表的进度
- 因等待备库追赶上主库、或验证备库是否存在差异的过程中，同样数据等待信息
- 可以选择 --quiet 参数，静默执行
如果要查看已经验证的信息以及主备库表的差异，可以执行以下查询，每一行代表一个数据块的信息

 SELECT db, tbl, SUM(this_cnt) AS total_rows, COUNT(*) AS chunks
FROM percona.checksums
WHERE (
 master_cnt <> this_cnt
 OR master_crc <> this_crc
 OR ISNULL(master_crc) <> ISNULL(this_crc))
GROUP BY db, tbl;

为了使用pt-table-sync version 1.0，可以做一下修改，让其与Version 2.0 of pt-table-checksum 兼容

ALTER TABLE checksums ADD boundaries VARCHAR(500);
UPDATE checksums
 SET boundaries = COALESCE(CONCAT('id BETWEEN ', lower_boundary,
    ' AND ', upper_boundary), '1=1');

限制

要求statement-based 复制模式，并且执行中在主库设置binlog_format=STATEMENT ，但是因为mysql 的限制，复制关系并不会受影响；会自动检查binlog_format，可以查看--[no]check-binlog-format
如果存在主备库schema 、表结构不同，主备关系将会被破坏

参数

-[no]check-binlog-format：默认 yes，检查 binlog_format
--check-interval：默认 1，Sleep time between checks for --max-lag.
--[no]check-plan：默认 yes，首先 Expalin table，根据执行计划将表分为多个小块
--[no]check-replication-filters：默认 yes，检查复制关系中是否有过滤条件
--max-lag：当备库延迟多久之后，检测进程自动暂停
--check-slave-lag：当指定备库复制延时时，才会暂停
--[no]check-slave-tables：默认 yes，当主备库表不一致或者表上字段不一致，会导致复制异常
--chunk-index：使用指定的索引进行分块
--chunk-index-columns：使用复合索引的前几个字段做分块
--chunk-size：默认 1000，指定多少行作为一个块，或者以 k M G 为单位，最好不配置
--chunk-time：每个块的检查时间，超时则舍弃这个块，与 --chunk-size互斥
--chunk-size-limit：--chunk-size的倍数
--chunk-time：默认 0.5S，默认一个块的检测时间
--columns：逗号分隔，全局设置，只检查指定的字段；建议检查单个表的时候使用
--[no]create-replicate-table：创建记录检测信息的表
--databases：只检查指定的数据库
--[no]empty-replicate-table：删除当前检查表的之前的验证信息
--max-load：以当前 SHOW GLOBAL STATUS参数 value为基准，默认值超过+20%时，进程pause，或者直接指定参数=xxx
--progress：默认30s，每隔多长时间输出日志
--quiet：只输出不一致的信息
--recurse：默认无限大，检测的备库的层级，M-S-S……
--recursion-method：发现复制关系的方法
    METHOD       USES
    ===========  =============================================
    processlist  SHOW PROCESSLIST
    hosts        SHOW SLAVE HOSTS
    cluster      SHOW STATUS LIKE 'wsrep\_incoming\_addresses'
    dsn=DSN      DSNs from a table
    none         Do not find slaves
    以 dsn方法为例：--recursion-method dsn=h=host,D=percona,t=dsns
    必须要有一张表来显示的配置备库的信息：
    CREATE TABLE `dsns` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `parent_id` int(11) DEFAULT NULL,
      `dsn` varchar(255) NOT NULL, # h=replica_host,u=repl_user,p=repl_pass
      PRIMARY KEY (`id`)
    );
--replicate：检测结果存储的表，默认--[no]create-replicate-table 会创建
    CREATE TABLE checksums (
       db             CHAR(64)     NOT NULL,
       tbl            CHAR(64)     NOT NULL,
       chunk          INT          NOT NULL,
       chunk_time     FLOAT            NULL,
       chunk_index    VARCHAR(200)     NULL,
       lower_boundary TEXT             NULL,
       upper_boundary TEXT             NULL,
       this_crc       CHAR(40)     NOT NULL,
       this_cnt       INT          NOT NULL,
       master_crc     CHAR(40)         NULL,
       master_cnt     INT              NULL,
       ts             TIMESTAMP    NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
       PRIMARY KEY (db, tbl, chunk),
       INDEX ts_db_tbl (ts, db, tbl)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
--[no]replicate-check：检测数据内容（字段值）是否一致
--replicate-check-only：不检查表，只检查之前的检测结果然后退出
--replicate-check-retries：默认是1，当遇到差异时，重复检测的次数
--replicate-database：只检测当前指定的数据库
--resume：接着上次退出时检测到的位置运行
--retries：当遇到锁或者线程被杀时，重复尝试运行的次数
--run-time：进程执行的时间，超过则停止
--slave-skip-tolerance：主库表只有一个块，当预估的备库表数据量在主的多大范围内时，继续执行
--where：pt-table-checksum --where "ts > CURRENT_DATE - INTERVAL 1 DAY" 全局的配置

脚本

pt-table-checksum --nocheck-binlog-format *--nocheck-plan* --nocheck-replication-filters --replicate=percona.checksums --set-vars innodb_lock_wait_timeout=120 --host=192.168.3.25 --port=3306 --user=root --password=zhjie198 --databases=working --tables=x

输出

# A software update is available:
#   * The current version for Percona::Toolkit is 2.2.20.

            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
01-16T16:57:55      0      0        2       1       0   0.418 working.x

报错

# 无索引
01-17T15:46:58 Cannot checksum table p95169.expert_hospdept_relation: There is no good index and the table is oversized. at /usr/bin/pt-table-checksum line 6559.
    CREATE TABLE `expert_hospdept_relation` (
      `EXPERTHOSPDEPTUUID` varchar(40) DEFAULT NULL,
      `EXPERTUUID` varchar(40) DEFAULT NULL,
      `HOSPITALUUID` varchar(40) DEFAULT NULL,
      `HOSPDEPTUUID` varchar(40) DEFAULT NULL,
      `EXPERTSORTCODE` int(11) DEFAULT NULL
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8

# 显示 progress当前处理进度
Checksumming sonar.project_measures:  36% 00:51 remain
01-17T15:49:33      0      0  3720214      74       0  48.901 sonar.project_measures

# 数据块太大
01-17T15:50:15 Skipping chunk 1 of tigasedb.tig_ma_msgs because it is oversized.  The current chunk size limit is 225674 rows (chunk size=112837 * chunk size limit=2.0), but MySQL estimates that there are 282244 rows in the chunk.
Checksumming tigasedb.tig_ma_msgs:  36% 00:52 remain
01-17T15:51:11      0      0  2315665      98      25  55.961 tigasedb.tig_ma_msgs
…………
01-17T15:51:23 Skipping chunk 5 of tigasedb2.tig_ma_msgs because it is oversized.  The current chunk size limit is 115534 rows (chunk size=57767 * chunk size limit=2.0), but MySQL estimates that there are 121246 rows in the chunk.
01-17T15:51:29      0      0   617401      17       5   9.202 tigasedb2.tig_ma_msgs

脚本

pt-table-checksum --nocheck-binlog-format --nocheck-replication-filters --replicate=percona.checksums --set-vars innodb_lock_wait_timeout=50 --host=192.168.3.25 --port=3306 --user=root --password=zhujie1986 --databases working --quiet --tables department
# 检测到备库不存在数据表
01-17T17:43:00 Skipping table working.department because it has problems on these replicas:
Table working.department does not exist on replica lx3-26
This can break replication.  If you understand the risks, specify --no-check-slave-tables to disable this check.
01-17T17:43:00 Error checksumming table working.department: Error getting row count estimate of table working.department on replica lx3-26: DBD::mysql::db selectrow_hashref failed: Table 'working.department' doesn't exist [for Statement "EXPLAIN SELECT * FROM `working`.`department` WHERE 1=1"] at /usr/bin/pt-table-checksum line 6679.

            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
01-17T17:43:00      1      0        0       0       0   0.002 working.department

脚本

**pt-table-checksum --nocheck-binlog-format --nocheck-replication-filters --replicate=percona.checksums --set-vars innodb_lock_wait_timeout=50 --host=192.168.3.25 --port=3306 --user=root --password=zhujie1986 --databases working --tables department --replicate-check**
# 检测到数据不一致时

            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
01-17T17:46:42      0      1        7       1       0   0.010 working.department