高级主题 -- Backups、Checkpoint durability、commit-level durability和Cache and eviction tuning

最新推荐文章于 2024-02-05 21:58:31 发布

caixinGO

最新推荐文章于 2024-02-05 21:58:31 发布

阅读量198

点赞数

分类专栏： wiredtiger

本文链接：https://blog.csdn.net/gongcaixin/article/details/104869864

版权

wiredtiger 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

本文尝试继续介绍数据库中非常关键的特性，这些特性与实现细节密切相关。但是由于还没有看源代码，只能根据文档，管中窥豹。

Backups，介绍了wiredtiger备份数据库的方法（通过cursors进行），支持全量备份（若为开启log，可选择一部分文件进行备份）和增量备份。
全量备份时，其关键就在于将数据库的所有文件进行拷贝，因为在拷贝的过程中，文件会被append（回收和整理怎么办？），拷贝时增长的部分会被忽略。在做全量备份时checkpoint可以被创建但是不能被删除。Additionally, if a crash occurs during the period the backup cursor is open and logging is disabled (in other words, when depending on checkpoints for durability), then the system will be restored to the most recent checkpoint prior to the opening of the backup cursor, even if later database checkpoints were completed. Note this exception to WiredTiger’s checkpoint durability guarantees.
增量备份基于先坐过全量备份，其只是将日志全部拷到要备份的文件夹，且限制较多，如日志一定要打开；日志不能被删除；且因为 bulk-load不记日志，使用了bulk-load后需要重新打全量快照。全量快照上，可以进行多次的进行增量快照（只是重新拷贝日志文件，同名的日志文件会被覆盖），但是前提条件是数据库未被open和recovery。

Checkpoint durability，checkpoint是数据库持久化的基础，wiredtiger默认支持checkpoint。在打checkpoint之前的所有日志可以保证被持久化。This guarantee has an exception: If a crash occurs when a backup cursor is open, then the system will be restored to the most recent checkpoint prior to the opening of the backup cursor, even if later database checkpoints were completed.
当开始打checkpoint时，需要进行exclusive的访问的操作将会返回EBUSY错误，包括 bulk load, verify or salvage。当数据库第一次打开数据源时，其打开的状态基于这个文件的最新checkpoint，也就是说，最新快照之后的更新不会出现在数据源中。若数据源没有快照，则打开的数据源将会是空的。
Automatic checkpoints，可以通过 WT_SESSION::checkpoint 接口手动打checkpoint, 也可以配置自动checkpoint。自动checkpoint可以通过时间周期和日志的大小（若日志打开）来自动触发，哪个条件先到达，则由哪个条件触发，且两个条件便会重置。
Checkpoint cursors，可以通过checkpoint 类型的cursor打开一个静态的不可修改的数据库视图。
Checkpoint naming，checkpoint允许命名（LSM除外）。命名的checkpoint会持久化下来，直至被取消或替换。通过创建一个同名的checkpoint将会被替换。若一个checkpoint不能被替换，要么是因为其有cursor，要么是它在被backup，都会导致checkpoint失败。内部的checkpoint的名字为WiredTigerCheckpoint。新建一个内部的checkpoint会使得之前的所有内部checkpoint被删除。若不能马上被删除，也会等到可以被删除的时候删除。
Checkpoints and file compaction， Checkpoints 共享 file blocks,删除某个快照可能可以也可能不可以使得一个file blocks被重复利用，取决于被删除的快照是否是某个file block的最后一个引用。因为named checkpoint不会被自动回收，checkpoint的存在使得共享 file blocks存在，故WT_SESSION::compact 函数（to compact data sources）可能什么也没做不了。

commit-level durability，基于checkpoint durability，打开日志（应当是redo日志，或说WAL）实现。当事务commit时，将日志从内存写入日志文件。这里的实现和RocksDB类似，有丰富的日志落盘选项，调整数据的可靠性级别和性能，支持日志的自动回收（后台周期性打checkpoint）。也支持group commit提升性能。
需要提及的是，Recovery is required after the failure of any thread of control in the application, where the failed thread might have been executing inside of the WiredTiger library or open WiredTiger handles have been lost. In most applications, if any thread of control exits unexpectedly, the application will close and re-open the database.

Cache and eviction tuning，该选择为最终要的性能调整参数。可以设置cache的大小；哪个对象（表）可以常驻内存（可以加速访问速度，比平常的缓存的对象缓存快；访问不用任何磁盘IO；若内存不够将会返回err或系统stall）。
用户可以调整eviction_target (default 80%，超过时，Eviction worker threads开始工作)，eviction_trigger(default 95%，超过时，application threads 开始工作，通过会使得系统延长升高，在后台eviction线程不能满足cache的压力时出现)。
用户可以调整eviction_dirty_target (default 5%) 和 eviction_dirty_trigger (default 20%)，这两个的工作原理和前面两个类似，只是只针对脏页进行。The dirty eviction settings control how much work checkpoints have to do in the worst case（需要理解实现原理进一步理解）, and also limit how much memory fragmentation is likely for memory allocations related to the cache. Most memory fragmentation is caused by workloads that generate a mix of updates (small allocations) with cache misses (large allocations). Limiting the percentage of cache that can be dirty limits the worst case fragmentation to the approximately the same level.
默认情况下后台evict的线程数是1，用户可以设置。eviction=(threads_min) and eviction=(threads_max) configuration values can be used to configure the minimum and maximum number of additional threads WiredTiger will create to keep up with the application eviction load. Finally, if the Wiredtiger eviction threads are unable to keep up with application demand for cache space, application threads will be tasked with eviction as well, potentially resulting in latency spikes.

总结，从TokuDB的实现和目前文档的描述可以做如下应该正确的推测，checkpoint的实现时，逻辑上通过将整个数据的最新snapshot持久化下来，下次数据库启动时，树的状态从最新的检查点恢复。而日志级别的持久化的实现，在通过重放日志的方式实现。所以实现的关键在于怎么快速的打checkpoint。
打checkpoint的实现推测，对cache中的打checkpoint这一时刻的脏页进行持久化（实现推测，首先加全局锁，可通过元数据快速的将脏页进行标记为处于打检查点的状态，如，在TokuDB应该表现为对block映射的数组快速标记，打完标记后解锁）；用户后续进行写操作时，需要对这些脏页进行copy on write；后台checkpoint线程异步的一一将这些标记为处于打检查点状态的脏页写进文件中（实现推测，block管理的page映射信息，在打快照时也会拷贝一份，如在TokuDB表现为将current_ block的映射信息，即最新的，拷贝为_inprogress，然后checkpoint的block分配基于_inprogress进行；checkpoint结束后，将_inprogress序列化和持久化，再将_inprogress拷贝到_checkpointed信息；下次数据库启动时，选择最新的_checkpointed进行恢复，恢复最新的page映射信息和B-tree的头节点等）。所以，eviction_dirty_trigger最好不要设置得很大，否则最差情况下因为copy on write的存在，cache可能不够用而使得checkpoint影响用户写线程(如果设置得很低，显然也会影响用户的写线程，用户线程会参与evict）；后台线程在某一阈值下不停的evict脏页，有利于缩短checkpoint的时间（如何保证这些持久化后的页在数据库下次启动时不可见? block管理器上的信息是某一个快照下的信息，新写的block有新的block number，新写入磁盘的数据对于新的检查点来说，走block manager时，并不可见）。

显然，大数据量时，写操作会使得B-tree产生很大的写放大和随机读（随机写理论上可以通过block cache的管理，顺序写可以实现），当写性能不够时，应用的线程便会被卡住（取决于B-tree的吞吐）。而对于读操作，对于CPU负载很轻。而对于更多的wiredtiger自己实现的优缺点需要看代码和性能测试，基本特点描述见此。