[翻译] SQLite中的原子提交

本翻译谢绝转载

SQLite中的原子提交

1. 介绍

如SQLite这样的事务型数据库的一个重要特性就是原子提交。原子提交意味着在一个事务中的所有针对数据库的变更,要们全部生效,要们全部不生效。原子提交犹如完成对数据库不同的区块瞬间、同时进行不同的写入。真实的硬件向大容量存储串行写,向单个扇区写需要花费固定的时间。所以不可能同时 且(或)瞬间向数据库所在的不同扇区进行写。但是SQLite的原子提交逻辑保证了一个事务的写犹如瞬间同时进行。

即使事务被操作系统崩溃或电源故障打断,事务仍然表现出原子性,这是SQLite的重要属性。

本文描述了SQLite为了创建原子提交的假象所使用的技术。

本文信息只适用于SQLite的回滚模式(“rollback mode”),换句话说SQLite不使用 write-ahead log. write-ahead logging有效的情况下SQLite仍然支持原子提交,但是SQLite使用与本文描述不同的原理来达到原子提交。SQLite如何在write-ahead情况下支持原子提交,请阅读write-ahead log documentation作为扩展信息。

2. Hardware Assumptions 硬件假设

尽管在本文中,我们称大容量存储设备为磁盘,但是大容量存储设备也有可能是闪存。

我们设定磁盘的写的块称为扇区(sector)。磁盘不可能更改比扇区(sector)更小的部分。为了修改比扇区(sector)更小的部分,你必须读取包含你想更改内容的完整的扇区(sector),然后修改,最后将整个扇区(sector)写回去。

对于传统的旋转磁盘,不管读还是写,扇区是传输的最小单元。然后对于闪存,读的最小量(size)通常比写的最小量(size)小的多。SQLite只考虑了写的最小量(size),本文也是如此,当我们说扇区(sector)时,意思就是说一次向大容量存储设备写所允许的最小数据大小。

SQLite在3.3.14版本之前,在所有场景下假定了一个扇区的大小是512字节。在编译时,有一个编译选项可以改变这个值,但是代码上从来没有测试过更大的值。设定扇区大小为512字节好像是合理的,因为最近所有的磁盘驱动内部都使用了512字节扇区。然而最近有人推动将磁盘扇区大小增加到4096字节。同时闪存的扇区大小通常比512字节大。因此,SQLite从3.3.14版本开始,在系统接口层(OS interface layer)有个方法,来询问底层文件系统,查找真正的扇区大小。在当前的实现(版本3.5.0)中,该方法始终返回512字节这个硬编码的值,因为在Unix或Windows上,没有一个标准的方法获取真正的扇区大小。但是这个方法可以让嵌入式制造商根据自己的需求进行调整。我们还为将来在Unix和Windows上填充一个更有意义的实现打开了大门。

SQLite传统上认为扇区写入不是原子的。然后SQLite总是假定扇区写入是线性的。“线性”是指SQLite假定在写入扇区时,硬件从数据的一端开始,然后逐字节写入,直到到达另一端。写入可能从头到尾或从尾到头。如果在写扇区中间发生电源故障,则该扇区可能一部分数据被修改,而另一部分保持不变。SQLite的关键假设是,扇区的任意一部分被改变,则第一个字节或最后一个自己将会被更改。所以硬件从来不会从中间开始往末尾写扇区。我们不知道这个假设是否总是正确的,但它似乎是合理的。

上一段指出SQLite不认为扇区写是原子的。默认情况下这是对的。但是从SQLite3.5.0版本起,有一个叫做虚拟文件系统(VFS)接口的新接口。VFS是SQLite和底层文件系统通信的唯一方法。源码中附带了VFS在Unix和windows的默认实现,并且提供在运行时创建新的自定义VFS实现的一种机制。又一个叫做xDeviceCharacteristics的方法在新的VFS接口中。该方法通过询问底层文件系统来发现文件系统可能展示或没有展示的各种属性和行为。xdeviceCharacteristics方法可能表明扇区写入是原子的,如果确实如此,则SQLite将尝试利用这一事实。但是,UNIX和Windows的默认xdevicecharacteristics方法并不表明扇区写入的原子性,因此通常忽略这些优化。

SQLite认为操作系统会缓存写操作,因此一个写请求在数据真实的存储到大容量存储设备之前就会返回。SQLite更是假定操作系统会对写操作进行重排序。因此,SQLite在关键点执行“flush”或“fsync”操作。SQLite假定flush或fsync在所有待写操作刷新到文件完成之前不会返回。我们被告知,在某些版本的Windows或Linux上,flush或fsync原语被破坏了。这是不幸的。这打开了在数据提交的过程中断电导致的数据损害的可能性。然而SQLite无法测试或修复这种场景。SQLite假设操作系统如它说明的那样工作。如果情况并非如此,那么希望你不会经常断电。

sqlite假定当一个文件的长度增加时,新的文件空间最初包含垃圾,然后用实际写入的数据填充。换句话说,SQLite假定文件的大小在文件内容写入前就被更新了。这是一个悲哀的假设,SQLite必须做一些额外的工作来保证在文件大小已经增加和新的内容在写入之间的这段时间断电 不会造成数据库损坏。VFS的xDeviceCharacteristics方法可能指出文件系统总是在更小文件大小之前写入数据。(查看代码的读者可以看SQLITE_IOCAP_SAFE_APPEND属性)当xDeviceCharacteristics方法指出,文件内容在文件大小增加之前写入,SQLite能够放弃一些多余的数据库保护步骤,从而减少执行提交(commit)所需的磁盘IO数量。然而,当前的实现并没有对Windows和Unix的默认VFS做出这样的假设。

从用户进程的视角来看,SQLite假设文件删除是原子的。针对这点,我们的意思是SQLite请求删除文件,在删除操作期间断电了,一旦电源回复,要么文件完整的存在,并且内容没有被更改,还是原来的内容,要么在文件系统将看不到任何该文件。如果它的某些数据已被更改或删除,或者文件已被截断但未完全删除,则可能导致数据库损坏。

SQLite假设检查和/或修订由于宇宙射线、热噪音、量子波动、设备驱动程序错误引起的位错误由底层硬件和操作系统负责。SQLite不会为了检查损坏检查或IO错误检查向数据库文件增加任何冗余。SQLite假设读取的数据和之前写的数据是完全一样的。

默认的,SQLite假设写入一段字节的系统调用,不会损坏或更改查过这段字节的范围,即使是在写的过程中发生断电或系统崩溃。我们称此为"powersafe overwrite documentation"特性。从版本version 3.7.9(2011-11-01)开始,SQLite不再假定powersafe overwrite。随着大部分磁盘驱动的标准扇区大小从512字节增长到4096字节,为了保持历史的性能水平,有必要假定powersafe overwrite。在最近的SQLite版本中,默认设定了powersafe overwrite。在编译时或运行时,如果需要,可以关闭powersafe overwrite特效的设定。详细情况可以查看powersafe overwrite documentation文档

3. Single File Commit 单文件提交

我们首先概观SQLite对单个数据库文件执行事务原子提交所需的步骤。后面的章节将讨论用于防止电源故障造成的损坏的文件格式的详细信息,以及跨多个数据库执行原子提交的技术。

3.1. Initial State 初始状态

第一次打开数据库连接时计算机的状态展示概念如右侧的图表中的所示。图中最右边的区域(标有“磁盘”)表示存储在大容量存储设备上的信息。每个矩形代表一个扇形。蓝色表示扇区包含原始数据。中间区域是操作系统的磁盘缓存。在我们的例子开始时,该缓存时冷的,这通过保持代表磁盘缓存矩型为空来表示。图中左侧的区域表示SQLite进程使用的内存中的内容。数据库链接刚刚被打开,还没有读取任何信息,所以用户空间时空的。

3.2. Acquiring A Read Lock 获取读锁

在SQLite能够向数据库写之前,必须先读取数据库,查看已经存在的内容。即使只是附加新数据,SQLite仍然需要从sqlite_master表读取数据库模式(schema),如此才能知道如何解析INSERT语句和发现新的信息应该存储在数据库文件的什么地方。

从数据库文件读取的第一步是获取一个数据库文件的共享锁。一个“共享”锁允许2个或多个数据库链接同时从数据库文件读取。但是共享锁会阻止另一个数据库连接在我们读取数据库文件时写入该文件。这个是有必要的,因为如果另一数据库链接正在向数据库文件写入,同时我们正在从数据库文件读取,我们可能读取一部分修改前的数据,另外一部分修改后的数据。这将使另一个进程所做的变更看起来不是原子的。

请注意,共享锁在操作系统的磁盘缓存上,不在磁盘上。通常,文件锁对于操作系统内核实际上只是一个标记。(具体细节取决于特点的操作系统接口层)所以,如果操作系统崩溃或断电,锁将立马消失。通常如果创建锁的进程退出,锁也会消失。

3.3. Reading Information Out Of The Database 从数据库中读取信息

我们能够在获取一个共享锁后开始从数据库文件读取信息。在这个场景下,我们假设一个冷的缓存,所以信息必须先从大容量存储中读取到操作系统缓存,然后从操作系统缓存传输到用户空间。在之后的读取操作中,一些或全部信息可能已经在操作系统缓存中被找到,所以只需要将数据传输到用户空间。

通常只读取数据库文件中页(pages)的一个子集。在这个例子中,我们将显示读取的8页中的3页。在一个典型的应用中,一个数据库会有上千页,一个请求通常只涉及到这些页的一小部分。

3.4. Obtaining A Reserved Lock 获取一个保留锁

对数据库进行更改前,SQLite首先获取一个数据库文件的“保留”锁。保留锁和共享锁类似,都允许其他进程从数据库文件读取。单个保留锁能够和来自其他进程的多个共享锁共存。然而数据库文件只能有一个保留锁。因此一次只有一个进程能够试图向数据库写。

保留锁背后的思想是,它表示一个进程打算在不久的将来修改数据库文件,但尚未开始进行修改。由于修改尚未开始,其他进程可以继续从数据库中读取。但是,其他进程也不应该开始尝试写入数据库。

3.5. Creating A Rollback Journal File 创建恢复日志

在对数据库文件进行任何更改之前,sqlite首先创建一个单独的回滚日志文件,并将要更改的数据库页的原始内容写入回滚日志。回滚日志背后的思想是,它包含将数据库恢复到原始状态所需的所有信息。

回滚日志包含一个小头部(在图中以绿色显示),它记录数据库文件的原始大小。因此,如果更改导致数据库文件增长,我们仍然可以知道数据库的原始大小。写入回滚日志的每个数据库页的页码被存储在一起。

创建新文件时,大多数桌面操作系统(Windows、Linux、Mac OS X)实际上不会向磁盘写入任何内容。新文件仅在操作系统磁盘缓存中创建。直到稍后,当操作系统有空闲时间时,才会在大容量存储中创建该文件。这给用户留下了这样的印象,即I/O的发生速度比实际磁盘I/O的速度要快得多。我们在右侧的图表中说明了这一想法,即新的回滚日志只出现在操作系统磁盘缓存中,而不出现在磁盘本身上。

3.6. Changing Database Pages In User Space 更改用户空间中的数据库页

原始页内容保存在回滚日志中后,可以在用户内存中修改页。每个数据库连接都有自己的用户空间私有副本,因此在用户空间中所做的更改仅对正在进行更改的数据库连接可见。其他数据库连接仍然可以在操作系统磁盘缓存缓冲区中看到尚未更改的信息。因此,即使一个进程忙于修改数据库,其他进程也可以继续读取自己的原始数据库内容副本。

3.7. Flushing The Rollback Journal File To Mass Storage 将回滚日志文件刷新到大容量存储

下一步是将回滚日志文件的内容刷新到非易失性存储。正如我们稍后将看到的,这是确保数据库能够在意外的断电情况下幸存的关键步骤。这一步也需要很多时间,因为写入非易失性存储器通常是一个缓慢的操作。

这一步通常比简单地将回滚日志刷新到磁盘要复杂得多。在大多数平台上,需要两个单独的刷新(或fsync())操作。第一次刷新会写出基本回滚日志内容。然后修改回滚日志的头,以显示回滚日志中的页数。然后将头刷新到磁盘。本文后面的部分将详细介绍我们为什么要修改头和进行额外的刷新。

3.8. Obtaining An Exclusive Lock 获取排他锁

在对数据库文件本身进行更改之前,我们必须获得对数据库文件的排他锁。获取排他锁实际上是一个两步过程。首先,SQLite获得一个“挂起”锁。然后它将挂起的锁升级为排他锁。

挂起锁允许已持有共享锁的其他进程继续读取数据库文件。但它阻止建立新的共享锁。挂起的锁背后的想法是为了防止大量读导致写饥饿。可能有几十个,甚至数百个其他进程试图读取数据库文件。每个进程在开始读取之前获取一个共享锁,读取它需要的内容,然后释放共享锁。但是,如果有许多不同的进程都是从同一数据库中读取的,则可能会发生这样的情况:新进程总是在前一个进程释放其共享锁之前获取其共享锁。因此,就不会有数据库文件上没有共享锁的那么瞬间,因此,写永远也没有机会获取排他锁。挂起锁被设计为通过允许现有的共享锁继续使用,但阻止创建新的共享锁来防止这种循环。最终,所有共享锁将被清除,挂起锁将能够升级为排他锁。

3.9. Writing Changes To The Database File 向数据库文件写变更

一旦持有排他锁,我们就知道没有其他进程正在从数据库文件中读取数据,并且可以安全地将更改写入数据库文件。通常,这些更改只到达操作系统磁盘缓存,而不会直达大容量存储。

3.10. 0 Flushing Changes To Mass Storage 向大容量存储刷入变更

必须进行另一次刷新,以确保所有数据库更改都写入非易失性存储。这是一个关键的步骤,可以确保数据库在断电后不会损坏。但是,由于写入磁盘或闪存本身的速度较慢,因此此步骤和上面第3.7节中的刷新回滚日志文件一起占用了在sqlite中完成事务提交所需的大部分时间。

3.11. 1 Deleting The Rollback Journal 删除回滚日志

在大容量存储设备上安全地完成数据库更改后,将删除回滚日志文件。这是事务提交的瞬间。如果在此之前发生电源故障或系统崩溃,那么稍后恢复过程会使其看起来好像从未对数据库文件进行过任何更改。如果在删除回滚日志后发生电源故障或系统崩溃,则表现的犹如所有更改都已写入磁盘。因此,根据回滚日志文件是否存在,sqlite提供了对数据库文件不做任何更改或对数据库文件进行完整更改的表现。

删除一个文件实际上不是一个原子操作,但用户进程的角度来看的是原子的。进程总是能够询问操作系统“这个文件存在吗?”,这个进程会得到一个是或否的答案。在事务提交期间发生电源故障后,SQLite将询问操作系统是否存在回滚日志文件。如果答案是“是”,那么事务将不完整并回滚。如果答案是“否”,则表示事务已提交。

事务的存在取决于回滚日志文件是否存在,从用户空间进程的角度来看,删除文件似乎是一个原子操作。因此,事务似乎是原子操作。

在许多系统中,删除文件的操作是昂贵的。作为优化,可以将sqlite配置为将日志文件的长度截断为零字节,或者用零覆盖日志文件头。在这两种情况下,生成的日志文件都不再能够回滚,因此事务仍然提交。从用户进程的角度来看,将文件截断为零长度(如删除文件)被认为是一个原子操作。用零覆盖日志的头不是原子的,但是如果头的任何部分格式不正确,日志将不会回滚。因此,可以说提交事务是在回滚日志头被充分更改以使其无效时发生的。通常,这种场景只要头的第一个字节归零就算发送。

3.12. 2 Releasing The Lock 释放锁

提交事务过程的最后一步是释放排他锁,以便其他进程可以再次开始访问数据库文件。

在右边的图表中,我们展示了当释放锁时,在用户空间中保存的信息被清除。对于旧版本的SQLite来说,这一点曾经是对的。但是,sqlite的最新版本将用户空间信息保存在内存中,以防在下一个事务开始时再次需要它。重新使用已经在本地内存中的信息要比从操作系统磁盘缓存中传输信息或再次从磁盘驱动器中读取信息便宜。在重用用户空间中的信息之前,我们必须首先重新获取共享锁,然后我们必须检查以确保在我们不持有锁的情况下没有其他进程修改数据库文件。数据库的第一页中有一个计数器,每次修改数据库文件时该计数器都递增。我们可以通过检查那个计数器来确定是否有另一个进程修改了数据库。如果修改了数据库,则必须清除用户空间缓存并重新读取。但通常情况下,没有进行任何更改,用户空间缓存可以重用,从而显著提升性能。

4. Rollback 回滚

原子提交应该是瞬间发生的。但是上面描述的处理过程显然需要一定的时间。假设通过上述提交事务操作时切断了计算机的电源。为了保持变更是即时完成的假象,我们必须“回滚”任何部分更改,并将数据库恢复到事务开始之前的状态。

4.1. When Something Goes Wrong... 当出问题时

假设在3.10章节期间发生断电,而数据库更改正在写入磁盘。恢复电源后,情况可能类似于右图所示。我们试图更改数据库文件中的三页,但只成功地写入了一页。另一页是部分写入,第三页根本没有写入。

恢复电源后,磁盘上的回滚日志完整无缺。这是一个关键点。在3.7章节中执行刷新操作的原因是,在对数据库文件本身进行任何更改之前,确保所有回滚日志都安全地保存在非易失性存储上。

4.2. Hot Rollback Journals 热回滚日志

任意的SQLite进程开始尝试访问数据库文件时,都会获取一个上面3.2章节描述的共享锁。但是随后注意到存在一个回滚日志文件。然后SQLite查看该回滚日志是否是热日志。热日志是一种回滚日志,需要回放日志,来恢复数据库到一个正常的状态。只有在一个更早的进程在提交事务过程中崩溃或断电的情况下,才会存在热日志。

如果以下条件都成立,则回滚日志是热日志:

  • 回滚日志存在。
  • 回滚日志不是空文件。
  • 在主数据库文件上没有保留锁(reserved lock)。
  • 回滚日志的头格式良好,特别是没有被清零。
  • 回滚日志不包含主日志文件的名称(查看下面章节5.5)或者包含主日志的名称,且主日志文件存在。

热日志的出现表明上一个进程正在视图提交事务,但是在完成提交事务之前,因此某些原因被终止了。热日志意味着数据库文件处于一个不一致的状态,被使用前需要被修复(通过回滚)

4.3. Obtaining An Exclusive Lock On The Database 在数据库持有一个排他锁

处理热日志的第一步就是持有数据库文件的排他锁。这样就避免了2个或多个进程同时回滚同一个的热日志。

4.4. Rolling Back Incomplete Changes 回滚未完成的变更

一旦进程持有了排他锁,它就可以向数据库文件进行写操作。然后它从回滚日志中读取页的原始内容,并将内容写回到数据库文件中内容原来的位置上。回想一下,回滚日志头记录了终止事务开始前的数据库文件大小。SQLite使用这个信息来截断数据库文件到原来的大小,如果不完整的事务导致数据库增长。在此步骤之后,数据库的大小和内容应该和被终止的事务开始前是一样的。

4.5. Deleting The Hot Journal 删除热日志

热日志中所有的信息被回放到数据库文件中后(并且刷新到磁盘,以避免我们遭遇另外一个电源故障),可以删除热日志。

章节3.11,因为删除文件是昂贵的,作为系统优化日志文件可能被截断成长度为0或它的头可能被0覆写。不管哪种,在这个步骤之后日志文件不在是热的

4.6. Continue As If The Uncompleted Writes Had Never Happened 犹如未完成的写重来没有发生过一样继续

恢复的最后一步是将排他锁恢复为共享锁。一旦这步发生,数据库回到了犹如终止的事务从来没有启动时的状态。由于所有的这些恢复活动都是完全自动、透明的发生,使用SQLite的程序看来,好像终止的事务从来没有开始过。

5. Multi-file Commit 多文件提交

SQLite允许一个数据库链接通过使用ATTACH DATABASE命令同时和2个或多个数据库文件通信。当多个数据库文件在一个事务中被修改时,所有文件被原子地更新。换句话说,要么所有数据库文件被更新,要么没有一个被更新。在跨越多个数据库文件中实现一个原子提交比在单独文件中实现要更加复杂。本节描述了SQLite如何完成这个工作的。

5.1. Separate Rollback Journals For Each Database 每个数据库单独的回滚日志

当一个事务涉及到多个数据库文件时,每个数据库有自己的回滚日志,每个数据库分别被锁定。右图展示了3个数据库文件被一个事务修改的场景。这一步的情况和步骤3.6的单文件事务场景相似。每个数据库持有一个保留锁。每个数据把被修改页的原始内容写入到各自的回滚日志中,但是日志的内容还没有被刷新到磁盘。数据库文件自身还没有任何变更,尽管可能在用户内存中保留了一些变更。

为了简洁起见,本节的图将从以前的图中简化。蓝色仍然代表原始内容,粉色仍然代表新内容。但是回滚日志和数据库文件中的各个页没有被展示,同时我们没有区分操作系统缓存信息和磁盘信息。所有的这些因素仍然适用于多文件提交场景。他们在图中占用了很多空间,没有增加任何新的信息,所以在这里他们被省略了。

5.2. The Master Journal File 主日志文件

The next step in a multi-file commit is the creation of a "master journal" file. The name of the master journal file is the same name as the original database filename (the database that was opened using the sqlite3_open() interface, not one of the ATTACHed auxiliary databases) with the text "-mjHHHHHHHH" appended where HHHHHHHH is a random 32-bit hexadecimal number. The random HHHHHHHH suffix changes for every new master journal.

(Nota bene: The formula for computing the master journal filename given in the previous paragraph corresponds to the implementation as of SQLite version 3.5.0. But this formula is not part of the SQLite specification and is subject to change in future releases.)

Unlike the rollback journals, the master journal does not contain any original database page content. Instead, the master journal contains the full pathnames for rollback journals for every database that is participating in the transaction.

After the master journal is constructed, its content is flushed to disk before any further actions are taken. On Unix, the directory that contains the master journal is also synced in order to make sure the master journal file will appear in the directory following a power failure.

The purpose of the master journal is to ensure that multi-file transactions are atomic across a power-loss. But if the database files have other settings that compromise integrity across a power-loss event (such as PRAGMA synchronous=OFF or PRAGMA journal_mode=MEMORY) then the creation of the master journal is omitted, as an optimization.

5.3. Updating Rollback Journal Headers

The next step is to record the full pathname of the master journal file in the header of every rollback journal. Space to hold the master journal filename was reserved at the beginning of each rollback journal as the rollback journals were created.

The content of each rollback journal is flushed to disk both before and after the master journal filename is written into the rollback journal header. It is important to do both of these flushes. Fortunately, the second flush is usually inexpensive since typically only a single page of the journal file (the first page) has changed.

This step is analogous to step 3.7 in the single-file commit scenario described above.

5.4. Updating The Database Files

Once all rollback journal files have been flushed to disk, it is safe to begin updating database files. We have to obtain an exclusive lock on all database files before writing the changes. After all the changes are written, it is important to flush the changes to disk so that they will be preserved in the event of a power failure or operating system crash.

This step corresponds to steps 3.83.9, and 3.10 in the single-file commit scenario described previously.

5.5. Delete The Master Journal File

The next step is to delete the master journal file. This is the point where the multi-file transaction commits. This step corresponds to step 3.11 in the single-file commit scenario where the rollback journal is deleted.

If a power failure or operating system crash occurs at this point, the transaction will not rollback when the system reboots even though there are rollback journals present. The difference is the master journal pathname in the header of the rollback journal. Upon restart, SQLite only considers a journal to be hot and will only playback the journal if there is no master journal filename in the header (which is the case for a single-file commit) or if the master journal file still exists on disk.

5.6. Clean Up The Rollback Journals

The final step in a multi-file commit is to delete the individual rollback journals and drop the exclusive locks on the database files so that other processes can see the changes. This corresponds to step 3.12 in the single-file commit sequence.

The transaction has already committed at this point so timing is not critical in the deletion of the rollback journals. The current implementation deletes a single rollback journal then unlocks the corresponding database file before proceeding to the next rollback journal. But in the future we might change this so that all rollback journals are deleted before any database files are unlocked. As long as the rollback journal is deleted before its corresponding database file is unlocked it does not matter in what order the rollback journals are deleted or the database files are unlocked.

6. Additional Details Of The Commit Process

Section 3.0 above provides an overview of how atomic commit works in SQLite. But it glosses over a number of important details. The following subsections will attempt to fill in the gaps.

6.1. Always Journal Complete Sectors

When the original content of a database page is written into the rollback journal (as shown in section 3.5), SQLite always writes a complete sector of data, even if the page size of the database is smaller than the sector size. Historically, the sector size in SQLite has been hard coded to 512 bytes and since the minimum page size is also 512 bytes, this has never been an issue. But beginning with SQLite version 3.3.14, it is possible for SQLite to use mass storage devices with a sector size larger than 512 bytes. So, beginning with version 3.3.14, whenever any page within a sector is written into the journal file, all pages in that same sector are stored with it.

It is important to store all pages of a sector in the rollback journal in order to prevent database corruption following a power loss while writing the sector. Suppose that pages 1, 2, 3, and 4 are all stored in sector 1 and that page 2 is modified. In order to write the changes to page 2, the underlying hardware must also rewrite the content of pages 1, 3, and 4 since the hardware must write the complete sector. If this write operation is interrupted by a power outage, one or more of the pages 1, 3, or 4 might be left with incorrect data. Hence, to avoid lasting corruption to the database, the original content of all of those pages must be contained in the rollback journal.

6.2. Dealing With Garbage Written Into Journal Files

When data is appended to the end of the rollback journal, SQLite normally makes the pessimistic assumption that the file is first extended with invalid "garbage" data and that afterwards the correct data replaces the garbage. In other words, SQLite assumes that the file size is increased first and then afterwards the content is written into the file. If a power failure occurs after the file size has been increased but before the file content has been written, the rollback journal can be left containing garbage data. If after power is restored, another SQLite process sees the rollback journal containing the garbage data and tries to roll it back into the original database file, it might copy some of the garbage into the database file and thus corrupt the database file.

SQLite uses two defenses against this problem. In the first place, SQLite records the number of pages in the rollback journal in the header of the rollback journal. This number is initially zero. So during an attempt to rollback an incomplete (and possibly corrupt) rollback journal, the process doing the rollback will see that the journal contains zero pages and will thus make no changes to the database. Prior to a commit, the rollback journal is flushed to disk to ensure that all content has been synced to disk and there is no "garbage" left in the file, and only then is the page count in the header changed from zero to true number of pages in the rollback journal. The rollback journal header is always kept in a separate sector from any page data so that it can be overwritten and flushed without risking damage to a data page if a power outage occurs. Notice that the rollback journal is flushed to disk twice: once to write the page content and a second time to write the page count in the header.

The previous paragraph describes what happens when the synchronous pragma setting is "full".

PRAGMA synchronous=FULL;

The default synchronous setting is full so the above is what usually happens. However, if the synchronous setting is lowered to "normal", SQLite only flushes the rollback journal once, after the page count has been written. This carries a risk of corruption because it might happen that the modified (non-zero) page count reaches the disk surface before all of the data does. The data will have been written first, but SQLite assumes that the underlying filesystem can reorder write requests and that the page count can be burned into oxide first even though its write request occurred last. So as a second line of defense, SQLite also uses a 32-bit checksum on every page of data in the rollback journal. This checksum is evaluated for each page during rollback while rolling back a journal as described in section 4.4. If an incorrect checksum is seen, the rollback is abandoned. Note that the checksum does not guarantee that the page data is correct since there is a small but finite probability that the checksum might be right even if the data is corrupt. But the checksum does at least make such an error unlikely.

Note that the checksums in the rollback journal are not necessary if the synchronous setting is FULL. We only depend on the checksums when synchronous is lowered to NORMAL. Nevertheless, the checksums never hurt and so they are included in the rollback journal regardless of the synchronous setting.

6.3. Cache Spill Prior To Commit

The commit process shown in section 3.0 assumes that all database changes fit in memory until it is time to commit. This is the common case. But sometimes a larger change will overflow the user-space cache prior to transaction commit. In those cases, the cache must spill to the database before the transaction is complete.

At the beginning of a cache spill, the status of the database connection is as shown in step 3.6. Original page content has been saved in the rollback journal and modifications of the pages exist in user memory. To spill the cache, SQLite executes steps 3.7 through 3.9. In other words, the rollback journal is flushed to disk, an exclusive lock is acquired, and changes are written into the database. But the remaining steps are deferred until the transaction really commits. A new journal header is appended to the end of the rollback journal (in its own sector) and the exclusive database lock is retained, but otherwise processing returns to step 3.6. When the transaction commits, or if another cache spill occurs, steps 3.7 and 3.9 are repeated. (Step 3.8 is omitted on second and subsequent passes since an exclusive database lock is already held due to the first pass.)

A cache spill causes the lock on the database file to escalate from reserved to exclusive. This reduces concurrency. A cache spill also causes extra disk flush or fsync operations to occur and these operations are slow, hence a cache spill can seriously reduce performance. For these reasons a cache spill is avoided whenever possible.

7. Optimizations

Profiling indicates that for most systems and in most circumstances SQLite spends most of its time doing disk I/O. It follows then that anything we can do to reduce the amount of disk I/O will likely have a large positive impact on the performance of SQLite. This section describes some of the techniques used by SQLite to try to reduce the amount of disk I/O to a minimum while still preserving atomic commit.

7.1. Cache Retained Between Transactions

Step 3.12 of the commit process shows that once the shared lock has been released, all user-space cache images of database content must be discarded. This is done because without a shared lock, other processes are free to modify the database file content and so any user-space image of that content might become obsolete. Consequently, each new transaction would begin by rereading data which had previously been read. This is not as bad as it sounds at first since the data being read is still likely in the operating systems file cache. So the "read" is really just a copy of data from kernel space into user space. But even so, it still takes time.

Beginning with SQLite version 3.3.14 a mechanism has been added to try to reduce the needless rereading of data. In newer versions of SQLite, the data in the user-space pager cache is retained when the lock on the database file is released. Later, after the shared lock is acquired at the beginning of the next transaction, SQLite checks to see if any other process has modified the database file. If the database has been changed in any way since the lock was last released, the user-space cache is erased at that point. But commonly the database file is unchanged and the user-space cache can be retained, and some unnecessary read operations can be avoided.

In order to determine whether or not the database file has changed, SQLite uses a counter in the database header (in bytes 24 through 27) which is incremented during every change operation. SQLite saves a copy of this counter prior to releasing its database lock. Then after acquiring the next database lock it compares the saved counter value against the current counter value and erases the cache if the values are different, or reuses the cache if they are the same.

7.2. Exclusive Access Mode

SQLite version 3.3.14 adds the concept of "Exclusive Access Mode". In exclusive access mode, SQLite retains the exclusive database lock at the conclusion of each transaction. This prevents other processes from accessing the database, but in many deployments only a single process is using a database so this is not a serious problem. The advantage of exclusive access mode is that disk I/O can be reduced in three ways:

  1. It is not necessary to increment the change counter in the database header for transactions after the first transaction. This will often save a write of page one to both the rollback journal and the main database file.

  2. No other processes can change the database so there is never a need to check the change counter and clear the user-space cache at the beginning of a transaction.

  3. Each transaction can be committed by overwriting the rollback journal header with zeros rather than deleting the journal file. This avoids having to modify the directory entry for the journal file and it avoids having to deallocate disk sectors associated with the journal. Furthermore, the next transaction will overwrite existing journal file content rather than append new content and on most systems overwriting is much faster than appending.

The third optimization, zeroing the journal file header rather than deleting the rollback journal file, does not depend on holding an exclusive lock at all times. This optimization can be set independently of exclusive lock mode using the journal_mode pragma as described in section 7.6 below.

7.3. Do Not Journal Freelist Pages

When information is deleted from an SQLite database, the pages used to hold the deleted information are added to a "freelist". Subsequent inserts will draw pages off of this freelist rather than expanding the database file.

Some freelist pages contain critical data; specifically the locations of other freelist pages. But most freelist pages contain nothing useful. These latter freelist pages are called "leaf" pages. We are free to modify the content of a leaf freelist page in the database without changing the meaning of the database in any way.

Because the content of leaf freelist pages is unimportant, SQLite avoids storing leaf freelist page content in the rollback journal in step 3.5 of the commit process. If a leaf freelist page is changed and that change does not get rolled back during a transaction recovery, the database is not harmed by the omission. Similarly, the content of a new freelist page is never written back into the database at step 3.9 nor read from the database at step 3.3. These optimizations can greatly reduce the amount of I/O that occurs when making changes to a database file that contains free space.

7.4. Single Page Updates And Atomic Sector Writes

Beginning in SQLite version 3.5.0, the new Virtual File System (VFS) interface contains a method named xDeviceCharacteristics which reports on special properties that the underlying mass storage device might have. Among the special properties that xDeviceCharacteristics might report is the ability of to do an atomic sector write.

Recall that by default SQLite assumes that sector writes are linear but not atomic. A linear write starts at one end of the sector and changes information byte by byte until it gets to the other end of the sector. If a power loss occurs in the middle of a linear write then part of the sector might be modified while the other end is unchanged. In an atomic sector write, either the entire sector is overwritten or else nothing in the sector is changed.

We believe that most modern disk drives implement atomic sector writes. When power is lost, the drive uses energy stored in capacitors and/or the angular momentum of the disk platter to provide power to complete any operation in progress. Nevertheless, there are so many layers in between the write system call and the on-board disk drive electronics that we take the safe approach in both Unix and w32 VFS implementations and assume that sector writes are not atomic. On the other hand, device manufacturers with more control over their filesystems might want to consider enabling the atomic write property of xDeviceCharacteristics if their hardware really does do atomic writes.

When sector writes are atomic and the page size of a database is the same as a sector size, and when there is a database change that only touches a single database page, then SQLite skips the whole journaling and syncing process and simply writes the modified page directly into the database file. The change counter in the first page of the database file is modified separately since no harm is done if power is lost before the change counter can be updated.

7.5. Filesystems With Safe Append Semantics

Another optimization introduced in SQLite version 3.5.0 makes use of "safe append" behavior of the underlying disk. Recall that SQLite assumes that when data is appended to a file (specifically to the rollback journal) that the size of the file is increased first and that the content is written second. So if power is lost after the file size is increased but before the content is written, the file is left containing invalid "garbage" data. The xDeviceCharacteristics method of the VFS might, however, indicate that the filesystem implements "safe append" semantics. This means that the content is written before the file size is increased so that it is impossible for garbage to be introduced into the rollback journal by a power loss or system crash.

When safe append semantics are indicated for a filesystem, SQLite always stores the special value of -1 for the page count in the header of the rollback journal. The -1 page count value tells any process attempting to rollback the journal that the number of pages in the journal should be computed from the journal size. This -1 value is never changed. So that when a commit occurs, we save a single flush operation and a sector write of the first page of the journal file. Furthermore, when a cache spill occurs we no longer need to append a new journal header to the end of the journal; we can simply continue appending new pages to the end of the existing journal.

7.6. Persistent Rollback Journals

Deleting a file is an expensive operation on many systems. So as an optimization, SQLite can be configured to avoid the delete operation of section 3.11. Instead of deleting the journal file in order to commit a transaction, the file is either truncated to zero bytes in length or its header is overwritten with zeros. Truncating the file to zero length saves having to make modifications to the directory containing the file since the file is not removed from the directory. Overwriting the header has the additional savings of not having to update the length of the file (in the "inode" on many systems) and not having to deal with newly freed disk sectors. Furthermore, at the next transaction the journal will be created by overwriting existing content rather than appending new content onto the end of a file, and overwriting is often much faster than appending.

SQLite can be configured to commit transactions by overwriting the journal header with zeros instead of deleting the journal file by setting the "PERSIST" journaling mode using the journal_mode PRAGMA. For example:

PRAGMA journal_mode=PERSIST;

The use of persistent journal mode provides a noticeable performance improvement on many systems. Of course, the drawback is that the journal files remain on the disk, using disk space and cluttering directories, long after the transaction commits. The only safe way to delete a persistent journal file is to commit a transaction with journaling mode set to DELETE:

PRAGMA journal_mode=DELETE; BEGIN EXCLUSIVE; COMMIT;

Beware of deleting persistent journal files by any other means since the journal file might be hot, in which case deleting it will corrupt the corresponding database file.

Beginning in SQLite version 3.6.4 (2008-10-15), the TRUNCATE journal mode is also supported:

PRAGMA journal_mode=TRUNCATE;

In truncate journal mode, the transaction is committed by truncating the journal file to zero length rather than deleting the journal file (as in DELETE mode) or by zeroing the header (as in PERSIST mode). TRUNCATE mode shares the advantage of PERSIST mode that the directory that contains the journal file and database does not need to be updated. Hence truncating a file is often faster than deleting it. TRUNCATE has the additional advantage that it is not followed by a system call (ex: fsync()) to synchronize the change to disk. It might be safer if it did. But on many modern filesystems, a truncate is an atomic and synchronous operation and so we think that TRUNCATE will usually be safe in the face of power failures. If you are uncertain about whether or not TRUNCATE will be synchronous and atomic on your filesystem and it is important to you that your database survive a power loss or operating system crash that occurs during the truncation operation, then you might consider using a different journaling mode.

On embedded systems with synchronous filesystems, TRUNCATE results in slower behavior than PERSIST. The commit operation is the same speed. But subsequent transactions are slower following a TRUNCATE because it is faster to overwrite existing content than to append to the end of a file. New journal file entries will always be appended following a TRUNCATE but will usually overwrite with PERSIST.

8. Testing Atomic Commit Behavior

The developers of SQLite are confident that it is robust in the face of power failures and system crashes because the automatic test procedures do extensive checks on the ability of SQLite to recover from simulated power loss. We call these the "crash tests".

Crash tests in SQLite use a modified VFS that can simulate the kinds of filesystem damage that occur during a power loss or operating system crash. The crash-test VFS can simulate incomplete sector writes, pages filled with garbage data because a write has not completed, and out of order writes, all occurring at varying points during a test scenario. Crash tests execute transactions over and over, varying the time at which a simulated power loss occurs and the properties of the damage inflicted. Each test then reopens the database after the simulated crash and verifies that the transaction either occurred completely or not at all and that the database is in a completely consistent state.

The crash tests in SQLite have discovered a number of very subtle bugs (now fixed) in the recovery mechanism. Some of these bugs were very obscure and unlikely to have been found using only code inspection and analysis techniques. From this experience, the developers of SQLite feel confident that any other database system that does not use a similar crash test system likely contains undetected bugs that will lead to database corruption following a system crash or power failure.

9. Things That Can Go Wrong

The atomic commit mechanism in SQLite has proven to be robust, but it can be circumvented by a sufficiently creative adversary or a sufficiently broken operating system implementation. This section describes a few of the ways in which an SQLite database might be corrupted by a power failure or system crash. (See also: How To Corrupt Your Database Files.)

9.1. Broken Locking Implementations

SQLite uses filesystem locks to make sure that only one process and database connection is trying to modify the database at a time. The filesystem locking mechanism is implemented in the VFS layer and is different for every operating system. SQLite depends on this implementation being correct. If something goes wrong and two or more processes are able to write the same database file at the same time, severe damage can result.

We have received reports of implementations of both Windows network filesystems and NFS in which locking was subtly broken. We can not verify these reports, but as locking is difficult to get right on a network filesystem we have no reason to doubt them. You are advised to avoid using SQLite on a network filesystem in the first place, since performance will be slow. But if you must use a network filesystem to store SQLite database files, consider using a secondary locking mechanism to prevent simultaneous writes to the same database even if the native filesystem locking mechanism malfunctions.

The versions of SQLite that come preinstalled on Apple Mac OS X computers contain a version of SQLite that has been extended to use alternative locking strategies that work on all network filesystems that Apple supports. These extensions used by Apple work great as long as all processes are accessing the database file in the same way. Unfortunately, the locking mechanisms do not exclude one another, so if one process is accessing a file using (for example) AFP locking and another process (perhaps on a different machine) is using dot-file locks, the two processes might collide because AFP locks do not exclude dot-file locks or vice versa.

9.2. Incomplete Disk Flushes

SQLite uses the fsync() system call on Unix and the FlushFileBuffers() system call on w32 in order to sync the file system buffers onto disk oxide as shown in step 3.7 and step 3.10. Unfortunately, we have received reports that neither of these interfaces works as advertised on many systems. We hear that FlushFileBuffers() can be completely disabled using registry settings on some Windows versions. Some historical versions of Linux contain versions of fsync() which are no-ops on some filesystems, we are told. Even on systems where FlushFileBuffers() and fsync() are said to be working, often the IDE disk control lies and says that data has reached oxide while it is still held only in the volatile control cache.

On the Mac, you can set this pragma:

PRAGMA fullfsync=ON;

Setting fullfsync on a Mac will guarantee that data really does get pushed out to the disk platter on a flush. But the implementation of fullfsync involves resetting the disk controller. And so not only is it profoundly slow, it also slows down other unrelated disk I/O. So its use is not recommended.

9.3. Partial File Deletions

SQLite assumes that file deletion is an atomic operation from the point of view of a user process. If power fails in the middle of a file deletion, then after power is restored SQLite expects to see either the entire file with all of its original data intact, or it expects not to find the file at all. Transactions may not be atomic on systems that do not work this way.

9.4. Garbage Written Into Files

SQLite database files are ordinary disk files that can be opened and written by ordinary user processes. A rogue process can open an SQLite database and fill it with corrupt data. Corrupt data might also be introduced into an SQLite database by bugs in the operating system or disk controller; especially bugs triggered by a power failure. There is nothing SQLite can do to defend against these kinds of problems.

9.5. Deleting Or Renaming A Hot Journal

If a crash or power loss does occur and a hot journal is left on the disk, it is essential that the original database file and the hot journal remain on disk with their original names until the database file is opened by another SQLite process and rolled back. During recovery at step 4.2 SQLite locates the hot journal by looking for a file in the same directory as the database being opened and whose name is derived from the name of the file being opened. If either the original database file or the hot journal have been moved or renamed, then the hot journal will not be seen and the database will not be rolled back.

We suspect that a common failure mode for SQLite recovery happens like this: A power failure occurs. After power is restored, a well-meaning user or system administrator begins looking around on the disk for damage. They see their database file named "important.data". This file is perhaps familiar to them. But after the crash, there is also a hot journal named "important.data-journal". The user then deletes the hot journal, thinking that they are helping to cleanup the system. We know of no way to prevent this other than user education.

If there are multiple (hard or symbolic) links to a database file, the journal will be created using the name of the link through which the file was opened. If a crash occurs and the database is opened again using a different link, the hot journal will not be located and no rollback will occur.

Sometimes a power failure will cause a filesystem to be corrupted such that recently changed filenames are forgotten and the file is moved into a "/lost+found" directory. When that happens, the hot journal will not be found and recovery will not occur. SQLite tries to prevent this by opening and syncing the directory containing the rollback journal at the same time it syncs the journal file itself. However, the movement of files into /lost+found can be caused by unrelated processes creating unrelated files in the same directory as the main database file. And since this is out from under the control of SQLite, there is nothing that SQLite can do to prevent it. If you are running on a system that is vulnerable to this kind of filesystem namespace corruption (most modern journalling filesystems are immune, we believe) then you might want to consider putting each SQLite database file in its own private subdirectory.

10. 10.0 Future Directions And Conclusion

Every now and then someone discovers a new failure mode for the atomic commit mechanism in SQLite and the developers have to put in a patch. This is happening less and less and the failure modes are becoming more and more obscure. But it would still be foolish to suppose that the atomic commit logic of SQLite is entirely bug-free. The developers are committed to fixing these bugs as quickly as they might be found.

The developers are also on the lookout for new ways to optimize the commit mechanism. The current VFS implementations for Unix (Linux and Mac OS X) and Windows make pessimistic assumptions about the behavior of those systems. After consultation with experts on how these systems work, we might be able to relax some of the assumptions on these systems and allow them to run faster. In particular, we suspect that most modern filesystems exhibit the safe append property and that many of them might support atomic sector writes. But until this is known for certain, SQLite will take the conservative approach and assume the worst.

转载于:https://my.oschina.net/purely/blog/3007078

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值