目录
12.Timestamp Ordering Concurrency Control 时间戳顺序并发控制【乐观】
13.Multi-Version Concurrency Control
10.Concurrency Control Theory
- 事务正确性的保证:ACID
- Atomicity: All actions in the txn happen, or none happen.
- Consistency: If each txn is consistent and the DB starts consistent, then it ends up consistent.
- Isolation: Execution of one txn is isolated from that of other txns.
- Durability: If a txn commits, its effects persist.
- mechanisms for ensuring atomicity 确保原子性的机制:
- Logging (undo log)【日志作用:1.回滚 2.审计 3.提高效率(先记录到日志再应用到磁盘,连续写提高效率)】
- Shadow Paging:数据库先复制事务要修改的页,事务在复制的页上修改,只有事务提交时,这些页才对别人可见(这种方法Originally from System R.,只有少数数据库使用:CouchDB、 LMDB (OpenLDAP) )
- mechanisms for ensuring Isolation:确保隔离性的机制:并发控制:悲观、乐观
- Serializable Schedule
- "conflicting" operations:Two operations conflict if:
- → They are by different transactions,
- → They are on the same object and at least one of them is a write.
- Read-Write Conflicts (R-W) :Unrepeatable Reads 不可重复读
- Write-Read Conflicts (W-R):Reading Uncommitted Data ("Dirty Reads")
- Write-Write Conflicts (W-W)
- Schedule S is conflict serializable if you can transform S into a serial schedule by swapping consecutive non-conflicting operations of different transactions. 如果您可以通过交换不同事务的连续不冲突操作将S转换为串行调度,则调度S是可冲突序列化的。
- Dependency Graphs 依赖图
- mechanisms for ensuring Durability:确保持久性的机制:所有提交的事物都是持久的
- Logging
- Shadow Paging
11.Two-Phase Locking【悲观】
- Basic Lock Types
- S-LOCK:共享锁
- X-LOCK:排他锁
- Two-Phase Locking 2PL:
- Phase #1: Growing
- Each txn requests the locks that it needs from the DBMS’s lock manager.
- The lock manager grants/denies lock requests.
- Phase #2: Shrinking
- The txn is allowed to only release locks that it previously acquired. It cannot acquire new locks.
- Phase #1: Growing
- 2PL on its own is sufficient to guarantee conflict serializability. 两阶段锁足以保证冲突串行化; It generates schedules whose precedence graph is acyclic. 它形成的依赖图是非循环的; But it is subject to cascading aborts.但是会造成级联回滚问题。:T2在T1的临时版本上执行,T1回滚之后,T2也应该回滚【级联回滚】---》脏读
- 2PL会造成的问题:
- 脏读(级联回滚)解决方法:strong strict 2PL:一个事务只有commit之后才能解锁 (见下图)
- 死锁:解决方法:
- Approach #1: Deadlock Detection 死锁检测:当DBMS检测到死锁时,它将选择一个“受害者”txn【victim txn】进行回滚以打破循环。受害txn将重启或中止(更常见),这取决于它是如何调用的。在检查死锁的频率和txns在打破死锁之前必须等待多长时间之间存在权衡。
- Approach #2: Deadlock Prevention 死锁预防
- Selecting the proper victim depends on a lot ofdifferent variables.... 合理选择死锁的需要回滚的txn:
- By age (lowest timestamp) 选择最近刚开始的事务,例如如果一个大事务已执行了很长时间,选为victim txn是不合适的
- By progress (least/most queries executed)
- By the # of items already locked
- By the # of txns that we have to rollback with it
- 在选择要中止的受害txn之后,DBMS还可以决定回滚txn更改的程度。
- Approach #1: Completely
- Approach #2: Minimally
- 死锁预防:
- Assign priorities based on timestamps:→ Older Timestamp = Higher Priority (e.g., T1 > T2)
- Wait-Die ("Old Waits for Young")
- → If requesting txn has higher priority than holding txn, thenrequesting txn waits for holding txn.
- → Otherwise requesting txn aborts.
- Wound-Wait ("Young Waits for Old")
- → If requesting txn has higher priority than holding txn, thenholding txn aborts and releases lock.
- → Otherwise requesting txn waits.
- Lock Granularities 锁的粒度:
- Attribute? Tuple? Page? Table?
- Trade-off between parallelism versus overhead.→ Fewer Locks, Larger Granularity vs. More Locks, Smaller Granularity.
- Intention Lock 意向锁:意图锁允许高级节点以共享或独占模式被锁定,而无需检查所有后代节点。如果一个节点以意图模式被锁定,那么某些txn将在树的较低级别执行显式锁定。
- Intention-Shared (IS)→ Indicates explicit locking at lower level with shared locks.
- Intention-Exclusive (IX)→ Indicates explicit locking at lower level with exclusive locks.
- Shared+Intention-Exclusive (SIX)→ The subtree rooted by that node is locked explicitly inshared mode and explicit locking is being done at a lowerlevel with exclusive-mode locks.
- 锁升级:若加了太多的行锁,lock Manager 升级成表锁
- 在实际应用中,锁通常是数据库自动加的,如执行update操作时。有一些语法支持手动加锁。
12.Timestamp Ordering Concurrency Control 时间戳顺序并发控制【乐观】
- Basic Timestamp Ordering (T/O) Protocol 基本时间戳排序(T/O)协议
- 每一行都增加两个时间戳:(1)W-TS(X) -在X上写入时间戳 (2)R-TS(X) -读取X上的时间戳
- 每次操作都检查:如果txn试图访问一个“来自未来”的对象,它会中止并重新启动。
- Basic T/O Reads:If TS(Ti) < W-TS(X), Abort Ti and restart it with a new TS. Else:
- → Allow Ti to read X.
- → Update R-TS(X) to max(R-TS(X), TS(Ti))
- → Make a local copy of X to ensure repeatable reads for Ti.
- Basic T/O Writes:If TS(Ti) < R-TS(X) or TS(Ti) < W-TS(X) Abort and restart Ti. Else:
- → Allow Ti to write X and update W-TS(X)
- → Also make a local copy of X to ensure repeatable reads
- 上述第3点可优化:Thomas Write Rule:
- If TS(Ti) < R-TS(X): Abort and restart Ti.
- If TS(Ti) < W-TS(X):→ Thomas Write Rule: Ignore the write to allow the txn to continue executing without aborting.This violates timestamp order of Ti. Else:→ Allow Ti to write X and update W-TS(X): 未来的事务重写了该值,则现在的事务可忽略对该值的写,继续执行不需要abort。如果这样做了之后,需要更新自己本地的快照。
- Basic T/O (不使用Thomas Write Rule),有以下问题:
-
- 没有死锁问题,因为事务从不会等待,没有锁机制
- 长事务可能会饥饿:在执行过程中可能碰到所有数据都是被短事务修改过的,都成为了“未来的”数据;
- 如果txns【仅在读取其更改的所有txns提交之后】才提交,则调度是可恢复的。否则,DBMS不能保证txns读取的数据是将在崩溃恢复后恢复的数据。
- 性能问题:将数据复制到txn的工作空间和更新时间戳带来的高开销。
-
- 在 Basic T/O 那里,如果一个事务对A进行读取,会检查合法性,如果合法会拷贝一个快照,那这个事务下一次对A操作时还会检查合法性吗?
是否检查时间戳需要视情况而定, 当该事务对A再次操作时:
-
- 若为“读”操作, 则无需检查任何时间戳, 只需要读取本地快照即可;
- 若为"写操作", 则检查读时间戳
- 若已被后发生的事务读取, 则本事务终止;
- 否则, 检查写时间戳:
- 若已经后发生的事务重写, 则根据Thomas Write Rule, 更新本地快照即可;
- 否则, 更新本地快照, 并将更新后的结果写入数据库。
- 【OCC】Optimistic Concurrency Control 乐观并发控制
- 如果您假定txns之间的冲突很少,并且大多数txns都是短期的,那么强制txns等待获取锁会增加大量开销。更好的方法是针对无冲突的情况进行优化。即使用OCC的并发控制
- OCC: The DBMS creates a private workspace for each txn. DBMS为每个txn创建一个私有工作区
- Any object read is copied into workspace. 任何读取的对象都被复制到工作区
- Modifications are applied to workspace.When a txn commits, the DBMS compares workspace write set to see whether it conflicts with other txns 修改应用到工作空间。当txn提交时,DBMS比较工作区写集,以确定它是否与其他txn冲突 [与Basic T/O 不同之处]
- 【OCC】三个阶段
- #1 – Read Phase:→ Track the read/write sets of txns and store their writes in a private workspace. 读取阶段:→跟踪txns的读/写集,并将其写入存储在私有工作区中
- #2 – Validation Phase:→ When a txn commits, check whether it conflicts with other txns. 验证阶段:→当一个txn提交时,检查它是否与其他txn冲突。【在该阶段获取时间戳】
- #3 – Write Phase:→ If validation succeeds, apply private changes to database. Otherwise abort and restart the txn. 写入阶段:→如果验证成功,对数据库应用私有更改。否则,中止并重新启动txn
- 【OCC】Read Phase:跟踪txns的读/写集,并将其写入存储在私有工作区中。DBMS将txn从共享数据库访问的每个元组复制到其工作空间,以确保可重复读
- 【OCC】Validation Phase:当txn Ti调用COMMIT时,DBMS检查它是否与其他txns冲突。DBMS需要保证只允许可序列化的调度【即小时间戳的先发生】:检查其他txns的RW和WW冲突,并确保冲突是在一个方向上(例如,更老→更年轻)。【即依赖图不能成环】本阶段有两种方法:
- →逆向验证 Backward Validation 向更老的数据校验:历史数据【小时间戳】历史上已提交的事务。验证不通过则abort
- →正向验证 Forward Validation 向未来的数据校验:未来数据【大时间戳】验证未来未提交的事务中 和本事务同时发生的部分。验证不通过的处理方法:
- Ti completes all three phases before Tj begins. 没有交集,串行发生
- Ti completes before Tj starts its Write phase, and Ti does not write to any object read by Tj. → WriteSet(Ti) ∩ ReadSet(Tj) = Ø 两个条件
- Ti completes its Read phase before Tj completes its Read phase And Ti does not write to any object that is either read or written by Tj:→ WriteSet(Ti) ∩ ReadSet(Tj) = Ø → WriteSet(Ti) ∩ WriteSet(Tj) = Ø 三个条件
- 【OCC】Write Phase:DBMS将txn写集中的更改传播到数据库,并使它们对其他txn可见。假设一次只能有一个txn处于Write阶段。→Use write latches to support parallel validation/writes.
- 【OCC】存在的一些问题:
- 本地复制数据的高开销。
- 验证/写入阶段瓶颈。
- 中止比2PL更浪费,因为它们只发生在txn已经执行之后。
- 2PL & OCC 都不能预防幻读,因为T1只锁定了现有的记录,而没有锁定正在进行的记录! 只有在对象集固定的情况下,单个项读写的冲突序列化性才能保证可序列化性。解决方法:
- Approach #1: Re-Execute Scans 所有范围查询的where clause 再次执行验证
- Approach #2: Predicate Locking 谓词锁:系统R提出的锁定方案。→SELECT查询的WHERE子句中谓词的共享锁。→对任何UPDATE、INSERT或DELETE查询的WHERE子句中的谓词的排他锁定。除了HyPer(精确锁定),从未在任何系统中实现。
- Approach #3: Index Locking 索引锁:谓词中有索引的话,锁索引页;没有索引,表锁
- mysql的实现:间隙锁
- Isolation Levels 隔离级别:事务并发存在的问题:
- → Dirty Reads 脏读(读未提交)
- → Unrepeatable Reads不可重复读(读的数据是已提交的数据)
- → Phantom Reads幻读(第二次读取的数据多于第一次)例如事务 A 对一个表中的数据进行了修改,这种修改涉及到表中的全部数据行。此时,突然事务 B 插入了一条数据并提交了,当事务 A 提交了修改数据操作之后,再次读取全部数据,结果发现还有一条数据未更新,给人感觉好像产生了幻觉一样。这就是幻读。
- Isolation Levels 隔离级别:
- SERIALIZABLE: No phantoms, all reads repeatable, no dirty reads.
- REPEATABLE READS: Phantoms may happen.
- READ COMMITTED: Phantoms and unrepeatable reads may happen.
- READ UNCOMMITTED: All of them may happen.
- 实现:
- SERIALIZABLE: Obtain all locks first; plus index locks, plus strict 2PL.
- REPEATABLE READS: Same as above, but no index locks.
- READ COMMITTED: Same as above, but S locks are released immediately.
- READ UNCOMMITTED: Same as above but allows dirty reads (no S locks)
13.Multi-Version Concurrency Control
- The DBMS maintains multiple physical versions of a single logical object in the database: DBMS维护数据库中单个逻辑对象的多个物理版本:
- → When a txn writes to an object, the DBMS creates a new version of that object. →当txn写入一个对象时,DBMS创建该对象的一个新版本。
- → When a txn reads an object, it reads the newest version that existed when the txn started. →当txn读取一个对象时,它读取txn启动时存在的最新版本。
- MVCC 好处:
- Writers do not block readers.Readers do not block writers.
- Read-only txns can read a consistent snapshot without acquiring locks.→ Use timestamps to determine visibility.
- Easily support time-travel queries
- MVCC不仅仅是一个并发控制协议。它完全影响DBMS管理事务和数据库的方式。
- Concurrency Control Protocol MVCC实现方式
- Approach #1: Timestamp Ordering→ Assign txns timestamps that determine serial order.
- Approach #2: Optimistic Concurrency Control→ Three-phase protocol from last class.→ Use private workspace for new versions.
- Approach #3: Two-Phase Locking→ Txns acquire appropriate lock on physical version before they can read/write a logical tuple.
- Version Storage MVCC存储版本:The DBMS uses the tuples' pointer field to create a version chain per logical tuple.DBMS使用元组的指针字段为每个逻辑元组创建一个版本链。This allows the DBMS to find the version that is visible to a particular txn at runtime.
- Approach #1: Append-Only Storage→ New versions are appended to the same table space. 逻辑元组的所有物理版本都存储在同一个表空间中。这些版本是相互混合的。在每次更新时,将元组的新版本附加到表中的空白区域。
- Approach #1: Oldest-to-Newest (O2N)→ Append new version to end of the chain.→ Must traverse chain on look-ups.
- Approach #2: Newest-to-Oldest (N2O)→ Must update index pointers for every new version.→ Do not have to traverse chain on look-ups. 搜索找第一个
- Approach #2: Time-Travel Storage→ Old versions are copied to separate table space.
- Approach #3: Delta Storage→ The original values of the modified attributes are copied into a separate delta record space.
- Approach #1: Append-Only Storage→ New versions are appended to the same table space. 逻辑元组的所有物理版本都存储在同一个表空间中。这些版本是相互混合的。在每次更新时,将元组的新版本附加到表中的空白区域。
- Garbage Collection MVCC垃圾回收:DBMS需要随着时间的推移从数据库中删除可回收的物理版本。
- DBMS中没有活动的txn可以“看到”该版本(SI)。SI:oracle 快照隔离级别
- 由回滚的txn创建的版本。
- 另外两个设计决策:→如何寻找过期版本?→如何决定何时回收内存是安全的?
- 实现方法:
- Approach #1: Tuple-level→ Find old versions by examining tuples directly.→ Background Vacuuming vs. Cooperative Cleaning 后台吸尘vs.协同清洁
- Approach #2: Transaction-level→ Txns keep track of their old versions so the DBMS does not have to scan tuples to determine visibility. Txns会跟踪它们的旧版本,所以DBMS不需要扫描元组来确定可见性
- Index Management 索引管理
- Primary key 主键索引:Primary key indexes point to version chain head. 主键索引指向版本链的表头
- → How often the DBMS must update the pkey index depends on whether the system creates new versions when a tuple is updated. DBMS必须多久更新一次pkey索引取决于当元组更新时系统是否创建了新的版本。
- → If a txn updates a tuple's pkey attribute(s), then this is treated as a DELETE followed by an INSERT. 如果txn更新了元组的pkey属性,则这将被视为DELETE后面跟着INSERT。
- Secondary indexes 二级索引
- Approach #1: Logical Pointers 逻辑地址 : 需要回表
- → Use a fixed identifier per tuple that does not change.
- → Requires an extra indirection layer.
- → Primary Key vs. Tuple Id
- Approach #2: Physical Pointers 物理地址
- → Use the physical address to the version chain head.
- Approach #1: Logical Pointers 逻辑地址 : 需要回表
- Primary key 主键索引:Primary key indexes point to version chain head. 主键索引指向版本链的表头
- MVCC 无论唯一键还是冗余键,都需要存储多个版本,对唯一键的维护需要额外的工作:每个索引的底层数据结构必须支持非唯一键的存储。使用额外的执行逻辑对pkey / unique索引执行条件插入。
- Deletes
- 只有当逻辑删除的元组的所有版本都不可见时,DBMS才会从数据库中删除一个元组。
- →如果一个元组被删除,则该元组在最新版本之后不能有新版本。
- →没有写写冲突/第一作者获胜 No write-write conflicts / first-writer wins
- 我们需要一种方法来表示元组已经在某个时间点被逻辑删除。
- Approach #1: Deleted Flag 删除元组
- → Maintain a flag to indicate that the logical tuple has been deleted after the newest physical version.保留一个标志,表示逻辑元组在最新的物理版本之后被删除。
- → Can either be in tuple header or a separate column.
- Approach #2: Tombstone Tuple 墓碑元组
- → Create an empty physical version to indicate that a logical tuple is deleted. 创建空物理版本,表示删除逻辑元组
- → Use a separate pool for tombstone tuples with only a special bit pattern in version chain pointer to reduce the storage overhead.墓碑元组使用单独的池,版本链指针中只有特殊的位模式,以减少存储开销。
- Approach #1: Deleted Flag 删除元组
- 只有当逻辑删除的元组的所有版本都不可见时,DBMS才会从数据库中删除一个元组。
14.Database Logging
- Failure Classification
- storage types
- Volatile Storage:→ Data does not persist after power loss or program exit.→ Examples: DRAM, SRAM 易失性存储器
- Non-volatile Storage:→ Data persists after power loss and program exit.→ Examples: HDD, SDD 非易失性存储器
- Stable Storage:→ A non-existent form of non-volatile storage that survives all possible failures scenarios. 持久存储器
- failure classification
- Type #1 – Transaction Failures
- Logical Errors:→ Transaction cannot complete due to some internal error condition (e.g., integrity constraint violation).
- Internal State Errors:→ DBMS must terminate an active transaction due to an error condition (e.g., deadlock).
- Type #2 – System Failures
- Software Failure:→ Problem with the OS or DBMS implementation (e.g., uncaught divide-by-zero exception).
- Hardware Failure:→ The computer hosting the DBMS crashes (e.g., power plug gets pulled).→ Fail-stop Assumption: Non-volatile storage contents are assumed to not be corrupted by system crash.
- Type #3 – Storage Media Failures : No DBMS can recover from this! Database must be restored from archived version
- Type #1 – Transaction Failures
- storage types
- Buffer Pool Policies
- 数据存储在非易失性的介质上最安全,但非易失性介质速度比易失性介质慢很多。使用易失性内存更快地访问:→第一次将目标记录复制到内存中。→在内存中执行写操作。→将脏记录写回磁盘。
- 但数据库需要保证:事务commit,所做的更改是持久的。事务abort,所做的修改全部撤销。
- Undo: The process of removing the effects of an incomplete or aborted txn. 撤消:删除不完整或中止的txn的影响的过程
- Redo: The process of re-instating the effects of a committed txn for durability. 重做:为持久性重新启动已提交txn的效果的过程【如果事务没有刷到磁盘上,在数据库重启之后要重做刷到磁盘保证持久性】
- Buffer Pool Policies:steal policy:Whether the DBMS allows an uncommitted txn to overwrite the most recent committed value of an object in non-volatile storage.STEAL: Is allowed. NO-STEAL: Is not allowed 是否允许刷脏页时把未提交的数据也刷到磁盘 【刷盘数据】
- Buffer Pool Policies:force policy:Whether the DBMS requires that all updates made by a txn are reflected on non-volatile storage before the txn can commit. FORCE: Is required.NO-FORCE: Is not required. 事务执行commit操作时是否需要立即刷盘 【刷盘时间】
- Shadow Paging:NO-STEAL + FORCE 的一个实现:
- 维护两个独立的数据库副本: Master:只包含来自已提交txns的更改;Shadow:临时数据库,由未提交的txns进行更改。Txns only make updates in the shadow copy. When a txn commits, atomically switch the shadow to become the new master.
- 此方案改进NO-STEAL + FORCE的地方:事务执行过程中修改的页可以部分刷盘
- 缺点:commit时工作太多:1.没有刷页的要刷到磁盘 2.修改db root指针 3. 做垃圾清理;容易造成磁盘碎片
- SQLLITE 【2010年版】使用了改进版的Shadow paging:执行事务时先在磁盘保存原始的页【Journal file】,事务在内存中修改的页刷到磁盘,commit时删掉Journal file。恢复时Journal file复制回磁盘即可。
- 思考:Shadow paging会对磁盘有很多的随机读写。We need a way for the DBMS convert random writes into sequential writes.
- Write-Ahead Log:预写日志AWL:单开一个日志文件记录事务对数据库所做的修改;脏页刷磁盘之前,先将预写日志刷到磁盘。Buffer Pool Policy: STEAL + NO-FORCE
- WAL Protocol:内存中开一个专用的缓存【WAL Buffer】写WAL log,在内存中事务修改数据前先写WAL log;事务commit之前先将WAL log 刷盘;
- WAL Protocol:WAL log包含的内容:
- → Transaction Id
- → Object Id
- → Before Value (UNDO)
- → After Value (REDO)
- mysql innodb 有undo log 和 redo log
- WAL Protocol:实现的一些细节问题:
- When should the DBMS write log entries to disk?【WAL刷盘时间】
- → When the transaction commits.
- → Can use group commit to batch multiple log flushes together to amortize overhead. 优化:组提交【多个事务commit时相互等待,一起刷盘WAL后commit成功】
- When should the DBMS write dirty records to disk?【脏页刷盘时间】
- → Every time the txn executes an update?
- → Once when the txn commits?
- When should the DBMS write log entries to disk?【WAL刷盘时间】
- Logging Schemes
- Physical Logging→ Record the changes made to a specific location in the database.→ Example: git diff 【占用空间大】
- Logical Logging→ Record the high-level operations executed by txns.→ Not necessarily restricted to single page.→ Example: The UPDATE, DELETE, and INSERT queries invoked by a txn.【时间相关的sql,limit相关的sql结果不同】
- Physiological Logging :混合方法,其中日志记录针对单个页面,但不指定页面的组织。→根据元组的槽号识别元组。→允许DBMS在日志记录写入磁盘后重新组织页面。这是最流行的方法
- Checkpoints: The DBMS periodically takes a checkpoint where it flushes all buffers out to disk.
15.Database Recovery
- Log Sequence Numbers LSN 日志序列号
- flushedLSN: Last LSN in log on disk. 上一次刷到磁盘上的日志编号。
- pageLSN: 最近一次修改数据页的日志编号;该数据页最新修改的日志编号。【缓存中对数据页修改的上限】
- recLSN: 该数据页上一次刷盘之后,第一个对该数据页修改的日志编号。【缓存中对数据页修改的下限】
- lastLSN: 事务的最近一条日志编号。
- MasterRecord: checkpoint最近一次的日志编号。
- Before page x can be written to disk, we must flush log at least to the point where:→ pageLSNx ≤ flushedLSN。 数据页被刷到磁盘之前,必须保证该页的最新修改的日志编号 ≤ 刷到磁盘的日志编号。
- All log records have an LSN.
- Update the pageLSN every time a txn modifies a record in the page.
- Update the flushedLSN in memory every time the DBMS writes out the WAL buffer to disk.
- Normal Commit & Abort Operations
- 假设如下条件以简化讨论:
- All log records fit within a single page.
- Disk writes are atomic.
- Single-versioned tuples with Strict 2PL.
- STEAL + NO-FORCE buffer management with WAL.
- TSN commit:Write COMMIT record to log.
- All log records up to txn's COMMIT record are flushed to disk. commit之前需要保证该事务之前的事务日志都刷新到磁盘中
- → Log flushes are sequential, synchronous writes to disk. 日志顺序同步刷到磁盘(同步:执行commit操作时会阻塞等到刷盘完成)
- → Many log records per log page.
- When the commit succeeds, write a special TXN-END record to log. This does not need to be flushed immediately. 需要等待缓存中的数据也进入磁盘再写TXN-END 【事务真正的结束】
- All log records up to txn's COMMIT record are flushed to disk. commit之前需要保证该事务之前的事务日志都刷新到磁盘中
- TSN abort: We need to add another field to our log records:
- prevLSN: The previous LSN for the txn.此时事务的上一条日志编号。【并发时LSN顺序随机】
- This maintains a linked-list for each txn that makes it easy to walk through its records. 维护了各个事务发生顺序的一个链表
- 假设如下条件以简化讨论:
- CLR:COMPENSATION LOG RECORDS 补偿日志记录:CLR描述了为撤消先前更新记录的操作而采取的操作。它包含更新日志记录的所有字段以及undoNext指针(下一个要撤消的LSN)。clr被添加到日志记录中,但是DBMS在通知应用程序txn终止之前并不等待它们被刷新
- Fuzzy Checkpointing
- Non-Fuzzy Checkpointing
- DBMS在使用检查点以确保快照一致时停止所有操作:→停止任何新txns的启动。
- →等待所有活动txns完成执行。【可改进 --> Slightly better Checkpoint】
- →清除磁盘上的脏页。
- 这对运行时性能不利,但使恢复变得容易。
- Slightly better Checkpoint:Pause modifying txns while the DBMS takes the checkpoint.
- → Prevent queries from acquiring write latch on table/index pages.
- → Don't have to wait until all txns finish before taking the checkpoint
- We must record internal state as of the beginning of the checkpoint.→ Active Transaction Table (ATT)→ Dirty Page Table (DPT)
- Active Transaction Table (ATT)
- One entry per currently active txn.
- → txnId: Unique txn identifier.
- → status: The current "mode" of the txn.
- → lastLSN: Most recent LSN created by txn.
- Entry removed after the TXN-END message.
- Txn Status Codes:
- → R → Running
- → C → Committing
- → U → Candidate for Undo
- One entry per currently active txn.
- Dirty Page Table (DPT)
- → recLSN: The LSN of the log record that first caused the page to be dirty.
- Fuzzy Checkpointing
- New log records to track checkpoint boundaries:→ CHECKPOINT-BEGIN: Indicates start of checkpoint→ CHECKPOINT-END: Contains ATT + DPT
- Non-Fuzzy Checkpointing
- Aries 恢复的一个算法
- Phase #1 – Analysis → Read WAL from last MasterRecord to identify dirty pages in the buffer pool and active txns at the time of the crash.
- Phase #2 – Redo → Repeat all actions starting from an appropriate point in the log (even txns that will abort).
- Phase #3 – Undo → Reverse the actions of txns that did not commit before the crash
16.分布式数据库介绍
- System Architectures
- shared everything
- shared memory:CPUs have access to commonmemory address space via a fastinterconnect.→ Each processor has a global view of all thein-memory data structures.→ Each DBMS instance on a processor must"know" about the other instances.
- shared disk: All CPUs can access a single logical disk directly via an interconnect, but each have their own private memories.→ Can scale execution layer independently from the storage layer.[算存分离]→ Must send messages between CPUs to learn about their current state.
- shared nothing:Each DBMS instance has its ownCPU, memory, and local disk.Nodes only communicate with eachother via network.→ Harder to scale capacity.→ Harder to ensure consistency.→ Better performance & efficiency
- Design Issues
- How does the application find data?
- Where does the application send queries?
- How to execute queries on distributed data? → Push query to data.→ Pull data to query.
- How does the DBMS ensure correctness?How do we divide the database across resources?
- Design Issues 解决方法
- 方法#1:同构节点 Homogenous Nodes
- →集群中的每个节点都可以执行相同的任务集(尽管可能在不同的数据分区上)。
- →使配置和故障转移“更容易”。
- 方法2:异构节点 Heterogenous Nodes
- →节点被分配特定的任务。
- →可以允许单个物理节点托管多个“虚拟”节点类型,用于专用任务。
- 方法#1:同构节点 Homogenous Nodes
- Partitioning Schemes
- The DBMS can partition a database physically(shared nothing) or logically (shared disk). Partitioning Schemes:→ Hashing→ Ranges→ Predicates 一致性hash
- Distributed Concurrency Control
- 分布式事务管理方法:→ Centralized: Global "traffic cop".→ Decentralized: Nodes organize themselves.
17.OLTP
- OLTP vs. ALTP
- On-line Transaction Processing (OLTP):→ Short-lived read/write txns.→ Small footprint.→ Repetitive operations.
- On-line Analytical Processing (OLAP):→ Long-running, read-only queries.→ Complex joins.→ Exploratory queries.
- Atomic Commit Protocols
- When a multi-node txn finishes, the DBMS needsto ask all the nodes involved whether it is safe tocommit.
- Examples:
- → Two-Phase Commit
- → Three-Phase Commit (not used)
- → Paxos
- → Raft
- → ZAB (Apache Zookeeper)
- → Viewstamped Replication
- Replication
- Design Decisions:
- → Replica Configuration Approach #1: Primary-Replica Approach #2: Multi-Primary
- → Propagation Scheme → Synchronous (Strong Consistency)→ Asynchronous (Eventual Consistency)
- → Propagation Timing Approach #1: Continuous Approach #2: On Commit
- → Update Method
- Design Decisions:
- Consistency Issues (CAP / PACELC) → Consistent→ Always Available→ Network Partition Tolerant
- Google Spanner
18.OLAP
- Execution Models
- Query Planning
- Distributed Join Algorithms
- Cloud Systems
- Approach #1: Managed DBMSs
- → No significant modification to the DBMS to be "aware"that it is running in a cloud environment.
- → Examples: Most vendors
- Approach #2: Cloud-Native DBMS
- → The system is designed explicitly to run in a cloudenvironment.
- → Usually based on a shared-disk architecture.
- → Examples: Snowflake, Google BigQuery, AmazonRedshift, Microsoft SQL Azure
- Approach #1: Managed DBMSs