CMU-15445 学习笔记总结（2）

最新推荐文章于 2024-08-06 20:16:34 发布

LittleSnailoo

最新推荐文章于 2024-08-06 20:16:34 发布

阅读量872

点赞数

分类专栏：数据库文章标签：数据库开发分布式 database

本文链接：https://blog.csdn.net/weixin_40448099/article/details/130028169

版权

数据库专栏收录该内容

4 篇文章 0 订阅

订阅专栏

10.Concurrency Control Theory

11.Two-Phase Locking【悲观】

12.Timestamp Ordering Concurrency Control 时间戳顺序并发控制【乐观】

13.Multi-Version Concurrency Control

10.Concurrency Control Theory

事务正确性的保证：ACID
1. Atomicity: All actions in the txn happen, or none happen.
2. Consistency: If each txn is consistent and the DB starts consistent, then it ends up consistent.
3. Isolation: Execution of one txn is isolated from that of other txns.
4. Durability: If a txn commits, its effects persist.
mechanisms for ensuring atomicity 确保原子性的机制：
1. Logging （undo log）【日志作用：1.回滚 2.审计 3.提高效率（先记录到日志再应用到磁盘，连续写提高效率）】
2. Shadow Paging：数据库先复制事务要修改的页，事务在复制的页上修改，只有事务提交时，这些页才对别人可见（这种方法Originally from System R.，只有少数数据库使用：CouchDB、 LMDB (OpenLDAP) ）
mechanisms for ensuring Isolation：确保隔离性的机制：并发控制：悲观、乐观
1. Serializable Schedule
2. "conflicting" operations：Two operations conflict if:
  1. → They are by different transactions,
  2. → They are on the same object and at least one of them is a write.
3. Read-Write Conflicts (R-W) ：Unrepeatable Reads 不可重复读
4. Write-Read Conflicts (W-R)：Reading Uncommitted Data ("Dirty Reads")
5. Write-Write Conflicts (W-W)
6. Schedule S is conflict serializable if you can transform S into a serial schedule by swapping consecutive non-conflicting operations of different transactions. 如果您可以通过交换不同事务的连续不冲突操作将S转换为串行调度，则调度S是可冲突序列化的。
7. Dependency Graphs 依赖图
mechanisms for ensuring Durability：确保持久性的机制：所有提交的事物都是持久的
1. Logging
2. Shadow Paging

11.Two-Phase Locking【悲观】

Basic Lock Types
1. S-LOCK：共享锁
2. X-LOCK：排他锁
Two-Phase Locking 2PL：
1. Phase #1: Growing
  1. Each txn requests the locks that it needs from the DBMS’s lock manager.
  2. The lock manager grants/denies lock requests.
2. Phase #2: Shrinking
  1. The txn is allowed to only release locks that it previously acquired. It cannot acquire new locks.
2PL on its own is sufficient to guarantee conflict serializability. 两阶段锁足以保证冲突串行化； It generates schedules whose precedence graph is acyclic. 它形成的依赖图是非循环的； But it is subject to cascading aborts.但是会造成级联回滚问题。：T2在T1的临时版本上执行，T1回滚之后，T2也应该回滚【级联回滚】---》脏读
2PL会造成的问题：
1. 脏读（级联回滚）解决方法：strong strict 2PL：一个事务只有commit之后才能解锁（见下图）
2. 死锁：解决方法：
  1. Approach #1: Deadlock Detection 死锁检测：当DBMS检测到死锁时，它将选择一个“受害者”txn【victim txn】进行回滚以打破循环。受害txn将重启或中止(更常见)，这取决于它是如何调用的。在检查死锁的频率和txns在打破死锁之前必须等待多长时间之间存在权衡。
  2. Approach #2: Deadlock Prevention 死锁预防
Selecting the proper victim depends on a lot ofdifferent variables.... 合理选择死锁的需要回滚的txn：
1. By age (lowest timestamp) 选择最近刚开始的事务，例如如果一个大事务已执行了很长时间，选为victim txn是不合适的
2. By progress (least/most queries executed)
3. By the # of items already locked
4. By the # of txns that we have to rollback with it
在选择要中止的受害txn之后，DBMS还可以决定回滚txn更改的程度。
1. Approach #1: Completely
2. Approach #2: Minimally
死锁预防：
1. Assign priorities based on timestamps:→ Older Timestamp = Higher Priority (e.g., T1 > T2)
2. Wait-Die ("Old Waits for Young")
  1. → If requesting txn has higher priority than holding txn, thenrequesting txn waits for holding txn.
  2. → Otherwise requesting txn aborts.
3. Wound-Wait ("Young Waits for Old")
  1. → If requesting txn has higher priority than holding txn, thenholding txn aborts and releases lock.
  2. → Otherwise requesting txn waits.
Lock Granularities 锁的粒度：
1. Attribute? Tuple? Page? Table?
2. Trade-off between parallelism versus overhead.→ Fewer Locks, Larger Granularity vs. More Locks, Smaller Granularity.
Intention Lock 意向锁：意图锁允许高级节点以共享或独占模式被锁定，而无需检查所有后代节点。如果一个节点以意图模式被锁定，那么某些txn将在树的较低级别执行显式锁定。
1. Intention-Shared (IS)→ Indicates explicit locking at lower level with shared locks.
2. Intention-Exclusive (IX)→ Indicates explicit locking at lower level with exclusive locks.
3. Shared+Intention-Exclusive (SIX)→ The subtree rooted by that node is locked explicitly inshared mode and explicit locking is being done at a lowerlevel with exclusive-mode locks.
锁升级：若加了太多的行锁，lock Manager 升级成表锁
在实际应用中，锁通常是数据库自动加的，如执行update操作时。有一些语法支持手动加锁。

12.Timestamp Ordering Concurrency Control 时间戳顺序并发控制【乐观】

Basic Timestamp Ordering (T/O) Protocol 基本时间戳排序(T/O)协议
1. 每一行都增加两个时间戳：（1）W-TS(X) -在X上写入时间戳（2）R-TS(X) -读取X上的时间戳
2. 每次操作都检查：如果txn试图访问一个“来自未来”的对象，它会中止并重新启动。
Basic T/O Reads：If TS(Ti) < W-TS(X), Abort Ti and restart it with a new TS. Else:
1. → Allow Ti to read X.
2. → Update R-TS(X) to max(R-TS(X), TS(Ti))
3. → Make a local copy of X to ensure repeatable reads for Ti.
Basic T/O Writes：If TS(Ti) < R-TS(X) or TS(Ti) < W-TS(X) Abort and restart Ti. Else:
1. → Allow Ti to write X and update W-TS(X)
2. → Also make a local copy of X to ensure repeatable reads
上述第3点可优化：Thomas Write Rule：
1. If TS(Ti) < R-TS(X): Abort and restart Ti.
2. If TS(Ti) < W-TS(X):→ Thomas Write Rule: Ignore the write to allow the txn to continue executing without aborting.This violates timestamp order of Ti. Else:→ Allow Ti to write X and update W-TS(X)：未来的事务重写了该值，则现在的事务可忽略对该值的写，继续执行不需要abort。如果这样做了之后，需要更新自己本地的快照。
Basic T/O （不使用Thomas Write Rule），有以下问题：
1. 1. 没有死锁问题，因为事务从不会等待，没有锁机制
  2. 长事务可能会饥饿：在执行过程中可能碰到所有数据都是被短事务修改过的，都成为了“未来的”数据；
  3. 如果txns【仅在读取其更改的所有txns提交之后】才提交，则调度是可恢复的。否则，DBMS不能保证txns读取的数据是将在崩溃恢复后恢复的数据。
  4. 性能问题：将数据复制到txn的工作空间和更新时间戳带来的高开销。
在 Basic T/O 那里，如果一个事务对A进行读取，会检查合法性，如果合法会拷贝一个快照，那这个事务下一次对A操作时还会检查合法性吗？

是否检查时间戳需要视情况而定, 当该事务对A再次操作时:

1. 若为“读”操作, 则无需检查任何时间戳, 只需要读取本地快照即可；
2. 若为"写操作", 则检查读时间戳
  1. 若已被后发生的事务读取, 则本事务终止;
  2. 否则, 检查写时间戳:
    1. 若已经后发生的事务重写, 则根据Thomas Write Rule, 更新本地快照即可;
    2. 否则, 更新本地快照, 并将更新后的结果写入数据库。

【OCC】Optimistic Concurrency Control 乐观并发控制
1. 如果您假定txns之间的冲突很少，并且大多数txns都是短期的，那么强制txns等待获取锁会增加大量开销。更好的方法是针对无冲突的情况进行优化。即使用OCC的并发控制
2. OCC: The DBMS creates a private workspace for each txn. DBMS为每个txn创建一个私有工作区
  1. Any object read is copied into workspace. 任何读取的对象都被复制到工作区
  2. Modifications are applied to workspace.When a txn commits, the DBMS compares workspace write set to see whether it conflicts with other txns 修改应用到工作空间。当txn提交时，DBMS比较工作区写集，以确定它是否与其他txn冲突 [与Basic T/O 不同之处]
【OCC】三个阶段
1. #1 – Read Phase:→ Track the read/write sets of txns and store their writes in a private workspace. 读取阶段:→跟踪txns的读/写集，并将其写入存储在私有工作区中
2. #2 – Validation Phase:→ When a txn commits, check whether it conflicts with other txns. 验证阶段:→当一个txn提交时，检查它是否与其他txn冲突。【在该阶段获取时间戳】
3. #3 – Write Phase:→ If validation succeeds, apply private changes to database. Otherwise abort and restart the txn. 写入阶段:→如果验证成功，对数据库应用私有更改。否则，中止并重新启动txn
【OCC】Read Phase：跟踪txns的读/写集，并将其写入存储在私有工作区中。DBMS将txn从共享数据库访问的每个元组复制到其工作空间，以确保可重复读
【OCC】Validation Phase：当txn Ti调用COMMIT时，DBMS检查它是否与其他txns冲突。DBMS需要保证只允许可序列化的调度【即小时间戳的先发生】：检查其他txns的RW和WW冲突，并确保冲突是在一个方向上(例如，更老→更年轻)。【即依赖图不能成环】本阶段有两种方法:
1. →逆向验证 Backward Validation 向更老的数据校验：历史数据【小时间戳】历史上已提交的事务。验证不通过则abort
2. →正向验证 Forward Validation 向未来的数据校验：未来数据【大时间戳】验证未来未提交的事务中和本事务同时发生的部分。验证不通过的处理方法：
  1. Ti completes all three phases before Tj begins. 没有交集，串行发生
  2. Ti completes before Tj starts its Write phase, and Ti does not write to any object read by Tj. → WriteSet(Ti) ∩ ReadSet(Tj) = Ø 两个条件
  3. Ti completes its Read phase before Tj completes its Read phase And Ti does not write to any object that is either read or written by Tj:→ WriteSet(Ti) ∩ ReadSet(Tj) = Ø → WriteSet(Ti) ∩ WriteSet(Tj) = Ø 三个条件
【OCC】Write Phase：DBMS将txn写集中的更改传播到数据库，并使它们对其他txn可见。假设一次只能有一个txn处于Write阶段。→Use write latches to support parallel validation/writes.
【OCC】存在的一些问题：
1. 本地复制数据的高开销。
2. 验证/写入阶段瓶颈。
3. 中止比2PL更浪费，因为它们只发生在txn已经执行之后。
2PL & OCC 都不能预防幻读，因为T1只锁定了现有的记录，而没有锁定正在进行的记录! 只有在对象集固定的情况下，单个项读写的冲突序列化性才能保证可序列化性。解决方法：
1. Approach #1: Re-Execute Scans 所有范围查询的where clause 再次执行验证
2. Approach #2: Predicate Locking 谓词锁：系统R提出的锁定方案。→SELECT查询的WHERE子句中谓词的共享锁。→对任何UPDATE、INSERT或DELETE查询的WHERE子句中的谓词的排他锁定。除了HyPer(精确锁定)，从未在任何系统中实现。
3. Approach #3: Index Locking 索引锁：谓词中有索引的话，锁索引页；没有索引，表锁
4. mysql的实现：间隙锁

Isolation Levels 隔离级别：事务并发存在的问题：
1. → Dirty Reads 脏读（读未提交）
2. → Unrepeatable Reads不可重复读（读的数据是已提交的数据）
3. → Phantom Reads幻读（第二次读取的数据多于第一次）例如事务 A 对一个表中的数据进行了修改，这种修改涉及到表中的全部数据行。此时，突然事务 B 插入了一条数据并提交了，当事务 A 提交了修改数据操作之后，再次读取全部数据，结果发现还有一条数据未更新，给人感觉好像产生了幻觉一样。这就是幻读。
Isolation Levels 隔离级别：
1. SERIALIZABLE: No phantoms, all reads repeatable, no dirty reads.
2. REPEATABLE READS: Phantoms may happen.
3. READ COMMITTED: Phantoms and unrepeatable reads may happen.
4. READ UNCOMMITTED: All of them may happen.
实现：
1. SERIALIZABLE: Obtain all locks first; plus index locks, plus strict 2PL.
2. REPEATABLE READS: Same as above, but no index locks.
3. READ COMMITTED: Same as above, but S locks are released immediately.
4. READ UNCOMMITTED: Same as above but allows dirty reads (no S locks)

13.Multi-Version Concurrency Control

The DBMS maintains multiple physical versions of a single logical object in the database: DBMS维护数据库中单个逻辑对象的多个物理版本:
1. → When a txn writes to an object, the DBMS creates a new version of that object. →当txn写入一个对象时，DBMS创建该对象的一个新版本。
2. → When a txn reads an object, it reads the newest version that existed when the txn started. →当txn读取一个对象时，它读取txn启动时存在的最新版本。
MVCC 好处：
1. Writers do not block readers.Readers do not block writers.
2. Read-only txns can read a consistent snapshot without acquiring locks.→ Use timestamps to determine visibility.
3. Easily support time-travel queries
MVCC不仅仅是一个并发控制协议。它完全影响DBMS管理事务和数据库的方式。
Concurrency Control Protocol MVCC实现方式
1. Approach #1: Timestamp Ordering→ Assign txns timestamps that determine serial order.
2. Approach #2: Optimistic Concurrency Control→ Three-phase protocol from last class.→ Use private workspace for new versions.
3. Approach #3: Two-Phase Locking→ Txns acquire appropriate lock on physical version before they can read/write a logical tuple.
Version Storage MVCC存储版本：The DBMS uses the tuples' pointer field to create a version chain per logical tuple.DBMS使用元组的指针字段为每个逻辑元组创建一个版本链。This allows the DBMS to find the version that is visible to a particular txn at runtime.
1. Approach #1: Append-Only Storage→ New versions are appended to the same table space. 逻辑元组的所有物理版本都存储在同一个表空间中。这些版本是相互混合的。在每次更新时，将元组的新版本附加到表中的空白区域。
  1. Approach #1: Oldest-to-Newest (O2N)→ Append new version to end of the chain.→ Must traverse chain on look-ups.
  2. Approach #2: Newest-to-Oldest (N2O)→ Must update index pointers for every new version.→ Do not have to traverse chain on look-ups. 搜索找第一个
2. Approach #2: Time-Travel Storage→ Old versions are copied to separate table space.
3. Approach #3: Delta Storage→ The original values of the modified attributes are copied into a separate delta record space.
Garbage Collection MVCC垃圾回收：DBMS需要随着时间的推移从数据库中删除可回收的物理版本。
1. DBMS中没有活动的txn可以“看到”该版本(SI)。SI：oracle 快照隔离级别
2. 由回滚的txn创建的版本。
3. 另外两个设计决策:→如何寻找过期版本?→如何决定何时回收内存是安全的?
4. 实现方法：
  1. Approach #1: Tuple-level→ Find old versions by examining tuples directly.→ Background Vacuuming vs. Cooperative Cleaning 后台吸尘vs.协同清洁
  2. Approach #2: Transaction-level→ Txns keep track of their old versions so the DBMS does not have to scan tuples to determine visibility. Txns会跟踪它们的旧版本，所以DBMS不需要扫描元组来确定可见性
Index Management 索引管理
1. Primary key 主键索引：Primary key indexes point to version chain head. 主键索引指向版本链的表头
  1. → How often the DBMS must update the pkey index depends on whether the system creates new versions when a tuple is updated. DBMS必须多久更新一次pkey索引取决于当元组更新时系统是否创建了新的版本。
  2. → If a txn updates a tuple's pkey attribute(s), then this is treated as a DELETE followed by an INSERT. 如果txn更新了元组的pkey属性，则这将被视为DELETE后面跟着INSERT。
2. Secondary indexes 二级索引
  1. Approach #1: Logical Pointers 逻辑地址：需要回表
    1. → Use a fixed identifier per tuple that does not change.
    2. → Requires an extra indirection layer.
    3. → Primary Key vs. Tuple Id
  2. Approach #2: Physical Pointers 物理地址
    1. → Use the physical address to the version chain head.
MVCC 无论唯一键还是冗余键，都需要存储多个版本，对唯一键的维护需要额外的工作：每个索引的底层数据结构必须支持非唯一键的存储。使用额外的执行逻辑对pkey / unique索引执行条件插入。
Deletes
1. 只有当逻辑删除的元组的所有版本都不可见时，DBMS才会从数据库中删除一个元组。
  1. →如果一个元组被删除，则该元组在最新版本之后不能有新版本。
  2. →没有写写冲突/第一作者获胜 No write-write conflicts / first-writer wins
2. 我们需要一种方法来表示元组已经在某个时间点被逻辑删除。
  1. Approach #1: Deleted Flag 删除元组
    1. → Maintain a flag to indicate that the logical tuple has been deleted after the newest physical version.保留一个标志，表示逻辑元组在最新的物理版本之后被删除。
    2. → Can either be in tuple header or a separate column.
  2. Approach #2: Tombstone Tuple 墓碑元组
    1. → Create an empty physical version to indicate that a logical tuple is deleted. 创建空物理版本，表示删除逻辑元组
    2. → Use a separate pool for tombstone tuples with only a special bit pattern in version chain pointer to reduce the storage overhead.墓碑元组使用单独的池，版本链指针中只有特殊的位模式，以减少存储开销。

14.Database Logging

Failure Classification
1. storage types
  1. Volatile Storage:→ Data does not persist after power loss or program exit.→ Examples: DRAM, SRAM 易失性存储器
  2. Non-volatile Storage:→ Data persists after power loss and program exit.→ Examples: HDD, SDD 非易失性存储器
  3. Stable Storage:→ A non-existent form of non-volatile storage that survives all possible failures scenarios. 持久存储器
2. failure classification
  1. Type #1 – Transaction Failures
    1. Logical Errors:→ Transaction cannot complete due to some internal error condition (e.g., integrity constraint violation).
    2. Internal State Errors:→ DBMS must terminate an active transaction due to an error condition (e.g., deadlock).
  2. Type #2 – System Failures
    1. Software Failure:→ Problem with the OS or DBMS implementation (e.g., uncaught divide-by-zero exception).
    2. Hardware Failure:→ The computer hosting the DBMS crashes (e.g., power plug gets pulled).→ Fail-stop Assumption: Non-volatile storage contents are assumed to not be corrupted by system crash.
  3. Type #3 – Storage Media Failures ： No DBMS can recover from this! Database must be restored from archived version
Buffer Pool Policies
1. 数据存储在非易失性的介质上最安全，但非易失性介质速度比易失性介质慢很多。使用易失性内存更快地访问:→第一次将目标记录复制到内存中。→在内存中执行写操作。→将脏记录写回磁盘。
2. 但数据库需要保证：事务commit，所做的更改是持久的。事务abort，所做的修改全部撤销。
3. Undo: The process of removing the effects of an incomplete or aborted txn. 撤消:删除不完整或中止的txn的影响的过程
4. Redo: The process of re-instating the effects of a committed txn for durability. 重做:为持久性重新启动已提交txn的效果的过程【如果事务没有刷到磁盘上，在数据库重启之后要重做刷到磁盘保证持久性】
5. Buffer Pool Policies：steal policy：Whether the DBMS allows an uncommitted txn to overwrite the most recent committed value of an object in non-volatile storage.STEAL: Is allowed. NO-STEAL: Is not allowed 是否允许刷脏页时把未提交的数据也刷到磁盘【刷盘数据】
6. Buffer Pool Policies：force policy：Whether the DBMS requires that all updates made by a txn are reflected on non-volatile storage before the txn can commit. FORCE: Is required.NO-FORCE: Is not required. 事务执行commit操作时是否需要立即刷盘【刷盘时间】
Shadow Paging：NO-STEAL + FORCE 的一个实现：
1. 维护两个独立的数据库副本: Master:只包含来自已提交txns的更改；Shadow:临时数据库，由未提交的txns进行更改。Txns only make updates in the shadow copy. When a txn commits, atomically switch the shadow to become the new master.
2. 此方案改进NO-STEAL + FORCE的地方：事务执行过程中修改的页可以部分刷盘
3. 缺点：commit时工作太多：1.没有刷页的要刷到磁盘 2.修改db root指针 3. 做垃圾清理；容易造成磁盘碎片
4. SQLLITE 【2010年版】使用了改进版的Shadow paging：执行事务时先在磁盘保存原始的页【Journal file】，事务在内存中修改的页刷到磁盘，commit时删掉Journal file。恢复时Journal file复制回磁盘即可。
5. 思考：Shadow paging会对磁盘有很多的随机读写。We need a way for the DBMS convert random writes into sequential writes.
Write-Ahead Log：预写日志AWL：单开一个日志文件记录事务对数据库所做的修改；脏页刷磁盘之前，先将预写日志刷到磁盘。Buffer Pool Policy: STEAL + NO-FORCE
1. WAL Protocol：内存中开一个专用的缓存【WAL Buffer】写WAL log，在内存中事务修改数据前先写WAL log；事务commit之前先将WAL log 刷盘；
2. WAL Protocol：WAL log包含的内容：
  1. → Transaction Id
  2. → Object Id
  3. → Before Value (UNDO)
  4. → After Value (REDO)
3. mysql innodb 有undo log 和 redo log
4. WAL Protocol：实现的一些细节问题：
  1. When should the DBMS write log entries to disk?【WAL刷盘时间】
    1. → When the transaction commits.
    2. → Can use group commit to batch multiple log flushes together to amortize overhead. 优化：组提交【多个事务commit时相互等待，一起刷盘WAL后commit成功】
  2. When should the DBMS write dirty records to disk?【脏页刷盘时间】
    1. → Every time the txn executes an update?
    2. → Once when the txn commits?
Logging Schemes
1. Physical Logging→ Record the changes made to a specific location in the database.→ Example: git diff 【占用空间大】
2. Logical Logging→ Record the high-level operations executed by txns.→ Not necessarily restricted to single page.→ Example: The UPDATE, DELETE, and INSERT queries invoked by a txn.【时间相关的sql，limit相关的sql结果不同】
3. Physiological Logging ：混合方法，其中日志记录针对单个页面，但不指定页面的组织。→根据元组的槽号识别元组。→允许DBMS在日志记录写入磁盘后重新组织页面。这是最流行的方法
Checkpoints: The DBMS periodically takes a checkpoint where it flushes all buffers out to disk.

15.Database Recovery

Log Sequence Numbers LSN 日志序列号
1. flushedLSN: Last LSN in log on disk. 上一次刷到磁盘上的日志编号。
2. pageLSN: 最近一次修改数据页的日志编号；该数据页最新修改的日志编号。【缓存中对数据页修改的上限】
3. recLSN: 该数据页上一次刷盘之后，第一个对该数据页修改的日志编号。【缓存中对数据页修改的下限】
4. lastLSN: 事务的最近一条日志编号。
5. MasterRecord: checkpoint最近一次的日志编号。
Before page x can be written to disk, we must flush log at least to the point where:→ pageLSNx ≤ flushedLSN。数据页被刷到磁盘之前，必须保证该页的最新修改的日志编号 ≤ 刷到磁盘的日志编号。
All log records have an LSN.
Update the pageLSN every time a txn modifies a record in the page.
Update the flushedLSN in memory every time the DBMS writes out the WAL buffer to disk.

Normal Commit & Abort Operations
1. 假设如下条件以简化讨论：
  1. All log records fit within a single page.
  2. Disk writes are atomic.
  3. Single-versioned tuples with Strict 2PL.
  4. STEAL + NO-FORCE buffer management with WAL.
2. TSN commit：Write COMMIT record to log.
  1. All log records up to txn's COMMIT record are flushed to disk. commit之前需要保证该事务之前的事务日志都刷新到磁盘中
    1. → Log flushes are sequential, synchronous writes to disk. 日志顺序同步刷到磁盘（同步：执行commit操作时会阻塞等到刷盘完成）
    2. → Many log records per log page.
  2. When the commit succeeds, write a special TXN-END record to log. This does not need to be flushed immediately. 需要等待缓存中的数据也进入磁盘再写TXN-END 【事务真正的结束】
3. TSN abort: We need to add another field to our log records:
  1. prevLSN: The previous LSN for the txn.此时事务的上一条日志编号。【并发时LSN顺序随机】
  2. This maintains a linked-list for each txn that makes it easy to walk through its records. 维护了各个事务发生顺序的一个链表
CLR：COMPENSATION LOG RECORDS 补偿日志记录：CLR描述了为撤消先前更新记录的操作而采取的操作。它包含更新日志记录的所有字段以及undoNext指针(下一个要撤消的LSN)。clr被添加到日志记录中，但是DBMS在通知应用程序txn终止之前并不等待它们被刷新

Fuzzy Checkpointing
1. Non-Fuzzy Checkpointing
  1. DBMS在使用检查点以确保快照一致时停止所有操作:→停止任何新txns的启动。
  2. →等待所有活动txns完成执行。【可改进 --> Slightly better Checkpoint】
  3. →清除磁盘上的脏页。
  4. 这对运行时性能不利，但使恢复变得容易。
2. Slightly better Checkpoint：Pause modifying txns while the DBMS takes the checkpoint.
  1. → Prevent queries from acquiring write latch on table/index pages.
  2. → Don't have to wait until all txns finish before taking the checkpoint
  3. We must record internal state as of the beginning of the checkpoint.→ Active Transaction Table (ATT)→ Dirty Page Table (DPT)
3. Active Transaction Table (ATT)
  1. One entry per currently active txn.
    1. → txnId: Unique txn identifier.
    2. → status: The current "mode" of the txn.
    3. → lastLSN: Most recent LSN created by txn.
  2. Entry removed after the TXN-END message.
  3. Txn Status Codes:
    1. → R → Running
    2. → C → Committing
    3. → U → Candidate for Undo
4. Dirty Page Table (DPT)
  1. → recLSN: The LSN of the log record that first caused the page to be dirty.
5. Fuzzy Checkpointing
  1. New log records to track checkpoint boundaries:→ CHECKPOINT-BEGIN: Indicates start of checkpoint→ CHECKPOINT-END: Contains ATT + DPT
Aries 恢复的一个算法
1. Phase #1 – Analysis → Read WAL from last MasterRecord to identify dirty pages in the buffer pool and active txns at the time of the crash.
2. Phase #2 – Redo → Repeat all actions starting from an appropriate point in the log (even txns that will abort).
3. Phase #3 – Undo → Reverse the actions of txns that did not commit before the crash

16.分布式数据库介绍

System Architectures
1. shared everything
2. shared memory：CPUs have access to commonmemory address space via a fastinterconnect.→ Each processor has a global view of all thein-memory data structures.→ Each DBMS instance on a processor must"know" about the other instances.
3. shared disk: All CPUs can access a single logical disk directly via an interconnect, but each have their own private memories.→ Can scale execution layer independently from the storage layer.[算存分离]→ Must send messages between CPUs to learn about their current state.
4. shared nothing：Each DBMS instance has its ownCPU, memory, and local disk.Nodes only communicate with eachother via network.→ Harder to scale capacity.→ Harder to ensure consistency.→ Better performance & efficiency
Design Issues
1. How does the application find data?
2. Where does the application send queries?
3. How to execute queries on distributed data? → Push query to data.→ Pull data to query.
4. How does the DBMS ensure correctness?How do we divide the database across resources?
Design Issues 解决方法
1. 方法#1:同构节点 Homogenous Nodes
  1. →集群中的每个节点都可以执行相同的任务集(尽管可能在不同的数据分区上)。
  2. →使配置和故障转移“更容易”。
2. 方法2:异构节点 Heterogenous Nodes
  1. →节点被分配特定的任务。
  2. →可以允许单个物理节点托管多个“虚拟”节点类型，用于专用任务。
Partitioning Schemes
1. The DBMS can partition a database physically(shared nothing) or logically (shared disk). Partitioning Schemes:→ Hashing→ Ranges→ Predicates 一致性hash
Distributed Concurrency Control
1. 分布式事务管理方法：→ Centralized: Global "traffic cop".→ Decentralized: Nodes organize themselves.

17.OLTP

OLTP vs. ALTP
1. On-line Transaction Processing (OLTP):→ Short-lived read/write txns.→ Small footprint.→ Repetitive operations.
2. On-line Analytical Processing (OLAP):→ Long-running, read-only queries.→ Complex joins.→ Exploratory queries.
Atomic Commit Protocols
1. When a multi-node txn finishes, the DBMS needsto ask all the nodes involved whether it is safe tocommit.
2. Examples:
  1. → Two-Phase Commit
  2. → Three-Phase Commit (not used)
  3. → Paxos
  4. → Raft
  5. → ZAB (Apache Zookeeper)
  6. → Viewstamped Replication
Replication
1. Design Decisions:
  1. → Replica Configuration Approach #1: Primary-Replica Approach #2: Multi-Primary
  2. → Propagation Scheme → Synchronous (Strong Consistency)→ Asynchronous (Eventual Consistency)
  3. → Propagation Timing Approach #1: Continuous Approach #2: On Commit
  4. → Update Method
Consistency Issues (CAP / PACELC) → Consistent→ Always Available→ Network Partition Tolerant
Google Spanner

18.OLAP

Execution Models
Query Planning
Distributed Join Algorithms
Cloud Systems
1. Approach #1: Managed DBMSs
  1. → No significant modification to the DBMS to be "aware"that it is running in a cloud environment.
  2. → Examples: Most vendors
2. Approach #2: Cloud-Native DBMS
  1. → The system is designed explicitly to run in a cloudenvironment.
  2. → Usually based on a shared-disk architecture.
  3. → Examples: Snowflake, Google BigQuery, AmazonRedshift, Microsoft SQL Azure