Berkeley DB 源代码分析 (7) --- 事务和日志 (2)
= Logging subsystem =
== Architecture ==
Log subsystem consists of one log buffer, and several log files, each has a unique file number. Log buffer and log files contains
log records. Log records has a loghdr and a data part, hdr contains the admin info about this logrec, and the checksum of log
rec data part(loghdr not checksumed), logrec data part Log records contains full info about an event that just happened(so the
data they contain may vary a lot, some log recs are huge, some are tiny), other parts of the db need to explicitly construct the
required log record and log the event, in order to do recovery. Log recs are appended to the last log file one by one, each log
record has an LSN(log sequence number), when one log file is full (set_log_size, and a log record is not partly stored in multiple
log files), another is created, with incremented index. Logically all log files make up a huge log.
log sequence number consists of two parts---(fileno, infile-offset), both fields are increasing only. LSNs are used to
identify each log record and its exact position, to identify relative time in mvcc mpool algorithm, and to identify those objects
related to the events recorded in the corresponding log record, for example, when a page is created, this event is logged in a log
record with LSN "lsn", then the "lsn" is also written into that page's header, so that we can find out when(relative time) that
page is created, and more info, in the log record; and this lsn will be used to maintain connection between this page and the
logs, when later this page is modified or deleted, we use the "lsn" in its header to identify exactly which page those events
happened to. During normal run of db, log recs are only appended to log files, and they are read out only when we are doing a recovery.
At that time, log recs are accessed using a log cursor, and in bdb, the log records contain enough information to do both redo and undo.
This implies that dirty pages can be written back to files, rather than pin them until the txn modifying it commits. And modified pages
don't have to be synced to disk when txn commits either, which is what undo logs do. This is not a huge storage burden because we are using
logical logs---we note down changes by key/data pairs rather than by pages, when a data item is modified, the log record can note down the
diff of old and new data item rather than the whole record or whole page. Whole pages are only noted down when absolutely needed like
in a btree page split event, the splitted page needs to be logged to do recovery.
The * __log structure also stores the db files opened, because some logrecs are related to the opened db files(file open, close,
reopen, preopen, etc), so we "register" these db files into the log subsystem, and there are dedicated functions to handle these
files in log/dbreg_util.c, writing special log recs related to these files(file open, close, reopen, preopen, etc),
or manipulating the files(open, close, etc)
== Implementation of Major Functionalities ==
=== Log buffer flush ===
During normal db run time, logrecs are written to log buffer(* __log), log buffer contains the "hot" log recs(not partial logrecs,
when it can't hold a whole log rec, it is full). but when the buffer is full (or exceed the trickle level), or when some txn
explicitly requests to flush logrecs earlier than an special lsn( ordinarily this txn's last logrec's lsn before commiting), or
when the mpool need to sync some unused dirty pages to disk(the log recs earlier than the latest one found in those pages will be
flushed), whole or part of the log buffer will be flushed to disk, and we remember this last lsn flushed, next time we won't have to
flush logrecs earlier than this lsn. When a log buffer is fully synced to disk, we reuse the buffer space. log buffer is not
circular---its first byte definitely has the smallest lsn.
=== Txn commit ===
The only thing need to do is flush all logrecs earlier than its last lsn to disk, then we write and sync the "txn-commit" logrec. So we first
check whether they are already flushed by others, if so, we are done, otherwise, if no one else is flushing, we can flush, and remember
when s_lsn(last synced lsn), otherwise, we have to wait for others to finish, so we put the current txn into a commit-waiting list, and
waits for a mutex related to the commit, when we acquire this mutex, we will have a chance to commit
=== Txn recovery ===
Log recs form a double-linked list on the log file, because the log header contains prev offset and current logrec length,
so we use a cursor to transverse the whole logrec list(although the list element are not same sized ). A log cursor contains a buffer to hold
a chunk of log data, it may be several complete or partial log records, only the complete log recs are used. A user can get the log recs by log
cursor operations---DB_NEXT, DB_PREV, DB_FIRST, DB_LAST, the * __log_get_int function first move a lsn to the wanted log rec, then, first check
if the current log cursor contains it, if not, check if the current log buffer contains it, if not, read it from the log file
=== Log archive ===
Return log file names or db file names according to the passed in flag:
flag DB_ARCH_LOG: from * __log get the last (latest) log file number N, and return number 0 to number N log file names
flag 0: find the newest (with largest lsn) chkpnt logrec's containing file N, and return number 0 to N-1 log file names
flag DB_ARCH_DATA: find the database files assocated with the log, by scanning for DB* ___dbreg_register log recs among all log
recs, DB* ___dbreg_register is a special kind of logrec to associate a db file with the log, so that we know events happened to this
db file is logged in this log. * ___dbreg_register_args contains the info of this kind of log rec. when we find a
DB* ___dbreg_register logrec, we find a database file associated with this log.
=== Txn chkpoint ===
First, check whether we should do chkpnt: DB_FORCE, or specified period of time passed, or specified amount of log data has been
written since last chkpnt. chkpnt log recs form a single-linked list in the log file, so we can find them by backtracking the log
files. If we should do checkpoint, first find out the oldest active transaction's "begin" LSN from the txn region, we will sync
all log recs older than this. In rep master, send chkpnt msgs to replicas to sync all changes starting from this "begin" LSN, then
sync the mpool, (which first flush the log recs), then write a "chkpnt" logrec into the log, but this time do not flush this log
rec, so if the app crash the moment after writing this chkpnt logrec, we will have to redo the events since the last chkpnt in
=== In-memory logging ===
log buffer is circular(ring buffer), and log recs are written to fake-files, they are actually mem chunks of a ring buffer, can
not guarantee durability, mostly used in replication scenario, when we guarantee durability by replicating data to other sites,
and have some of them guarantee durability. The log buffer has to be large enough to hold all log records of an entire txn.
== Specification of important functions ==
* __log_put: putting a log record
* Precondition: rep client doesnot put logs, rep master with no "send" function can not put logs
* In * __log and DB_LOGC, "first byte lsn" is used. this means the fileno containing this "first byte', and the first byte's offset,
so it is not necessarily equal to the containing logrec's lsn. that is, any byte in a log file can have a lsn value.
* calculate check sum of the log rec data, and store the chksum to logrec hdr
* write this logrec to the log buffer or file(append to the tail of the log file) in * __log_put_next
* __log_put_next: write a logrec to log buffer of log file
* This function has to prepare for rep master: saving the old lsn of * __log, to send to rep clients so that they know they don't
miss log records;
* If we are writing logs to an older version log file, we can not write into it, we create a new log file to
write in instead;
* If rest space of the current log file is not large enough to hold this log record, we need a new log file, by calling
* __log_new_file. Then, we call * __log_putr to actually write this log rec into log buffer or possibly flush the log buffer.
* calculate logrec hdr chksum
* if doing inmem logging, check we have enough space in log buffer
* swap byte order of loghdr if need to, then fill the hdr into log, and swap back hdr's byte order (why log data not swapped?
because most of log data will never be used, if we use them in rep clients or recovery, the user of the log recs knows how to
swap the log recs with the correct hdr info already handy)
* fill log rec data into log buffer by calling * __log_fill
* __log_fill : write information into the log. if this log rec makes the lbuffer full, we flush the buffer, otherwise, only copy the logrec
into log buffer
* as to inmem logging, copy the chunk of data into log buffer calling * __log_inmem_copyin
* copy data chunk into log buffer, if buffer full, flush buffer to log file.
* If we're on a buffer boundary and the data is big enough, copy as many records as we can directly from the data, and handle
the rest of the log data in step 2.
* whenever we fill full a buffer and have flushed the buffer, we need to update the * __log->f_lsn