Berkeley DB 源代码分析 (2) --- Btree的实现 （1）
II. Type Dictionary
The DB handle's DB->bt_internal structure, stores per-process and per-dbhandle
btree info and function pointers.
The btree meta page structure shared by all processes. It stores what's in the
btree's meta page, including all btree specific global(btree-db-wide) info and
common AM db-wide global info.
It contains the common cursor fields defined in macro __DBC_INTERNAL which are shared
by all types of cursors, and mainly a page stack for btree searching.
In __dbc type, we use a pointer to the __dbc_internal type which is defined as the "base class" for all cursor types, but
actually we allocate memory for BTREE_CURSOR or other types, and cast to
specific cursor types before actually using them. We never directly use
III. Macro Dictionary
P_INIT: init a non-meta page.
DBC_LOGGING: Find via a cursor if using logging.
LCK_COUPLE: a parameter to __db_lget. If specified, in __db_lget we will first
release the lock then aquire the lock to the same lockobj with specified lock
DBC_DOWNREV: In replication it is allowd that the master have lower version DB lib than replicas.
So if the master uses DB versions older than the version which first
have latching support, replicas will notice this and set this flag to all its cursors, and replicas
will use traditional mutex locking rather than shared latches.
STD_LOCKING: Whether to use std locking, that is, the locking subsystem is started, and we are not using CDS, and the cursor is not sitting on an
DB_MPOOL_TRY: a flag for __memp_fget, telling __memp_fget to try get a latch on the target page's buffer.
If the latch is not granted, return DB_LOCK_NOTGRANTED immediately without waiting.
IV. Function Dictionary
Open a btree database. It simply calls __bam_read_root after some config
get the btree db's metadata page and use info in it to init the BTREE structure of the DB->bt_internal. The meta data's info was filled in general
DB->open call before calling __bam_open.
Checks a btree meta data page validity.
Init a meta data page's fields, i.e. the BTMETA structure's fields. Called
whenever a metadata page is created during btree db open procedures. For other
pages than meta pages, we use P_INIT to init them.
Routed from __db_new_file.
Create a btree db file by initializing its meta page and root page. Called
during db open process and routed from __db_new_file when db is a btree db.
The db may be in memory or not. For inmem db, we create the page from cache
and mark it dirty (mark this in __memp_fget rather than after actually writing
to it otherwise the page may get evicted before we had a chance to mark it.);
For on-disk db files, we don't use cache for now, rather, we put the page in
private memory to init, and directly write the pages into the db file using __fop_write.
when writing pages directly via __fop_writ/__fop_read, we should call the
internal common page in/out functions after got the page via __fop_read and
before writing the page via __fop_write. The __memp_fget/__memp_fput functions
call them too, as registered callbacks via __memp_pg. We have internal page in/out
callbacks for the 3 types of databases(btree, hash, queue), the internal page in/out functions mainly do
check summing and page header byte swap, so that database files created in
big-endian machines can be opened on little-endian machines, though the user
data are never swapped, so users need to make sure the bytes they get are correct.
There are AM specific work to do in internal page in/out functions, so we have
a __db_pgin/__db_pgout pair(placed in db/db_conv.c), in which they call AM specific pgin/out functions
like __bam_pgin/__bam_pgout (placed in btree/btree_conv.c, note the file name
The reason we use __fop_write here, is that at this point, the db is not fully
opened, it's not registered in the mpool region yet.
__memp_fget/put functions do not do logging, so before putting a dirty page
back to the cache, we should log changes; __fop_write logs the action, so no
need to do it in __bam_new_file.
1. In this function we didn't lock the meta/root pages but use latches, why? don't we want txnal semantics?
2. Generally how do we guarantee txnal sementics when we release metapage
non-txnal locks immediately after use? (this is good for performance, but how to enture
consistence? ) examples are __bam_read_root, __bam_new_subdb, and __db_new.
These locks are not txnal, why can't they be replaced by latches?
In the DBMETA general meta info, only the "last_pgno", and "free", and
"key_count" and "record_count" can be updated, others are static fields. AM
specific parts have several more, for btree they are "root", "iv" and "chksum". So if
these fields don't require txnal locks, it's OK to release locks before txn
Routed from __db_init_subdb. It init the subdb's meta and root pages. It
locks the subdb's meta page during the entire function.
When this function is called, the db file is registered into the mpool so we
always use __memp_fget/put to read/write the page.
It calls __db_new to get a page.
Other than above, it's quite like __bam_new_file.
__db_new prefer free pages in db file, and
falls back to allocating a new page by extending the db file. __db_new is
seldom called because it writes the db file's metadata page, which becomes a
bottle neck and is expensive, thus there can be many free pages but we are
extending the db file.
Free a page and put it into the free list.
Mark a page dirty.
Mark the key/data pair with B_DELETE on the page containing it, and then
mark all cursors sitting on the key/data pair with C_DELETED via __bam_ca_delete.
But do not delete it yet or decrement the number of entries in the page, the k/d will
be deleted by the last cursor sitting on it if it is closed at this position. I think we should delete it if we find via __bam_ca_delete that
no other cursors are sitting on it.
Whenever we modify a page, we first lock the page via __db_lget, then get the page from cache via __memp_fget, then optionally mark the
page dirty via __memp_dirty, then log the action using various logging functions, followed by actually/effectively modifying the page.
Then we call __memp_fput to return the page back to the mpool, finally we release lock on the page.
QUESTION: how are key/data pairs deleted? this function only marks k/d
"deleted", but don't delete them from db pages. __bamc_close only deletes the
k/d it sits on when closed, but other k/d marked deleted by the cursur are not
even deleted when the cursor is closed. so when are all of them deleted from
When counting, consider B_DELETE items, don't count them.
12. __bamc_physdel & __bam_ditem
Physically delete a key/data pair, called when the last cursor sitting on the
deleted key/data pair is closed.
We call __bam_ditem twice to delete a key/data pair, and we log the op in
__bam_ditem. Following each __bam_ditem call, we call __bam_ca_di to adjust
other cursors of this database. We don't have a function to delete a k/d pair
from a btree leaf page at once, I think we should have such a function.
Internal btree pages only has a single structure to store the key and pageno,
they don't exist in pairs. actually except for btree leaf pages(P_LBTREE), all
other data items exist in single.
__bam_ditem alters the btree page's index array according to the type of btree
pages, and decrement the number of entries in the page, then calls __db_ditem to remove
the item from the page and log the action. or calls __db_doff to delete a opd overflow item.
from the overflow page and free the overflow page.
When deleting the last key/data pair from a btree leaf page, the page itself,
and potentially the stack of pages leading from root node to this leaf node
need to be deleted. So we note down the last key K by calling __db_ret to get
the k/d pair, and then delete this last k/d
pair by calling __bam_ditem twice, each followed by __bam_ca_di to adjust
cursors. Then, we search that last key K from root, when we complete the search,
we have in dbc->dbc_internal the stack/path of nodes leading to
this node, and we should delete several nodes in the stack --- imagine the leaf page P2's parent page P1
also has only one item, when we delete P2, we also delete P1's last item, thus
delete P1, and so on.
Release pages in the search stack of the cursor, put each page back to mpool
and optionally unlock each page.
the effective part of DBC->get.
According to the flags, dispatch calls to __bamc_prev, __bamc_next,
__bamc_search, or simply get page. The impl _DUP to is quite straigtforward,
by simply comparing adjacent keys; similarly for NO_DUP flags, it simply
iterate the k/d pairs with identical keys util got a different key.
15. __bamc_prev, __bamc_next
get from next/prev page, or from current page. alter DBC->dbc_internal's pgno
and indx. Note that in the 2 functions we may be on an opd page or a btree
ordinary leaf page.
The 2 functions plus __bamc_search only read data, they don't effectively
modify the page, so by default if we need to get another page, we read-lock
it, unless DB_RMW is set, and we would write-lock it.
the 2 funcs can skip empty pages, and deleted k/d pairs or key items in
btree internal pages.
QUESTION: Strangely enough, a k/d marked deleted is not physically deleted even when the
cursor moves away from it. so when is it deleted?