Berkeley DB 源代码分析 (1) --- 代码特征与游标的实现

原创 2012年03月25日 15:52:06

I. General Notes
1. use a cursor to access db internally. cursor connects lock/txn/logging/AM, etc.
To get a page, first create a cursor if don't have one, then call __db_lget to lock the page, then call __memp_fget to get the page from cache,
then you have the page's pointer. after use, call __TLPUT  OR __LPUT to release lock, then call __memp_fput to release page.

2. How to lock/unlock a page and get/put a page from mpool?
See __bam_read_lock.

3. Cursor Ajudgement
Cursor adjustments are logged if they are for subtransactions.  This is
because it's possible for a subtransaction to adjust cursors which will
still be active after the subtransaction aborts, and so which must be
restored to their previous locations.  Cursors that can be both affected
by our cursor adjustments and active after our transaction aborts can
only be found in our parent transaction -- cursors in other transactions,
including other child transactions of our parent, must have conflicting
locker IDs, and so cannot be affected by adjustments in this transaction.

When an key/data pair is deleted, there can be other cursors pointing to it so
we don't physically delete it immediately, but mark the key/data pair deleted, and mark
all cursors pointing at this k/d with C_DELETED (which is called a logical
delete). When the last cursor pointing at the 'deleted' k/d is closed, this key/data pair is physically deleted.
this is the only situation physical deletes happen, e.g. when the last cursor moves away from
a k/d marked BI_DELETED, that k/d is not deleted.

When closing a cursor if we find it has C_DELETED, we walk all cursors of the
same database and mark them with C_DELETED. And opd cursors if any will be checked
and marked too in the same way. This is done by __bam_ca_delete. This function
also tells us how many other cursors are sitting on this key/data pair. If no
more, we can do physical delete. If we find that there are still other cursors
sitting on the k/d, __bamc_close is done, otherwise we physically delete the
k/d, or even the opd btree/recno-tree and its on-page k/d items.

When deleting a k/d on opd pages, we don't lock the opd tree, we only lock the
page containing the on-page key/opd-root-pgno key/data pair.

It's impossible for there to be cursors from another process to sit on the
page where the key/data pair is logically deleted, because of the txnal or
handle locking. So it's sufficient to mark or adjust only cursors in the current
process when deleting/inserting a key/data pair.

Given a cursor C which made db ins/del so that we want to adjust cursors sitting
on the modified page, for each type of cursor adjust operation, it calls a __db_walk_cursors to
iterate all cursors of all DB handles of opened from the same db as C.db in C.env in current process,
and register a callback F into __db_walk_cursors for it to call against each
cursor. There are one F for each type of cursor adjust op. And if the adjust
op modifies a page, there will be log ops done, and there will be undo ops to
be called when aborting a child txn. This is the only time such adjust ops are
meaningful --- we want to restore cursors if a child txn aborts. In recovery
code no cursor adjust ops are recovered, because we don't need to restore
cursor state, we only want to restore data consistently.

TODO: my idea: transfer ownership of the data item marked deleted when a
cursor goes away or closes, until a cursor can't find another cursor on the
item--then it will physically delete the data item.

See comments in __bamc_close for more details.

4. Code file naming conventions in btree/hash/queue
AM_auto.c: contains generated log read/log functions to log changes made by
this AM.
AM_autop.c: contains generated log print functions.
AM.src: contains log records definitions, used by dist/gen_rec.awk to generate
log read/log/print functions.

AM_compact.c: contains functions to compact db file of this AM type.
AM_conv.c: contains functions to do AM specific pgin/out processing, which all
do page swapping for this AM.
AM_curadj.c: contains functions to do cursor ajudements. see above #3 for

AM_method.c: contains simple functions and all functions to init db handle
function pointers.
AM_open.c: contains functions to open databases of AM type.
AM_rec.c: contains functions to do recovery for each type of logs.
AM_stat.c contains functions to accumulate/print stats of this AM.
AM_upgrade.c contains functions to upgrade db files of this AM to newer
AM_verify.c: contains functions to verify this AM db file.

5. Function naming conventions
1. __AMc_ACTION for cursor manipulations
__bamc_init, __bamc_close,
__bamc_destroy, __bamc_refresh(***_refresh refreshes the structure
as if it's newly created, so that it can be reused), etc. As well as public
methods such as __bamc_del, __bamc_get, __bamc_put, __bamc_cmp, __bamc_count,
such methods are set to DBC handle function pointers, so that calls to those
handle function pointers can be routed to AM specific ops.

6. Off Page Duplicates (opd)
Dup data items share the same key items, like (1, 2), (1, 3), and (1,4) which
have the same key 1 and different data items 2,3,4. In btree and hash AM,
2,3,4 are normally stored in leaf pages (btree) just like other key/data
pairs, but if dup data items consum larger space than the overflow size, they
are put into a off-page-duplicate tree, which is a btree, and the on-page data
item stores the root page no of the opd tree.

opd trees can be a recno tree too, when the DB_DUPSORT and the dup compare
function is not specified.

We don't acquire any lock to access an opd tree/page, because we always lock
the on-page key/opd-pgno keydata pair's page before accessing the opd page.

In a opd btree, there is no "data" items, only "key" items, even on the leaf
pages of a opd tree. The reason is that all "key"
items are dup data items of the same "key" in that db.

an overflow data item is stored on a chain of pages and leave a B_OVERFLOW
data item on the leaf page, OR ON THE OPD-TREE'S LEAF PAGE. That is to say,
it's always allowed that a data item of a set of dup key/data pairs is an overflow item.

Duplicate key/data pairs storage:
1. DB_DUP is set, DB_DUPSORT not set
When the dup data items don't consume over a quater of the page space, they
are put on btree leaf pages. Otherwise, they are put onto a
off page duplicate recno tree. I think it's better to put them on a chain of opd pages
unsorted, because we never randomly access a dup data item(recall the flags DB_NEXT,
DB_NEXT_DUP, DB_NEXT_NODUP, and the PREV versions for DBC->get).(TODO: try a

2. both DB_DUP and DB_DUPSORT set
When the dup data items don't consume over a quater of the page space, they
are put on btree leaf pages and sorted. Otherwise, they are put into a opd



Berkeley DB——Cursor

Berkeley DB——Cursor IntroductionBerkeley DB的游标(Dbc)和关系数据库的游标是类似的——一种可以迭代数据库中的记录的装置。对于重复记录,使用游标来访问他们会...

C++下Berkeley DB的配置和简单应用

1.下载Berkeley DB 12cR1 6.0.20

Berkeley DB内核源码分析

Berkeley DB是一个优秀的数据库存储引擎,虽然它比起那些大块头的DBMS来说显的很小,但是size并不与能力呈正比,Berkeley DB拥有那些大块头DBMS的存储引擎所拥有的全部功能,而且...

Berkeley DB 源代码分析 --- 小结

刚才贴了一些文章,都是我之前读Berkeley DB的代码时候记下来的笔记,基于Berkeley DB 4.6 ~ Berkeley DB 4.8版本的代码,不过相信与现在最新的代码差别也不大,有兴趣...

Berkeley DB 源代码分析 (4) --- 事务和日志

1. in nested txns, when child txns of any level commit, __txn_child logs are always written, no matt...

Berkeley DB 源代码分析 (2) --- Btree的实现 (1)

II. Type Dictionary 1. BTREE The DB handle's DB->bt_internal structure, stores per-process and p...

Berkeley DB 源代码分析 (7) --- 事务和日志 (2)

这篇和上篇一样,也是含有一些wiki格式控制字符,看的时候直接忽略这些格式字符。 = Logging subsystem = == Architecture == Log sub...

Berkeley DB 列存储设计方案

这是我根据列存储的需求以及Berkeley DB的技术特征做的一个列存储设计方案。有兴趣的朋友可以研究一下,并且在Berkeley DB的基础上面实现出来。有问题可以联系我,我尽量抽时间回答。 ...


其实对于英语学习,一直以来我都是不愠不火,对于英语的认识也只是停留在感性的认识上,说实在的,没有细想过将要做什么计划,用什么好的方法来做适合自己的事情,提高自己的英语水平。         因为对于...

[转]Berkeley DB实现分析

Berkeley DB实现分析