Berkeley DB 源代码分析 (1) --- 代码特征与游标的实现

原创 2012年03月25日 15:52:06

I. General Notes
1. use a cursor to access db internally. cursor connects lock/txn/logging/AM, etc.
To get a page, first create a cursor if don't have one, then call __db_lget to lock the page, then call __memp_fget to get the page from cache,
then you have the page's pointer. after use, call __TLPUT  OR __LPUT to release lock, then call __memp_fput to release page.

2. How to lock/unlock a page and get/put a page from mpool?
See __bam_read_lock.

3. Cursor Ajudgement
Cursor adjustments are logged if they are for subtransactions.  This is
because it's possible for a subtransaction to adjust cursors which will
still be active after the subtransaction aborts, and so which must be
restored to their previous locations.  Cursors that can be both affected
by our cursor adjustments and active after our transaction aborts can
only be found in our parent transaction -- cursors in other transactions,
including other child transactions of our parent, must have conflicting
locker IDs, and so cannot be affected by adjustments in this transaction.

When an key/data pair is deleted, there can be other cursors pointing to it so
we don't physically delete it immediately, but mark the key/data pair deleted, and mark
all cursors pointing at this k/d with C_DELETED (which is called a logical
delete). When the last cursor pointing at the 'deleted' k/d is closed, this key/data pair is physically deleted.
this is the only situation physical deletes happen, e.g. when the last cursor moves away from
a k/d marked BI_DELETED, that k/d is not deleted.

When closing a cursor if we find it has C_DELETED, we walk all cursors of the
same database and mark them with C_DELETED. And opd cursors if any will be checked
and marked too in the same way. This is done by __bam_ca_delete. This function
also tells us how many other cursors are sitting on this key/data pair. If no
more, we can do physical delete. If we find that there are still other cursors
sitting on the k/d, __bamc_close is done, otherwise we physically delete the
k/d, or even the opd btree/recno-tree and its on-page k/d items.

When deleting a k/d on opd pages, we don't lock the opd tree, we only lock the
page containing the on-page key/opd-root-pgno key/data pair.

It's impossible for there to be cursors from another process to sit on the
page where the key/data pair is logically deleted, because of the txnal or
handle locking. So it's sufficient to mark or adjust only cursors in the current
process when deleting/inserting a key/data pair.

Given a cursor C which made db ins/del so that we want to adjust cursors sitting
on the modified page, for each type of cursor adjust operation, it calls a __db_walk_cursors to
iterate all cursors of all DB handles of opened from the same db as C.db in C.env in current process,
and register a callback F into __db_walk_cursors for it to call against each
cursor. There are one F for each type of cursor adjust op. And if the adjust
op modifies a page, there will be log ops done, and there will be undo ops to
be called when aborting a child txn. This is the only time such adjust ops are
meaningful --- we want to restore cursors if a child txn aborts. In recovery
code no cursor adjust ops are recovered, because we don't need to restore
cursor state, we only want to restore data consistently.

TODO: my idea: transfer ownership of the data item marked deleted when a
cursor goes away or closes, until a cursor can't find another cursor on the
item--then it will physically delete the data item.

See comments in __bamc_close for more details.

4. Code file naming conventions in btree/hash/queue
AM_auto.c: contains generated log read/log functions to log changes made by
this AM.
AM_autop.c: contains generated log print functions.
AM.src: contains log records definitions, used by dist/gen_rec.awk to generate
log read/log/print functions.

AM_compact.c: contains functions to compact db file of this AM type.
AM_conv.c: contains functions to do AM specific pgin/out processing, which all
do page swapping for this AM.
AM_curadj.c: contains functions to do cursor ajudements. see above #3 for

AM_method.c: contains simple functions and all functions to init db handle
function pointers.
AM_open.c: contains functions to open databases of AM type.
AM_rec.c: contains functions to do recovery for each type of logs.
AM_stat.c contains functions to accumulate/print stats of this AM.
AM_upgrade.c contains functions to upgrade db files of this AM to newer
AM_verify.c: contains functions to verify this AM db file.

5. Function naming conventions
1. __AMc_ACTION for cursor manipulations
__bamc_init, __bamc_close,
__bamc_destroy, __bamc_refresh(***_refresh refreshes the structure
as if it's newly created, so that it can be reused), etc. As well as public
methods such as __bamc_del, __bamc_get, __bamc_put, __bamc_cmp, __bamc_count,
such methods are set to DBC handle function pointers, so that calls to those
handle function pointers can be routed to AM specific ops.

6. Off Page Duplicates (opd)
Dup data items share the same key items, like (1, 2), (1, 3), and (1,4) which
have the same key 1 and different data items 2,3,4. In btree and hash AM,
2,3,4 are normally stored in leaf pages (btree) just like other key/data
pairs, but if dup data items consum larger space than the overflow size, they
are put into a off-page-duplicate tree, which is a btree, and the on-page data
item stores the root page no of the opd tree.

opd trees can be a recno tree too, when the DB_DUPSORT and the dup compare
function is not specified.

We don't acquire any lock to access an opd tree/page, because we always lock
the on-page key/opd-pgno keydata pair's page before accessing the opd page.

In a opd btree, there is no "data" items, only "key" items, even on the leaf
pages of a opd tree. The reason is that all "key"
items are dup data items of the same "key" in that db.

an overflow data item is stored on a chain of pages and leave a B_OVERFLOW
data item on the leaf page, OR ON THE OPD-TREE'S LEAF PAGE. That is to say,
it's always allowed that a data item of a set of dup key/data pairs is an overflow item.

Duplicate key/data pairs storage:
1. DB_DUP is set, DB_DUPSORT not set
When the dup data items don't consume over a quater of the page space, they
are put on btree leaf pages. Otherwise, they are put onto a
off page duplicate recno tree. I think it's better to put them on a chain of opd pages
unsorted, because we never randomly access a dup data item(recall the flags DB_NEXT,
DB_NEXT_DUP, DB_NEXT_NODUP, and the PREV versions for DBC->get).(TODO: try a

2. both DB_DUP and DB_DUPSORT set
When the dup data items don't consume over a quater of the page space, they
are put on btree leaf pages and sorted. Otherwise, they are put into a opd


Berkeley DB基础教程

一、Berkeley DB的介绍 (1)Berkeley DB是一个嵌入式数据库,它适合于管理海量的、简单的数据。如Google使用其来保存账户信息,Heritrix用其来保存froniter. (...

Berkeley db笔记二 常用方法举例(需修改)

1、使用游标删除一个元素 int aa = 7; Dbt key1(&aa,sizeof(int)); if (db.del(NULL,&key1,0) == 0) { cout...

Berkeley db 数据库

简介 它是由sleepcat开发,后被oracle收购,并提升了它在数据库行业中的知名度。 Berkeley DB (DB)是一个高性能的,嵌入数据库编程库,和C语言,C++,Java,Perl,Py...


简单总结: BeeGFS的轻量级架构设计和对现在高性能计算中对Burst Buffer技术需求的精确把握,加上对开源趋势和势态的开怀拥抱将力助BeeGFS在商业上获得成功。此外,BeeGFS也具备丰富...

Redis的VM实现——终究敌不过业务架构师!  Redis的VM机制将会成为历史,不过,回顾一下历史,未尝不是一件乐事。 ...

Berkeley DB 源代码分析 (2) --- Btree的实现 (1)

II. Type Dictionary 1. BTREE The DB handle's DB->bt_internal structure, stores per-process and p...

Berkeley DB 源代码分析 (3) --- Btree的实现 (2)

__bam_ditem In btree we store on-page duplicate key/data pairs this way: 1. we only put the key...

Berkeley DB 源代码分析 --- 小结

刚才贴了一些文章,都是我之前读Berkeley DB的代码时候记下来的笔记,基于Berkeley DB 4.6 ~ Berkeley DB 4.8版本的代码,不过相信与现在最新的代码差别也不大,有兴趣...

Berkeley DB 源代码分析 (7) --- 事务和日志 (2)

这篇和上篇一样,也是含有一些wiki格式控制字符,看的时候直接忽略这些格式字符。 = Logging subsystem = == Architecture == Log sub...

Berkeley DB 源代码分析 (6) 缓存模块

这篇文字原来是贴在wiki里面的,所以有一些wiki系统使用的格式标记,大家将就看吧,不好意思哈。 = Memory Pool subsystem = == Architecture ==...
您举报文章:Berkeley DB 源代码分析 (1) --- 代码特征与游标的实现