Berkeley DB 源代码分析 (1) --- 代码特征与游标的实现

原创 2012年03月25日 15:52:06

I. General Notes
1. use a cursor to access db internally. cursor connects lock/txn/logging/AM, etc.
To get a page, first create a cursor if don't have one, then call __db_lget to lock the page, then call __memp_fget to get the page from cache,
then you have the page's pointer. after use, call __TLPUT  OR __LPUT to release lock, then call __memp_fput to release page.

2. How to lock/unlock a page and get/put a page from mpool?
See __bam_read_lock.

3. Cursor Ajudgement
Cursor adjustments are logged if they are for subtransactions.  This is
because it's possible for a subtransaction to adjust cursors which will
still be active after the subtransaction aborts, and so which must be
restored to their previous locations.  Cursors that can be both affected
by our cursor adjustments and active after our transaction aborts can
only be found in our parent transaction -- cursors in other transactions,
including other child transactions of our parent, must have conflicting
locker IDs, and so cannot be affected by adjustments in this transaction.

When an key/data pair is deleted, there can be other cursors pointing to it so
we don't physically delete it immediately, but mark the key/data pair deleted, and mark
all cursors pointing at this k/d with C_DELETED (which is called a logical
delete). When the last cursor pointing at the 'deleted' k/d is closed, this key/data pair is physically deleted.
this is the only situation physical deletes happen, e.g. when the last cursor moves away from
a k/d marked BI_DELETED, that k/d is not deleted.

When closing a cursor if we find it has C_DELETED, we walk all cursors of the
same database and mark them with C_DELETED. And opd cursors if any will be checked
and marked too in the same way. This is done by __bam_ca_delete. This function
also tells us how many other cursors are sitting on this key/data pair. If no
more, we can do physical delete. If we find that there are still other cursors
sitting on the k/d, __bamc_close is done, otherwise we physically delete the
k/d, or even the opd btree/recno-tree and its on-page k/d items.

When deleting a k/d on opd pages, we don't lock the opd tree, we only lock the
page containing the on-page key/opd-root-pgno key/data pair.

It's impossible for there to be cursors from another process to sit on the
page where the key/data pair is logically deleted, because of the txnal or
handle locking. So it's sufficient to mark or adjust only cursors in the current
process when deleting/inserting a key/data pair.

Given a cursor C which made db ins/del so that we want to adjust cursors sitting
on the modified page, for each type of cursor adjust operation, it calls a __db_walk_cursors to
iterate all cursors of all DB handles of opened from the same db as C.db in C.env in current process,
and register a callback F into __db_walk_cursors for it to call against each
cursor. There are one F for each type of cursor adjust op. And if the adjust
op modifies a page, there will be log ops done, and there will be undo ops to
be called when aborting a child txn. This is the only time such adjust ops are
meaningful --- we want to restore cursors if a child txn aborts. In recovery
code no cursor adjust ops are recovered, because we don't need to restore
cursor state, we only want to restore data consistently.

TODO: my idea: transfer ownership of the data item marked deleted when a
cursor goes away or closes, until a cursor can't find another cursor on the
item--then it will physically delete the data item.

See comments in __bamc_close for more details.

4. Code file naming conventions in btree/hash/queue
AM_auto.c: contains generated log read/log functions to log changes made by
this AM.
AM_autop.c: contains generated log print functions.
AM.src: contains log records definitions, used by dist/gen_rec.awk to generate
log read/log/print functions.

AM_compact.c: contains functions to compact db file of this AM type.
AM_conv.c: contains functions to do AM specific pgin/out processing, which all
do page swapping for this AM.
AM_curadj.c: contains functions to do cursor ajudements. see above #3 for

AM_method.c: contains simple functions and all functions to init db handle
function pointers.
AM_open.c: contains functions to open databases of AM type.
AM_rec.c: contains functions to do recovery for each type of logs.
AM_stat.c contains functions to accumulate/print stats of this AM.
AM_upgrade.c contains functions to upgrade db files of this AM to newer
AM_verify.c: contains functions to verify this AM db file.

5. Function naming conventions
1. __AMc_ACTION for cursor manipulations
__bamc_init, __bamc_close,
__bamc_destroy, __bamc_refresh(***_refresh refreshes the structure
as if it's newly created, so that it can be reused), etc. As well as public
methods such as __bamc_del, __bamc_get, __bamc_put, __bamc_cmp, __bamc_count,
such methods are set to DBC handle function pointers, so that calls to those
handle function pointers can be routed to AM specific ops.

6. Off Page Duplicates (opd)
Dup data items share the same key items, like (1, 2), (1, 3), and (1,4) which
have the same key 1 and different data items 2,3,4. In btree and hash AM,
2,3,4 are normally stored in leaf pages (btree) just like other key/data
pairs, but if dup data items consum larger space than the overflow size, they
are put into a off-page-duplicate tree, which is a btree, and the on-page data
item stores the root page no of the opd tree.

opd trees can be a recno tree too, when the DB_DUPSORT and the dup compare
function is not specified.

We don't acquire any lock to access an opd tree/page, because we always lock
the on-page key/opd-pgno keydata pair's page before accessing the opd page.

In a opd btree, there is no "data" items, only "key" items, even on the leaf
pages of a opd tree. The reason is that all "key"
items are dup data items of the same "key" in that db.

an overflow data item is stored on a chain of pages and leave a B_OVERFLOW
data item on the leaf page, OR ON THE OPD-TREE'S LEAF PAGE. That is to say,
it's always allowed that a data item of a set of dup key/data pairs is an overflow item.

Duplicate key/data pairs storage:
1. DB_DUP is set, DB_DUPSORT not set
When the dup data items don't consume over a quater of the page space, they
are put on btree leaf pages. Otherwise, they are put onto a
off page duplicate recno tree. I think it's better to put them on a chain of opd pages
unsorted, because we never randomly access a dup data item(recall the flags DB_NEXT,
DB_NEXT_DUP, DB_NEXT_NODUP, and the PREV versions for DBC->get).(TODO: try a

2. both DB_DUP and DB_DUPSORT set
When the dup data items don't consume over a quater of the page space, they
are put on btree leaf pages and sorted. Otherwise, they are put into a opd


Windows下Berkeley DB 4.6.19的安装和调试

  • treeinsea
  • treeinsea
  • 2007年09月20日 22:33
  • 5627

Berkeley DB设计经验

很久没有做翻译这种苦力活了,这是断断续续折腾了好久周的结果 - 原文链接:http://www....
  • zedware
  • zedware
  • 2012年08月15日 20:47
  • 27040

[收藏]Berkeley DB文章集合--游标

 Berkeley DB——CursorIntroductionBerkeley DB的游标(Dbc)和关系数据库的游标是类似的——一种可以迭代数据库中的记录的装置。对于重复记录,使用游标来访问他们会...
  • wuhuiran
  • wuhuiran
  • 2007年12月07日 09:18
  • 913

Berkeley db 数据库

简介 它是由sleepcat开发,后被oracle收购,并提升了它在数据库行业中的知名度。 Berkeley DB (DB)是一个高性能的,嵌入数据库编程库,和C语言,C++,Java,Perl,Py...
  • zhaojinjia
  • zhaojinjia
  • 2013年07月14日 16:53
  • 4876

Berkeley db笔记二 常用方法举例(需修改)

1、使用游标删除一个元素 int aa = 7; Dbt key1(&aa,sizeof(int)); if (db.del(NULL,&key1,0) == 0) { cout...
  • zhaojinjia
  • zhaojinjia
  • 2013年04月15日 16:52
  • 1348

Berkeley DB基础教程

一、Berkeley DB的介绍 (1)Berkeley DB是一个嵌入式数据库,它适合于管理海量的、简单的数据。如Google使用其来保存账户信息,Heritrix用其来保存froniter. (...
  • jediael_lu
  • jediael_lu
  • 2014年05月29日 15:21
  • 14911


  • creatorx
  • creatorx
  • 2017年10月03日 17:05
  • 362

[BZOJ3622]已经没有什么好害怕的了 二项式反演

恰好kk组不好求,先求至少kk组,设恰好kk组方案数为GkG_k,至少为FkF_k。 首先把aa,bb都排序,然后求出tit_i表示aia_i比bb中多少个数大。设fi,jf_{i,j}为考虑a1....
  • 2018年01月20日 20:03
  • 158


1、设出两个函数,f(i),g(i)。 要求其满足条件f(n)=sigm(i=0到i=n)C(n,i)g(i) 然后根据这个条件,可以推导出 g(n)=sigm(i=0到i=n)C(n,i)*(-1)...
  • qq_36124802
  • qq_36124802
  • 2017年07月31日 16:35
  • 92


1、MongoDB 介绍 MongoDB是一个基于分布式文件存储的数据库。由C++语言编写。主要解决的是海量数据的访问效率问题,为WEB应用提供可扩展的高性能数据存储解决方案。当数据量达到50GB以上...
  • fishmai
  • fishmai
  • 2016年06月17日 12:41
  • 13711
您举报文章:Berkeley DB 源代码分析 (1) --- 代码特征与游标的实现