Berkeley DB 源代码分析 (3) --- Btree的实现 (2)

原创 2012年03月25日 15:54:00


In btree we store on-page duplicate key/data pairs this way:
1. we only put the key onto the page once, since it's duplicated, there is no meaning putting
identical keys multiple times. and we put each of the dup keys' data items
onto the page;

2. In the index array, there are multiple index element for this
dup key pointing at the same key's offset. and since the index array is sorted by
the keys the elements point at, index element to the same dup keys are continuous, like
indx[i], indx[i+1] and indx[i+2] point at the same key value on the page who
has 3 dup key/data pairs.

so when deleting the key indx[i+1], we don't remove the key from page since there are
still indx[i] and indx[i+2] pointing at the key. we simply move elements after
indx[i+1] one element forward, and then we will have indx[i] and indx[i+1]
pointing at that key, and thus we will have two dup key/data pairs. When
deleting a key/data pair of btree leaf page, we do it twice, first delete the
key then delete the data item -- the order can't be reversed.

Deleting key/data pairs

1. In DBC->del, we only mark the key/data pair deleted (B_DELETE), and mark
the cursor to be pointing to a deleted k/d pair(C_DELETED), but we don't
effectively remove the k/d from page, unless the cursor is closed and it's
pointing to a deleted k/d. In this special case we will remove the single k/d
pair it points to. After a data item is marked deleted, it can be internally
found/located by search functions, but never returned to user. The space it
takes can be overwritten, when inserting a k/d which should be located at
exactly the same page and location.

Thus, if we use DB->del to delete a k/d, it's immediately deleted from db; if
we use DBC->del to iterate the db and del each k/d, none except the last one
is removed from db. This can avoid frequent tree structure change
(split/rsplit), which are expensive operations, but also waste a lot of space

I think we should add a DB_FORCE flag for DBC->close and when it's specified
we know no other cursor is pointing on the k/d, thus when our cursor is about
to move away from current page to another page, we delete all k/d pairs marked
B_DELETE. We don't remove on each DBC->del call because it would make the
cursor movement operations harder to implement.

Return a specified key/data pair, given the page pointer(which was locked and
fetched from mpool already), pgno and index.


works for DB_GET_BOTH and DB_GET_BOTH_RANGE flags in DB/DBC->get.

If DB_DUP is set but DB_DUPSORT is not set, in which case dbp->dup_compare is
null, we do a linear search, and only look for exact match even RANGE is
specified, i.e. RANGE is identical to GET_BOTH if not DUPSORT, which is

Otherwise both DB_DUP and DB_DUPSORT are specified, and we do a binary search
on the leaf page. __bamc_search does the btree search in the opd, not this


In btree we can't specify where to store a k/d because its stored according to
k's value and d's value. The only exception to this rule is when the btree
allows dup(DB_DUP set) but doesn't allow sorted dup(!DB_DUPSORT), and in this
case we can specify to insert a data item before or after(DB_BEFORE/DB_AFTER)
the cursor's current pointed key/data pair as a dup data item for the same

controls how to deal with dup data items rather than control movement or pos
of the cursor. The first two is effective when the
db is a btree with both DB_DUP and DB_DUPSORT set --- DB_OVERWRITE_DUP asks DB/DBC->put to
overwrite the identical key/data pair in such a db, the DB_NODUPDATA asks DB/DBC->put to
return DB_KEYEXIST error value. DB_NOOVERWRITE asks DB/DBC->put to to always
return DB_KEYEXIST if the key to put already exists in db, no matter DB_DUP is
set or not.

In this function we need to deal with four types of btree db:
a. no-dup db; b. dup but unsorted db; c.
dup sorted on-page key/data pairs; d. dup sorted opd tree.

we try to decide the flags parameter to __bam_iitem, according to the type of
db and the flag parameter for __bamc_put. and we may need to split if the page is full, and
if we do so, we will retry after split. we will also move the cursor at the
correct position for the put operation, for __bam_iitem to operate.

If our cursor is currently in the main btree and we need to go into an opd
tree to search/put, we have to return to the caller for it to create an opd
cursor for this key item, whose opd root pgno we already found out, and retry
the put op.

* __bam_iitem (dbc, key, data, op, flags)
Insert a key/data pair into specified position. flags parameter is not used.

When this function is called, we have positioned the cursor at the position
where we want to insert/update the key/data pair. If the op is DB_CURRENT,
we update the current k/d pair; if op is DB_BEFORE/DB_AFTER we insert a new
data item as the dup data item of current k/d pair's key, and we insert it
immediately before or after the current position; if op is DB_KEYFIRST, we
insert key/data pair at the sorted pos of the btree.

The first part handles partial put and streaming put. If it's streaming put --
identifiable according to the dbt's doff/dlen and existing data item length --
we need to read the entire existing data item into "tdbt" var by calling
__db_goff, and set DB_DBT_STREAMING.
If it's a normal partial put, we also need to read it into "tdbt" var by calling __bam_build.

The 2nd part computes the bytes needed to store this data item, based on the
type of operation and the computed byte-len at the beginning of this function.
If the free space on the page is not enough for this item, we will split. We
don't try to utilize items marked deleted. I think we should do so, by making
use of the space taken by those deleted items which are not referenced by any cursor. (TODO)

consider this situation: we insert a thousand k/d pairs with keys always being
"1", data being arbitary byte string. how the split will handle it? dozens of
pages are needed, but only one key available. I think the only solution is to
store all dups to an opd, but don't form a opd tree, just put these data items
into the chain of opd pages.

If allowing DUP and less then half free space after this k/d put, we call
__bam_dup_check to check if we need to put all dup data items for the
key into an opd btree/recno-tree. If the dup data items consum over a quater
of the page space, call __bam_dup_convert to do so. Thus we only need one such
page to store all on-page dups. If DB_DUPSORT is set to the db handle, we use
btree opd to store the sorted dup data items, otherwise we use recno opd tree
to store unsorted dup data items.

Then, we put the key and then the data item into the page.
Putting a key:
For DB_AFTER and DB_BEFORE, we only store another element in the index array,
value is the key offset. Then we ajust cursor indexes.
For DB_CURRENT, we need to modify only the current data item, key is not
modified, so we will talk about it below.
is the only case we need to store the key into the page. We will call
__bam_ovput if the key should be put into an overflow page, or __db_pitem

Putting a data item:
Calls __bam_ovput, __db_pitem and __bam_ritem to put the data item onto the
page or overflow pages. finally call __bam_dup_convert to move too large dup
items into opd tree.

* Partial put:
__db_buildpartial and __db_partsize and __bam_partsize

For a DBT dbt passed to DB/DBC->put to be partial put, if dbt.doff > existing
data item's total length L0(L0 can be 0 if adding a new data item for an
existing key using DB_AFTER or DB_BEFORE), after this partial put, there will be a hole of
zeros between L0 and dbt.doff bytes in the new data item's byte array.
__db_partsize computes how many bytes the new data item will need, based on
existing data item total length, and dbt's doff, dlen, size members. the new
data item may shink or grow.
__db_buildpartial builds the new data item given the new total size of this
item and the existing bytes and's new bytes.

__bam_partsize get the btree specific new data item's total length. To do so,
it first get this data item's existing length, then calls
__db_partsize to get total size in bytes of the new data item we will insert.
providing to it the old data item length and the new data item represented by a dbt.

* __db_goff (dbt, bpp, bpsz, pgno, tlen)

get an overflow data item whose first page number is pgno and total length is
tlen, partially or entirely. The partial get settings is in dbt. It starts
from the first page, fetch each page on the chain from mpool, until arrive at
the page where we want to start the partial get (dbt.doff), then get specified
length. It has optimize for streaming get -- if doing streaming get, i.e. get
a huge chunk of data continuously portion by portion with no overlap or holes,
the next-pageno and stream-offset is stored in the cursor, so that we directly
start from that pgno's stream-offset, and get as much as we want, so that we
don't have to go through the page chain from beginning.

* __bam_build
Build a dbt obj representing the new data item we want to insert. The data
item may be an overflow item, or not. And partial put is allowed. streaming
put don't need this function.

* __bam_ovput(dbc, type, pgno, h, indx, item)
Store a overflow or dup data item into the overflow pages, or the opd trees,
and put the B_OVERFLOW/B_DUPLICATE on-page item onto the leaf page.

type: overflow or dups
pgno: opd root page;
h: page pointer of the leaf page
indx: the index of the leaf page's index-array where we will put the offset of
the leaf-page item. this item can be an overflow item(BOVERFLOW) or
item: data item to store.

__db_pitem(pagep, indx, nbytes, hdr, data)
Put a data item (all 3 types, BKEYDATA, BOVERFLOW and BDUPLICATE supported ) onto the
pagep: the page where we will store the data item;
indx: the index of the element to store the offset of this data item;
nbytes: the total bytes of the data item, including the hdr (BKEYDATA and the
other 2 types of headers), whether or not hdr is NULL.
hdr: the BKEYDATA and the other 2 hdrs, if NULL, construct internally.
data: the data item to store.
The remaining work here is to shuffle the index array to store the offset of
the item, and append the item into the page at the tail of the items in the
page(items grow from end to beginning)

__db_ditem (pagep, indx, nbytes)
Remove the data item refered by 'index' element of the index array of page
'pagep' from pagep, by shuffle the item array tail part towards the end of the
page(i.e. the beginning of the item array) and shuffle the index array's elements
after 'index' one element forward. And consequently need to alter index elements
which point to elements with less offset than original inp[index].

We do not trust page contents because this function can be used by recovery
code too, when the page may be inconsistent. Thus we need the "nbytes" to
indicate the NO. of bytes of this data item, which can be computed during
normal operation and read from log during recovery.

For each page altering function, i.e. __db_ditem, there is a ***_nolog like
__db_ditem_nolog which is called by __db_ditem and then __db_ditem logs the
op. so that **nolog can be called during recovery.

__bam_dup_convert (h, indx, cnt)
Move dup data items from leaf page onto a opd tree leaf page. It won't be able
to set up a tree of >1 pages because when it's called the dup items only takes
up a quarter of the leaf page.
h: the leaf page containing dup data items;
cnt: the NO. of key/data pairs, e.g. (1, 2) and (1, 3) gives cnt 2.
indx: the 1st key item's index of the dup key/data pairs.

First we move the dbc cursor to the 1st key/data pair of the dup k/d pairs.
Then we allocate a new page via __db_new, whose type is either LDUP(btree opd
for sorted dup items) or LRECNO(recno opd for unsorted dup items), and init
the page.

Then we iterate all k/d pairs, for each data item we move it onto the opd leaf page.
The data items will be used as keys in the opd trees, but no data items will
be available for those leaf pages, which is why we need another type(LDUP).
then delete the key index from h's index array by calling __bam_adjindx, then
call __db_ditem to remove the data item(it's already on the opd leaf page)
from h. Any of these items can be BOVERFLOW data items, we will delete the
page chain via __db_doff before calling __db_ditem.

Finally we call __bam_ovput to store the dup leaf page into db and it's pgno
onto the data item for the key as a B_DUPLICATE data item. This DUP data
item's index in h's index array is 1 greater than the 1st key's index, of

__bam_get_root (dbc, root_pgno, slevel, flags, stack)
Fetch the root of a tree and see if we want to keep it in the stack.

Determine the lock mode and buffer get mode:
    By default read lock it. but in some conditions we want to write lock it, which means we will dirty the page, so use DB_MPOOL_DIRTY in memp_fget.
    and sometimes we want to try latch the buffer.

Get the root page using the lock and get modes. we need to consider sub db's root pages which may change position.
If we fail to try latch the buffer, lock the page and retry the get-root-page action.



context knowleage:
Btree leaf pages have level 1, and leaf pages' parent pages have level 2, etc.
Search path : when searching for a key in a tree we have a stack of pages, we also call it the search path for the key.
split a root page or other pages of a btree. we are given a key which is the one faild an insert because the leaf page is full. when we split a page we need to
promote the mid value to its parent page, thus over time the parent page can be full too, thus a leaf page split may result in spliting the entire
search path from root to the leaf, and we would need to lock all pages in the path to do such a split.

However in order to enhance concurrency, we don't want to lock the entire path all at once, rather,
we try to lock as few pages and as lower level pages as possible, so we search down the tree for key K to a certain level L (L starts at leaf level), locking the
page L and its parent page PL, and see (by calling the split functions __bam_root and __bam_page which will return DB_NEEDSPLIT if we need to split
its parent, and will succeed if the parent page does not need split.) when we split L, do we need to split PL. If we do, we go UP one level and repeat
the search and judge, until we meet L0
which is the page in our search path that does not require its parent to split, and L0 will be our start point for the entire split process. so we start at L0
and go downwards on the search path, locking a page and its parent, then split the page. Such split can succeed because the parent already have enough space because
it's just been split, or had enough space (for L0's parent). But such split can fail too because between our two splits for internal pages,
the parent of the page we are splitting (PS) may have been filled and we can't promote the mid value into it, so we have to go UP again starting from PS,
to find the new L0, and split pages then go down just like said above, until we reach the leaf page by going down to leaf level.

Whenever we go UP or DOWN, we MUST start a new search from the root page, we should always go from the root to lock pages downwards, never upwards. if our insert
operation fails because the leaf page needs a split, we must unlock all pages then do the split.

__bam_root -- split a btree root page.
The special thing about root page is that we don't have to worry that its parent page may need to split too since the root
page does not have a parent.

Mark the root page dirty first in case the page we have is for read only. then allocate new pages via __db_new for the left and right page and init them,
then call __bam_psplit to do the page split, then log the split action. since we are splitting the root there is no risk that we need to do further upper
level split.

In this function we don't log the snapshot of the root page before the split, we only log the new root page after the split, as well as the
new left and right leaf page, and the two pointers to them in the new root page.

in this function there is a wrong order to lock and mempool ops: we unlocked the pages before mempool puts, we should have done in reverse order.

split a btree page other than the root page.

Call __bam_psplit to move pg2split's left page data and right page data to 2 pages alloced from malloc'ed memory rather than cache and init them.

we must first test whether pg2split's parent page needs a split. if so __bam_pinsert returns
DB_NEEDSPLIT but don't actually modifies the parent page.

We malloc space for both the left and right pages, so we don't get
a new page from the underlying buffer pool until we know the split
is going to succeed.  The reason is that we can't release locks
acquired during the get-a-new-page process because metadata page
locks can't be discarded on failure since we may have modified the
free list.

If __bam_pinsert test succeeds we will call __db_new to allocate pages and call __bam_pinsert to actually modify teh parent page, and log the split.

promote the mid value to the parent page. may return DB_NEEDSPLIT if the parent page is full.

__bam_psplit: move half of items from one page to another. both __bam_root and __bam_page call it to do the split work.
it does not do any logging.
when we want to read/write a page, we should always first lock the page in the correct mode, then call mempool methods to get the page, and modify it. after the
modification we should first put it back to mempool, then unlock the page if needed (txnal locks are not released here). this way there will always be only one
AM executing in the mpool, although there may be backend threads doing trickling, chkpnting, dirty-page-syncing, etc.



Berkeley DB 源代码分析 (2) --- Btree的实现 (1)

II. Type Dictionary 1. BTREE The DB handle's DB->bt_internal structure, stores per-process and p...

Berkeley DB 源代码分析 (7) --- 事务和日志 (2)

这篇和上篇一样,也是含有一些wiki格式控制字符,看的时候直接忽略这些格式字符。 = Logging subsystem = == Architecture == Log sub...

Berkeley DB 源代码分析 (1) --- 代码特征与游标的实现

I. General Notes 1. use a cursor to access db internally. cursor connects lock/txn/logging/AM, etc....

Berkeley DB 源代码分析 --- 小结

刚才贴了一些文章,都是我之前读Berkeley DB的代码时候记下来的笔记,基于Berkeley DB 4.6 ~ Berkeley DB 4.8版本的代码,不过相信与现在最新的代码差别也不大,有兴趣...

Berkeley DB 源代码分析 (5) --- 事务锁模块

Locking Subsystem Learning Notes 0. locking API __db_lput/__db_lget are txnal lock put/get, often ...

Berkeley DB 源代码分析 (6) 缓存模块

这篇文字原来是贴在wiki里面的,所以有一些wiki系统使用的格式标记,大家将就看吧,不好意思哈。 = Memory Pool subsystem = == Architecture ==...

Berkeley DB part2

  • 2010-05-10 20:33
  • 17.65MB
  • 下载

基于hadoop MR+berkeley DB实现的十亿级数据的秒级部署和实时查询的解决方案

要解决的问题 1、有10亿级别的某视频网的注册用户和设备用户,需要T+1天的延时后,供前端实时查询任意uid或是设备id对应的用户画像数据。 2、分为计算周期+布署服务化+查询三部分,计算用时优化余地...

Berkeley DB内核源码分析

Berkeley DB是一个优秀的数据库存储引擎,虽然它比起那些大块头的DBMS来说显的很小,但是size并不与能力呈正比,Berkeley DB拥有那些大块头DBMS的存储引擎所拥有的全部功能,而且...

免费数据库(SQLite、Berkeley DB、PostgreSQL、MySQL、Firebird、mSQL、MSDE、DB2 Express-C、Oracle XE

免费数据库(SQLite、Berkeley DB、PostgreSQL、MySQL、Firebird、mSQL、MSDE、DB2 Express-C、Oracle XE 2009-02-16 1...