Berkeley DB 源代码分析 (2) --- Btree的实现 (1)

原创 2012年03月25日 15:52:53

II. Type Dictionary

The DB handle's DB->bt_internal structure, stores per-process and per-dbhandle
btree info and function pointers.

The btree meta page structure shared by all processes. It stores what's in the
btree's meta page, including all btree specific global(btree-db-wide) info and
common AM db-wide global info.

It contains the common cursor fields defined in macro __DBC_INTERNAL which are shared
by all types of cursors, and mainly a page stack for btree searching.
In __dbc type, we use a pointer to the __dbc_internal type which is defined as the "base class" for all cursor types, but
actually we allocate memory for BTREE_CURSOR or other types, and cast to
specific cursor types before actually using them. We never directly use
__dbc_internal type.

III. Macro Dictionary
P_INIT: init a non-meta page.

DBC_LOGGING: Find via a cursor if using logging.

LCK_COUPLE: a parameter to __db_lget. If specified, in __db_lget we will first
release the lock then aquire the lock to the same lockobj with specified lock

DBC_DOWNREV: In replication it is allowd that the master have lower version DB lib than replicas.
So if the master uses DB versions older than the version which first
have latching support, replicas will notice this and set this flag to all its cursors, and replicas
will use traditional mutex locking rather than shared latches.

STD_LOCKING: Whether to use std locking, that is, the locking subsystem is started, and we are not using CDS, and the cursor is not sitting on an
off-page-duplicate apge/tree.

DB_MPOOL_TRY: a flag for __memp_fget, telling __memp_fget to try get a latch on the target page's buffer.
If the latch is not granted, return DB_LOCK_NOTGRANTED immediately without waiting.

IV. Function Dictionary

0. __bam_open
Open a btree database. It simply calls __bam_read_root after some config

1. __bam_read_root  
get the btree db's metadata page and use info in it to init the BTREE structure of the DB->bt_internal. The meta data's info was filled in general
DB->open call before calling __bam_open.

2. __bam_metachk
Checks a btree meta data page validity.

3. __bam_init_meta
Init a meta data page's fields, i.e. the BTMETA structure's fields. Called
whenever a metadata page is created during btree db open procedures. For other
pages than meta pages, we use P_INIT to init them.

4. __bam_new_file
Routed from __db_new_file.

Create a btree db file by initializing its meta page and root page. Called
during db open process and routed from __db_new_file when db is a btree db.
The db may be in memory or not. For inmem db, we create the page from cache
and mark it dirty (mark this in __memp_fget rather than after actually writing
to it otherwise the page may get evicted before we had a chance to mark it.);
For on-disk db files, we don't use cache for now, rather, we put the page in
private memory to init, and directly write the  pages into the db file using __fop_write.

when writing pages directly via __fop_writ/__fop_read, we should call the
internal common page in/out functions after got the page via __fop_read and
before writing the page via __fop_write. The __memp_fget/__memp_fput functions
call them too, as registered callbacks via __memp_pg. We have internal page in/out
callbacks for the 3 types of databases(btree, hash, queue), the internal page in/out functions mainly do
check summing and page header byte swap, so that database files created in
big-endian machines can be opened on little-endian machines, though the user
data are never swapped, so users need to make sure the bytes they get are correct.
There are AM specific work to do in internal page in/out functions, so we have
a __db_pgin/__db_pgout pair(placed in db/db_conv.c), in which they call AM specific pgin/out functions
like __bam_pgin/__bam_pgout (placed in btree/btree_conv.c, note the file name

The reason we use __fop_write here, is that at this point, the db is not fully
opened, it's not registered in the mpool region yet.

__memp_fget/put functions do not do logging, so before putting a dirty page
back to the cache, we should log changes; __fop_write logs the action, so no
need to do it in __bam_new_file.

1. In this function we didn't lock the meta/root pages but use latches, why? don't we want txnal semantics?
2. Generally how do we guarantee txnal sementics when we release metapage
non-txnal locks immediately after use? (this is good for performance, but how to enture
consistence? ) examples are __bam_read_root, __bam_new_subdb, and __db_new.
These locks are not txnal, why can't they be replaced by latches?
In the DBMETA general meta info, only the "last_pgno", and "free", and
"key_count" and "record_count" can be updated, others are static fields. AM
specific parts have several more, for btree they are "root", "iv" and "chksum". So if
these  fields don't require txnal locks, it's OK to release locks before txn

5. __bam_new_subdb

Routed from __db_init_subdb. It init the subdb's meta and root pages. It
locks the subdb's meta page during the entire function.
When this function is called, the db file is registered into the mpool so we
always use __memp_fget/put to read/write the page.
It calls __db_new to get a page.

Other than above, it's quite like __bam_new_file.

6. __db_new

__db_new prefer free pages in db file, and
falls back to allocating a new page by extending the db file. __db_new is
seldom called because it writes the db file's metadata page, which becomes a
bottle neck and is expensive, thus there can be many free pages but we are
extending the db file.

7. __db_free
Free a page and put it into the free list.

8. __bamc_close
See I.3.

9. __memp_dirty
Mark a page dirty.

10. __bamc_del
Mark the key/data pair with B_DELETE on the page containing it, and then
mark all cursors sitting on the key/data pair with C_DELETED via __bam_ca_delete.
But do not delete it yet or decrement the number of entries in the page, the k/d will
be deleted by the last cursor sitting on it if it is closed at this position. I think we should delete it if we find via __bam_ca_delete that
no other cursors are sitting on it.

Whenever we modify a page, we first lock the page via __db_lget, then get the page from cache via __memp_fget, then optionally mark the
page dirty via __memp_dirty, then log the action using various logging functions, followed by actually/effectively modifying the page.
Then we call __memp_fput to return the page back to the mpool, finally we release lock on the page.

QUESTION: how are key/data pairs deleted? this function only marks k/d
"deleted", but don't delete them from db pages. __bamc_close only deletes the
k/d it sits on when closed, but other k/d marked deleted by the cursur are not
even deleted when the cursor is closed. so when are all of them deleted from

11. __bamc_count
When counting, consider B_DELETE items, don't count them.

12. __bamc_physdel & __bam_ditem

Physically delete a key/data pair, called when the last cursor sitting on the
deleted key/data pair is closed.

We call __bam_ditem twice to delete a key/data pair, and we log the op in
__bam_ditem. Following each __bam_ditem call, we call __bam_ca_di to adjust
other cursors of this database. We don't have a function to delete a k/d pair
from a btree leaf page at once, I think we should have such a function.
Internal btree pages only has a single  structure to store the key and pageno,
they don't exist in pairs. actually except for btree leaf pages(P_LBTREE), all
other data items exist in single.

__bam_ditem alters the btree page's index array according to the type of btree
pages, and decrement the number of entries in the page, then calls __db_ditem to remove
the item from the page and log the action. or calls __db_doff to delete a opd overflow item.
from the overflow page and free the overflow page.

When deleting the last key/data pair from a btree leaf page, the page itself,
and potentially the stack of pages leading from root node to this leaf node
need to be deleted. So we note down the last key K by calling __db_ret to get
the k/d pair, and then delete this last k/d
pair by calling __bam_ditem twice, each followed by __bam_ca_di to adjust
cursors. Then, we search that last key K from root, when we complete the search,
we have in dbc->dbc_internal the stack/path of nodes leading to
this node, and we should delete several nodes in the stack --- imagine the leaf page P2's parent page P1
also has only one item, when we delete P2, we also delete P1's last item, thus
delete P1, and so on.

13. __bam_stkrel
Release pages in the search stack of the cursor, put each page back to mpool
and optionally unlock each page.

14. __bamc_get
the effective part of DBC->get.
According to the flags, dispatch calls to __bamc_prev, __bamc_next,
__bamc_search, or simply get page. The impl _DUP to is quite straigtforward,
by simply comparing adjacent keys; similarly for NO_DUP flags, it simply
iterate the k/d pairs with identical keys util got a different key.

15. __bamc_prev, __bamc_next
get from next/prev page, or from current page. alter DBC->dbc_internal's pgno
and indx. Note that in the 2 functions we may be on an opd page or a btree
ordinary leaf page.
The 2 functions plus __bamc_search only read data, they don't effectively
modify the page, so by default if we need to get another page, we read-lock
it, unless DB_RMW is set, and we would write-lock it.

the 2 funcs can skip empty pages, and deleted k/d pairs or key items in
btree internal pages.

QUESTION: Strangely enough, a k/d marked deleted is not physically deleted even when the
cursor moves away from it. so when is it deleted?



Berkeley DB 源代码分析 (3) --- Btree的实现 (2)

__bam_ditem In btree we store on-page duplicate key/data pairs this way: 1. we only put the key...

Berkeley DB 源代码分析 (1) --- 代码特征与游标的实现

I. General Notes 1. use a cursor to access db internally. cursor connects lock/txn/logging/AM, etc....

Berkeley DB 源代码分析 (7) --- 事务和日志 (2)

这篇和上篇一样,也是含有一些wiki格式控制字符,看的时候直接忽略这些格式字符。 = Logging subsystem = == Architecture == Log sub...

Berkeley DB 源代码分析 --- 小结

刚才贴了一些文章,都是我之前读Berkeley DB的代码时候记下来的笔记,基于Berkeley DB 4.6 ~ Berkeley DB 4.8版本的代码,不过相信与现在最新的代码差别也不大,有兴趣...

Berkeley DB 源代码分析 (5) --- 事务锁模块

Locking Subsystem Learning Notes 0. locking API __db_lput/__db_lget are txnal lock put/get, often ...

Berkeley DB 源代码分析 (6) 缓存模块

这篇文字原来是贴在wiki里面的,所以有一些wiki系统使用的格式标记,大家将就看吧,不好意思哈。 = Memory Pool subsystem = == Architecture ==...

Berkeley DB part2

  • 2010年05月10日 20:33
  • 17.65MB
  • 下载

基于hadoop MR+berkeley DB实现的十亿级数据的秒级部署和实时查询的解决方案

要解决的问题 1、有10亿级别的某视频网的注册用户和设备用户,需要T+1天的延时后,供前端实时查询任意uid或是设备id对应的用户画像数据。 2、分为计算周期+布署服务化+查询三部分,计算用时优化余地...

berkeley db--进阶特性分析

转至: 数据存储 Berkeley DB的数据存储可以抽象为一张表,其中第一列是key,剩余的n-1列(field...
  • whycold
  • whycold
  • 2013年12月29日 20:05
  • 3315

Berkeley DB内核源码分析

Berkeley DB是一个优秀的数据库存储引擎,虽然它比起那些大块头的DBMS来说显的很小,但是size并不与能力呈正比,Berkeley DB拥有那些大块头DBMS的存储引擎所拥有的全部功能,而且...
您举报文章:Berkeley DB 源代码分析 (2) --- Btree的实现 (1)