Berkeley DB 源代码分析 (2) --- Btree的实现 (1)

原创 2012年03月25日 15:52:53

II. Type Dictionary

The DB handle's DB->bt_internal structure, stores per-process and per-dbhandle
btree info and function pointers.

The btree meta page structure shared by all processes. It stores what's in the
btree's meta page, including all btree specific global(btree-db-wide) info and
common AM db-wide global info.

It contains the common cursor fields defined in macro __DBC_INTERNAL which are shared
by all types of cursors, and mainly a page stack for btree searching.
In __dbc type, we use a pointer to the __dbc_internal type which is defined as the "base class" for all cursor types, but
actually we allocate memory for BTREE_CURSOR or other types, and cast to
specific cursor types before actually using them. We never directly use
__dbc_internal type.

III. Macro Dictionary
P_INIT: init a non-meta page.

DBC_LOGGING: Find via a cursor if using logging.

LCK_COUPLE: a parameter to __db_lget. If specified, in __db_lget we will first
release the lock then aquire the lock to the same lockobj with specified lock

DBC_DOWNREV: In replication it is allowd that the master have lower version DB lib than replicas.
So if the master uses DB versions older than the version which first
have latching support, replicas will notice this and set this flag to all its cursors, and replicas
will use traditional mutex locking rather than shared latches.

STD_LOCKING: Whether to use std locking, that is, the locking subsystem is started, and we are not using CDS, and the cursor is not sitting on an
off-page-duplicate apge/tree.

DB_MPOOL_TRY: a flag for __memp_fget, telling __memp_fget to try get a latch on the target page's buffer.
If the latch is not granted, return DB_LOCK_NOTGRANTED immediately without waiting.

IV. Function Dictionary

0. __bam_open
Open a btree database. It simply calls __bam_read_root after some config

1. __bam_read_root  
get the btree db's metadata page and use info in it to init the BTREE structure of the DB->bt_internal. The meta data's info was filled in general
DB->open call before calling __bam_open.

2. __bam_metachk
Checks a btree meta data page validity.

3. __bam_init_meta
Init a meta data page's fields, i.e. the BTMETA structure's fields. Called
whenever a metadata page is created during btree db open procedures. For other
pages than meta pages, we use P_INIT to init them.

4. __bam_new_file
Routed from __db_new_file.

Create a btree db file by initializing its meta page and root page. Called
during db open process and routed from __db_new_file when db is a btree db.
The db may be in memory or not. For inmem db, we create the page from cache
and mark it dirty (mark this in __memp_fget rather than after actually writing
to it otherwise the page may get evicted before we had a chance to mark it.);
For on-disk db files, we don't use cache for now, rather, we put the page in
private memory to init, and directly write the  pages into the db file using __fop_write.

when writing pages directly via __fop_writ/__fop_read, we should call the
internal common page in/out functions after got the page via __fop_read and
before writing the page via __fop_write. The __memp_fget/__memp_fput functions
call them too, as registered callbacks via __memp_pg. We have internal page in/out
callbacks for the 3 types of databases(btree, hash, queue), the internal page in/out functions mainly do
check summing and page header byte swap, so that database files created in
big-endian machines can be opened on little-endian machines, though the user
data are never swapped, so users need to make sure the bytes they get are correct.
There are AM specific work to do in internal page in/out functions, so we have
a __db_pgin/__db_pgout pair(placed in db/db_conv.c), in which they call AM specific pgin/out functions
like __bam_pgin/__bam_pgout (placed in btree/btree_conv.c, note the file name

The reason we use __fop_write here, is that at this point, the db is not fully
opened, it's not registered in the mpool region yet.

__memp_fget/put functions do not do logging, so before putting a dirty page
back to the cache, we should log changes; __fop_write logs the action, so no
need to do it in __bam_new_file.

1. In this function we didn't lock the meta/root pages but use latches, why? don't we want txnal semantics?
2. Generally how do we guarantee txnal sementics when we release metapage
non-txnal locks immediately after use? (this is good for performance, but how to enture
consistence? ) examples are __bam_read_root, __bam_new_subdb, and __db_new.
These locks are not txnal, why can't they be replaced by latches?
In the DBMETA general meta info, only the "last_pgno", and "free", and
"key_count" and "record_count" can be updated, others are static fields. AM
specific parts have several more, for btree they are "root", "iv" and "chksum". So if
these  fields don't require txnal locks, it's OK to release locks before txn

5. __bam_new_subdb

Routed from __db_init_subdb. It init the subdb's meta and root pages. It
locks the subdb's meta page during the entire function.
When this function is called, the db file is registered into the mpool so we
always use __memp_fget/put to read/write the page.
It calls __db_new to get a page.

Other than above, it's quite like __bam_new_file.

6. __db_new

__db_new prefer free pages in db file, and
falls back to allocating a new page by extending the db file. __db_new is
seldom called because it writes the db file's metadata page, which becomes a
bottle neck and is expensive, thus there can be many free pages but we are
extending the db file.

7. __db_free
Free a page and put it into the free list.

8. __bamc_close
See I.3.

9. __memp_dirty
Mark a page dirty.

10. __bamc_del
Mark the key/data pair with B_DELETE on the page containing it, and then
mark all cursors sitting on the key/data pair with C_DELETED via __bam_ca_delete.
But do not delete it yet or decrement the number of entries in the page, the k/d will
be deleted by the last cursor sitting on it if it is closed at this position. I think we should delete it if we find via __bam_ca_delete that
no other cursors are sitting on it.

Whenever we modify a page, we first lock the page via __db_lget, then get the page from cache via __memp_fget, then optionally mark the
page dirty via __memp_dirty, then log the action using various logging functions, followed by actually/effectively modifying the page.
Then we call __memp_fput to return the page back to the mpool, finally we release lock on the page.

QUESTION: how are key/data pairs deleted? this function only marks k/d
"deleted", but don't delete them from db pages. __bamc_close only deletes the
k/d it sits on when closed, but other k/d marked deleted by the cursur are not
even deleted when the cursor is closed. so when are all of them deleted from

11. __bamc_count
When counting, consider B_DELETE items, don't count them.

12. __bamc_physdel & __bam_ditem

Physically delete a key/data pair, called when the last cursor sitting on the
deleted key/data pair is closed.

We call __bam_ditem twice to delete a key/data pair, and we log the op in
__bam_ditem. Following each __bam_ditem call, we call __bam_ca_di to adjust
other cursors of this database. We don't have a function to delete a k/d pair
from a btree leaf page at once, I think we should have such a function.
Internal btree pages only has a single  structure to store the key and pageno,
they don't exist in pairs. actually except for btree leaf pages(P_LBTREE), all
other data items exist in single.

__bam_ditem alters the btree page's index array according to the type of btree
pages, and decrement the number of entries in the page, then calls __db_ditem to remove
the item from the page and log the action. or calls __db_doff to delete a opd overflow item.
from the overflow page and free the overflow page.

When deleting the last key/data pair from a btree leaf page, the page itself,
and potentially the stack of pages leading from root node to this leaf node
need to be deleted. So we note down the last key K by calling __db_ret to get
the k/d pair, and then delete this last k/d
pair by calling __bam_ditem twice, each followed by __bam_ca_di to adjust
cursors. Then, we search that last key K from root, when we complete the search,
we have in dbc->dbc_internal the stack/path of nodes leading to
this node, and we should delete several nodes in the stack --- imagine the leaf page P2's parent page P1
also has only one item, when we delete P2, we also delete P1's last item, thus
delete P1, and so on.

13. __bam_stkrel
Release pages in the search stack of the cursor, put each page back to mpool
and optionally unlock each page.

14. __bamc_get
the effective part of DBC->get.
According to the flags, dispatch calls to __bamc_prev, __bamc_next,
__bamc_search, or simply get page. The impl _DUP to is quite straigtforward,
by simply comparing adjacent keys; similarly for NO_DUP flags, it simply
iterate the k/d pairs with identical keys util got a different key.

15. __bamc_prev, __bamc_next
get from next/prev page, or from current page. alter DBC->dbc_internal's pgno
and indx. Note that in the 2 functions we may be on an opd page or a btree
ordinary leaf page.
The 2 functions plus __bamc_search only read data, they don't effectively
modify the page, so by default if we need to get another page, we read-lock
it, unless DB_RMW is set, and we would write-lock it.

the 2 funcs can skip empty pages, and deleted k/d pairs or key items in
btree internal pages.

QUESTION: Strangely enough, a k/d marked deleted is not physically deleted even when the
cursor moves away from it. so when is it deleted?


MySQL索引背后的数据结构及BTree B+Tree算法原理

摘要 本文以MySQL数据库为研究对象,讨论与数据库索引相关的一些话题。特别需要说明的是,MySQL支持诸多存储引擎,而各种存储引擎对索引的支持也各不相同,因此MySQL数据库支持多种索引类型,如B...
  • truelove12358
  • truelove12358
  • 2015年08月26日 16:43
  • 8356

Berkeley DB 由浅入深【转自架构师杨建】

Berkeley DB 由浅入深【转自架构师杨建】Author: 杨建  Click: 901   Date: 2010.04.03 @ 15:39:22 pm Category: 数据库在网上看到不...
  • xiaofei0859
  • xiaofei0859
  • 2010年10月25日 10:39
  • 3763

SQLite和Berkeley DB(BDB)比较资料收集

嵌入式数据库典型技术―SQLite和Berkeley DB的研究 摘 要: 与常见的数据库相比,嵌入式数据库具有体积小、功能齐备、可移植性、健壮性等特点,本文分析和比较了典型的嵌入式数据库S...
  • xiaofei0859
  • xiaofei0859
  • 2016年04月12日 09:55
  • 1010

Berkeley DB 示例程序详解(2)

 // File TxnGuide.cpp/** 这个例子程序是Berkeley DB的示例程序之一(DB/example_cxx/txn_guide/TxnGuide.cpp),* 它演示了如何使用...
  • smartpig_zw
  • smartpig_zw
  • 2009年07月07日 22:50
  • 1531

Oracle数据库中B-Tree以及BitMap index 的性能对比

索引概述 什么是索引? 索引是Oracle数据库中提供的一种可选的数据结构,用于关联一个表。  为什么要使用索引? 索引在有些情况下可以加快访问速度,减少磁盘IO。 ...
  • renfengjun
  • renfengjun
  • 2012年12月31日 00:07
  • 11267

Berkeley DB的使用

转自: Berkeley DB的使用 最近碰到一个项目需要多进程读写一份共享数据,并且共...
  • duanbeibei
  • duanbeibei
  • 2015年05月28日 10:36
  • 2288

Berkeley DB 源代码分析 (3) --- Btree的实现 (2)

__bam_ditem In btree we store on-page duplicate key/data pairs this way: 1. we only put the key...
  • smartpig_zw
  • smartpig_zw
  • 2012年03月25日 15:54
  • 933

[转]Berkeley DB实现分析

Berkeley DB实现分析
  • heiyeshuwu
  • heiyeshuwu
  • 2016年05月27日 20:44
  • 3265


摘要 本文介绍MySQL的InnoDB索引相对底层原理相关知识,涉及到B+Tree索引和Hash索引,但本文主要介绍B+Tree索引,其中包括聚簇索引(InnoDB)和非聚簇索引(MyIASM),I...
  • Watson2016
  • Watson2016
  • 2017年04月07日 21:00
  • 556


1、btree 索引 通过建表t1 object_id的值没有重复值,而t2 表的object_id的值重复率很高 通过实验在t1,t2表的object_id列建立普通索引,来证明普通索引列比较适...
  • wll_1017
  • wll_1017
  • 2013年09月30日 14:57
  • 731
您举报文章:Berkeley DB 源代码分析 (2) --- Btree的实现 (1)