Berkeley DB 源代码分析 (6) 缓存模块

原创 2012年03月25日 15:56:57


= Memory Pool subsystem =

== Architecture ==

mpool is the underlying cache for all access methods (AMs). It provides pages to AMs, and writes pages to disk when requested (for example, a
checkpoint or DB->sync /close call) or when free memory is required. It retrieves specified pages from the database file and caches them in the
mpool region; provides pages to access methods and do necessary synchronization and reference count to allow multiple threads of control to
access pages consistently; When cache is full, it eliminates unused pages to make room for required pages; It implements MVCC concurrency control
to allow non-block read; The mpool manages both the database files and the pages inside these files. It needs to be efficient because there can
be millions of pages in the cache.

The mpool manages pages using an array of hash buckets.  Pages are hashed to a bucket using the page number (pgno) and
containing-file's-offset-in-cache-region, which is a dynamic and global property of a database file. Each bucket has a double-linked list of
buffers for pages that hash to that bucket. There is also a doubly linked list of MPOOL_FILE structures which represent opened db files in the pool.

mpool implements Multi-version concurrency control (mvcc). Mvcc allows Berkeley DB to ensure that a reader never blocks, even if data has
changed. This is accomplished by having writers update a new version of data which is needed by reading transactions.  This is done at the
granularity of a page.

In the case that the database environment is shared (not DB_PRIVATE) mpool is in a shared memory region.  Any process (or thread) which opens
the environment will have an environment handle (dbenv) which references the same cache, the same hash table of pages and files.  A shared memory
region may be either system memory or a memory mapped file.

In some special scenarios (accessing a small db file read-only), mpool simply maps the whole db file into the cache. This saves having to hash
the pages into a hash table, so that the performance can be better.

The mpool supports multiple cache regions. The reason for this is two fold.  First some systems limit the size of a shared memory region
(often 2 Gigabytes) and this may not be big enough on a large system.  Second mpool can support the dynamic growing of the memory region by
adding (or removing) a shared memory region under application control.

Multiple threads of control (toc) can access db simultaneously and because in essence mpool is everything the end user want---it contains the
data they want mpool is a hot spot. Mpool must protect its data from simultaneous updates and accesses while permitting as much concurrency as
possible.  Generally an accessing thread only needs to lock a single hash bucket to get access to a page.

== Key functionality implementation ==

=== General region space management (all regions use the same methods) ===
A region consists of a structure containing information about this region - this structure is identical across all regions and is at the beginning
of the region address space. The information structure is followed by a continuous chunk of memory space.

Different regions store different types of data in the memory space. There can be many space allocations and frees during the lifetime of a region,
so it is important to manage the free space efficiently. We should combine adjacent free blocks into a single larger block, otherwise we will end
up with fragmentation (quite a lot of tiny chips). Region free space is managed in the following way:
* Each region hash 12 lists, 11 size lists, and one address list.
* At the head of each chunk of space, there is a header, which is linked into the size list and the address list. It contains information about
the size of the chunk.
* In the 11 size lists, the i-th list element lists the chunk that are smaller than 1024*2^(i-1)
* The last queue contains chunks larger than 512K.
* In each queue, the chunks are ordered by increasing sizes.
* The chunks are linked to the address list, in address-increasing order. Inserts into this address list need to maintain this order.
* When a chunk is removed, if the adjacent chunk is also free, it will be combined into one larger chunk.

=== mvcc (answer from Alex) ===

Q1: What do "frozen" and "thaw" mean in context of the mpool?
If the cache becomes full of page copies before old copies can be discarded pages are written to temporary "freezer" files on disk. When the content
of one of these files is required, it is "thawed" back into the cache.

Q2: How does the Berkeley DB mvcc implementation differ from a "time stamp" based mvcc?
Berkeley DB uses a transaction ID based mechanism. I guess that a time stamp based method would be similar, since 'time' would need to be measured
on an operation by operation scale (or you would run into multi-threading issues).

Q2: If creating a new write transaction, does it create a new page based on the last one in the mvcc chain? If so, what if the previous version is
I think the way this works is: Even with MVCC enabled transactions, there can only be one active write on a page at a time. When a transaction
performs a write, a copy of the page is made and a new entry added to the head of the list of pages. The older version is retained if and only
if there are currently transactions open and reading the page being written.
After the first write transaction has committed, two versions of the page will be retained (until the initial read transaction finishes). If you
have a new write operation acting on the page, it will create a new copy - generating a chain of three pages.
It is not possible to have multiple write transactions operating on the same page (even different versions of the same page) simultaneously.
So with MVCC you can ensure that
1) It will always be possible to get a read lock on a page.
2) You are guaranteed that the page will not change as long as the transaction containing the read operation is open.

A good starting point for reading about MVCC:

Some useful resources about the Berkeley DB MVCC implementation follow.

The Getting Started Guide, has an overview and some sample code:

The reference guide:


The original design discussion of MVCC is in SR 6770. A draft design doc is here:

Some of the implemented details will differ slightly to the design document.

=== Page management (hash table implementation and hash algorithm) ===
* readonly small(<10M, by defualt) db files can be mmapped to cache region, so that we can directly access the mapped pages, no need to put them
 into hash table.
* there can be multiple cache regions, only the first have global info about the cache of db, but each has its own hash table, but the bucket number
increases across cache regions, and each cache has the same amount of buckets. This number of a cache's bucket is computed when the region is
opened, according to the cache size, and the we suppose each page is 1K, and we require less than 10 buffers in each bucket on average, so
num-of-bucket-in-a-hashtable=cachesize/(10K). User can later resize the cache, and buckets can be added or removed.
* how to compute a hash key for a page? use (pageno, file-offset) to compute---KEY=((pgno) ^ ((mf_offset) * 509)).
When a page is mapped into a cache, its containing file must be opened(set up a info structure and put to the cache region) first if not
already open, and the file's info structure is put in the region for shared use, so file-offset is the relative address of the this structure
in the region. In order for the shared information to be accessed by all processes attached to the db region, the "address" and "pointers"
in the global region can only be offset values, rather than absolute address values, because in different processes, the region file is mapped
to different memory locations, and an absolute address only works in the process's own address space. So we have to use "the offset from the
first byte of a region" as an address value in shared region info, we use it as a pointer value in a shared region. If we want to dereference
the "relative pointers", we need to obtain the absolute address of this relative address by adding a base address, that is, the address of the
first byte of the mapped-in region  in this process's address space. All regions in bdb use this kind of technique to maintain shared info and
utilize them in each process's addr space.
* mvcc support
In each page buffer, one page is contained, but if we are using mvcc, it also may contain a list of version buffers, they are different versions
currently being held by different txns. using mvcc, when a txn wants to read a page, it never has to wait---it simply makes a new version buffer,
and copy the page in the page buffer into the version buffer, and insert this buffer to the head of the version list; when this txn commits, this
version buffer is not used, and it is the most-wanted buffer to eliminate in the case of freeing cache space. when a txn wants to write a page,
it has to wait until the prev writer commits, then make up a new version as described above, and write the page, and before it commits, it will
copy this version to the page buffer, so that this page is updated.

=== Cache space management (space allocation, eliminating low priority pages) ===

* use region space management algorithms for normal allocation & free
* when there is no free space in the cache when we want to allocate a page, we need to discard some low priority buffers in order to reuse the
space and allcate space. the most wanted buffer to free is the unused version buffers, then is the lowest priority singleton buffers, followed
by the lowest buffer of all. we scan the buffer pool to find buffers with low priorities.  We consider small sets of hash buckets (2 buckets)
each time to limit the amount of work needing to be done. we find a better one from the two, and when we freed 3 times the amout we need to
allocate, we do an allocation, if it still fails, we continue the search-free process until we fail or allocate the space. This approximates
LRU, but not very well.  We either find a buffer of the same size to use, or we will free 3 times what we need in the hopes it will coalesce
into a contiguous chunk of the right size. If one time(scaning two buckets) is not enough, we go on freeing, after 2 times, we become
aggresive---The failure mode is when there are too many buffers we can't write or there's not enough memory in the system to support the
number of pinned buffers.

Get aggressive if we've reviewed the entire cache without freeing the needed space.  (The code resets "aggressive" when we free any space.)  
Aggressive means:

* set a flag to attempt to flush high priority buffers as well as other buffers.
* sync the mpool to force out queue extent pages.  While we might not have enough space for what we want and flushing is expensive, why not?
* look at a buffer in every hash bucket rather than choose the more preferable of two.
* consider freeing mvcc version buffers that is not and the end---we have to freeze it, and thaw it when it is next time used.
* start to think about giving up.
* If we get here twice, sleep for a second, hopefully someone else will run and free up some memory.

Always try to allocate memory too, in case some other thread returns its memory to the region.

=== Providing a page to user (AMs) ===
* db file mapped to cache region.
The offset in the cache region of the first byte of the file is stored in the '* __db_mpoolfile' structure, so we directly return
the in-process address of the wanted page by adding the offset
* db pages loaded in the cache hash table
    * compute hash key of the page and find the hash bucket containing this page, then go through the bucket to find this page
    * if doing snapshot reads( mvcc), we need to find the version buffer that is visible by this txn. in bdb mvcc, each txn has a readlsn and
a visiblelsn, this txn only can see changes made before its readlsn, and its changes are only visible to other txns with readlsns larger than
this txn's visiblelsn.
    * the buffer we found may be in IO---being read from the disk, or syncing to disk or being frozen or thawed, if so, we should wait until
the IO finishes. when IO finishes, we should make sure that if this buffer was frozen and not thawed, we should thaw it in order to return it to
the page's user(some AM).
    * in mvcc, if we can find an obsolete buffer, use that one instead of allocating a new buffer, to make a new version of the page.
    * if requested to create a new buffer or we have no obsolete buffer to use in mvcc case, allocate a new buffer; if requested to free the
found buffer, dereference it and free it if not in use any more.
    * the found buffer may not contain the page, we need to read it from the db file.
    * in mvcc, if we need to make a copy of the found page in order to write, do so.
    * record the pin of this page into current txn's thread info

=== Putting a page back ===
* Unpin the buffer from the threadinfo pinlist;
* Mark the page's containing file dirty (if the buffer is dirty)
* Decrease ref count of the page, if this page is not used by anyone else and it is going to be flushed to disk, update global
lru value and the page's priority(lru value), and prepare for global lru wrap; If the global lru value stored in '* __mpool' really
wraps, decrease it and all pages' lru values by the same amount.
* Decrease this buffer's ref_sync count, that is, the number of threads waiting to sync that buffer.
=== Syncing pages and/or files ===
* Syncing changes happened before a specified lsn
Actually the changes made after this lsn will also be synced. See specification of '* __memp_sync' below.
* Syncing a specified dirty file
Find the dirty pages belonging to this file and then sycn the pages. See specification of '* __memp_fsync' below.
* Simply syncing all of part of dirty pages in the cache. See specification of '* __memp_sync_int' below.

=== Resizing the cache ===
* Cache regions architecture
    * There can be more than one cache region files in a dbenv, the first one contains the global info about the whole cache region,
   others only act as extended place to hold more buckets of the global hash table, that is, the hash table resides in multiple
   cache region files, each cache region file contains the same number of hash buckets, and the bucket number increases across the
   multiple cache region files from 0 to '* __mpool'->nbuckets - 1 as the cache region number increases from 0 to '* __mpool'->nreg - 1.
    * The '* __db_mpool' structure contains a cache region info array which is allocated on startup, containing the global cache region
   info, the region number is used as the index to retrieve the global region info of a cache region.
* How to resize the cache region?
    * If we are using multiple cache files, then finding the bucket containing the buffer we want will be a little complicated, see
   spefication of * __memp_get_bucket.
    * We can add or remove a cache region, or add/remove buckets in a cache region file, by doing this we can resize the whole
   cache region. When cache region is added/removed, the * __mpool->regids array may not match the * __db_mpool->reginfo array,
   in that case, we need to remap(detatch and attach) the unmatching regions
    * the cache regions form a virtual array--the real representatives are their IDs stored in the * __mpool->regids array, and their
   REGINFO structure stored in perprocess * __db_mpool->reginfo array. so adding a region means appending an element to this
   virtual array--- appending region ID to * __mpool->regids array and REGINFO to * __db_mpool->reginfo array, and removing a
   region means removing the last element of the virtual array, and the region ID and REGINFO in * __mpool->regids and
   * __db_mpool->reginfo respectively.  When adding a region, we should add * __mpool->htab_buckets number of buckets to this
   region and when removing a region, we should remove the same number of buckets from this region, making this region contain
   no bucket at all.

    * When adding a hash bucket A, the total number of hash bucket increments, so some buffers previously put in hash bucket B will
   now be hashed to A, this is because the MP_HASH_BUCKET macro uses total number of bucket as an argument to compute the
   bucket number to put a page.  So we need to move these buffers from B to A. Removing a hash bucket is similiar.

=== Managing db files (opening, extending/shrinking, etc) ===

=== Mvcc ===
mvcc is an important and special feature in mpool, so single it out and talk about it here, I think mvcc can be described by the anwers of the
following questions:
* When writing, when to reuse an existing buffer and which to use?
* When writing, when to create a new buffer and which version to copy ? when syncing, which version to sync?
* When to delete a useless buffer and what kind of buffer is useless?
* When reading, which version to read and which txns can read versions created by a specific txn?
* Why, when and how to freeze and thaw a page buffer?

== Specifications of key functions ==

* __memp_sync : syncing dirty pages changed before a specified lsn
    * the * __mpool->lsn stores the latest lsn that has been synced, that is, all changes before this lsn have been synced.
   If the required lsn is earlier than * __mpool->lsn, we don't have to do anything because they are already on disk, otherwise,
   call * __memp_sync_int to do real syncing.
    * update the * __mpool->lsn to the new the lsn if it is even larger than the already updated * __mpool->lsn.

* __memp_fsync : sync an already opened db mpool file
    * precondition:  not readonly, not temp, have backing file, and file is dirty
    * first check the preconditions, if qualifies, call * __memp_sync_int with DB_SYNC_FILE flag

* __memp_sync_int: sync all or part of dirty pages in all the cache regions

    * We do this by walking each cache's list of buffers and mark all dirty buffers to be written and all dirty buffers to be  
   potentially written, the "potentiality" depends on our flags.
    * There are three loops, first is the looping all the caches, second is in each cache, looping all the bucket, and only go into a
   bucket with number of dirty buffers non zero, the third loop is in each bucket, loop all the buffers in this bucket.
    * To each buffer, we will decide whether or not to mark this buffer to the buffer list, and this depends on the argument flags
   and the buffer's page's containing file's attributes(stored in a * __mpoolfile structure), specifically, pages belonging to the
   * __mpoolfile which has no backing file on disk don't need to be synced. And if we are syncing a file, the temp files don't need
   syncing as well as those pages don't belong to this file.
    * When we have gone though all buffers of all hash buckets of all regions, we have marked up all the buffers we want to sync.
   However, I have a question here: we mark up a buffer by remembering its pageno, mf_offset and containing bucket pointer, but
   not the buffer's pointer, I don't know why, and when later we are syncing, we have to search this bucket again to find the
   buffer we want , using the (pgno, mf_offset) as the key. If we simply noted down the buffer pointer, the performance would also
    * If we are doing trickling, we only sync minimum number of buffers to achieve the percentage of free space. QUESTION: in this
   case part of our searching above is in vain because we never use the extra buffers, so why not decide to quit the loops when
   we have enough buffer to do trickling?
    * Then, we need to sync them by file, and in increasing page number, so that we make best effort to write to disk efficiently,
   this means we need to sort the page list by page number and file number(mp_offset). Before syncing any page, first flush
   the logs because we are using a redo log, log records must go to disk before its corresponding pages.
    * Then we sync the pages in the markup list one by one, note that some pages may being synced by other thread of control, so we
   have to temporily skip it, and retry later, and we only sync if the buffer's ref_sync count is 0, otherwise, it may be used by
   others now, and we will wait for it to be free for a short while before we decide to try the next buffer.
   We use the hash bucket pointer in each entry as the flag whether this buffer has been synced, and we only sync each buffer

    * After syncing, we clear the lock flag of the buffer and wake up the thread of control that are waiting to do io(sync or read)
   on this buffer. Finally we may have to flush the files as an entirety if required to do so, even many pages of the file is not
   dirty at all

* __memp_get_bucket : find a bucket when there are more than one cache region files
    * we should make sure the cache region file is attached (mapped in), when we get the bucket number we want, we can get the
   region number we want: regno=bucketno/bucket-per-region.
    * Then we look for the * __db_mpool->reginfo array to find the slot of this region, see whether it is initialized, if so, see if
   the slot's regionid is the one mapped in(each mapped in region has its regionid in * __mpool->regid array), if not, map
   the region, otherwise, this region is the one we want.
    * When we get the region, we get the bucket containing the buffer we want---array-index= bucketno- bucket-per-region*regionno.
    * Then acquire the hash mutex, and when we get it, recheck that the cache region is not resized ( no region added or removed),
   if resized, we need to retry the above procedures again. Note that it is common practice to check whether the shared states
   we rely on is not changed when we acquire a mutex because it is likely that during our sleep to wait for the mutex, the
   shared states are changed by others.
    * Finally return the bucket pointer.

* __memp_merge_buckets

Berkeley DB 源代码分析 --- 小结

刚才贴了一些文章,都是我之前读Berkeley DB的代码时候记下来的笔记,基于Berkeley DB 4.6 ~ Berkeley DB 4.8版本的代码,不过相信与现在最新的代码差别也不大,有兴趣...
  • smartpig_zw
  • smartpig_zw
  • 2012年03月25日 15:59
  • 2061

Berkeley DB持久化和高速缓存

Berkeley DB之所以能实现如此高的性能,和它的持久化特点及高速缓存有莫大的关系,本文将偏向这方面进行解释,而不会或者很少提及到事务,复制等方面的问题。 任何的持久化数据库系统最终的数据存...
  • hit_hlj_sgy
  • hit_hlj_sgy
  • 2013年09月17日 17:36
  • 1735

[转]Berkeley DB实现分析

Berkeley DB实现分析
  • heiyeshuwu
  • heiyeshuwu
  • 2016年05月27日 20:44
  • 3323

Berkeley DB的内存过量使用问题

berkeley db中的数据结点占用了大量内存,导致内存溢出了。为了提高访问效率,berkeley db会缓存大量的结点,缓存大小限制可以在Envi...
  • gaolong
  • gaolong
  • 2013年06月16日 18:18
  • 1548

berkeley db--入门介绍

转至: 1. Berkeley DB的简介 Berkeley DB(BDB)是一个高性能的嵌入式...
  • whycold
  • whycold
  • 2013年12月29日 19:53
  • 1623

Berkeley DB for C 使用手册

Berkeley DB for C 简单介绍Brekeley DB打开databases在开打数据库前必须通过db_create()方法来初始化一个db句柄, 然后你可以通过他的open方法来打开一个...
  • muyannian
  • muyannian
  • 2007年12月11日 16:40
  • 9336


Berkeley DB是由美国Sleepycat Software公司开发的一套开放源代码的嵌入式数据库管理系统(已被Oracle收购),它为应用程序提供可伸缩的、高性能的、有事务保护功能的数据管理服...
  • maryzhao1985
  • maryzhao1985
  • 2013年05月02日 16:35
  • 6279

Linux下安装openldap+Berkeley DB

普通linux安装,以XXX用户身份安装: 1、安装Berkeley DB 4.7.25:伯克利大学嵌入式数据库解决方案,openldap拿它作为存储方案。 htt...
  • u014353474
  • u014353474
  • 2014年06月19日 11:04
  • 1365

在Python上使用Berkeley DB ——bsddb

  • xuegufei2007
  • xuegufei2007
  • 2014年09月09日 21:02
  • 1596

[转]Berkeley DB简介及安装使用

1        简介 BDB的全称Berkeley DB,是一套开放源码的嵌入式数据库的程序库。它为应用程序提供可伸缩的、高性能的、有事务保护功能的数据管理服务。Berkeley DB为数据的...
  • anda0109
  • anda0109
  • 2014年05月16日 10:54
  • 1442
您举报文章:Berkeley DB 源代码分析 (6) 缓存模块