Berkeley DB 列存储设计方案

原创 2012年03月25日 15:58:38

这是我根据列存储的需求以及Berkeley DB的技术特征做的一个列存储设计方案。有兴趣的朋友可以研究一下,并且在Berkeley DB的基础上面实现出来。有问题可以联系我,我尽量抽时间回答。

Column store improvements:
0. Store column metadata and table metadata. Column metadata including:
fix lengh or variable len? len or len range for fix/viariable len column;
value distribution characteristics (VDC) (mostly distinct values, majority dup values, etc);
whether need sort and search in the entire table; if no table-wide sort, whether need page level sort;

When storing a row R, store R's columns separately according to their column metadata. For those needing sort and search, store them into btrees as secondary dbs, otherwise store in heap pages (see below). design several types of pages to store specific type of column according to the metadata.

The row itself will be stored with short field values(like numeric types, etc) and field references, if a field F1 is stored in heap pages, we store the [pgno, indx] addr, otherwise we store the value itself.

1. For columns with no need of sort and search, do not store them into btree or hash, simply always
append new data items in a list of pages, never insert. We call such pages 'heap pages'.
If updating such a data item requires page split because it will have more bytes, delete old
one and append new one. In the table entry, store absolute addr(pgno, indx) of such a data item.
Such pages can have row locks.

2. In btree leaf page, if updating does not make a row's size change, use row locks. may use row locks
in hash buckets too.

3. Freespace mgmt
In heap pages, when a page becomes empty because all rows having fields in the page are deleted,
the page is put into the free page list; When we need an empty page, pick the head from the free list
unless it's empty.
We may also link half empty pages with fill rate less than a specified threshold into the free list for
them to be utilized.

4. row level locking:
In column store, if a column or the entire row is fixed length and we don't need in-page sorted, there will
be no need for a index array, because i'th item is located at (pgsz - i) * field-sz offset of the page.
The pages storing this column should use row locking, and insert/delete/update never need to acquire
upper level(page) locks, because update does not change field len, del simply log the del and there
will be a bit array to denote whether there is data in each slot, so that the slot can be used again.

According to the value distribution charisteristics (VDC), we may also compress data.
for example dup items can be stored only once, etc.

If we require in-page sort, we can't use above technique, we need to keep the index array and deleting
an data item means removing the index from the index array, and logging the op.
And we need to somehow identify a deleted slot. we will need a structure for each slot, in the
structure there is a "deleted" bit. And when inserting we look for such slot to store data.

In order to maximize row locking, separate vairable len columns and blobs from a table, making a fixed
len row plus variable len fields and blobs in other dbs.

The page header contains page-wide common info, and we use a dedicated page header txnal lock to
lock the header only to access the page header, rather than locking the entire page. In order to
avoid deadlock, we must stick to a locking order: lock the page header as late as possible in a txn.
Store as little info as possible in the page header, especially don't store info that can be computed
from the content within the page, such as numeric values like max/min/avg/sum of all data items within
the page, etc.

In the main db, each row not only has pointer to non-index field, but also has the field value;
and for indexed field, it stores the field value only. Other indexed field simply are created as
traditional sec db. Such a redundancy for non-indexed fields can make searching by main table faster.
When updating the row, the main row and the field in its column store must be locked;
when the column is dropped, we need to somehow keep the main table consistent.

When storing variable len strings, if the max len is small (e.g. 32), store them as fixed len fields;
if max len is large (e.g. larger then half of a page), or if we are storing blobs, do not use overflow pages,
simply store them as on to a list of pages. And any remaining space in the last page can be used to
store the next such field of another row.

Locking hiarachy:

We need to alter the locking subsystem for it to allow row locks, and to understand row-page relationships:
when row R is in page P, lock requests to either of them may be blocked by existing lock on the other.
The existing way to identify a page as lock object, is to use the 16byte file uid and 4byte pgno,
totally 20byte. The way to identify a row/data item in a page, is to append 2 more byte in-page
index after the 20 bytes. This way, we can use the first 20 byte to know which page it belongs to.

We need to design a lock object hiarachy: table > page > row, in order to do intention locking.
all levels of such objects are regarded as lock objects and fit into current architecture, but we will
add an upper owner link, and a kid object list pointer in them, and also a ref count in page object.
When we want to IWRITE/IREAD a page, we use row.owner to find the page object (we make sure it exists
when creating the row object); and we see page.refcount is 0 then remove the page from lock object bucket.
when deciding whether a lock should be granted, we see not only row locks but also its owner page's locks,
and even table locks.  row will be the finest level of lock, don't use field lock. page and row lock
objects will be removed from hash bucket, but not for table locks.

If scan a whole db or a lot of pages and hold record lock for every record, a huge amount of
lock is needed, so we need to provide flags in data access functions to
specify access type. For example, DB_SCAN_PAGES in DB/DBC->get will make DB hold page locks
rather than record locks.

A page and its records must be hashed to the same bucket otherwise we will need to acquire >1 mutices when locking a record;
When locking a record, we should use the in-page-offset rather than the in-page-index because the index can
change in order to make records sorted in page, but the offset does not change when the record is fixed length;
When a record is deleted we mark the slot unused in the slot header.
And we must make it very cheap to find a table(in bdb terms, database)'s lockobj. we may avoid using hashing functions, and always
keep such lockobjs in the lock region even if no locks.

add/drop a column:
link the column list into free list, then mark the column deleted in the table metadata row for this table,
but do not modify the main table to drop the field for this column. This task will be done by user cmd,
or when adding a column(a whole table scan would be inevitable then).

space preallocation:
In order to have a better chance of allocating continuous sectors on disks for data with locality, we should preallocate a btree or a column.
DB could not do this because the key/data are of arbitrary sizes, but when we have fixed rows, we can do this for both an entire btree or a column.

[转]Berkeley DB设计经验

关于Berkeley DB数据库的总体设计与实现的文章
  • heiyeshuwu
  • heiyeshuwu
  • 2016年05月27日 20:24
  • 3620

Berkeley DB数据库和日志文件归档

Berkeley DB数据库和日志文件归档 归档数据库和日志文件目的是提供面对数据库的灾难性故障的可恢复性,最大限度地减少发生物理硬件故障导致数据丢失。 首先,你可能需要定期创建数据库快照(也就是...
  • wilbertzhou
  • wilbertzhou
  • 2014年04月08日 09:46
  • 1524

Berkeley DB持久化和高速缓存

Berkeley DB之所以能实现如此高的性能,和它的持久化特点及高速缓存有莫大的关系,本文将偏向这方面进行解释,而不会或者很少提及到事务,复制等方面的问题。 任何的持久化数据库系统最终的数据存...
  • hit_hlj_sgy
  • hit_hlj_sgy
  • 2013年09月17日 17:36
  • 1634

Berkeley DB并发数据存储编程

在完全没有必要用上恢复机制和事件的相关语义时,通常我们会用上可并发存储的数据库.对于这类的程序,BDB提供了释放死锁(deadlock- free),数读/单写(multiple-reader/sin...
  • huangyongwang
  • huangyongwang
  • 2013年03月25日 10:41
  • 959

Berkeley DB Java Edition存储文件格式概述

Bdb je的底层存储格式是Log-Structured File System,即就像写日志那样append only,并且其日志和数据是存在一起不分开的。 因此bdb的存储格式是面向entry的...
  • xugangwen
  • xugangwen
  • 2017年02月13日 21:47
  • 784

基于hadoop MR+berkeley DB实现的十亿级数据的秒级部署和实时查询的解决方案

要解决的问题 1、有10亿级别的某视频网的注册用户和设备用户,需要T+1天的延时后,供前端实时查询任意uid或是设备id对应的用户画像数据。 2、分为计算周期+布署服务化+查询三部分,计算用时优化余地...
  • erliang20088
  • erliang20088
  • 2016年08月24日 16:06
  • 1409

NoSQL解决方案比较(MongoDB vs Redis, Tokyo Cabinet, and Berkeley DB)

NoSQL解决方案比较 NoSQL Solution: Evaluation and Comparison: MongoDB vs Redis, Tokyo Cabinet, and Ber...
  • wzhg0508
  • wzhg0508
  • 2014年10月17日 12:14
  • 1048

Berkeley DB Java 版 4.0.92

  • 2010年05月29日 18:20
  • 10.37MB
  • 下载

sqlite PK Berkeley DB

  • 2011年12月24日 18:22
  • 47KB
  • 下载

Berkeley DB Java Edition

  • 2011年08月23日 13:35
  • 12.89MB
  • 下载
您举报文章:Berkeley DB 列存储设计方案