Berkeley DB 列存储设计方案

原创 2012年03月25日 15:58:38

这是我根据列存储的需求以及Berkeley DB的技术特征做的一个列存储设计方案。有兴趣的朋友可以研究一下,并且在Berkeley DB的基础上面实现出来。有问题可以联系我,我尽量抽时间回答。


Column store improvements:
0. Store column metadata and table metadata. Column metadata including:
fix lengh or variable len? len or len range for fix/viariable len column;
value distribution characteristics (VDC) (mostly distinct values, majority dup values, etc);
whether need sort and search in the entire table; if no table-wide sort, whether need page level sort;

When storing a row R, store R's columns separately according to their column metadata. For those needing sort and search, store them into btrees as secondary dbs, otherwise store in heap pages (see below). design several types of pages to store specific type of column according to the metadata.

The row itself will be stored with short field values(like numeric types, etc) and field references, if a field F1 is stored in heap pages, we store the [pgno, indx] addr, otherwise we store the value itself.

1. For columns with no need of sort and search, do not store them into btree or hash, simply always
append new data items in a list of pages, never insert. We call such pages 'heap pages'.
If updating such a data item requires page split because it will have more bytes, delete old
one and append new one. In the table entry, store absolute addr(pgno, indx) of such a data item.
Such pages can have row locks.



2. In btree leaf page, if updating does not make a row's size change, use row locks. may use row locks
in hash buckets too.


3. Freespace mgmt
In heap pages, when a page becomes empty because all rows having fields in the page are deleted,
the page is put into the free page list; When we need an empty page, pick the head from the free list
unless it's empty.
We may also link half empty pages with fill rate less than a specified threshold into the free list for
them to be utilized.


4. row level locking:
In column store, if a column or the entire row is fixed length and we don't need in-page sorted, there will
be no need for a index array, because i'th item is located at (pgsz - i) * field-sz offset of the page.
The pages storing this column should use row locking, and insert/delete/update never need to acquire
upper level(page) locks, because update does not change field len, del simply log the del and there
will be a bit array to denote whether there is data in each slot, so that the slot can be used again.

According to the value distribution charisteristics (VDC), we may also compress data.
for example dup items can be stored only once, etc.

If we require in-page sort, we can't use above technique, we need to keep the index array and deleting
an data item means removing the index from the index array, and logging the op.
And we need to somehow identify a deleted slot. we will need a structure for each slot, in the
structure there is a "deleted" bit. And when inserting we look for such slot to store data.

In order to maximize row locking, separate vairable len columns and blobs from a table, making a fixed
len row plus variable len fields and blobs in other dbs.

The page header contains page-wide common info, and we use a dedicated page header txnal lock to
lock the header only to access the page header, rather than locking the entire page. In order to
avoid deadlock, we must stick to a locking order: lock the page header as late as possible in a txn.
Store as little info as possible in the page header, especially don't store info that can be computed
from the content within the page, such as numeric values like max/min/avg/sum of all data items within
the page, etc.


In the main db, each row not only has pointer to non-index field, but also has the field value;
and for indexed field, it stores the field value only. Other indexed field simply are created as
traditional sec db. Such a redundancy for non-indexed fields can make searching by main table faster.
When updating the row, the main row and the field in its column store must be locked;
when the column is dropped, we need to somehow keep the main table consistent.

When storing variable len strings, if the max len is small (e.g. 32), store them as fixed len fields;
if max len is large (e.g. larger then half of a page), or if we are storing blobs, do not use overflow pages,
simply store them as on to a list of pages. And any remaining space in the last page can be used to
store the next such field of another row.

Locking hiarachy:

We need to alter the locking subsystem for it to allow row locks, and to understand row-page relationships:
when row R is in page P, lock requests to either of them may be blocked by existing lock on the other.
The existing way to identify a page as lock object, is to use the 16byte file uid and 4byte pgno,
totally 20byte. The way to identify a row/data item in a page, is to append 2 more byte in-page
index after the 20 bytes. This way, we can use the first 20 byte to know which page it belongs to.

We need to design a lock object hiarachy: table > page > row, in order to do intention locking.
all levels of such objects are regarded as lock objects and fit into current architecture, but we will
add an upper owner link, and a kid object list pointer in them, and also a ref count in page object.
When we want to IWRITE/IREAD a page, we use row.owner to find the page object (we make sure it exists
when creating the row object); and we see page.refcount is 0 then remove the page from lock object bucket.
when deciding whether a lock should be granted, we see not only row locks but also its owner page's locks,
and even table locks.  row will be the finest level of lock, don't use field lock. page and row lock
objects will be removed from hash bucket, but not for table locks.

If scan a whole db or a lot of pages and hold record lock for every record, a huge amount of
lock is needed, so we need to provide flags in data access functions to
specify access type. For example, DB_SCAN_PAGES in DB/DBC->get will make DB hold page locks
rather than record locks.

A page and its records must be hashed to the same bucket otherwise we will need to acquire >1 mutices when locking a record;
When locking a record, we should use the in-page-offset rather than the in-page-index because the index can
change in order to make records sorted in page, but the offset does not change when the record is fixed length;
When a record is deleted we mark the slot unused in the slot header.
And we must make it very cheap to find a table(in bdb terms, database)'s lockobj. we may avoid using hashing functions, and always
keep such lockobjs in the lock region even if no locks.

add/drop a column:
link the column list into free list, then mark the column deleted in the table metadata row for this table,
but do not modify the main table to drop the field for this column. This task will be done by user cmd,
or when adding a column(a whole table scan would be inevitable then).

space preallocation:
In order to have a better chance of allocating continuous sectors on disks for data with locality, we should preallocate a btree or a column.
DB could not do this because the key/data are of arbitrary sizes, but when we have fixed rows, we can do this for both an entire btree or a column.
版权声明:本文为博主原创文章,未经博主允许不得转载。

相关文章推荐

列存储、行存储

一、定义   1.1定义 Sybase在2004年左右就推出了列存储的Sybase IQ数据库系统,主要用于在线分析、数据挖掘等查询密集型应用。列存储,缩写为DSM,相对于NSM(N...
  • klarclm
  • klarclm
  • 2013年02月26日 13:24
  • 26690

行存储和列存储的区别

列存储不同于传统的关系型数据库,其数据在表中是按行存储的,列方式所带来的重要好处之一就是,由于查询中的选择规则是通过列来定义的,因此整个数据库是自动索引化的。按列存储每个字段的数据聚集存储,在查询只需...

基于hadoop MR+berkeley DB实现的十亿级数据的秒级部署和实时查询的解决方案

要解决的问题 1、有10亿级别的某视频网的注册用户和设备用户,需要T+1天的延时后,供前端实时查询任意uid或是设备id对应的用户画像数据。 2、分为计算周期+布署服务化+查询三部分,计算用时优化余地...

Berkeley DB Java Edition存储文件格式概述

Bdb je的底层存储格式是Log-Structured File System,即就像写日志那样append only,并且其日志和数据是存在一起不分开的。 因此bdb的存储格式是面向entry的...

NoSQL解决方案比较(MongoDB vs Redis, Tokyo Cabinet, and Berkeley DB)

NoSQL解决方案比较 NoSQL Solution: Evaluation and Comparison: MongoDB vs Redis, Tokyo Cabinet, and Ber...

[转]Berkeley DB设计经验

关于Berkeley DB数据库的总体设计与实现的文章

Berkeley DB je3.3版

  • 2016年09月08日 20:03
  • 6.39MB
  • 下载

Berkeley DB基础教程

一、Berkeley DB的介绍 (1)Berkeley DB是一个嵌入式数据库,它适合于管理海量的、简单的数据。如Google使用其来保存账户信息,Heritrix用其来保存froniter. (...

Berkeley DB C API chm版

  • 2009年05月23日 21:53
  • 364KB
  • 下载
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:Berkeley DB 列存储设计方案
举报原因:
原因补充:

(最多只允许输入30个字)