Berkeley DB 列存储设计方案
这是我根据列存储的需求以及Berkeley DB的技术特征做的一个列存储设计方案。有兴趣的朋友可以研究一下，并且在Berkeley DB的基础上面实现出来。有问题可以联系我，我尽量抽时间回答。
Column store improvements:
0. Store column metadata and table metadata. Column metadata including:
fix lengh or variable len? len or len range for fix/viariable len column;
value distribution characteristics (VDC) (mostly distinct values, majority dup values, etc);
whether need sort and search in the entire table; if no table-wide sort, whether need page level sort;
When storing a row R, store R's columns separately according to their column metadata. For those needing sort and search, store them into btrees as secondary dbs, otherwise store in heap pages (see below). design several types of pages to store specific type of column according to the metadata.
The row itself will be stored with short field values(like numeric types, etc) and field references, if a field F1 is stored in heap pages, we store the [pgno, indx] addr, otherwise we store the value itself.
1. For columns with no need of sort and search, do not store them into btree or hash, simply always
append new data items in a list of pages, never insert. We call such pages 'heap pages'.
If updating such a data item requires page split because it will have more bytes, delete old
one and append new one. In the table entry, store absolute addr(pgno, indx) of such a data item.
Such pages can have row locks.
2. In btree leaf page, if updating does not make a row's size change, use row locks. may use row locks
in hash buckets too.
3. Freespace mgmt
In heap pages, when a page becomes empty because all rows having fields in the page are deleted,
the page is put into the free page list; When we need an empty page, pick the head from the free list
unless it's empty.
We may also link half empty pages with fill rate less than a specified threshold into the free list for
them to be utilized.
4. row level locking:
In column store, if a column or the entire row is fixed length and we don't need in-page sorted, there will
be no need for a index array, because i'th item is located at (pgsz - i) * field-sz offset of the page.
The pages storing this column should use row locking, and insert/delete/update never need to acquire
upper level(page) locks, because update does not change field len, del simply log the del and there
will be a bit array to denote whether there is data in each slot, so that the slot can be used again.
According to the value distribution charisteristics (VDC), we may also compress data.
for example dup items can be stored only once, etc.
If we require in-page sort, we can't use above technique, we need to keep the index array and deleting
an data item means removing the index from the index array, and logging the op.
And we need to somehow identify a deleted slot. we will need a structure for each slot, in the
structure there is a "deleted" bit. And when inserting we look for such slot to store data.
In order to maximize row locking, separate vairable len columns and blobs from a table, making a fixed
len row plus variable len fields and blobs in other dbs.
The page header contains page-wide common info, and we use a dedicated page header txnal lock to
lock the header only to access the page header, rather than locking the entire page. In order to
avoid deadlock, we must stick to a locking order: lock the page header as late as possible in a txn.
Store as little info as possible in the page header, especially don't store info that can be computed
from the content within the page, such as numeric values like max/min/avg/sum of all data items within
the page, etc.
In the main db, each row not only has pointer to non-index field, but also has the field value;
and for indexed field, it stores the field value only. Other indexed field simply are created as
traditional sec db. Such a redundancy for non-indexed fields can make searching by main table faster.
When updating the row, the main row and the field in its column store must be locked;
when the column is dropped, we need to somehow keep the main table consistent.
When storing variable len strings, if the max len is small (e.g. 32), store them as fixed len fields;
if max len is large (e.g. larger then half of a page), or if we are storing blobs, do not use overflow pages,
simply store them as on to a list of pages. And any remaining space in the last page can be used to
store the next such field of another row.
We need to alter the locking subsystem for it to allow row locks, and to understand row-page relationships:
when row R is in page P, lock requests to either of them may be blocked by existing lock on the other.
The existing way to identify a page as lock object, is to use the 16byte file uid and 4byte pgno,
totally 20byte. The way to identify a row/data item in a page, is to append 2 more byte in-page
index after the 20 bytes. This way, we can use the first 20 byte to know which page it belongs to.
We need to design a lock object hiarachy: table > page > row, in order to do intention locking.
all levels of such objects are regarded as lock objects and fit into current architecture, but we will
add an upper owner link, and a kid object list pointer in them, and also a ref count in page object.
When we want to IWRITE/IREAD a page, we use row.owner to find the page object (we make sure it exists
when creating the row object); and we see page.refcount is 0 then remove the page from lock object bucket.
when deciding whether a lock should be granted, we see not only row locks but also its owner page's locks,
and even table locks. row will be the finest level of lock, don't use field lock. page and row lock
objects will be removed from hash bucket, but not for table locks.
If scan a whole db or a lot of pages and hold record lock for every record, a huge amount of
lock is needed, so we need to provide flags in data access functions to
specify access type. For example, DB_SCAN_PAGES in DB/DBC->get will make DB hold page locks
rather than record locks.
A page and its records must be hashed to the same bucket otherwise we will need to acquire >1 mutices when locking a record;
When locking a record, we should use the in-page-offset rather than the in-page-index because the index can
change in order to make records sorted in page, but the offset does not change when the record is fixed length;
When a record is deleted we mark the slot unused in the slot header.
And we must make it very cheap to find a table(in bdb terms, database)'s lockobj. we may avoid using hashing functions, and always
keep such lockobjs in the lock region even if no locks.
add/drop a column:
link the column list into free list, then mark the column deleted in the table metadata row for this table,
but do not modify the main table to drop the field for this column. This task will be done by user cmd,
or when adding a column(a whole table scan would be inevitable then).
In order to have a better chance of allocating continuous sectors on disks for data with locality, we should preallocate a btree or a column.
DB could not do this because the key/data are of arbitrary sizes, but when we have fixed rows, we can do this for both an entire btree or a column.