2021SC@SDUSC B树中的cell概念
一、cell是什么?
sqlite数据库文件被分为固定大小的页,所有的页由B+树模块管理。每个页要么是树中的页(内部页或者叶子页),要么是溢出页,或者是自由页(自由页通过单链表组织起来)。
树中的页(内部页和叶子页)被分成许多cell,一个cell包括一个(或者一部分)负载。Cell是已分配或者已释放的磁盘空间集合。页的内容见往期分析。
每个页被分为4个部分:
1.页头
2.Cell指针数组
3.未分配空间
4.Cell内容
struct CellInfo {
i64 nKey; /* The key for INTKEY tables, or nPayload otherwise */
u8 *pPayload; /* Pointer to the start of payload */
u32 nPayload; /* Bytes of payload */
u16 nLocal; /* Amount of payload held locally, not on overflow */
u16 nSize; /* Size of the cell content on the main b-tree page */
};
Cell指针数组从上向下增长,cell内容从下向上增长。Cell指针数组作为页内的一种目录,帮助把cell组织起来。
页头只包含本页的管理信息,并且总是存储在页的开头。
Cell被保存在页的最底部,向页开头的方向增长。Cell指针数组从页头之后的第一个byte开始,包含0个或者多个cell指针。每个cell指针是一个2byte整数,该整数指示了实际cell内容相对于页开头的偏移量。Cell指针按照相应的键值排序,尽管cell本身的存储不是按照顺序的。Cell指针数组的大小被存储在页头处偏移量为3的地方。
Cell是变长的字节字符串。一个cell保存了一个负载。Cell的结构如下图所示,size列的单位是byte。
对应btreeInt.h中注释如下:
** The content of a cell looks like this:
**
** SIZE DESCRIPTION
** 4 Page number of the left child. Omitted if leaf flag is set.
** var Number of bytes of data. Omitted if the zerodata flag is set.
** var Number of bytes of key. Or the key itself if intkey flag is set.
** * Payload
** 4 First page of the overflow chain. Omitted if no overflow
**
** Overflow pages form a linked list. Each page except the last is completely
** filled with data (pagesize - 4 bytes). The last page can have as little
** as 1 byte of data.
**
** SIZE DESCRIPTION
** 4 Page number of next overflow page
** * Data
**
** Freelist pages come in two subtypes: trunk pages and leaf pages. The
** file header points to the first in a linked list of trunk page. Each trunk
** page points to multiple leaf pages. The content of a leaf page is
** unspecified. A trunk page looks like this:
**
** SIZE DESCRIPTION
** 4 Page number of next trunk page
** 4 Number of leaf pointers on this page
** * zero or more pages numbers of leaves
对于内部节点,每个cell包含一个4byte的子节点指针;对于叶子节点,cell没有子节点指针。接下来是该cell存储的数据的大小(bytes)和存储的键值的大小(bytes)
(如果页头中的intkey为真,那么存储键值大小的地方就直接存储键值本身的整型值,如果zerodata为真,那么数据部分不存在)。
二、cell与溢出页结构
1.溢出页的意义
SQLite限制了每页中负载的数量。负载有可能不会把自身的全部存储在同一页中,尽管可能该页有足够大的空间。每页能够存储的最大单个负载由该页可用空间(可以被内部节点的单个cell使用)的总大小决定。
如果内部节点中的cell的负载大于最大负载限制,超出的部分就被分割并存储到溢出页链表中。一但分配了一个溢出页,要把尽可能多的字节转移到溢出页中,只要不导致cell的大小低于最小负载限制就行。对于叶子节点,最小负载限制存储在文件头内偏移量23处,但是最大负载限制总是100%,并且不会在文件头给出。
static int getOverflowPage(
BtShared *pBt, /* The database file */
Pgno ovfl, /* Current overflow page number */
MemPage **ppPage, /* OUT: MemPage handle (may be NULL) */
Pgno *pPgnoNext /* OUT: Next overflow page number */
){
Pgno next = 0;
MemPage *pPage = 0;
int rc = SQLITE_OK;
assert( sqlite3_mutex_held(pBt->mutex) );
assert(pPgnoNext);
#ifndef SQLITE_OMIT_AUTOVACUUM
/* Try to find the next page in the overflow list using the
** autovacuum pointer-map pages. Guess that the next page in
** the overflow list is page number (ovfl+1). If that guess turns
** out to be wrong, fall back to loading the data of page
** number ovfl to determine the next page number.
*/
if( pBt->autoVacuum ){
Pgno pgno;
Pgno iGuess = ovfl+1;
u8 eType;
while( PTRMAP_ISPAGE(pBt, iGuess) || iGuess==PENDING_BYTE_PAGE(pBt) ){
iGuess++;
}
if( iGuess<=btreePagecount(pBt) ){
rc = ptrmapGet(pBt, iGuess, &eType, &pgno);
if( rc==SQLITE_OK && eType==PTRMAP_OVERFLOW2 && pgno==ovfl ){
next = iGuess;
rc = SQLITE_DONE;
}
}
}
#endif
assert( next==0 || rc==SQLITE_DONE );
if( rc==SQLITE_OK ){
rc = btreeGetPage(pBt, ovfl, &pPage, (ppPage==0) ? PAGER_GET_READONLY : 0);
assert( rc==SQLITE_OK || pPage==0 );
if( rc==SQLITE_OK ){
next = get4byte(pPage->aData);
}
}
*pPgnoNext = next;
if( ppPage ){
*ppPage = pPage;
}else{
releasePage(pPage);
}
return (rc==SQLITE_DONE ? SQLITE_OK : rc);
}
2.溢出页的定义
一个cell的溢出页是一个单链表,每个溢出页(除了最后一个页)被填满数据,数据长度等于可用空间除以4bytes:头4个bytes存储下一个溢出页的页号,最后一个溢出页可以小到只有1byte数据。不能在同一个溢出页存储来自两个cell的内容。
三、cell与空间管理
btree模块接受来自虚拟机的插入或者删除cell的请求。插入操作需要对树上的页(和溢出页)分配空间,相对的,删除操作释放空间。
以删除为例
int sqlite3BtreeDelete(BtCursor *pCur, u8 flags){
Btree *p = pCur->pBtree;
BtShared *pBt = p->pBt;
int rc; /* Return code */
MemPage *pPage; /* Page to delete cell from */
unsigned char *pCell; /* Pointer to cell to delete */
int iCellIdx; /* Index of cell to delete */
int iCellDepth; /* Depth of node containing pCell */
CellInfo info; /* Size of the cell being deleted */
int bSkipnext = 0; /* Leaf cursor in SKIPNEXT state */
u8 bPreserve = flags & BTREE_SAVEPOSITION; /* Keep cursor valid */
assert( cursorOwnsBtShared(pCur) );
assert( pBt->inTransaction==TRANS_WRITE );
assert( (pBt->btsFlags & BTS_READ_ONLY)==0 );
assert( pCur->curFlags & BTCF_WriteFlag );
assert( hasSharedCacheTableLock(p, pCur->pgnoRoot, pCur->pKeyInfo!=0, 2) );
assert( !hasReadConflicts(p, pCur->pgnoRoot) );
assert( (flags & ~(BTREE_SAVEPOSITION | BTREE_AUXDELETE))==0 );
if( pCur->eState==CURSOR_REQUIRESEEK ){
rc = btreeRestoreCursorPosition(pCur);
assert( rc!=SQLITE_OK || CORRUPT_DB || pCur->eState==CURSOR_VALID );
if( rc || pCur->eState!=CURSOR_VALID ) return rc;
}
assert( CORRUPT_DB || pCur->eState==CURSOR_VALID );
iCellDepth = pCur->iPage;
iCellIdx = pCur->ix;
pPage = pCur->pPage;
pCell = findCell(pPage, iCellIdx);
if( pPage->nFree<0 && btreeComputeFreeSpace(pPage) ) return SQLITE_CORRUPT;
/* If the bPreserve flag is set to true, then the cursor position must
** be preserved following this delete operation. If the current delete
** will cause a b-tree rebalance, then this is done by saving the cursor
** key and leaving the cursor in CURSOR_REQUIRESEEK state before
** returning.
**
** Or, if the current delete will not cause a rebalance, then the cursor
** will be left in CURSOR_SKIPNEXT state pointing to the entry immediately
** before or after the deleted entry. In this case set bSkipnext to true. */
if( bPreserve ){
if( !pPage->leaf
|| (pPage->nFree+cellSizePtr(pPage,pCell)+2)>(int)(pBt->usableSize*2/3)
|| pPage->nCell==1 /* See dbfuzz001.test for a test case */
){
/* A b-tree rebalance will be required after deleting this entry.
** Save the cursor key. */
rc = saveCursorKey(pCur);
if( rc ) return rc;
}else{
bSkipnext = 1;
}
}
/* If the page containing the entry to delete is not a leaf page, move
** the cursor to the largest entry in the tree that is smaller than
** the entry being deleted. This cell will replace the cell being deleted
** from the internal node. The 'previous' entry is used for this instead
** of the 'next' entry, as the previous entry is always a part of the
** sub-tree headed by the child page of the cell being deleted. This makes
** balancing the tree following the delete operation easier. */
if( !pPage->leaf ){
rc = sqlite3BtreePrevious(pCur, 0);
assert( rc!=SQLITE_DONE );
if( rc ) return rc;
}
/* Save the positions of any other cursors open on this table before
** making any modifications. */
if( pCur->curFlags & BTCF_Multiple ){
rc = saveAllCursors(pBt, pCur->pgnoRoot, pCur);
if( rc ) return rc;
}
/* If this is a delete operation to remove a row from a table b-tree,
** invalidate any incrblob cursors open on the row being deleted. */
if( pCur->pKeyInfo==0 && p->hasIncrblobCur ){
invalidateIncrblobCursors(p, pCur->pgnoRoot, pCur->info.nKey, 0);
}
/* Make the page containing the entry to be deleted writable. Then free any
** overflow pages associated with the entry and finally remove the cell
** itself from within the page. */
rc = sqlite3PagerWrite(pPage->pDbPage);
if( rc ) return rc;
BTREE_CLEAR_CELL(rc, pPage, pCell, info);
dropCell(pPage, iCellIdx, info.nSize, &rc);
if( rc ) return rc;
/* If the cell deleted was not located on a leaf page, then the cursor
** is currently pointing to the largest entry in the sub-tree headed
** by the child-page of the cell that was just deleted from an internal
** node. The cell from the leaf node needs to be moved to the internal
** node to replace the deleted cell. */
if( !pPage->leaf ){
MemPage *pLeaf = pCur->pPage;
int nCell;
Pgno n;
unsigned char *pTmp;
if( pLeaf->nFree<0 ){
rc = btreeComputeFreeSpace(pLeaf);
if( rc ) return rc;
}
if( iCellDepth<pCur->iPage-1 ){
n = pCur->apPage[iCellDepth+1]->pgno;
}else{
n = pCur->pPage->pgno;
}
pCell = findCell(pLeaf, pLeaf->nCell-1);
if( pCell<&pLeaf->aData[4] ) return SQLITE_CORRUPT_BKPT;
nCell = pLeaf->xCellSize(pLeaf, pCell);
assert( MX_CELL_SIZE(pBt) >= nCell );
pTmp = pBt->pTmpSpace;
assert( pTmp!=0 );
rc = sqlite3PagerWrite(pLeaf->pDbPage);
if( rc==SQLITE_OK ){
insertCell(pPage, iCellIdx, pCell-4, nCell+4, pTmp, n, &rc);
}
dropCell(pLeaf, pLeaf->nCell-1, nCell, &rc);
if( rc ) return rc;
}
/* Balance the tree. If the entry deleted was located on a leaf page,
** then the cursor still points to that page. In this case the first
** call to balance() repairs the tree, and the if(...) condition is
** never true.
**
** Otherwise, if the entry deleted was on an internal node page, then
** pCur is pointing to the leaf page from which a cell was removed to
** replace the cell deleted from the internal node. This is slightly
** tricky as the leaf node may be underfull, and the internal node may
** be either under or overfull. In this case run the balancing algorithm
** on the leaf node first. If the balance proceeds far enough up the
** tree that we can be sure that any problem in the internal node has
** been corrected, so be it. Otherwise, after balancing the leaf node,
** walk the cursor up the tree to the internal node and balance it as
** well. */
rc = balance(pCur);
if( rc==SQLITE_OK && pCur->iPage>iCellDepth ){
releasePageNotNull(pCur->pPage);
pCur->iPage--;
while( pCur->iPage>iCellDepth ){
releasePage(pCur->apPage[pCur->iPage--]);
}
pCur->pPage = pCur->apPage[pCur->iPage];
rc = balance(pCur);
}
if( rc==SQLITE_OK ){
if( bSkipnext ){
assert( bPreserve && (pCur->iPage==iCellDepth || CORRUPT_DB) );
assert( pPage==pCur->pPage || CORRUPT_DB );
assert( (pPage->nCell>0 || CORRUPT_DB) && iCellIdx<=pPage->nCell );
pCur->eState = CURSOR_SKIPNEXT;
if( iCellIdx>=pPage->nCell ){
pCur->skipNext = -1;
pCur->ix = pPage->nCell-1;
}else{
pCur->skipNext = 1;
}
}else{
rc = moveToRoot(pCur);
if( bPreserve ){
btreeReleaseAllCursorPages(pCur);
pCur->eState = CURSOR_REQUIRESEEK;
}
if( rc==SQLITE_EMPTY ) rc = SQLITE_OK;
}
}
return rc;
}
1.自由页的管理
当一页从树中移除的时候,该页被添加到自由页链表中,为稍后的重复使用做准备。当树需要扩展的时候,从自由页链表中取出一页添加到树上。如果自由页链表为空,就从本地文件系统中取出一页。(从本地文件系统取出的页总是添加在数据库文件的末尾。)
2.页内空间管理
树上的页有三种类型的自由空间:
1.
Cell指针数组到cell实体之间的空间(称为未分配空间)。(可分配)
2.
cell实体间的自由空间块(可分配)(通过一个自由空间块链表组织起来)
3.
Cell间的碎片空间(不可分配)
(图源:《基于Android手机SQLite的取证系统设计实现》)
每次分配和释放空间后,造成的自由空间变化遵循以下的原则:
空间分配器不会分配小于4bytes的空间,如果有小于4bytes的申请,那么空间分配器分配4bytes的空间。假设某页有nFree bytes的空间,对于该页有nRequired bytes(nRequired>=4)的申请,nRequired <= nFree,那么空间分配器按照以下的步骤分配空间:
1.
遍历自由空间块链表,寻找是否有足够大的自由空间块,如果找到了,按照如下原则继续执行:
a.如果自由空间块的大小小于nRequired + 4,就把该空间块从自由空间块链表中取出来,用从空间块开头算起的nRequired bytes满足申请,剩下的空间(<=3 bytes)就成了碎片空间。
b.否则,用从空间块末尾算起的nRequired bytes满足申请,剩下的自由空间依旧作为自由空间块存储于自由空间块链表中。
2.
如果没找到足够大的自由空间块,而且需要分配的空间超过了未分配空间,或者该页有太多的碎片空间,那么本模块就对该页进行碎片整理。通过执行压缩算法把碎片集合成更大的自由空间,并放到Cell指针数组与cell实体之间。在压缩过程中,需要一个接一个地把已经存在的cell移动到页的底部。
3.
从未分配空间的底部开始,分配nRequired bytes的空间。
释放cell占有的空间
假设有一个释放nFree(>=4 bytes)(这nFree之前被分配器分配过)的请求。分配器创建一个新的大小为nFree bytes的自由空间块,把这块自由空间插入到自由空间块链表中合适的位置。然后尝试将该块和周围的空间块合并。如果两个空间块之间有碎片,这些碎片也会被合并。如果cell指针数组和cell实体之间有未分配空间,那么将释放的空间和未分配空间合并。