2021SC@SDUSC SQLite源码分析(四)————sqlite中的cell概念


一、cell是什么?

sqlite数据库文件被分为固定大小的页,所有的页由B+树模块管理。每个页要么是树中的页(内部页或者叶子页),要么是溢出页,或者是自由页(自由页通过单链表组织起来)。
树中的页(内部页和叶子页)被分成许多cell,一个cell包括一个(或者一部分)负载。Cell是已分配或者已释放的磁盘空间集合。页的内容见往期分析。

每个页被分为4个部分:
1.页头
2.Cell指针数组
3.未分配空间
4.Cell内容

struct CellInfo {
  i64 nKey;      /* The key for INTKEY tables, or nPayload otherwise */
  u8 *pPayload;  /* Pointer to the start of payload */
  u32 nPayload;  /* Bytes of payload */
  u16 nLocal;    /* Amount of payload held locally, not on overflow */
  u16 nSize;     /* Size of the cell content on the main b-tree page */
};

Cell指针数组从上向下增长,cell内容从下向上增长。Cell指针数组作为页内的一种目录,帮助把cell组织起来。
页头只包含本页的管理信息,并且总是存储在页的开头。
Cell被保存在页的最底部,向页开头的方向增长。Cell指针数组从页头之后的第一个byte开始,包含0个或者多个cell指针。每个cell指针是一个2byte整数,该整数指示了实际cell内容相对于页开头的偏移量。Cell指针按照相应的键值排序,尽管cell本身的存储不是按照顺序的。Cell指针数组的大小被存储在页头处偏移量为3的地方。
Cell是变长的字节字符串。一个cell保存了一个负载。Cell的结构如下图所示,size列的单位是byte。

在这里插入图片描述

对应btreeInt.h中注释如下:

** The content of a cell looks like this:
**
**    SIZE    DESCRIPTION
**      4     Page number of the left child. Omitted if leaf flag is set.
**     var    Number of bytes of data. Omitted if the zerodata flag is set.
**     var    Number of bytes of key. Or the key itself if intkey flag is set.
**      *     Payload
**      4     First page of the overflow chain.  Omitted if no overflow
**
** Overflow pages form a linked list.  Each page except the last is completely
** filled with data (pagesize - 4 bytes).  The last page can have as little
** as 1 byte of data.
**
**    SIZE    DESCRIPTION
**      4     Page number of next overflow page
**      *     Data
**
** Freelist pages come in two subtypes: trunk pages and leaf pages.  The
** file header points to the first in a linked list of trunk page.  Each trunk
** page points to multiple leaf pages.  The content of a leaf page is
** unspecified.  A trunk page looks like this:
**
**    SIZE    DESCRIPTION
**      4     Page number of next trunk page
**      4     Number of leaf pointers on this page
**      *     zero or more pages numbers of leaves

对于内部节点,每个cell包含一个4byte的子节点指针;对于叶子节点,cell没有子节点指针。接下来是该cell存储的数据的大小(bytes)和存储的键值的大小(bytes)
(如果页头中的intkey为真,那么存储键值大小的地方就直接存储键值本身的整型值,如果zerodata为真,那么数据部分不存在)。

二、cell与溢出页结构

1.溢出页的意义

SQLite限制了每页中负载的数量。负载有可能不会把自身的全部存储在同一页中,尽管可能该页有足够大的空间。每页能够存储的最大单个负载由该页可用空间(可以被内部节点的单个cell使用)的总大小决定。
如果内部节点中的cell的负载大于最大负载限制,超出的部分就被分割并存储到溢出页链表中。一但分配了一个溢出页,要把尽可能多的字节转移到溢出页中,只要不导致cell的大小低于最小负载限制就行。对于叶子节点,最小负载限制存储在文件头内偏移量23处,但是最大负载限制总是100%,并且不会在文件头给出。

static int getOverflowPage(
  BtShared *pBt,               /* The database file */
  Pgno ovfl,                   /* Current overflow page number */
  MemPage **ppPage,            /* OUT: MemPage handle (may be NULL) */
  Pgno *pPgnoNext              /* OUT: Next overflow page number */
){
  Pgno next = 0;
  MemPage *pPage = 0;
  int rc = SQLITE_OK;

  assert( sqlite3_mutex_held(pBt->mutex) );
  assert(pPgnoNext);

#ifndef SQLITE_OMIT_AUTOVACUUM
  /* Try to find the next page in the overflow list using the
  ** autovacuum pointer-map pages. Guess that the next page in 
  ** the overflow list is page number (ovfl+1). If that guess turns 
  ** out to be wrong, fall back to loading the data of page 
  ** number ovfl to determine the next page number.
  */
  if( pBt->autoVacuum ){
    Pgno pgno;
    Pgno iGuess = ovfl+1;
    u8 eType;

    while( PTRMAP_ISPAGE(pBt, iGuess) || iGuess==PENDING_BYTE_PAGE(pBt) ){
      iGuess++;
    }

    if( iGuess<=btreePagecount(pBt) ){
      rc = ptrmapGet(pBt, iGuess, &eType, &pgno);
      if( rc==SQLITE_OK && eType==PTRMAP_OVERFLOW2 && pgno==ovfl ){
        next = iGuess;
        rc = SQLITE_DONE;
      }
    }
  }
#endif

  assert( next==0 || rc==SQLITE_DONE );
  if( rc==SQLITE_OK ){
    rc = btreeGetPage(pBt, ovfl, &pPage, (ppPage==0) ? PAGER_GET_READONLY : 0);
    assert( rc==SQLITE_OK || pPage==0 );
    if( rc==SQLITE_OK ){
      next = get4byte(pPage->aData);
    }
  }

  *pPgnoNext = next;
  if( ppPage ){
    *ppPage = pPage;
  }else{
    releasePage(pPage);
  }
  return (rc==SQLITE_DONE ? SQLITE_OK : rc);
}

2.溢出页的定义

一个cell的溢出页是一个单链表,每个溢出页(除了最后一个页)被填满数据,数据长度等于可用空间除以4bytes:头4个bytes存储下一个溢出页的页号,最后一个溢出页可以小到只有1byte数据。不能在同一个溢出页存储来自两个cell的内容。

三、cell与空间管理

btree模块接受来自虚拟机的插入或者删除cell的请求。插入操作需要对树上的页(和溢出页)分配空间,相对的,删除操作释放空间。
以删除为例

int sqlite3BtreeDelete(BtCursor *pCur, u8 flags){
  Btree *p = pCur->pBtree;
  BtShared *pBt = p->pBt;              
  int rc;                              /* Return code */
  MemPage *pPage;                      /* Page to delete cell from */
  unsigned char *pCell;                /* Pointer to cell to delete */
  int iCellIdx;                        /* Index of cell to delete */
  int iCellDepth;                      /* Depth of node containing pCell */ 
  CellInfo info;                       /* Size of the cell being deleted */
  int bSkipnext = 0;                   /* Leaf cursor in SKIPNEXT state */
  u8 bPreserve = flags & BTREE_SAVEPOSITION;  /* Keep cursor valid */

  assert( cursorOwnsBtShared(pCur) );
  assert( pBt->inTransaction==TRANS_WRITE );
  assert( (pBt->btsFlags & BTS_READ_ONLY)==0 );
  assert( pCur->curFlags & BTCF_WriteFlag );
  assert( hasSharedCacheTableLock(p, pCur->pgnoRoot, pCur->pKeyInfo!=0, 2) );
  assert( !hasReadConflicts(p, pCur->pgnoRoot) );
  assert( (flags & ~(BTREE_SAVEPOSITION | BTREE_AUXDELETE))==0 );
  if( pCur->eState==CURSOR_REQUIRESEEK ){
    rc = btreeRestoreCursorPosition(pCur);
    assert( rc!=SQLITE_OK || CORRUPT_DB || pCur->eState==CURSOR_VALID );
    if( rc || pCur->eState!=CURSOR_VALID ) return rc;
  }
  assert( CORRUPT_DB || pCur->eState==CURSOR_VALID );

  iCellDepth = pCur->iPage;
  iCellIdx = pCur->ix;
  pPage = pCur->pPage;
  pCell = findCell(pPage, iCellIdx);
  if( pPage->nFree<0 && btreeComputeFreeSpace(pPage) ) return SQLITE_CORRUPT;

  /* If the bPreserve flag is set to true, then the cursor position must
  ** be preserved following this delete operation. If the current delete
  ** will cause a b-tree rebalance, then this is done by saving the cursor
  ** key and leaving the cursor in CURSOR_REQUIRESEEK state before 
  ** returning. 
  **
  ** Or, if the current delete will not cause a rebalance, then the cursor
  ** will be left in CURSOR_SKIPNEXT state pointing to the entry immediately
  ** before or after the deleted entry. In this case set bSkipnext to true.  */
  if( bPreserve ){
    if( !pPage->leaf 
     || (pPage->nFree+cellSizePtr(pPage,pCell)+2)>(int)(pBt->usableSize*2/3)
     || pPage->nCell==1  /* See dbfuzz001.test for a test case */
    ){
      /* A b-tree rebalance will be required after deleting this entry.
      ** Save the cursor key.  */
      rc = saveCursorKey(pCur);
      if( rc ) return rc;
    }else{
      bSkipnext = 1;
    }
  }

  /* If the page containing the entry to delete is not a leaf page, move
  ** the cursor to the largest entry in the tree that is smaller than
  ** the entry being deleted. This cell will replace the cell being deleted
  ** from the internal node. The 'previous' entry is used for this instead
  ** of the 'next' entry, as the previous entry is always a part of the
  ** sub-tree headed by the child page of the cell being deleted. This makes
  ** balancing the tree following the delete operation easier.  */
  if( !pPage->leaf ){
    rc = sqlite3BtreePrevious(pCur, 0);
    assert( rc!=SQLITE_DONE );
    if( rc ) return rc;
  }

  /* Save the positions of any other cursors open on this table before
  ** making any modifications.  */
  if( pCur->curFlags & BTCF_Multiple ){
    rc = saveAllCursors(pBt, pCur->pgnoRoot, pCur);
    if( rc ) return rc;
  }

  /* If this is a delete operation to remove a row from a table b-tree,
  ** invalidate any incrblob cursors open on the row being deleted.  */
  if( pCur->pKeyInfo==0 && p->hasIncrblobCur ){
    invalidateIncrblobCursors(p, pCur->pgnoRoot, pCur->info.nKey, 0);
  }

  /* Make the page containing the entry to be deleted writable. Then free any
  ** overflow pages associated with the entry and finally remove the cell
  ** itself from within the page.  */
  rc = sqlite3PagerWrite(pPage->pDbPage);
  if( rc ) return rc;
  BTREE_CLEAR_CELL(rc, pPage, pCell, info);
  dropCell(pPage, iCellIdx, info.nSize, &rc);
  if( rc ) return rc;

  /* If the cell deleted was not located on a leaf page, then the cursor
  ** is currently pointing to the largest entry in the sub-tree headed
  ** by the child-page of the cell that was just deleted from an internal
  ** node. The cell from the leaf node needs to be moved to the internal
  ** node to replace the deleted cell.  */
  if( !pPage->leaf ){
    MemPage *pLeaf = pCur->pPage;
    int nCell;
    Pgno n;
    unsigned char *pTmp;

    if( pLeaf->nFree<0 ){
      rc = btreeComputeFreeSpace(pLeaf);
      if( rc ) return rc;
    }
    if( iCellDepth<pCur->iPage-1 ){
      n = pCur->apPage[iCellDepth+1]->pgno;
    }else{
      n = pCur->pPage->pgno;
    }
    pCell = findCell(pLeaf, pLeaf->nCell-1);
    if( pCell<&pLeaf->aData[4] ) return SQLITE_CORRUPT_BKPT;
    nCell = pLeaf->xCellSize(pLeaf, pCell);
    assert( MX_CELL_SIZE(pBt) >= nCell );
    pTmp = pBt->pTmpSpace;
    assert( pTmp!=0 );
    rc = sqlite3PagerWrite(pLeaf->pDbPage);
    if( rc==SQLITE_OK ){
      insertCell(pPage, iCellIdx, pCell-4, nCell+4, pTmp, n, &rc);
    }
    dropCell(pLeaf, pLeaf->nCell-1, nCell, &rc);
    if( rc ) return rc;
  }

  /* Balance the tree. If the entry deleted was located on a leaf page,
  ** then the cursor still points to that page. In this case the first
  ** call to balance() repairs the tree, and the if(...) condition is
  ** never true.
  **
  ** Otherwise, if the entry deleted was on an internal node page, then
  ** pCur is pointing to the leaf page from which a cell was removed to
  ** replace the cell deleted from the internal node. This is slightly
  ** tricky as the leaf node may be underfull, and the internal node may
  ** be either under or overfull. In this case run the balancing algorithm
  ** on the leaf node first. If the balance proceeds far enough up the
  ** tree that we can be sure that any problem in the internal node has
  ** been corrected, so be it. Otherwise, after balancing the leaf node,
  ** walk the cursor up the tree to the internal node and balance it as 
  ** well.  */
  rc = balance(pCur);
  if( rc==SQLITE_OK && pCur->iPage>iCellDepth ){
    releasePageNotNull(pCur->pPage);
    pCur->iPage--;
    while( pCur->iPage>iCellDepth ){
      releasePage(pCur->apPage[pCur->iPage--]);
    }
    pCur->pPage = pCur->apPage[pCur->iPage];
    rc = balance(pCur);
  }

  if( rc==SQLITE_OK ){
    if( bSkipnext ){
      assert( bPreserve && (pCur->iPage==iCellDepth || CORRUPT_DB) );
      assert( pPage==pCur->pPage || CORRUPT_DB );
      assert( (pPage->nCell>0 || CORRUPT_DB) && iCellIdx<=pPage->nCell );
      pCur->eState = CURSOR_SKIPNEXT;
      if( iCellIdx>=pPage->nCell ){
        pCur->skipNext = -1;
        pCur->ix = pPage->nCell-1;
      }else{
        pCur->skipNext = 1;
      }
    }else{
      rc = moveToRoot(pCur);
      if( bPreserve ){
        btreeReleaseAllCursorPages(pCur);
        pCur->eState = CURSOR_REQUIRESEEK;
      }
      if( rc==SQLITE_EMPTY ) rc = SQLITE_OK;
    }
  }
  return rc;
}

1.自由页的管理

当一页从树中移除的时候,该页被添加到自由页链表中,为稍后的重复使用做准备。当树需要扩展的时候,从自由页链表中取出一页添加到树上。如果自由页链表为空,就从本地文件系统中取出一页。(从本地文件系统取出的页总是添加在数据库文件的末尾。)

2.页内空间管理

树上的页有三种类型的自由空间:
1.
Cell指针数组到cell实体之间的空间(称为未分配空间)。(可分配)
2.
cell实体间的自由空间块(可分配)(通过一个自由空间块链表组织起来)
3.
Cell间的碎片空间(不可分配)
Alt
(图源:《基于Android手机SQLite的取证系统设计实现》)

每次分配和释放空间后,造成的自由空间变化遵循以下的原则:

空间分配器不会分配小于4bytes的空间,如果有小于4bytes的申请,那么空间分配器分配4bytes的空间。假设某页有nFree bytes的空间,对于该页有nRequired bytes(nRequired>=4)的申请,nRequired <= nFree,那么空间分配器按照以下的步骤分配空间:
1.
遍历自由空间块链表,寻找是否有足够大的自由空间块,如果找到了,按照如下原则继续执行:
a.如果自由空间块的大小小于nRequired + 4,就把该空间块从自由空间块链表中取出来,用从空间块开头算起的nRequired bytes满足申请,剩下的空间(<=3 bytes)就成了碎片空间。
b.否则,用从空间块末尾算起的nRequired bytes满足申请,剩下的自由空间依旧作为自由空间块存储于自由空间块链表中。
2.
如果没找到足够大的自由空间块,而且需要分配的空间超过了未分配空间,或者该页有太多的碎片空间,那么本模块就对该页进行碎片整理。通过执行压缩算法把碎片集合成更大的自由空间,并放到Cell指针数组与cell实体之间。在压缩过程中,需要一个接一个地把已经存在的cell移动到页的底部。
3.
从未分配空间的底部开始,分配nRequired bytes的空间。
释放cell占有的空间
假设有一个释放nFree(>=4 bytes)(这nFree之前被分配器分配过)的请求。分配器创建一个新的大小为nFree bytes的自由空间块,把这块自由空间插入到自由空间块链表中合适的位置。然后尝试将该块和周围的空间块合并。如果两个空间块之间有碎片,这些碎片也会被合并。如果cell指针数组和cell实体之间有未分配空间,那么将释放的空间和未分配空间合并。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值