- Data Model, sparse, distributed, persisted multidimensional sorted map
(row:string, column:string, time:int64) -> string //both key and value are uninterpreted bytes
- Row
- single row read and update is atomic.
- design of row key to make good rows locality when access them by range. note the order of int (1,10,100, 2,...)
- Column Family, should be declared upfront. basic unit of access control. columns under the column family have same type. disk(compress and index) and memory statistics are based on column family
- Timestamp (64-bit int), multiple versions of cell. values are ordered in timestamp decreasing order. values can be garbage collected by specifying last n versions or new-enough versions (e.g., values written in last several days)
- API, Get, Put, Scan, Delete
HBase doesn't modify data in place. delete is handled by putting tomstones on. These tomstones, along with dead values are cleaned up on major compaction
get rows by,
- row key
- row range
- scan
- Building block
- region, regions split by row range comprise table. they can be distributed and load-balanced in different machines (region servers).
- HFile, pesisted, ordered immutable map from keys to values. Block indices is loaded to memory for looking up. then load blocks to memory.
- Datablock and metablock can be compressed, gzip or lzo
- FileInfo, hfile meta info and user customized info
- Datablock index, Index's key is the key of the first record in the datablock
- trainer, fixed offset in hfile. it's loaded first when access hfile.
- HLog, one region server one Hlog file. it's a sequence file (HLogKey -> KeyValue).
HLogKey: table+region+sequence nbr+timestamp
KeyValue: from HFile
cons: need to split log and send splits to different region servers when recover from the region server failure.
pros: only append log to one log file avoiding seeking time to multiple files.
- Architecture view
- client, access hbase via API, cache some info, like region location
- zookeeper
- only one master gets lock of master service. all other masters keep trying to get locks in case of the master failures.
- a file pointing to the root region of META table
- status of all region servers, notify server up and down event to master
- hbase schema info, table and column family
- Master
- assign region to region server when Hbase startup and table is created. existing regions will be assigned to the same region server before Hbase shutdown to keep data locality.
- region server load-balancing
- reassign regions to other region server after being notified of region server down
- schema update
- reclaim of HDFS file
- assign ROOT and META table to region server when startup
- region server
- maintain its region being assigned. I/O talks to client
- split large region
- workflow
- locate region (b+ tree of three-level structure
- non-splittable root region
- META table contains locations of regions. row key = table name + start key of region
- all records of MEAT table are in memory of regionserver hosting the region of META table
- up to (128MB/1KB) * (128MB/1KB) =2(34) regions
- 6-rounds between client and server to recover invalid region cache in client
- read/write,
- write, first to WAL and memstore. flush memstore to storefile if exceeds valve. a redo point sends to zookeeper for recovery.
- read, merged view of storefiles and memstore. need minor compact and major compact to reduce to one storefile to improve read performance.
- region assignment, done by Master who knows region servers, region's affliation and unassigned regions.
- region server up/down,
- up, put a file under zookeeper's server folder and lock it. Master gets notified by zookeeper.
- down, zookeeper unlock the file and master can then lock the file (master enquiries all files under the server folder to get file lock info or master failed to talk to the region server after several tries). After that, Master reassign the regions to other region servers
- master up/down
- up, master tries to get the master service lock to become the primary master; get all servers from zookeeper; get all regions from region servers; access META table and calculate unassigned regions.
- down, cannot create/delete table, cannot update schema, cannot assign region, cannot merge region
reference: http://mvplee.iteye.com/blog/2247221, the bigtable paper