HBase and MapR-DB: Designed for Distribution, Scale, and Speed

本文深入探讨了HBase和MapR-DB如何为大规模数据处理提供支持，强调了它们在分布式环境下的优势。文章对比了关系型数据库与NoSQL数据库HBase在数据存储模型、扩展性和速度方面的差异，详细解析了HBase的设计理念及其在大数据应用中的作用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

HBase and MapR-DB: Designed for Distribution, Scale, and Speed

前言

最近看了一片文章，收获良多。由于是文章是英文版，国内的翻译看着有点不得劲，所以利用闲暇时间自己翻译了下，以便自己复习。

正文

Apache HBase是运行在Hadoop集群之上的数据看。HBase不是传统的关系型数据库，为了实现更好的可扩展性，它放松了传统关系型数据库的ACID（原子性（Atomicity）、一致性（Consistency）、隔离性（Isolation）、持久性（Durability））特性。存储在HBase中的数据也不需要像关系型数据库那样遵守严格的数据集合模式，这使得它更适合存储非结构化或半结构化数据。

The MapR Converged Data Platform supports HBase, but also supports MapR-DB, a high performance, enterprise-grade NoSQL DBMS that includes the HBase API to run HBase applications. For this blog, I’ll specifically refer to HBase, but understand that many of the advantages of using HBase in your data architecture apply to MapR-DB. MapR built MapR-DB to take HBase applications to the next level, so if the thought of higher powered, more reliable HBase deployments sound appealing to you, take a look at some of the MapR-DB content here.
HBase allows you to build big data applications for scaling, but with this comes some different ways of implementing applications compared to developing with traditional relational databases. In this blog post, I will provide an overview of HBase, touch on the limitations of relational databases, and dive into the specifics of the HBase data model.

Relational Databases vs. HBase – Data Storage Model

为什么我们需要NoSQL/HBase？在我们讨论关系型数据库的局限前，先让我们一起看一下关系型数据库的优势。

关系型数据库提供了数据持久化模型
SQL已经成为了操作数据库的标准模型
关系型数据库提供了良好事务并发性控制
关系型数据库提供了全面的操作工具

Why do we need NoSQL/HBase? First, let’s look at the pros of relational databases before we discuss its limitations: * Relational databases have provided a standard persistence model * SQL has become a de-facto standard model of data manipulation (SQL) * Relational databases manage concurrency for transactionsRelational * database have lots of tools

关系型数据库是长期以来数据存储的标准工具，那是什么导致的关系型数据库地位的改变？随着越来越多的数据需要存储，数据库也需要进行扩展。一个方式是需要垂直扩展，也就是采用更大更高效的服务器来存储，但是这种方法成本较高，同时服务器的扩展是有限的。

Relational databases were the standard for years, so what changed? With more and more data came the need to scale. One way to scale is vertically with a bigger server, but this can get expensive, and there are limits as your size increases.

Relational Databases vs. HBase - Scaling

What changed to bring on NoSQL?

除了垂直扩展，还有一种方式是横向扩展，即以集群的形式扩展数据库。集群服务器可以是性能普通的服务器。横向扩展相对垂直扩展，更便宜且更加可靠。
由于关系型数据库的特性，因此数据是按行分布的。对于关系型数据库的水平分区和分片，一些行分布在一台机器上，另一些行分布在另外的机器上。对于关系型数据库分布书存储的分区和分片是非常复杂的，且其不具备自动分布式存储的功能。另外，将失去总体上的数据库查询功能、数据处理及事务（transaction）的一致性。总之。关系型数据库是为单个节点设计的，不是为了在集群上运行而设计的。

An alternative to vertical scaling is to scale horizontally with a cluster of machines, which can use commodity hardware. This can be cheaper and more reliable. To horizontally partition or shard a RDBMS, data is distributed on the basis of rows, with some rows residing on a single machine and the other rows residing on other machines, However, it’s complicated to partition or shard a relational database, and it was not designed to do this automatically. In addition, you lose the querying, transactions, and consistency controls across shards. Relational databases were designed for a single node; they were not designed to be run on clusters.

Limitations of a Relational Model
关系型数据库中，利用数据库规范化（database normalization）的理念消除了冗余数据，使得数据的存储更高效。然而，数据库规范化模式在查询时为了将数据重新组织起来，就需要Join操作。由于HBase在设计上不支持关系和Join这样的概念，可能查询的数据会被存储在一起。因此也就避免了关系型数据库的一些局限性。关系型数据库与HBase数据存储模型的不同见下图：

Database normalization eliminates redundant data, which makes storage efficient. However, a normalized schema causes joins for queries, in order to bring the data back together again. While HBase does not support relationships and joins, data that is accessed together is stored together so it avoids the limitations associated with a relational model. See the difference in data storage models in the chart below:

Relational databases vs. HBase - data storage model

HBase Designed for Distribution, Scale, and Speed

HBase会将可能访问的数据存储在一起。HBase之所以能够在集群上运行，是因为HBase以key的方式对数据进行分组。对于水平分区或者分片， row key的range被用来分片，分片在多个服务器之间分配不同的数据。每个服务器存储源数据的数据子集（数据分片）。在查询过程中，这些分片将一起被访问。这一特性，加强了Hbase的可扩展性。HBase是BigTable存储架构的一种实现，它是由Google开发的分布式存储系统，用于管理结构化数据，且具有极强的扩展能力。
HBase的数据存储方式是基于列族。同时，它也是面向行，可以用一个key来查询到数据库的某一行数据。每个行包含多个列族，可以视为是些列族的结合。

HBase was designed to scale due to the fact that data that is accessed together is stored together. Grouping the data by key is central to running on a cluster. In horizontal partitioning or sharding, the key range is used for sharding, which distributes different data across multiple servers. Each server is the source for a subset of data. Distributed data is accessed together, which makes it faster for scaling. HBase is actually an implementation of the BigTable storage architecture, which is a distributed storage system developed by Google that’s used to manage structured data that is designed to scale to a very large size.
HBase is referred to as a column family-oriented data store. It’s also row-oriented: each row is indexed by a key that you can use for lookup (for example, lookup a customer with the ID of 1234). Each column family groups like data (customer address, order) within rows. Think of a row as the join of all values in all column families.

HBase is a column family-oriented database

HBase可以认为是分布式数据库。按key对数据进行分组对于在群集和分片上运行至关重要。key是更新操作时的基本单位。在分布存储数据的时候，根据key值，将数据分配存储至不同的服务器上。

HBase is also considered a distributed database. Grouping the data by key is central to running on a cluster and sharding. The key acts as the atomic unit for updates. Sharding distributes different data across multiple servers, and each server is the source for a subset of data.

HBase is a distributed database

HBase Data Model

存储在HBase中的数据由其“rowkey”定位。rowkey类似于关系数据库中的主键。HBase中的数据按照rowkey排序，存储。这是HBase数据存储的一个重要原则，也是HBase设计架构的一个重要部分。

Data stored in HBase is located by its “rowkey.” This is like a primary key from a relational database. Records in HBase are stored in sorted order, according to rowkey. This is a fundamental tenet of HBase and is also a critical semantic used in HBase schema design.

HBase data model – row keys

HBase中的表根据key range，被分割为行序列，这些行序列称为regions。然后将这些regions分配给集群中名为“RegionServers”的数据节点。HBase通过扩展集群中的regions来扩展读写能力。这是操作是自动完成的。
Tables are divided into sequences of rows, by key range, called regions. These regions are then assigned to the data nodes in the cluster called “RegionServers.” This scales read and write capacity by spreading regions across the cluster. This is done automatically and is how HBase was designed for horizontal sharding.

Tables are split into regions = contiguous keys

下面图表示列族是如何映射至存储文件的，不同的列族被存储在不同的文件中，列族可以单独访问。

The image below shows how column families are mapped to storage files. Column families are stored in separate files, which can be accessed separately.

数据存储在HBase表格单元格中。具有附加结构信息的整个单元称为键值（Key Value）。Hbase数据存储至cell里，每个单元格存储整个单元格，行键，列族名称，列名称，时间戳和值。 Key由row key，列族名称，列名称和时间戳组成。

The data is stored in HBase table cells. The entire cell, with the added structural information, is called Key Value. The entire cell, the row key, column family name, column name, timestamp, and value are stored for every cell for which you have set a value. The key consists of the row key, column family name, column name, and timestamp.

从逻辑上讲，cell是以表格形式存储的，但从物理上讲，row是以线性单元格集的形式存储的，这些cell中包含所有key值信息。
下方图片中，其左上角部分显示数据的逻辑布局，右下角显示数据的在文件中的物理存储。列族被存储在不同的文件中。每个单元格存储单元格、行键、列族名称、列名称、时间戳和值。

Logically, cells are stored in a table format, but physically, rows are stored as linear sets of cells containing all the key value information inside them.
In the image below, the top left shows the logical layout of the data, while the lower right section shows the physical storage in files. Column families are stored in separate files. The entire cell, the row key, column family name, column name, timestamp, and value are stored for every cell for which you have set a value.

Logical data model vs. physical data storage

如前所述，cell值的完整坐标为：“Table:Row:Family:Column:Timestamp ➔ Value”。 HBase表稀疏填充。如果列中不存在数据，则该列存储数据。表格单元格是版本化的未解释的字节数组。可以以时间戳作为版本控制参数，也可以自己设定版本控制参数。对于每个row:family:column，可以有多个版本号。

As mentioned before, the complete coordinates to a cell's value are: Table:Row:Family:Column:Timestamp ➔ Value. HBase tables are sparsely populated. If data doesn’t exist at a column, it’s not stored. Table cells are versioned uninterpreted arrays of bytes. You can use the timestamp or set up your own versioning system. For every coordinate row:family:column, there can be multiple versions of the value.

Sparse data with cell versions

HBase版本控制是内置的。Put操作是即是插入操作（创建）也是更新操作，每条数据都带有版本号。Delete操作会给数据加一个删除标签。这个标签会保证数据不会在查询时被返回。Get操作会基于参数返回版本号。如果不指定任何参数，则返回最新版本。每个列族可以配置要保留的版本数。默认保存3个版本。当超过设置的最大版本数时，HBase将删除最早版本的数据。

Versioning is built in. A put is both an insert (create) and an update, and each one gets its own version. Delete gets a tombstone marker. The tombstone marker prevents the data being returned in queries. Get requests return specific version(s) based on parameters. If you do not specify any parameters, the most recent version is returned. You can configure how many versions you want to keep and this is done per column family. The default is to keep up to three versions. When the max number of versions is exceeded, extra records will be eventually removed.

Versioned data