google bigtable论文原文_Kudu论文解读: Fast Analytics on Fast Data (上)

bc39172b764d2efdcde413d812f8f83b.png
组织形式为先列原文,再是进一步的解读

论文地址: https://kudu.apache.org/kudu.pdf

Part 1: Introduction 部分

1 Introduction

In recent years, explosive growth in the amount of data being generated and captured by enterprises has resulted in the rapid adoption of open source technology which is able to store massive data sets at scale and at low cost. In particular, the Hadoop ecosystem has become a focal point for such “big data” workloads, because many traditional open source database systems have lagged in offering a scalable alternative.

Structured storage in the Hadoop ecosystem has typically been achieved in two ways: for static data sets, data is typically stored on HDFS using binary data formats such as Apache Avro[1] or Apache Parquet[3]. However, neither HDFS nor these formats has any provision for updating individual records, or for efficient random access. Mutable data sets are typically stored in semi-structured stores such as Apache HBase[2] or Apache Cassandra[21]. These systems allow for low-latency record-level reads and writes, but lag far behind the static file formats in terms of sequential read throughput for applications such as SQL-based analytics or machine learning.

The gap between the analytic performance offered by static data sets on HDFS and the low-latency row-level random access capabilities of HBase and Cassandra has required practitioners to develop complex architectures when the need for both access patterns arises in a single application. In particular, many of Cloudera’s customers have developed data pipelines which involve streaming ingest and updates in HBase, followed by periodic jobs to export tables to Parquet for later analysis. Such architectures suffer several downsides:

  1. Application architects must write complex code to manage the flow and synchronization of data between the two systems.
  2. Operators must manage consistent backups, security policies, and monitoring across multiple distinct systems.
  3. The resulting architecture may exhibit significant lag between the arrival of new data into the HBase “staging area” and the time when the new data is available for analytics.
  4. In the real world, systems often need to accomodate latearriving data, corrections on past records, or privacyrelated deletions on data that has already been migrated to the immutable store. Achieving this may involve expensive rewriting and swapping of partitions and manual intervention.

Kudu is a new storage system designed and implemented from the ground up to fill this gap between high-throughput sequential-access storage systems such as HDFS[27] and lowlatency random-access systems such as HBase or Cassandra. While these existing systems continue to hold advantages in some situations, Kudu offers a “happy medium” alternative that can dramatically simplify the architecture of many common workloads. In particular, Kudu offers a simple API for row-level inserts, updates, and deletes, while providing table scans at throughputs similar to Parquet, a commonly-used columnar format for static data.

This paper introduces the architecture of Kudu. Section 2 describes the system from a user’s point of view, introducing the data model, APIs, and operator-visible constructs. Section 3 describes the architecture of Kudu, including how it partitions and replicates data across nodes, recovers from faults, and performs common operations. Section 4 explains how Kudu stores its data on disk in order to combine fast random access with efficient analytics. Section 5 discusses integrations between Kudu and other Hadoop ecosystem projects. Section 6 presents preliminary performance results in synthetic workloads.

解读

Kudu 的诞生,源自于大数据领域对 实时性 越来越强的 分析 需求。

这里面有两个关键字,实时性分析

我们为什么会需要 实时的分析 ?

天下武功,唯快不破。

数据的价值随着时间而降低。

在大数据刚刚兴起时,我们对数据分析的要求通常是针对历史数据的分析,例如我们出日报或者小时报告,针对昨天或者上个小时的海量数据进行分析并导出到类似ElasticSearch/Mysql 这样的实时查询系统,用户便可以通过页面看到数据。

但是,随着大数据越来越成熟,现在有了更进一步的需求。

  • 需要最新的数据,对于一个 App Crash 上报系统,等一小时才可以发现用户大面积崩溃是不可忍受的
  • 不只是最新的数据(例如最近15分钟),而是对历史+最新的数据做分析 (也就是说,流计算不能覆盖这里的场景)
  • 需要分析更灵活,传统的SQL分析越来越难以满足多种多样的分析模型,基于机器学习的方案需求越来越强烈
  • 分析需要更快,不是几小时或者几分钟,而是几秒

这里有什么挑战 ?

核心的问题,在于存储。大数据从设计之初,就不是为了实时分析而设计的。

看大数据奠基的三篇论文: GFS, BigTable, Mapreduce。

  • Mapreduce 是计算引擎
  • GFS (HDFS) 是海量文件存储,面向 OLAP
  • BigTable (HBase) 是海量KV存储,面向 OLTP

如果需要 OLAP + OLTP 呢?

非常容易想象一种方案,HDFS + Parquet/ORC + HBase,数据实时写入 HBase,满足基础的 OLTP 需求,我们再起定时任务,将历史数据写入 HDFS,以做后期的 OLAP 分析。

这个方案很好,但是有一些细节需要精心设计

  • 数据延迟到来怎么办 ?
  • OLAP 任务需要依赖 HBase 导入 HDFS 的任务已经完成了,谁来保证这个依赖 ?
  • OLAP 也需要实时数据的话,那需要 Merge HBase 和 HDFS 的数据,要怎么做呢 ?

幸运的是,这些问题都有解决方案

  • 我们可以给每条数据注入后台时间戳,我们按照后台时间戳进行划分。后台服务必须在指定时延内将数据写入 HBase,否则,需要重跑对应的 HBase => HDFS 任务 (如何发现数据延迟写入,这是另外的问题了)
  • Airflow 的 DAG 依赖可以保证,但是这要求所有计算下游任务都依赖这个任务
  • 支持多种数据源的查询引擎,例如 Spark/Presto,可以 Union 多个数据源结果进行查询

但是,问题是,我们是否可以让这个事情变得简单 ?

核心问题已经很明确了,我们有 HDFS/HBase 这两个系统分别统治 OLAP 和 OLTP,为什么不可以消除这种鸿沟,让计算引擎只跟一个存储系统打交道?这样我们也不用管 Union DataSourceTable Compact 这些看起来无趣并且容易出错的工作。

就这样,Kudu 为了解决这些问题诞生Fast Analytics On Fast。

01745295165af07bb1a2c0ea50fe6f17.png

Part 2: Kudu at a high level

2.1 Kudu at a high level Tables and schemas

From the perspective of a user, Kudu is a storage system for tables of structured data. A Kudu cluster may have any number of tables, each of which has a well-defined schema consisting of a finite number of columns. Each such column has a name, type (e.g INT32 or STRING) and optional nullability. Some ordered subset of those columns are specified to be the table’s primary key. The primary key enforces a uniqueness constraint (at most one row may have a given primary key tuple) and acts as the sole index by which rows may be efficiently updated or deleted. This data model is familiar to users of relational databases, but differs from many other distributed datastores such as Cassandra. MongoDB[6], Riak[8], BigTable[12], etc.

As with a relational database, the user must define the schema of a table at the time of creation. Attempts to insert data into undefined columns result in errors, as do violations of the primary key uniqueness constraint. The user may at any time issue an alter table command to add or drop columns, with the restriction that primary key columns cannot be dropped.

Our decision to explicitly specify types for columns instead of using a NoSQL-style “everything is bytes” is motivated by two factors:

  1. Explicit types allow us to use type-specific columnar encodings such as bit-packing for integers.
  2. Explicit types allow us to expose SQL-like metadata to other systems such as commonly used business intelligence or data exploration tools.

Unlike most relational databases, Kudu does not currently offer secondary indexes or uniqueness constraints other than the primary key. Currently, Kudu requires that every table has a primary key defined, though we anticipate that a future version will add automatic generation of surrogate keys.

解读

显式的 Table Schema,为字段自动应用合适的编码

HBase 没有字段类型,所有字段作为二进制数据对待。

但其实允许用户声明字段类型,可以带来一些好处:

  • 让 Kudu 可以为字段自动选择合适编码方式,提高压缩率 (例如对 int 类型进行 bit-packing )
  • 表是自解释的,如果是 HBase 那种模式,则字段真实类型和二进制编码之间的映射需要用户负责
  • 让BI报表工具可以探知表结构,会使交互体验更好

必须定义主键,不支持二级索引

主键在 LSM(log-structed-merge-tree) 存储模型有重要地位,有主键可以对数据块进行排序,而排过序的数据块使用很少的内存便可以 Merge,Kudu 也是使用类 LSM 模型。

另外,主键可以完成数据去重,这对于实现严格一次(Exactly Once)语义非常有帮助。

另外主键是 Kudu 支持随机存取的重要工具,同时可以提升 Scan 性能 (后文有更多描述)

2.2 Write operations

After creating a table, the user mutates the table using Insert, Update, and Delete APIs. In all cases, the user must fully specify a primary keypredicate-based deletions or updates must be handled by a higher-level access mechanism (see section 5).

Kudu offers APIs in Java and C++, with experimental support for Python. The APIs allow precise control over batching and asynchronous error handling to amortize the cost of round trips when performing bulk data operations (such as data loads or large updates). Currently, Kudu does not offer any multirow transactional APIs: each mutation conceptually executes as its own transaction, despite being automatically batched with other mutations for better performance. Modifications within a single row are always executed atomi cally across columns.

解读

  • 写操作需要通过客户端进行,包括 Java/Python/C++
  • 写操作必须指明主键
  • 支持 bulk 操作以降低批量处理网络开销
  • 不支持跨行的事务

2.3 Read operations

Kudu offers only a Scan operation to retrieve data from a table. On a scan, the user may add any number of predicates to filter the results. Currently, we offer only two types of predicates: comparisons between a column and a constant value, and composite primary key ranges. These predicates are interpreted both by the client API and the server to efficiently cull the amount of data transferred from the disk and over the network.

In addition to applying predicates, the user may specify a projection for a scan. A projection consists of a subset of columns to be retrieved. Because Kudu’s on-disk storage is columnar, specifying such a subset can substantially improve performance for typical analytic workloads.

解读

  • 支持 Scan 操作读取数据
  • Scan 操作可以支持谓词下推(Predicate pushdown)
  • 支持两种谓词操作: 1. 字段与常数的比较 2. 主键键值Range扫描
  • 谓词下推直接透传到Kudu后台,以便减少磁盘和网络开销
  • 支持 projection, 也就是仅选择需要的字段,这也是列式存储的核心能力

2.4 Other APIs

In addition to data path APIs, the Kudu client library offers other useful functionality. In particular, the Hadoop ecosystem gains much of its performance by scheduling for data locality. Kudu provides APIs for callers to determine the mapping of data ranges to particular servers to aid distributed execution frameworks such as Spark, MapReduce, or Impala in scheduling.

解读

除了核心的数据存取 API,Kudu 还提供了一些功能性的 API,例如关于数据 Locality 方面的 API。数据 Locality 可以帮助大数据应用,例如 Spark 将计算调度到靠近数据的地方以提升性能。

Kudu Spark RDD 数据 Locality 相关的设置,可以参考: KuduRDD.scala

2.5 Consistency Model

Kudu provides clients the choice between two consistency modes. The default consistency mode is snapshot consistency. A scan is guaranteed to yield a snapshot with no anomalies in which causality would be violated. As such, it also guarantees read-your-writes consistency from a single client.

By default, Kudu does not provide an external consistency guarantee. That is to say, if a client performs a write, then communicates with a different client via an external mechanism (e.g. a message bus) and the other performs a write, the causal dependence between the two writes is not captured. A third reader may see a snapshot which contains the second write without the first.

Based on our experiences supporting other systems such as HBase that also do not offer external consistency guarantees, this is sufficient for many use cases. However, for users who require a stronger guarantee, Kudu offers the option to manually propagate timestamps between clients: after performing a write, the user may ask the client library for a timestamp token. This token may be propagated to another client through the external channel, and passed to the Kudu API on the other side, thus preserving the causal relationship between writes made across the two clients.

If propagating tokens is too complex, Kudu optionally uses commit-wait as in Spanner[14]. After performing a write with commit-wait enabled, the client may be delayed for a period of time to ensure that any later write will be causally ordered correctly. Absent specialized time-keeping hardware, this can introduce significant latencies in writes (100-1000ms with default NTP configurations), so we anticipate that a minority of users will take advantage of this option. We also note that, since the publication of Spanner, several data stores have started to take advantage of real-time clocks. Given this, it is plausible that within a few years, cloud providers will offer tight global time synchronization as a differentiating service.

The assignment of operation timestamps is based on a clock algorithm termed HybridTime[15]. Please refer to the cited article for details.

解读

Kudu 提供两种一致性模型: (快照一致性)snapshot consistency(外部一致性)external consistency guarantee

快照一致性提供的级别是: 客户端可以读到自己写的,但是不能实时读到别人写的 (read-your-writes)。

而外部一致性则是需要用户额外的代码实现的(client propagate),会复杂很多。

Kudu 参考 Span,额外提供了一个 commit-wait 的外部一致性选项。这个方案相比 client propagate 会简单很多,但是要求服务器配置 NTP 服务做时间校准。

2.6 Timestamps

Although Kudu uses timestamps internally to implement concurrency control, Kudu does not allow the user to manually set the timestamp of a write operation. This differs from systems such as Cassandra and HBase, which treat the timestamp of a cell as a first-class part of the data model. In our experiences supporting users of these other systems, we have found that, while advanced users can make effective use of the timestamp dimension, the vast majority of users find this aspect of the data model confusing and a source of user error, especially with regard to the semantics of back-dated insertions and deletions.

We do, however, allow the user to specify a timestamp for a read operation. This allows the user to perform point-intime queries in the past, as well as to ensure that different distributed tasks that together make up a single “query” (e.g. as in Spark or Impala) read a consistent snapshot.

解读

数据记录内部的 Timestamp 不再暴漏给用户(像 HBase 那样,timestamp 非常重要,比较新的 timestamp 数据会覆盖旧的),Kudu 认为大部分用户不能很好理解这里面的概念,并且 Kudu 实际上并不支持 MVCC 数据模型,timestamp 只是在内部用于确保数据一致性。

但是 Kudu 允许查询时候指定数据 timestamp, 主要是为了不同的分布式查询任务最终可以组合为一个单独的Query。参阅: Read Operations (Scans)

在 Spark RDD 中,如果SCAN允许设置 locality,那么就会启用 READ_AT_SNAPSHOT 读取方式来确保一致性。

    // A scan is partitioned to multiple ones. If scan locality is enabled,
    // each will take place at the closet replica from the executor. In this
    // case, to ensure the consistency of such scan, we use READ_AT_SNAPSHOT
    // read mode without setting a timestamp.
    builder.replicaSelection(options.scanLocality)
    if (options.scanLocality == ReplicaSelection.CLOSEST_REPLICA) {
      builder.readMode(AsyncKuduScanner.ReadMode.READ_AT_SNAPSHOT)
    }

Part 3: Architecture

3.1 Cluster roles

Following the design of BigTable and GFS[18] (and their open-source analogues HBase and HDFS), Kudu relies on a single Master server, responsible for metadata, and an arbitrary number of Tablet Servers, responsible for data. The master server can be replicated for fault tolerance, supporting very fast failover of all responsibilities in the event of an outage. Typically, all roles are deployed on commodity hardware, with no extra requirements for master nodes.

解读

同大数据领域著名的论文: BigTable 和 GFS 描述的那样,Kudu 也拥有类似的架构,由 master, tablet server 组成。 其对照关系如下表所示:

5d534fd3ed37672d2c21cd5da893d1a6.png

dcea5fe719fd8568844e2c700d935ad5.png

在生产环境中,Master 一般部署3个或者5个,Tablet Server 则可以根据数据规模,部署3个 ~ 数百台之间。

在上面的架构图中,还可以看到关于复制相关的概念,后面会详细展开。

3.2 Partitioning

As in most distributed database systems, tables in Kudu are horizontally partitioned. Kudu, like BigTable, calls these horizontal partitions tablets. Any row may be mapped to exactly one tablet based on the value of its primary key, thus ensuring that random access operations such as inserts or updates affect only a single tablet. For large tables where throughput is important, we recommend on the order of 10-100 tablets per machine. Each tablet can be tens of gigabytes.

Unlike BigTable, which offers only key-range-based partitioning, and unlike Cassandra, which is nearly always deployed with hash-based partitioning, Kudu supports a flexible array of partitioning schemes. When creating a table, the user specifies a partition schema for that table. The partition schema acts as a function which can map from a primary key tuple into a binary partition key. Each tablet covers a contiguous range of these partition keys. Thus, a client, when performing a read or write, can easily determine which tablet should hold the given key and route the request accordingly.

The partition schema is made up of zero or more hashpartitioning rules followed by an optional range-partitioning rule:

  • A hash-partitioning rule consists of a subset of the primary key columns and a number of buckets. For example, as expressed in our SQL dialect, DISTRIBUTE BY HASH(hostname, ts) INTO 16 BUCKETS. These rules convert tuples into binary keys by first concatenating the values of the specified columns, and then computing the hash code of the resulting string modulo the requested number of buckets. This resulting bucket number is encoded as a 32-bit big-endian integer in the resulting partition key.
  • A range-partitioning rule consists of an ordered subset of the primary key columns. This rule maps tuples into binary strings by concatenating the values of the specified columns using an order-preserving encoding.

By employing these partitioning rules, users can easily trade off between query parallelism and query concurrency based on their particular workload. For example, consider a time series application which stores rows of the form (host, metric, time, value) and in which inserts are almost always done with monotonically increasing time values. Choosing to hash-partition by timestamp optimally spreads the insert load across all servers; however, a query for a specific metric on a specific host during a short time range must scan all tablets, limiting concurrency. A user might instead choose to range-partition by timestamp while adding separate hash partitioning rules for the metric name and hostname, which would provide a good trade-off of parallelism on write and concurrency on read.

Though users must understand the concept of partitioning to optimally use Kudu, the details of partition key encoding are fully transparent to the user: encoded partition keys are not exposed in the API. Users always specify rows, partition split points, and key ranges using structured row objects or SQL tuple syntax. Although this flexibility in partitioning is relatively unique in the “NoSQL” space, it should be quite familiar to users and administrators of analytic MPP database management systems.

解读

混合 Range Partiton 和 Hash Partition, 避免数据热点

  • Hash Partition 是解决数据热点的有效方案

很多数据都是有时效性的,按照时间范围进行分区是非常自然的需求,所以大数据存储都有 Range Partition 的能力。

但是仅有 Range Partition 是不够的,因为这会带来数据热点问题(这是HBase的经典问题),最近插入的数据总是集中在最近的一个分区,这会导致有些主机的负载显著大于其余机器,整体资源利用率和性能会下降。

而将 Range Partiton 和 Hash Partition 混合使用,则既可以提高查询效率,又可以避免数据热点。

一个 Partiton 对应一个 Kudu 一个 Tablet

举个例子:

HASH (app_id, device_id) PARTITIONS 16,
RANGE (stamp) (
    PARTITION 1587600000 <= VALUES < 1587686400,
    PARTITION 1587686400 <= VALUES < 1587772800,
    PARTITION 1587772800 <= VALUES < 1587859200
)

上面的分区,比较适合互联网产品客户端数据的存储。

我们用 产品ID + 设备ID 作为 Hash分区(16个分区),并且按照数据 stamp 进行按天分区(3个分区)。

共会产生 16 * 3 = 48 个 Tablet。

3.3 Replication

In order to provide high availability and durability while running on large commodity clusters, Kudu replicates all of its table data across multiple machines. When creating a table, the user specifies a replication factor, typically 3 or 5, depending on the application’s availability SLAs. Kudu’s master strives to ensure that the requested number of replicas are maintained at all times (see Section 3.4.2).

Kudu employs the Raft[25] consensus algorithm to replicate its tablets. In particular, Kudu uses Raft to agree upon a logical log of operations (e.g. insert/update/delete) for each tablet. When a client wishes to perform a write, it first locates the leader replica (see Section 3.4.3) and sends a Write RPC to this replica. If the client’s information was stale and the replica is no longer the leader, it rejects the request, causing the client to invalidate and refresh its metadata cache and resend the request to the new leader. If the replica is in fact still acting as the leader, it employs a local lock manager to serialize the operation against other concurrent operations, picks an MVCC timestamp, and proposes the operation via Raft to its followers. If a majority of replicas accept the write and log it to their own local write-ahead logs2, the write is considered durably replicated and thus can be committed on all replicas. Note that there is no restriction that the leader must write an operation to its local log before it may be committed: this provides good latency-smoothing properties even if the leader’s disk is performing poorly.

In the case of a failure of a minority of replicas, the leader can continue to propose and commit operations to the tablet’s replicated log. If the leader itself fails, the Raft algorithm quickly elects a new leader. By default, Kudu uses a 500millisecond heartbeat interval and a 1500-millisecond election timeout; thus, after a leader fails, a new leader is typically elected within a few seconds.

Kudu implements some minor improvements on the Raft algorithm. In particular:

  1. As proposed in [19] we employ an exponential back-off algorithm after a failed leader election. We found that, as we typically commit Raft’s persistent metadata to contended hard disk drives, such an extension was necessary to ensure election convergence on busy clusters.
  2. When a new leader contacts a follower whose log diverges from its own, Raft proposes marching backward one operation at a time until discovering the point where they diverged. Kudu instead immediately jumps back to the last known committedIndex, which is always guaranteed to be present on any divergent follower. This minimizes the potential number of round trips at the cost of potentially sending redundant operations over the network. We found this simple to implement, and it ensures that divergent operations are aborted after a single round-trip.

Kudu does not replicate the on-disk storage of a tablet, but rather just its operation log. The physical storage of each replica of a tablet is fully decoupled. This yields several advantages:

  • When one replica is undergoing physical-layer background operations such as flushes or compactions (see Section 4), it is unlikely that other nodes are operating on the same tablet at the same time. Because Raft may commit after an acknowledgment by a majority of replicas, this reduces the impact of such physical-layer operations on the tail latencies experienced by clients for writes. In the future, we anticipate implementing techniques such as the speculative read requests described in [16] to further decrease tail latencies for reads in concurrent read/write workloads.
  • During development, we discovered some rare race conditions in the physical storage layer of the Kudu tablet. Because the storage layer is decoupled across replicas, none of these race conditions resulted in unrecoverable data loss: in all cases, we were able to detect that one replica had become corrupt (or silently diverged from the majority) and repair it.

解读

使用 Raft 协议进行 Replication,提供高可用及数据持久性

在 Tablet 层级,Kudu 使用 Raft 协议进行主从选举分区复制,每个 tablet 有多个实例提供服务,仅有一个为 master,其余为 follower,对于有数据更改的操作(insert/update/delete) 仅由 master 进行处理,之后再同步给 follower。仅需要超过半数副本写入即可,甚至不需要 master 本身写入。

默认 Raft 心跳间隔 500ms,选举超时为 1500ms,当 leader 挂掉后,可以在几秒内选出新的 leader。

Kudu 仅对操作日志进行副本同步,而不是 tablet 的磁盘物理存储,这有下面的优点:

  • 后台任务(例如 Table Compaction) 错峰执行,而写操作只需要超过半数副本投票便可以通过,因此可以提升写操作效率
  • 各个副本是完全自治的,这会防止错误通过副本同步迁移到其他副本,在发生错误时,恢复的机会会更高。

3.3.1 Configuration Change

Kudu implements Raft configuration change following the one-by-one algorithm proposed in [24]. In this approach, the number of voters in the Raft configuration may change by at most one in each configuration change. In order to grow a 3replica configuration to 5 replicas, two separate configuration changes (3→4, 4→5) must be proposed and committed.

Kudu implements the addition of new servers through a process called remote bootstrap. In our design, in order to add a new replica, we first add it as a new member in the Raft configuration, even before notifying the destination server that a new replica will be copied to it. When this configuration change has been committed, the current Raft leader replica triggers a StartRemoteBootstrap RPC, which causes the destination server to pull a snapshot of the tablet data and log from the current leader. When the transfer is complete, the new server opens the tablet following the same process as after a server restart. When the tablet has opened the tablet data and replayed any necessary write-ahead logs, it has fully replicated the state of the leader at the time it began the transfer, and may begin responding to Raft RPCs as a fully-functional replica.

In our current implementation, new servers are added immediately as VOTER replicas. This has the disadvantage that, after moving from a 3-server configuration to a 4-server configuration, three out of the four servers must acknowledge each operation. Because the new server is in the process of copying, it is unable to acknowledge operations. If another server were to crash during the snapshot-transfer process, the tablet would become unavailable for writes until the remote bootstrap finished.

To address this issue, we plan to implement a PRE VOTER replica state. In this state, the leader will send Raft updates and trigger remote bootstrap on the target replica, but not count it as a voter when calculating the size of the configuration’s majority. Upon detecting that the PRE VOTER replica has fully caught up to the current logs, the leader will automatically propose and commit another configuration change to transition the new replica to a full VOTER.

When removing replicas from a tablet, we follow a similar approach: the current Raft leader proposes an operation to change the configuration to one that does not include the node to be evicted. If this is committed, then the remaining nodes will no longer send messages to the evicted node, though the evicted node will not know that it has been removed. When the configuration change is committed, the remaining nodes report the configuration change to the Master, which is responsible for cleaning up the orphaned replica (see Section 3.4.2).

解读

这部分着重介绍了如何更新 Raft 相关的配置。

例如副本数 3->5,就需要分解成 3->4, 4->5 两步。

添加新的副本

  • 为 Raft 添加新的 Member
  • Leader 向新副本主机,触发 StartRemoteBootstrap RPC,会通知目标服务器从 Leader 同步数据
  • 新副本成为 Follower
  • 当所有数据同步完成之后,其成为正式的副本并且接受 Raft RPC 请求

删除副本

  • Leader 发起提议: 将一个副本驱逐
  • 如果提议通过,其余副本节点不再发送请求给被驱逐的副本
  • 副本节点配置OK后,将状态同步给 Leader
  • Master 负责清理副本

3.4 The Kudu Master

Kudu’s central master process has several key responsibilities:

  1. Act as a catalog manager, keeping track of which tables and tablets exist, as well as their schemas, desired replication levels, and other metadata. When tables are created, altered, or deleted, the Master coordinates these actions across the tablets and ensures their eventual completion.
  2. Act as a cluster coordinator, keeping track of which servers in the cluster are alive and coordinating redistribution of data after server failures.
  3. Act as a tablet directory, keeping track of which tablet servers are hosting replicas of each tablet.

We chose a centralized, replicated master design over a fully peer-to-peer design for simplicity of implementation, debugging, and operations.

解读

Kudu Master 只做一些轻负载的任务,包括:

  • Catalog manager,对表进行管理,包括表的Schema,副本级别等。当表创建/删除/更新时,Master 会协调这些操作对所有的 tablets 生效
  • Cluster coordinator,跟踪 Tablet Server 是否存活,当 Server 节点挂掉时候,对数据副本进行调度维持期望副本数。
  • Tablet directory,跟踪每个 Tablet server 有哪些 tablet 副本。这样客户端就可以知道应该去哪里读取数据。

下面会对每种职责进行详细叙述。

3.4.1 Catalog Manager

The master itself hosts a single-tablet table which is restricted from direct access by users. The master internally writes catalog information to this tablet, while keeping a full writethrough cache of the catalog in memory at all times. Given the large amounts of memory available on current commodity hardware, and the small amount of metadata stored per tablet, we do not anticipate this becoming a scalability issue in the near term. If scalability becomes an issue, moving to a paged cache implementation would be a straightforward evolution of the architecture.

The catalog table maintains a small amount of state for each table in the system. In particular, it keeps the current version of the table schema, the state of the table (creating, running, deleting, etc), and the set of tablets which comprise the table. The master services a request to create a table by first writing a table record to the catalog table indicating a CREATING state. Asynchronously, it selects tablet servers to host tablet replicas, creates the Master-side tablet metadata, and sends asynchronous requests to create the replicas on the tablet servers. If the replica creation fails or times out on a majority of replicas, the tablet can be safely deleted and a new tablet created with a new set of replicas. If the Master fails in the middle of this operation, the table record indicates that a roll-forward is necessary and the master can resume where it left off. A similar approach is used for other operations such as schema changes and deletion, where the Master ensures that the change is propagated to the relevant tablet servers before writing the new state to its own storage. In all cases, the messages from the Master to the tablet servers are designed to be idempotent, such that on a crash and restart, they can be safely resent.

Because the catalog table is itself persisted in a Kudu tablet, the Master supports using Raft to replicate its persistent state to backup master processes. Currently, the backup masters act only as Raft followers and do not serve client requests. Upon becoming elected leader by the Raft algorithm, a backup master scans its catalog table, loads its in-memory cache, and begins acting as an active master following the same process as a master restart.

解读

Kudu Master 内部维护了一个 单tablet 的 Catalog table, 用于记录表meta信息(不能被用户访问)。

Catalog table 将会一直在内存缓存,并且维护了一些最基础的信息,如当前表的状态(running, deleting ...)、Table 包含哪些 tablet。

Catalog table 和其余的表一样通过 Raft 协议进行多副本同步,但不同的是,Follower 不参与处理请求。所以 Master Leader 节点将会是系统的一个单点瓶颈。

Kudu 在 Client 进行了一些缓存,避免 Client 需要频繁与 Master 进行交互。

3.4.2 Cluster Coordination

Each of the tablet servers in a Kudu cluster is statically configured with a list of host names for the Kudu masters. Upon startup, the tablet servers register with the Masters and proceed to send tablet reports indicating the total set of tablets which they are hosting. The first such tablet report contains information about all tablets. All future tablet reports are incremental, only containing reports for tablets that have been newly created, deleted, or modified (e.g. processed a schema change or Raft configuration change).

A critical design point of Kudu is that, while the Master is the source of truth about catalog information, it is only an observer of the dynamic cluster state. The tablet servers themselves are always authoritative about the location of tablet replicas, the current Raft configuration, the current schema version of a tablet, etc. Because tablet replicas agree on all state changes via Raft, every such change can be mapped to a specific Raft operation index in which it was committed. This allows the Master to ensure that all tablet state updates are idempotent and resilient to transmission delays: the Master simply compares the Raft operation index of a tablet state update and discards it if the index is not newer than the Master’s current view of the world.

This design choice leaves much responsibility in the hands of the tablet servers themselves. For example, rather than detecting tablet server crashes from the Master, Kudu instead delegates that responsibility to the Raft LEADER replicas of any tablets with replicas on the crashed machine. The leader keeps track of the last time it successfully communicated with each follower, and if it has failed to communicate for a significant period of time, it declares the follower dead and proposes a Raft configuration change to evict the follower from the Raft configuration. When this configuration change is successfully committed, the remaining tablet servers will issue a tablet report to the Master to advise it of the decision made by the leader.

In order to regain the desired replication count for the tablet, the Master selects a tablet server to host a new replica based on its global view of the cluster. After selecting a server, the Master suggests a configuration change to the current leader replica for the tablet. However, the Master itself is powerless to change a tablet configuration – it must wait for the leader replica to propose and commit the configuration change operation, at which point the Master is notified of the configuration change’s success via a tablet report. If the Master’s suggestion failed (e.g. because the message was lost) it will stubbornly retry periodically until successful. Because these operations are tagged with the unique index of the degraded configuration, they are fully idempotent and conflictfree, even if the Master issues several conflicting suggestions, as might happen soon after a master fail-over.

The master responds similarly to extra replicas of tablets. If the Master receives a tablet report which indicates that a replica has been removed from a tablet configuration, it stubbornly sends DeleteTablet RPCs to the removed node until the RPC succeeds. To ensure eventual cleanup even in the case of a master crash, the Master also sends such RPCs in response to a tablet report which identifies that a tablet server is hosting a replica which is not in the newest committed Raft configuration.

解读

和 HDFS 一样,Tablet server 会在启动时向所有的 Master 汇报自己的负责的 Tablets 集合(这就要求在 Tablet server 启动时,所有的 Master 都是可以 Ping 通的)。

与多数分布式系统不同的是,Kudu Master 对系统更像是一个观察者。Master 并不会去主动探测副本状态,坏掉的副本是通过以下方式提交给 Master:

  • 首先 tablet 的多个副本通关 Raft 协议进行复制
  • 当一个副本宕掉时(一段时间没有心跳),tablet leader 会发起提议,驱逐宕掉的副本
  • 一旦提议通过,tablet leader 会将副本被驱逐报告给 Master,由 Master 做进一步处理
  • Master 根据集群的全局视野,选取一个合适的服务节点放置新的副本,并建议 tablet leader 进行配置
  • tablet leader 会发起新的提议,尝试将新的副本加入,最后报告给 Master

当副本数多于预期时(由于提议,该副本应该删除),Master 会发送 DeleteTablet RPC 给对应的 Node 直到 PRC 成功。

可以看到 Kudu Master 角色承担的角色会比 HDFS NameNode 要轻量很多,NameNode 因为需要直接维护Block 的复制,在集群文件非常多时,NameNode 直接成为系统瓶颈 (HDFS 目前文件数量基本上不能超过 1 亿,并且在超过千万级时,已经需要数分钟重启,如果shutdown时未能保存 cache,甚至需要数小时来重启)。

3.4.3 Tablet Directory

In order to efficiently perform read and write operations without intermediate network hops, clients query the Master for tablet location information. Clients are “thick” and maintain a local metadata cache which includes their most recent information about each tablet they have previously accessed, including the tablet’s partition key range and its Raft configuration. At any point in time, the client’s cache may be stale; if the client attempts to send a write to a server which is no longer the leader for a tablet, the server will reject the request. The client then contacts the Master to learn about the new leader. In the case that the client receives a network error communicating with its presumed leader, it follows the same strategy, assuming that the tablet has likely elected a new leader.

In the future, we plan to piggy-back the current Raft configuration on the error response if a client contacts a non-leader replica. This will prevent extra round-trips to the master after leader elections, since typically the followers will have up-to-date information.

Because the Master maintains all tablet partition range information in memory, it scales to a high number of requests per second, and responds with very low latency. In a 270node cluster running a benchmark workload with thousands of tablets, we measured the 99.99th percentile latency of tablet location lookup RPCs at 3.2ms, with the 95th percentile at 374 microseconds and 75th percentile at 91 microseconds. Thus, we do not anticipate that the tablet directory lookups will become a scalability bottleneck at current target cluster sizes. If they do become a bottleneck, we note that it is always safe to serve stale location information, and thus this portion of the Master can be trivially partitioned and replicated across any number of machines.

解读

前面已经叙述过,Kudu 虽说有多个 Master,但是仅有 Leader 可以处理客户端请求 (可能考虑到强一致性模型会降低可用性),这就要求 Leader 的工作要非常轻量。

对应的,Kudu 的 Client 端要复杂一些,需要 Cache 最近访问的 Tablet 信息,例如 分区的Key或者Raft配置。

如果客户端的信息已经失效 (例如去访问Tablet Leader 写数据,但是对方已经不是 Leader),服务端会直接拒绝请求,然后客户端需要和 Master 更新最新的配置信息。

因为 Master 的数据会全部 Cache 到内存里面,就算是单节点,可以支撑非常大的 QPS。并且在之前的讨论已经可以看到,Master 的配置就算落后历史版本,客户端依然会重试获取最新版本,所以其实 Master 并不要求强一致性,所以 Master 其他副本实际上也可以处理请求 (只是服务质量会变差,一般不需要这样)

--- 文章太长只能到这里,下一部分请移步 --

昼星:Kudu论文解读: Fast Analytics on Fast Data (下)​zhuanlan.zhihu.com
70e9ff3e37f3df804cd4ba29dbec99e9.png

欢迎关注,谢谢大家!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值