Firebolt whitepaper - 3. Scalability

codealy

已于 2022-12-06 10:24:56 修改

阅读量84

点赞数

分类专栏： firebolt 文章标签：数据仓库数据库数据挖掘

于 2022-08-17 19:05:59 首次发布

firebolt 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

官方文档的技术白皮书
官方文档
https://www.firebolt.io/resources/firebolt-cloud-data-warehouse-whitepaper

Scalability 可扩展性

There have been two major advancements in data warehouse scalability over the last two decades. The first was a shared nothing architecture, which started with partitioning data and queries across nodes. It helped deliver linear horizontal “scale-out” scalability for the first time. But you still had to have all data on each cluster.

在过去的二十年里，数据仓库的可扩展性有两个主要的进步。第一种是shard-nothing结构，它从跨节点划分数据和查询开始。它首次提供了线性的水平“扩展”。但您仍然需要在每个集群上保存所有数据。

The second was a decoupled storage and compute architecture, which added the ability to have different compute clusters running different queries and retrieve data subsets “on demand” as needed from remote storage. It improved scalability by allowing different compute on different clusters. It made scaling more elastic as well since you could easily provision and resize new compute clusters.

第二种是存储计算分离架构，它增加了不同计算集群运行不同查询和根据需要从远程存储“按需”检索数据子集的能力。它允许在不同的集群上进行不同的计算，从而提高了可扩展性。它还使可扩展性更有弹性，因为您可以轻松地提供和调整新的计算集群。

Modern cloud data warehouses with decoupled storage and compute and a shared nothing architecture are able to support petabyte-scale data by storing any size data in remote storage separate from any cluster. They have been able to support large, complex queries by scaling up node sizes, adding more nodes, and running different queries on different clusters. They have also been able to support high user concurrency by replicating clusters and partitioning users across the clusters.

现代云数据仓库具有解耦的存储和计算以及shard-nothing的架构，通过将任意大小的数据存储在远离任何集群的远程存储中，能够支持PB字节级的数据。通过扩展节点大小、添加更多节点以及在不同的集群上运行不同的查询，它们已经能够支持大型、复杂的查询。它们还能够通过复制集群和跨集群划分用户来支持高并发用户。

But other scalability bottlenecks remain even in the most modern cloud data warehouses. These include:

但是，即使是最现代的云数据仓库也存在其他可扩展性的瓶颈。这些包括:

Data ingestion: most data warehouses are still limited to batch-centric ingestion. They do not support low latency, streaming ingestion at scale. This is in large part because most data warehouse storage is columnar and requires rewriting entire partitions for a single row update.

数据写入：大多数数据仓库仍然限于以批处理为中心的数据写入。它们不支持低延迟、大规模的流式写入。这在很大程度上是因为大多数数据仓库存储是列存的，需要为单行更新重写整个分区。
Data access: most decoupled storage and compute architectures move entire partitions from storage into the compute cluster. This is a major problem because it makes the network the biggest bottleneck.

数据访问：大多数存储计算分离的架构将整个分区从存储拉取到计算集群中。这是一个主要问题，因为它使网络成为最大的瓶颈。
Query scalability: Queries with large joins or complex nesting can require a lot of SSD to store partitions, a lot of RAM to hold data while processing, and a lot of compute for scanning and complex processing. In many cases the only solution is to use very large node types.‍

查询可伸缩性：具有大型连接或复杂嵌套的查询可能需要大量的SSD来存储分区，在处理时需要大量的内存来保存数据，以及大量的扫描和复杂处理计算。在许多情况下，唯一的解决方案是使用非常大的节点类型。
Semi-structured data: Most cloud data warehouses either store semi-structured data as flattened strings, or don’t support formats like JSON at all and require you to “flatten” or “unnest” it into columns in tables. But processing strings ends up taking a lot of compute and RAM to hold all the JSON for scanning, which limits scalability.

半结构化数据：大多数云数据仓库要么将半结构化数据存储为扁平字符串，要么根本不支持像JSON这样的格式，要求您将其“扁平”或“unnest”到表中的列中。但是处理字符串最终需要大量的计算和内存来保存所有用于扫描的JSON，这限制了可扩展性。

Data ingestion 数据写入

The first change was to act a little like a federated query engine. Any Firebolt engine can ingest data from external files by exposing them as a relational table. Anyone can then use SQL to select the relevant data from the external files or other data sources, perform any operations, and then insert into target tables into F3. This enables data engineers to do their own ELT and build dashboards without depending on others to complete their work first.

第一个变化是有点像联合查询引擎。任何 Firebolt 引擎都可以通过将外部文件公开为关系表来从外部文件中获取数据。然后任何人都可以使用 SQL 从外部文件或其他数据源中选择相关数据，执行任何操作，然后插入到目标表中进入 F3。这使数据工程师可以自己做 ELT 并构建仪表板，而无需依赖其他人先完成工作。