数据仓库实时计算_如果您的云数据仓库没有分开存储和计算，为什么您会浪费金钱...

最新推荐文章于 2021-08-27 14:05:59 发布

weixin_26631359

最新推荐文章于 2021-08-27 14:05:59 发布

阅读量1.2k

点赞数

文章标签： python java 大数据算法 leetcode

原文链接：https://towardsdatascience.com/why-you-are-throwing-money-away-if-your-cloud-data-warehouse-doesnt-separate-storage-and-compute-65d2dffd450f

版权

数据仓库实时计算

Not so long ago, establishing an enterprise data warehouse involved a project that would take months or even years. These days, with cloud computing, you can easily register for a SaaS or PaaS offering provided by one of the cloud vendors, and shortly after you can start building your schemas and tables. In this article, I will discuss the key features to consider when migrating a data warehouse to the cloud and why is it a smart choice to pick one that separates storage from compute.

ñ加时赛不久之前，建立企业数据仓库参与了一个项目，可能需要数月甚至数年。如今，借助云计算，您可以轻松注册其中一家云供应商提供的SaaS或PaaS产品，并且在开始构建架构和表之后不久。在本文中，我将讨论将数据仓库迁移到云时要考虑的关键功能，以及为什么选择一个将存储与计算分离的明智选择。

分开存储和计算意味着什么？ (What does it mean to separate storage and compute?)

从单个服务器到数据仓库集群 (From a single server to a data warehouse cluster)

It all boils down to the difference between scale-out & scale-in vs. scale-up & scale-down. In older database and data warehouse solutions the storage and compute reside within a single (often large & powerful) server instance. This may work well until this single server instance would reach its maximum compute or storage capacity. In such cases, in order to accommodate the increased workloads, you could scale-up, i.e. exchange the CPU, RAM, or storage disks to ones with a larger capacity — with cloud services it would mean switching to a larger instance. Analogically, if your single instance is too large, to save money, you could exchange it for a smaller one, i.e. scale-down. This process has two main disadvantages:

这一切都归结为规模化与规模化和缩小 横向扩展和之间的差异。在较旧的数据库和数据仓库解决方案中，存储和计算驻留在单个(通常是大型且功能强大的)服务器实例中 。在单个服务器实例达到其最大计算或存储容量之前，这可能效果很好。在这种情况下，为了适应增加的工作负载，您可以扩大规模 ，即将CPU，RAM或存储磁盘交换为容量更大的磁盘-使用云服务将意味着切换到更大的实例。以此类推，如果您的单个实例太大以节省资金，则可以将其替换为较小的实例，即scale-down 。此过程有两个主要缺点 ：

scale-up & scale-down process is time-consuming and often means that your data warehouse would become unavailable for some time
放大和缩小过程非常耗时，并且通常意味着您的数据仓库将在一段时间内不可用
there is a limit to how much you can scale-up due to the natural limitations of a single server instance.
由于单个服务器实例的自然限制，可以扩展的数量有限。

MPP：大规模并行计算 (MPP: Massively Parallel Computing)

In order to mitigate this problem, data warehouse vendors started using MPP (Massively Parallel Computing) paradigm, allowing your data warehouse to use an entire cluster of instances at once. This way, if you start reaching the maximum capacity limits, you can simply add another server instance with more storage and compute capacity to the cluster (i.e. scale-out).

为了缓解此问题，数据仓库供应商开始使用MPP( 大规模并行计算 )范式，允许您的数据仓库一次使用整个实例集群。这样，如果您开始达到最大容量限制，则可以简单地向群集 添加另一个具有更多存储和计算容量的服务器实例 (即横向扩展 )。

MPP can solve the initial problem of scalability to a large extent. However, it also entails that your storage and compute capacities are tightly coupled together across the nodes in your cluster. This means that if you want to shut down some compute capacity at night (i.e. scale-in), because almost nobody queries data during that time, you can’t do that, as terminating the instance would mean either loss of your data or having to create backups and restoring from it in the morning. If your architecture doesn’t allow you to easily scale-in idle compute resources, you simply throw your money away.

MPP可以在很大程度上解决可扩展性的最初问题。但是，这还需要使存储和计算能力在群集中的各个节点之间紧密耦合在一起 。这意味着如果您想在夜间关闭某些计算能力(即“ 放大” )，因为在此期间几乎没有人查询数据，则不能这样做，因为终止实例将意味着丢失数据或丢失数据。在早晨创建备份并从中还原。 如果您的架构不允许您轻松扩展 闲置的计算资源，那么您就可以把钱浪费掉 。

The paradigm discussed here is often called Shared-Nothing architecture. This is how Wikipedia [1] defines it:

这里讨论的范式通常称为无共享架构 。这是维基百科[1]定义的方式：

A shared-nothing architecture (SN) is a distributed-computing architecture in which each update request is satisfied by a single node (processor/ memory/ storage unit). The intent is to eliminate contention among nodes.

无共享架构 ( SN )是一种分布式计算架构，其中每个更新请求都由单个节点 (处理器/内存/存储单元)来满足。目的是消除节点之间的竞争 。

如何使SN结构的MPP更好？ (How can we make MPP with SN architecture better?)

We can clearly see that the bottleneck lies in the fact that the storage and compute are tightly coupled and can’t be scaled independently from each other. Ideally, we would like to obtain an architecture, where we can scale the compute capacity as needed depending on our query workloads, and the storage is shared across all compute nodes. The storage should have unlimited capacity to make the architecture future-proof and should scale automatically as we store increasingly more data over time.

我们可以清楚地看到瓶颈在于存储和计算紧密耦合并且不能彼此独立地扩展。理想情况下，我们希望获得一个体系结构，在该体系结构中，我们可以根据查询工作负载按需扩展计算能力 ，并且存储在所有计算节点之间共享。该存储应具有无限容量，以使体系结构适应未来的发展，并随着我们随着时间的推移存储越来越多的数据而自动扩展 。

So instead of SN-MPP, we would like to achieve SD-MPP — Shared Data Massively Parallel Processing cluster.

因此，我们希望代替SN-MPP，实现SD-MPP — 共享数据大规模并行处理集群。

This is precisely what many cloud vendors did. There are some differences in their implementation, but their goal is the same: an elastic compute layer with an independently and endlessly scalable shared storage layer.

这正是许多云供应商所做的。它们的实现存在一些差异，但是它们的目标是相同的：具有独立且无限扩展的共享存储层的弹性计算 层。

为什么将存储与计算分离对于分析查询来说效果很好？ (Why does the separation of storage and compute work well for analytical queries?)

Hadoop was originally designed to analyze data (i.e. to run queries to retrieve & process data) as closest as possible to where it is stored. This means that if your Sales data is stored on node A, your query to retrieve Sales data will likely be executed also on node A for performance gains. Overall, the fastest way to get the data for processing is (in this order):

Hadoop最初旨在分析数据( 即运行查询以检索和处理数据 )，使其与存储位置尽可能接近。这意味着，如果您的销售数据存储在节点A上，则可能还会在节点A上执行检索销售数据的查询，以提高性能。总体而言，获取数据进行处理的最快方法是( 按此顺序 )：

from RAM,
从RAM
then from SSDs
然后从SSD
and then from HDDs and object storage.
然后从硬盘和对象存储。

So having a separation of storage and compute may seem counterintuitive, as we move the data (storage) further away from where it is processed.

因此，将存储与计算分开似乎是违反直觉的，因为我们将数据( 存储 )移离了处理位置。

However, with high-performance columnar databases, cloud vendors apply many optimization techniques to both, storage and compute (ex. AWS Redshift applies AQUA [4]) so that the separation of storage from compute shouldn’t have any negative impact on the performance. Those optimization techniques involve a combination of compression, encoding, caching, and internally moving data between object storage and SSDs.

但是，对于高性能的列式数据库， 云供应商 对存储和计算都 应用了许多优化技术 ( 例如，AWS Redshift应用了AQUA [4] )，因此存储与计算的分离不会对性能产生任何负面影响。。这些优化技术涉及压缩，编码，缓存以及在对象存储和SSD之间内部移动数据的组合。

When you run a query within a columnar in-memory database, under the hood you only load some small portion of this data from a shared storage layer (say from object storage) into memory (while at the same time also applying dictionary encoding and compression on this data to reduce its size). This data can be then processed in the same way as it would be when loaded from a local disk with block storage.

当您在列式内存数据库中运行查询时，在后台运行时， 您仅将这些数据的一小部分从共享存储层 ( 例如，从对象存储 ) 加载 到内存中 ( 同时还应用字典编码和压缩)在此数据上以减小其大小 )。然后可以使用与从具有块存储的本地磁盘加载数据时相同的方式处理该数据。

Distributed compute engines such as Spark also support directly loading data from object storage [2], which is yet another example of a separation of compute and storage for analytical data processing.

诸如Spark之类的分布式计算引擎还支持直接从对象存储中加载数据[2]，这是用于分析数据处理的计算与存储分离的又一个示例。

分离数据仓库中的存储和计算有什么好处 (What are the benefits of separating storage and compute in a data warehouse)

no idle compute resources — storage and compute can be scale up and down independently from each other
没有闲置的计算资源 —存储和计算可以彼此独立地扩展和缩减
if used with object storage (ex. AWS S3) or with Network File System (es. AWS EFS), we get infinite highly-available and fault-tolerant storage at a low cost
如果与对象存储 (例如AWS S3)或网络文件系统 (例如AWS EFS)一起使用，我们将以低成本获得无限的高可用性和容错存储
no management of nodes for storage (this is what you would usually have to maintain with ex. Hadoop cluster nodes or with Amazon Redshift dense storage nodes) — with SD-MPP you (usually) only need to monitor and scale your compute nodes
无需管理存储节点 ( 这通常是使用Hadoop集群节点或Amazon Redshift密集存储节点通常需要维护的)–使用SD-MPP，您( 通常 )只需要监视和扩展计算节点
massive cost reduction — being able to shut down some of the compute nodes at night, after seasonal peaks or simply when they are not needed, can save tons of money
大量降低成本—在夜间，季节性高峰之后或仅在不需要高峰时关闭某些计算节点，可以节省大量资金
making your architecture future-proof with respect to data growth — with data growth that we experience these days, it’s inevitable that our amounts of data will increase over time. By using SN-MPP it’s still possible to accommodate this growth, but at a price that many companies just can’t afford.
使您的架构在数据增长方面永不过时 –随着我们近来经历的数据增长，不可避免地我们的数据量会随着时间的推移而增加。通过使用SN-MPP ，仍然有可能适应这种增长，但价格却是许多公司买不起的。
flexibility — being able to take into account seasonality into your architecture: ex. more compute needed during specific times of the year such as Black Friday, Christmas, or the time when you release a new product
灵活性 -能够将季节性因素纳入您的架构中：例如在一年中的特定时间(例如黑色星期五，圣诞节或发布新产品的时间)需要更多计算
higher performance — you can get more done within your time: ex. if you have some more computationally expensive jobs, you can spin up one additional compute node with much more RAM and CPU capacity to get your compute-intensive job faster and after that, you can terminate that node without having to re-architect your entire data warehouse
更高的性能 -您可以在自己的时间内完成更多工作 ：例如如果您有更多需要大量计算的工作，则可以启动一个具有更多RAM和CPU容量的附加计算节点，以更快地完成计算密集型工作，此后，您可以终止该节点而不必重新构建整个数据仓库
fault-tolerance: if for some reason, all of your compute nodes should go down, you won’t lose your data — you can simply launch a new compute instance and instantly get back access to your schemas and tables
容错：如果由于某种原因，所有计算节点均应关闭，则不会丢失数据-您只需启动一个新的计算实例即可立即恢复对模式和表的访问
when scaling out, there is no need to redistribute, repartition or reindex the data in your cluster — with tightly coupled SN-MPP architecture, repartitioning or reindexing is needed to prevent from overburning some particular nodes, i.e. to prevent that one node takes all the storage or all the work to compute while the other nodes remain idle. In short, to evenly distribute storage and compute across the nodes.
进行横向扩展时， 无需重新分配，重新分区或为群集中的数据重新索引 -使用紧密耦合的SN-MPP架构，就需要重新分区或重新索引，以防止过度消耗某些特定节点，即，防止一个节点占用所有节点。存储或所有要计算的工作，而其他节点保持空闲。简而言之，要在节点之间平均分配存储和计算。
separation of compute for different teams while still keeping your data (shared storage) in one central place accessible to everyone. For instance, you could have separate “virtual” compute capacity allocated to data scientists so that their computationally expensive queries for ML don’t affect other users. This feature is not supported by all cloud vendors (only heard about it from Snowflake [10]).
将不同团队的计算分开，同时仍将您的数据( 共享存储 )保持在每个人都可以访问的中央位置。例如，您可以为数据科学家分配单独的“虚拟”计算能力，以使他们对ML的计算量大的查询不会影响其他用户。并非所有云供应商都支持此功能( 仅从Snowflake [10]听说过 )。

云供应商如何实现存储与计算分离 (How cloud vendors approached the separation of storage and compute)

In the following, I list only cloud data warehouse services that make use of the SD-MPP architecture which separates storage from compute. Since each of those cloud offerings differ from each other, I provide a short description of how they incorporated shared data paradigm into their services.

在下文中，我仅列出利用SD-MPP体系结构将存储与计算分开的云数据仓库服务。由于这些云产品各自互不相同，因此我将简要介绍它们如何将共享数据范型纳入其服务中。

雪花 (Snowflake)

Snowflake pioneered and marketed (they seem to have a solid marketing budget!) the concept of multi-cluster shared data architecture (SD-MPP). They further divide it into [3]:

Snowflake开创并销售了多集群共享数据架构( SD-MPP )的概念( 他们似乎有坚实的营销预算！ )。他们进一步将其分为[3]：

Database Storage layer — this is where data is persisted and optimized (turned into columnar form and compressed) after we load any data to Snowflake. This storage is abstracted away from users — data is only visible when running the queries.
数据库存储层 -这是在我们将任何数据加载到Snowflake之后对数据进行持久化和优化( 变成柱状形式并进行压缩 )的地方。该存储是从用户那里抽象出来的-数据仅在运行查询时可见。
Query Processing layer — determines how data is processed inside of virtual warehouses. This is the compute layer that we can actively manage ex. we can create several warehouses for specific teams.
查询处理层 -确定如何在虚拟仓库内部处理数据。这是我们可以主动管理的计算层。我们可以为特定团队创建多个仓库。
Cloud Services layer — includes metadata & infrastructure management, authentication & access control, as well as query optimization.
云服务层 -包括元数据和基础架构管理，身份验证和访问控制以及查询优化。

One of the biggest selling points of Snowflake is that their SD-MPP product is cloud-agnostic — you can set it up on Amazon Web Services, Azure, or Google Cloud Platform [9].

Snowflake的最大卖点之一是其SD-MPP产品与云无关，您可以在Amazon Web Services，Azure或Google Cloud Platform上进行设置[9]。

亚马逊Redshift (Amazon Redshift)

Until December 2019, Redshift would be considered a typical example of an SN-MPP architecture. Redshift is one of the first cloud data warehouse solutions — it’s on the market since October 2012.

在2019年12月之前，Redshift将被视为SN-MPP体系结构的典型示例。 Redshift是最早的云数据仓库解决方案之一-自2012年10月开始投放市场。

AWS likely noticed that other cloud vendors are offering competing services with SD-MPP architecture (allowing for massive cost reduction due to separation of storage and compute) or perhaps they listened to the voices of their customers. At first, AWS implemented Redshift Spectrum — a service that provides an additional compute layer to query data directly from S3. This feature lets us create External Tables (external, as they don’t exist within the data warehouse— they are retrieved from S3) and join them with existing tables from the data warehouse. It offers seamless integration of data between data warehouse and data lake, but it doesn’t solve the issue of SN-MPP architecture that Redshift is based on.

AWS可能注意到其他云供应商正在提供具有SD-MPP架构的竞争服务( 由于存储和计算的分离，从而可以大幅度降低成本 )，或者他们听取了客户的声音。最初，AWS实现了Redshift Spectrum —一种提供额外的计算层以直接从S3查询数据的服务。通过此功能，我们可以创建外部表 ( 外部表 ，因为它们在数据仓库中不存在-它们是从S3检索到的 ) ，并将它们与数据仓库中的现有表连接起来 。它提供了数据仓库和数据湖之间数据的无缝集成，但是并不能解决Redshift基于的SN-MPP体系结构的问题。

In December 2019, AWS released:

在2019年12月，AWS发布了：

Amazon Redshift Managed Storage, which allows us to use shared storage between compute nodes that scale automatically up to 8.2 PB. This storage is based on a combination of S3 and SSD disks. AWS completely manages how data is stored and moved between S3 and SSD.
Amazon Redshift托管存储 ，它使我们能够在自动扩展至8.2 PB的计算节点之间使用共享存储。此存储基于S3和SSD磁盘的组合。 AWS完全管理数据在S3和SSD之间的存储和移动方式。
AQUA (Advanced Query Accelerator) for Redshift — includes hardware-accelerated cache within a Managed Storage layer to speed up operations. According to AWS, this allows Redshift to be 10x faster than any cloud data warehouse [4]. Here are the slides from re:Invent [5], where they claim this speed improvement — however, they “forgot” to link the source of their performance benchmark so I couldn’t validate it.
用于Redshift的AQUA(高级查询加速器) -在托管存储层中包含硬件加速的缓存，以加快操作速度。根据AWS的说法，这使得Redshift的速度是任何云数据仓库的10倍[4]。这是来自re：Invent [5] 的幻灯片，他们声称这种速度提高了-但是，他们“ 忘记”了链接性能基准的来源，所以我无法对其进行验证。
new RA3 instance type — compute nodes from this instance type family work together with the Redshift Managed Storage.
新的RA3实例类型 -该实例类型家族的计算节点与Redshift托管存储一起使用。

AWS雅典娜 (AWS Athena)

Amazon has another service that can be used for data warehousing and data lake: Athena. In contrast to Redshift, Athena is a serverless option that combines the Presto engine (compute) together with S3 (storage) to query data on-demand from S3-based data lake. You pay on a per-query basis + for the S3 storage.

亚马逊还有另一项可用于数据仓库和数据湖的服务：Athena。与Redshift相比，Athena是一种无服务器选项，将Presto引擎( compute )与S3( 存储 )结合在一起，可以从基于S3的数据湖中按需查询数据。您需要为每个查询+ S3存储空间付费。

IBM公司 (IBM)

IBM Db2 Warehouse on Cloud is a fully-managed service with SD-MPP architecture, but with some extras such as AI-based query optimizers:

IBM Db2 Warehouse on Cloud是具有SD-MPP架构的完全托管服务，但具有一些额外功能，例如基于AI的查询优化器：

“Where normal query optimizers may continue to suggest the same query path even after it proves to be less effective than hoped, a machine learning query optimizer can learn from experience, mimicking neural network patterns. This helps it constantly improve as opposed to optimizing at intervals”. [6]

“即使普通的查询优化器被证明比预期的效果要差，即使普通的查询优化器可能仍会继续建议相同的查询路径，而机器学习的查询优化器可以从经验中学习，模仿神经网络模式。与定期优化相反，这有助于其不断改进。 [6]

Side note: at the time of writing, IBM claims to offer $1,000 USD in IBM Cloud credit so that you can try out their cloud data warehouse. You can find more about it here.

旁注：在撰写本文时，IBM声称提供了1,000美元的IBM Cloud信用额度，以便您可以试用他们的云数据仓库。您可以在这里找到更多有关它的信息。

SAP数据仓库云 (SAP Data Warehouse Cloud)

SAP took, in some ways, a similar approach to Snowflake in the sense that they also offer virtual warehouses that they call “Spaces”:

从某种意义上说，SAP采取了与Snowflake类似的方法，因为它们还提供了称为“空间”的虚拟仓库 ：

Spaces […] are isolated and can be assigned quotas for available disc space, CPU usage, runtime hours, and memory usage. [7]

空间[…]是隔离的，可以分配可用磁盘空间，CPU使用率，运行时间和内存使用量的配额。 [7]

They promise that within those Spaces, you can scale storage and compute independently of each other. However, the storage doesn’t seem to grow elastically, as you are asked to specifically assign disc space quotas per Space.

他们承诺，在这些空间中，您可以彼此独立地扩展存储和计算。但是，由于似乎要求您为每个空间专门分配磁盘空间配额，因此存储似乎并没有弹性增长。

Google大查询 (Google Big Query)

BigQuery is completely serverless so it’s abstracted away from users how the storage and compute works. BigQuery scales storage and compute automatically without us having to do anything. Under the hood, it uses a separate distributed storage layer called Colossus and a compute engine called Dremel. Similarly to Amazon Athena, Big Query uses a per-query pricing model.

BigQuery完全没有服务器，因此可以从用户那里抽象出存储和计算的工作方式。 BigQuery无需我们做任何事情即可自动扩展存储和计算。在后台，它使用了一个称为Colossus的单独的分布式存储层和一个名为Dremel的计算引擎 。与Amazon Athena相似，Big Query使用每个查询的定价模型。

现代数据仓库解决方案中存储和计算的分离模糊了Data Lake和Data Warehouse之间的界限 (The separation of storage and compute within modern data warehouse solutions blurs the lines between Data Lake and Data Warehouse)

Companies tend to build data lakes to save costs — data lakes offer unlimited storage and many data lake cloud services offer additional services to quickly and efficiently retrieve data from a data lake, often using SQL interfaces built on top of a data lake. This way, we can obtain a storage layer (your data lake ex. by leveraging AWS S3) and some SQL query engines serving as a (serverless) compute layer to query this data (ex. Amazon Athena). From an architectural perspective, it’s a similar concept to the shared data layer in a data warehouse. Also, using those SQL interfaces built for data lakes often resembles using a data warehouse. In a way, this blurs the line between a data lake and a data warehouse.

公司倾向于建立数据湖以节省成本 -数据湖提供无限的存储空间，许多数据湖云服务通常使用基于数据湖之上的SQL接口 ，提供其他服务来快速有效地从数据湖检索数据。这样，我们可以获得一个存储层( 例如，通过利用AWS S3来获得您的数据湖 )和一些SQL查询引擎作为( 无服务器 )计算层来查询此数据( 例如Amazon Athena )。 从体系结构的角度来看，它与数据仓库中的共享数据层类似。 同样，使用为数据湖构建SQL接口通常类似于使用数据仓库。在某种程度上，这模糊了数据湖和数据仓库之间的界限。

Common examples that confirm this hypothesis:

证实该假设的常见示例：

the open-source Presto + the AWS implementation of Presto: Amazon Athena
开源Presto + Presto的AWS实现：Amazon Athena
Upsolver provides SQL interface for ingestion and transformation of data stored in a data lake [8]
Upsolver提供了用于提取和转换存储在数据湖中的数据SQL接口[8]
the “good old” Apache Hive provides a SQL interface to a data lake stored on Hadoop already since 2010
自2010年以来，“性能良好”的Apache Hive为存储在Hadoop上的数据湖提供了SQL接口
Snowflake already calls itself a cloud data platform because it combines data warehouse and data lake functionality within one product
Snowflake已经自称为云数据平台，因为它在一个产品中结合了数据仓库和数据湖功能
Amazon Redshift created Redshift Spectrum to provide a capability to query data warehouse and data lake together within a single service.
Amazon Redshift创建了Redshift Spectrum，以提供在单个服务中一起查询数据仓库和数据湖的功能。

结论 (Conclusion)

In this article, we looked at why the separation of compute from storage is crucial to make your cloud data warehouse and data lake architecture future-proof in a cost-effective way.

在本文中，我们探讨了为什么将计算与存储分离对于以经济高效的方式 使云数据仓库和数据湖体系结构面向未来至关重要的原因。

We looked at the history of how we reached the Shared Disk Massively Parallel Processing architecture and how it has been implemented by Snowflake, Amazon, Google, SAP, and IBM.

我们回顾了如何达到共享磁盘大规模并行处理架构的历史，以及Snowflake，Amazon，Google，SAP和IBM如何实现该架构。

Finally, we listed the benefits of this approach and concluded that the separation of compute from storage in the modern cloud data warehouse solutions blurs the lines between data lake and data warehouse.

最后，我们列出了这种方法的好处，并得出结论，现代云数据仓库解决方案中的计算与存储的分离模糊了数据湖与数据仓库之间的界限。

Thank you for reading and feel free to follow me for the next articles.

感谢您的阅读，并随时关注我的下一篇文章。

[1] https://en.wikipedia.org/wiki/Shared-nothing_architecture

[1] https://zh.wikipedia.org/wiki/Shared-nothing_architecture

[2] Preetam Kumar: Cutting the cord: separating data from compute in your data lake with object storage https://www.ibm.com/cloud/blog/cutting-cord-separating-data-compute-data-lake-object-storage

[2] Preetam Kumar： 切断电线：使用对象存储在您的数据湖中将数据与计算分离 https://www.ibm.com/cloud/blog/cutting-cord-separating-data-compute-data-lake-object -存储

[3] Snowflake Docs: https://docs.snowflake.com/en/user-guide/intro-key-concepts.html#:~:text=installation%20and%20updates.-,Snowflake%20Architecture,nodes%20in%20the%20data%20warehouse.

[3] Snowflake Docs： https ://docs.snowflake.com/en/user-guide/intro-key-concepts.html#:~: text= installation%20and%20updates.-，Snowflake%20Architecture，nodes%20in ％20the％20data％20Warehouse。

[4] AWS Pages: https://pages.awscloud.com/AQUA_Preview.html#:~:text=AQUA%20is%20a%20new%20distributed,to%20compute%20clusters%20for%20processing.

[4] AWS页： https ://pages.awscloud.com/AQUA_Preview.html#:~: text= AQUA%20is%20a%20new%20distributedto%20compute%20clusters%20for% 20processing。

[5] AWS Slides from re:Invent December 2019 https://d1.awsstatic.com/events/reinvent/2019/NEW_LAUNCH_Amazon_Redshift_reimagined_RA3_and_AQUA_ANT230.pdf

[5]来自re：Invent的AWS幻灯片2019年12月https://d1.awsstatic.com/events/reinvent/2019/NEW_LAUNCH_Amazon_Redshift_reimagined_RA3_and_AQUA_ANT230.pdf

[6] IBM Blog: https://www.ibm.com/blogs/journey-to-ai/2020/03/the-technical-advancements-behind-db2/

[6] IBM博客： https ： //www.ibm.com/blogs/journey-to-ai/2020/03/the-technical-advancements-behind-db2/

[7] SAP Blog: https://saphanajourney.com/data-warehouse-cloud/resources/what-are-spaces/

[7] SAP博客： https ： //saphanajourney.com/data-warehouse-cloud/resources/what-are-spaces/

[8] Upsolver: https://www.upsolver.com/data-lake-platform

[8] Upsolver： https ：//www.upsolver.com/data-lake-platform

[9] Snowflake — supported vendors: https://docs.snowflake.com/en/user-guide/intro-cloud-platforms.html

[9] Snowflake-受支持的供应商： https : //docs.snowflake.com/en/user-guide/intro-cloud-platforms.html

[10] Snowflake Virtual Warehouses: https://www.analytics.today/blog/snowflake-virtual-warehouse

[10]雪花虚拟仓库： https ： //www.analytics.today/blog/snowflake-virtual-warehouse

翻译自: https://towardsdatascience.com/why-you-are-throwing-money-away-if-your-cloud-data-warehouse-doesnt-separate-storage-and-compute-65d2dffd450f

数据仓库实时计算

weixin_26631359

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据仓库实时计算_如果您的云数据仓库没有分开存储和计算，为什么您会浪费金钱...

数据仓库实时计算Not so long ago, establishing an enterprise data warehouse involved a project that would take months or even years. These days, with cloud computing, you can easily register for a SaaS or Paa...
复制链接

扫一扫