redshift 数据仓库_您如何使用Amazon Redshift Spectrum访问“暗数据”

redshift 数据仓库

by Lars Kamp

通过拉斯·坎普

您如何使用Amazon Redshift Spectrum访问“暗数据” (How you can access your “dark data” with Amazon Redshift Spectrum)

Amazon’s Simple Storage Service (S3) has been around since 2006. Enterprises have been pumping their data into this data lake at a furious rate. Within 10 years of its birth, S3 stored over 2 trillion objects, each up to 5 terabytes in size. These companies know their data is valuable and worth preserving. But much of this data lies inert, in “cold” data lakes, unavailable for analysis, as so-called “dark data.”

亚马逊的简单存储服务( S3 )自2006年以来一直存在。企业一直在以惊人的速度将数据泵入该数据湖。 S3诞生10年之内,存储了超过2万亿个对象 ,每个对象的大小高达5 TB。 这些公司知道他们的数据很有价值并且值得保留。 但是,这些数据中的大部分都是惰性的,即所谓的“暗数据”,位于“冷”数据湖中,无法进行分析。

分析“暗数据” (Analyzing “Dark Data”)

So what lies below the surface of data lakes? The first thing for organizations to do is to find out what dark data they have accumulated. Then then need to analyze it in search of valuable insights. That means analysts need solutions that allow them to access petabytes of dark data.

那么,数据湖表面之下是什么? 组织要做的第一件事是找出他们积累了哪些暗数据。 然后,需要对其进行分析以寻找有价值的见解。 这意味着分析人员需要允许他们访问PB级暗数据的解决方案。

With Amazon Redshift Spectrum, you can query data in Amazon S3 without first loading it into Amazon Redshift. For nomenclature purposes, I’ll use “Redshift” for “Amazon Redshift,” and “Spectrum” for “Amazon Redshift Spectrum.”

使用Amazon Redshift Spectrum ,您可以在Amazon S3中查询数据,而无需先将其加载到Amazon Redshift中。 出于命名目的,我将“ Redshift”用于“ Amazon Redshift”,将“ Spectrum”用于“ Amazon Redshift Spectrum”。

There are three major existing ways to access and analyze data in S3.

现有的三种主要方法可以访问和分析S3中的数据。

  • Amazon Elastic MapReduce (EMR). EMR uses Hadoop-style queries to access and process large data sets in S3.

    Amazon Elastic MapReduce (EMR)。 EMR使用Hadoop风格的查询来访问和处理S3中的大数据集。

  • Amazon Athena. Athena offers a console to query S3 data with standard SQL and no infrastructure to manage. Athena also has an API.

    亚马逊雅典娜。 Athena提供了一个控制台,可使用标准SQL查询S3数据,而无需管理基础结构。 雅典娜也有一个API

  • Amazon Redshift. You can load data from S3 into an Amazon Redshift cluster for analysis.

    亚马逊Redshift 。 您可以将数据从S3加载到Amazon Redshift集群中进行分析。

So why not use these existing options? For example, companies already use Amazon Redshift to analyze their “hot” data. So why not load that cold data from S3 into Redshift and call it a day?

那么为什么不使用这些现有选项呢? 例如,公司已经在使用Amazon Redshift分析其“热门”数据。 那么,为什么不将来自S3的冷数据加载到Redshift中并称之为一天呢?

There are two main reasons:

主要有两个原因:

  • Effort. Loading data into Amazon Redshift involves extract, transform, and load (ETL) steps. Those steps are necessary to convert and structure data for analysis. Amazon estimates that figuring out the right ETL consumes 70% of an analytics project.

    努力 。 将数据加载到Amazon Redshift中涉及提取,转换和加载(ETL)步骤。 这些步骤对于转换和构建用于分析的数据是必需的。 亚马逊估计找出正确的ETL会消耗70%的分析项目。

  • Cost. You may not even know what data to extract until you have analyzed it a bit. Uploading lots of cold S3 data for analysis requires growing your clusters. That translates to paying more, as Redshift pricing is based on the size of your cluster. Meanwhile, you continue to pay S3 storage charges for retaining your cold data.

    费用 。 您可能甚至不知道要分析什么数据,否则就不知道要提取什么数据。 上载大量冷S3数据进行分析需要扩展群集。 这意味着您需要支付更多,因为Redshift定价基于集群的大小。 同时,您将继续为保留冷数据而支付S3存储费用。

Redshift Spectrum offers the best of both worlds. With Spectrum, you can:

Redshift Spectrum提供了两全其美的优势。 使用Spectrum,您可以:

  • Continue using your analytics applications, with the same queries you’ve written for Redshift.

    继续使用您为Redshift编写的查询使用分析应用程序。
  • Leave cold data as-is in S3, and query it via Amazon Redshift, without ETL processing. That includes joining data from your data lake with data in Redshift, using a single query.

    将冷数据保留在S3中,并通过Amazon Redshift进行查询,而无需ETL处理。 这包括使用单个查询将数据湖中的数据与Redshift中的数据结合在一起。
  • Decouple processing from storage. Because there’s no need to increase cluster size, you can save on Redshift storage.

    使处理与存储脱钩。 由于无需增加群集大小,因此可以节省Redshift存储。
  • Pay only when you run queries against S3 data. Spectrum queries cost a reasonable $5 /terabyte of data processed.

    仅在对S3数据运行查询时付费。 频谱查询的处理数据成本为每TB合理的$ 5。

Spectrum is the “glue” or “bridge” layer that provides Redshift an interface to S3 data. Redshift becomes the access layer for your business applications. Spectrum is the query processing layer for data accessed from S3. The above picture illustrates the relationship between these services.

频谱是为Redshift提供S3数据接口的“胶”或“桥”层。 Redshift成为您的业务应用程序的访问层。 Spectrum是从S3访问的数据的查询处理层。 上图说明了这些服务之间的关系。

仔细查看Redshift Spectrum (A closer look at Redshift Spectrum)

From a deployment perspective, Spectrum is “under the hood.” It’s a group of managed nodes in your VPC, available to any of your Redshift clusters that are Spectrum-enabled. It pushes compute-intensive tasks down to the Redshift Spectrum layer. That layer is independent of your Amazon Redshift cluster.

从部署的角度来看,Spectrum处于“幕后”。 这是VPC中的一组受管节点,可用于任何启用了Spectrum的Redshift群集。 它将计算密集型任务下推到Redshift Spectrum层。 该层独立于您的Amazon Redshift集群。

There are three key concepts to understand how to run queries with Redshift Spectrum:

有三个关键概念来了解如何使用Redshift Spectrum运行查询:

  1. External data catalog

    外部数据目录
  2. External schemas

    外部架构
  3. External tables

    外部表

The external data catalog contains the schema definitions for the data you wish to access in S3. It’s a central metadata repository for your data assets.

外部数据目录包含您要在S3中访问的数据的架构定义。 它是数据资产的中央元数据存储库。

The external schema contains your tables. External tables allow you to query data in S3 using the same SELECT syntax as with other Amazon Redshift tables. External tables are read-only, that is, you can’t write to an external table.

外部架构包含您的表。 外部表允许您使用与其他Amazon Redshift表相同的SELECT语法查询S3中的数据。 外部表是只读的,也就是说,您不能写入外部表。

You can keep writing your usual Redshift queries. The main change with Spectrum is that the queries now also contain a reference to data stored in S3.

您可以继续编写常规的Redshift查询。 Spectrum的主要变化是查询现在还包含对存储在S3中的数据的引用。

连接内部和外部表 (Joining internal and external tables)

The Redshift query engine treats internal and external tables the same way. You can do the typical operations like queries and joins on either type of table or a combination of both. Query an external table and join its data with that from an internal one.

Redshift查询引擎以相同的方式对待内部和外部表。 您可以在表类型或两者的组合上执行诸如查询和联接之类的典型操作。 查询一个外部表,并将其数据与内部表的数据连接起来。

As an example, let’s say you are using Redshift to analyze data of your e-commerce site visitors. What pages they visit, how long they stay, what they buy (or not), and so on. You keep a year’s worth of data in your Redshift clusters. Older data you move to S3.

例如,假设您正在使用Redshift分析电子商务站点访问者的数据。 他们访问哪些页面,停留多长时间,购买(或不购买)等等。 您可以在Redshift群集中保留一年的数据量。 您将旧数据移至S3。

Then you notice an odd seasonal variation. You want to see if this was also true for past years, or if it was an aberration for this year. Luckily you have saved historic clickstream data in S3, going back many years. You can now access that historic data via an external table with Spectrum, and run the same queries you’re running in Amazon Redshift. Or you can create new insights by joining other past data with this year’s data.

然后您会发现一个奇怪的季节性变化。 您想看看过去几年是否也是这样,或者今年是否如此。 幸运的是,您已经在S3中保存了历史点击流数据,可以追溯到很多年前。 您现在可以使用Spectrum通过外部表访问该历史数据,并运行与在Amazon Redshift中运行的查询相同的查询。 或者,您可以通过将其他过去的数据与今年的数据结合起来来创建新见解。

Redshift parses, compiles, and distributes an SQL query to the nodes in a cluster the normal way. The part of the query that references an external data source gets sent to Spectrum. Spectrum processes the relevant data in S3, and sends the result back to Redshift. Redshift collects the partial results from its nodes and Spectrum, concatenates and joins them (and so on), and returns the complete result.

Redshift以正常方式解析,编译SQL查询并将SQL查询分发到群集中的节点。 查询的引用外部数据源的部分将发送到Spectrum。 Spectrum在S3中处理相关数据,并将结果发送回Redshift。 Redshift从其节点和Spectrum收集部分结果,进行串联和联接(依此类推),然后返回完整结果。

摘要 (Summary)

Here are a few points to keep in mind when working with Spectrum:

使用Spectrum时,请牢记以下几点:

  • Your business applications remain unchanged and don’t know how or where a query is running. The only change for the business analyst is when defining access to external tables.

    您的业​​务应用程序保持不变,并且不知道查询的运行方式或位置。 业务分析师的唯一更改是在定义对外部表的访问时。
  • External data remains in S3 — there is no ETL to load it into your Redshift cluster. That decouples your storage layer in S3 from your processing layer with Redshift and Spectrum.

    外部数据保留在S3中-没有ETL可以将其加载到Redshift集群中。 这使Redshift和Spectrum将S3中的存储层与处理层分离。
  • You don’t need to increase the size of your Redshift cluster to process data in S3. You only pay for the S3 data your queries actually access.

    您无需增加Redshift集群的大小即可处理S3中的数据。 您只需为查询实际访问的S3数据付费。
  • Redshift does all the hard work of minimizing the number of Spectrum nodes needed to access the S3 data. It also makes processing between Redshift and Spectrum efficient.

    Redshift进行了所有艰苦的工作,以尽量减少访问S3数据所需的Spectrum节点数。 它还使Redshift和Spectrum之间的处理效率更高。

You should also do the homework to ensure that processing of data in S3 is economical and efficient. You can save on costs and get better performance if you partition the data, compress it, or convert it to columnar formats such as Apache Parquet.

您还应该做功课,以确保S3中的数据处理既经济又高效。 如果对数据进行分区,压缩或将其转换为列格式(例如Apache Parquet),则可以节省成本并获得更好的性能。

In summary, Spectrum adds one more tool to your Redshift-based data warehouse investment. You can now use its power to probe and analyze your data lake on an as-needed basis for a very low per query price.

总之,Spectrum为基于Redshift的数据仓库投资增加了另一种工具。 现在,您可以按需使用它的功能来探查和分析数据湖,而每查询价格却非常低。

I’m the cofounder of intermix.io. If you want to check it out, you can do so here.

我是intermix.io的联合创始人。 如果您想签出,可以在这里进行

Originally published at www.intermix.io.

最初在www.intermix.io上发布。

翻译自: https://www.freecodecamp.org/news/amazon-redshift-spectrum-diving-into-the-data-lake-7532e7e11716/

redshift 数据仓库

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值